# Pandas

## Task 1

You should import the necessary libraries. You will use `numpy` and `pandas` libraries.


> Don't forget to import `numpy`, and `pandas` in short form.

In [12]:
import numpy as np
import pandas as pd

## Task 2

Create a `DataFrame` from the given dictionary `data`.

In [6]:
data = {
    "Name": ["John", "Emily", "Ryan"],
    "Age": [16, 28, 22],
    "City": ["New York", "Los Angeles", "Chicago"],
}

df = pd.DataFrame(data)
df

## Task 3

In this task you should complete the following steps:

### Task 3.1
Display all data for `Age` column for the DataFrame you created in the previous task.

In [9]:
df["Age"]

### Task 3.2

Add `Salary` column to the `DataFrame` with the values `[50000, 60000, 45000]`.

In [10]:
df["Salary"] = [50000, 60000, 45000]
df

### Task 3.3

Filter the `DataFrame` to show only the rows with the `Age` greater than 18.

In [11]:
df_adults = df[df["Age"] > 18]
df_adults

## Task 4

In this task you should complete the following steps:

### Task 4.1

Add new calculated field `Birth year`;

In [16]:
import datetime

df["Birth year"] = datetime.datetime.now().year - df["Age"]
df

### Task 4.2

Add new calculated field `Average age`.

In [17]:
df["Average age"] = df["Age"].mean()
df

### Task 4.3

Calculate absolute difference between `Age` and `Average age`.

In [1]:
df["Difference"] = abs(df["Age"] - df["Average age"])
df

## Task 5

Complete the following tasks described below.

### Task 5.1

You have two dictionaries `data1` and `data2`. Create two `DataFrame` objects from these dictionaries. Then, `concatenate`, and `merge` these two `DataFrame` objects into one `DataFrame` object, and see the difference.

In [20]:
# Data
data1 = {'Name': ['John', 'Emily', 'Ryan'],
         'Age': [25, 28, 22]}
data2 = {'Name': ['Emily', 'Ryan', 'Mike'],
         'City': ['Los Angeles', 'Chicago', 'Houston']}

# create DataFrame objects
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

In [22]:
# merge - об'єднує тільки спільні записи (inner join за замовчуванням)
df_merged = pd.merge(df1, df2, on="Name")
print("Merged DataFrame:")
df_merged

In [24]:
# concatenate - просто з'єднує DataFrames один за одним
df_concatenated = pd.concat([df1, df2])
print("Concatenated DataFrame:")
df_concatenated

### Task 5.2

You have two `Series` objects `s1` and `s2`. Make default mathematical operations on these two `Series` objects. Such as `+`, `-`, `*`, `/`, `**`, `//`, `%`. Also, filter `s1` and print only even numbers, and `s2` only with odd numbers.


In [None]:
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([6, 7, 8, 9, 10])

In [None]:
print("Addition (+):")
print(s1 + s2)

In [None]:
print("Subtraction (-):")
print(s1 - s2)

In [None]:
print("Multiplication (*):")
print(s1 * s2)

In [None]:
print("Division (/):")
print(s1 / s2)

In [None]:
print("Power (**):")
print(s1 ** s2)

In [None]:
print("Floor division (//):")
print(s1 // s2)

In [None]:
print("Modulo (%):")
print(s1 % s2)

In [None]:
# filter s1 for even numbers (pandas way)
s1_even = s1[s1 % 2 == 0]
print("Even numbers from s1:")
print(s1_even)

In [None]:
# filter s2 for odd numbers (pandas way)
s2_odd = s2[s2 % 2 == 1]
print("Odd numbers from s2:")
print(s2_odd)

### Task 5.3

You have a `Serias` object `s`. Make the following operations on this `Series` object:

* make all words in uppercase without loops;
* get length of each word without loops.


In [None]:
s = pd.Series(["numpy", "pandas", "matplotlib"])

In [None]:
s_upper = s.str.upper()
print("Uppercase:")
print(s_upper)

In [None]:
s_length = s.str.len()
print("Word lengths:")
print(s_length)

## Task 6 Optional

You have a large dataset consisting of a series of numbers. Your task is to calculate the moving average of the series using a window of `5` elements. The moving average is the average of a set of consecutive values in the series, where the window `slides` through the series to compute the average at each position. The goal is to calculate the moving average efficiently and accurately. Also, compare time performance of Python and Pandas solutions.

> Please, note: If you get `nan` while calculating the moving average, you should remove it from the result. Pandas solution should take 1-2 lines of code.


### Python Solution

In the Python solution, a large dataset is generated using random numbers. The moving average is computed by iterating over the dataset using a loop. At each position, a window of 5 elements is created, and the average is calculated by summing the values in the window and dividing by 5.

In [None]:
import random

def unknown_signature():
    # Generate a large dataset
    data = [random.randint(1, 100) for _ in range(10000000)]

    # Calculate the moving average with a window of 5
    moving_averages = []
    for i in range(4, len(data)):
        window = data[i - 4 : i + 1]
        average = sum(window) / 5
        moving_averages.append(average)
    return moving_averages[:10]

%time unknown_signature()

### Solution using Pandas and Numpy


In [None]:
import time

def pandas_moving_average():
    # Generate large dataset using pandas
    data = pd.Series(np.random.randint(1, 101, 10000000))
    
    # Calculate moving average with window=5 and remove NaN values
    moving_avg = data.rolling(window=5).mean().dropna()
    return moving_avg.head(10).values

print("Pandas solution:")
start_time = time.time()
result_pandas = pandas_moving_average()
end_time = time.time()

print(f"Result: {result_pandas}")
print(f"Time taken: {end_time - start_time:.4f} seconds")

# Compare with Python solution
print("\nPython solution:")
start_time = time.time()
result_python = unknown_signature()
end_time = time.time()
print(f"Time taken: {end_time - start_time:.4f} seconds")
