# Pandas

## Task 1

You should import the necessary libraries. You will use `numpy` and `pandas` libraries.


> Don't forget to import `numpy`, and `pandas` in short form.

In [1]:
# Write your code here
import pandas as pd
import numpy as np


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Task 2

Create a `DataFrame` from the given dictionary `data`.

In [3]:
data = {
    "Name": ["John", "Emily", "Ryan"],
    "Age": [16, 28, 22],
    "City": ["New York", "Los Angeles", "Chicago"],
}

# write your code here
df = pd.DataFrame(data)
print(df)


    Name  Age         City
0   John   16     New York
1  Emily   28  Los Angeles
2   Ryan   22      Chicago


## Task 3

In this task you should complete the following steps:

### Task 3.1
Display all data for `Age` column for the DataFrame you created in the previous task.

In [4]:
# Write your code here
df['Age']
print(df['Age'])


0    16
1    28
2    22
Name: Age, dtype: int64


### Task 3.2

Add `Salary` column to the `DataFrame` with the values `[50000, 60000, 45000]`.

In [5]:
# Write your code here
df['Salary'] = [50000, 60000, 45000]
print(df)


    Name  Age         City  Salary
0   John   16     New York   50000
1  Emily   28  Los Angeles   60000
2   Ryan   22      Chicago   45000


### Task 3.3

Filter the `DataFrame` to show only the rows with the `Age` greater than 18.

In [6]:
# Write your code here
df[df['Age'] > 18]
print(df[df['Age'] > 18])


    Name  Age         City  Salary
1  Emily   28  Los Angeles   60000
2   Ryan   22      Chicago   45000


## Task 4

In this task you should complete the following steps:

### Task 4.1

Add new calculated field `Birth year`;

In [7]:
# Write your code here
from datetime import datetime

current_year = datetime.now().year
df['Birth year'] = current_year - df['Age']
print(df)


    Name  Age         City  Salary  Birth year
0   John   16     New York   50000        2008
1  Emily   28  Los Angeles   60000        1996
2   Ryan   22      Chicago   45000        2002


### Task 4.2

Add new calculated field `Average age`.

In [8]:
# Write your code here
df['Average age'] = df['Age'].mean()
print(df)


    Name  Age         City  Salary  Birth year  Average age
0   John   16     New York   50000        2008         22.0
1  Emily   28  Los Angeles   60000        1996         22.0
2   Ryan   22      Chicago   45000        2002         22.0


### Task 4.3

Calculate absolute difference between `Age` and `Average age`.

In [9]:
# Write your code here
absolute_diff = (df['Age'] - df['Average age']).abs()
print(absolute_diff)


0    6.0
1    6.0
2    0.0
dtype: float64


## Task 5

Complete the following tasks described below.

### Task 5.1

You have two dictionaries `data1` and `data2`. Create two `DataFrame` objects from these dictionaries. Then, `concatenate`, and `merge` these two `DataFrame` objects into one `DataFrame` object, and see the difference.

In [20]:
# Data
data1 = {'Name': ['John', 'Emily', 'Ryan'],
         'Age': [25, 28, 22]}
data2 = {'Name': ['Emily', 'Ryan', 'Mike'],
         'City': ['Los Angeles', 'Chicago', 'Houston']}

# create DataFrame objects
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)


In [16]:
# merge
result = pd.merge(df1, df2, how='outer')
print(result)


    Name   Age         City
0  Emily  28.0  Los Angeles
1   John  25.0          NaN
2   Mike   NaN      Houston
3   Ryan  22.0      Chicago


In [15]:
# concatenate
df3 = pd.concat([df1, df2], ignore_index=True)
print(df3)


    Name   Age         City
0   John  25.0          NaN
1  Emily  28.0          NaN
2   Ryan  22.0          NaN
3  Emily   NaN  Los Angeles
4   Ryan   NaN      Chicago
5   Mike   NaN      Houston


### Task 5.2

You have two `Series` objects `s1` and `s2`. Make default mathematical operations on these two `Series` objects. Such as `+`, `-`, `*`, `/`, `**`, `//`, `%`. Also, filter `s1` and print only even numbers, and `s2` only with odd numbers.


In [None]:
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([6, 7, 8, 9, 10])


In [18]:
# use + operator here
adding = s1 + s2


0     7
1     9
2    11
3    13
4    15
dtype: int64


In [None]:
# use - operator here
substract = s1 - s2


In [None]:
# use * operator here
multiply = s1 * s2


In [None]:
# use / operator here
divide = s1 / s2


In [19]:
# use ** operator here
multiply2 = s1 ** s2


In [None]:
# use // operator here
divide2 = s1 // s2


In [20]:
# use % operator here
divide2 = s1 % s2


In [22]:
# filter s1 here
s1[s1 % 2 != 0]


0    1
2    3
4    5
dtype: int64


In [None]:
# filter s2 here
s2[s2 % 2 == 0]


### Task 5.3

You have a `Serias` object `s`. Make the following operations on this `Series` object:

* make all words in uppercase without loops;
* get length of each word without loops.


In [None]:
s = pd.Series(["numpy", "pandas", "matplotlib"])


In [None]:
# make all words in uppercase
upper = s.str.upper()


In [23]:
# get length of each word
length = s.str.len()


0     5
1     6
2    10
dtype: int64


## Task 6 Optional

You have a large dataset consisting of a series of numbers. Your task is to calculate the moving average of the series using a window of `5` elements. The moving average is the average of a set of consecutive values in the series, where the window `slides` through the series to compute the average at each position. The goal is to calculate the moving average efficiently and accurately. Also, compare time performance of Python and Pandas solutions.

> Please, note: If you get `nan` while calculating the moving average, you should remove it from the result. Pandas solution should take 1-2 lines of code.


### Python Solution

In the Python solution, a large dataset is generated using random numbers. The moving average is computed by iterating over the dataset using a loop. At each position, a window of 5 elements is created, and the average is calculated by summing the values in the window and dividing by 5.

In [None]:
import random

def unknown_signature():
    # Generate a large dataset
    data = [random.randint(1, 100) for _ in range(10000000)]

    # Calculate the moving average with a window of 5
    moving_averages = []
    for i in range(4, len(data)):
        window = data[i - 4 : i + 1]
        average = sum(window) / 5
        moving_averages.append(average)
    return moving_averages[:10]

%time unknown_signature()


### Solution using Pandas and Numpy


In [25]:
# write your code here
series = pd.Series(np.random.randint(0, 100, size=10000000))
%time pd.Series(np.random.randint(0, 100, size=10000000))
series.rolling(window=5).mean().dropna()
%time series.rolling(window=5).mean().dropna()


CPU times: user 109 ms, sys: 2.93 ms, total: 112 ms
Wall time: 131 ms
CPU times: user 266 ms, sys: 60.8 ms, total: 326 ms
Wall time: 328 ms


4          58.6
5          48.8
6          50.0
7          53.4
8          42.6
           ... 
9999995    58.4
9999996    41.6
9999997    48.2
9999998    37.8
9999999    38.2
Length: 9999996, dtype: float64