# Exercise

What follows are several exercises regarding aggregation and grouping in pandas.

In this exercise, you will work with a fictional dataset containing sales data for a retail store. The dataset is provided in CSV format and consists of the following columns:

1. Age: Age of the individuals.
2. Income: Annual income in USD.
3. Education: Education level of the individuals.
4. Experience: Years of professional experience.

In [1]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {
    'Age': [25, 30, np.nan, 35, 40],  # Age of the individuals
    'Income': [50000, 60000, np.nan, 70000, 80000],  # Annual income in USD
    'Education': ['High School', 'Bachelor', np.nan, 'Master', 'Doctorate'],  # Educational qualification
    'Experience': [5, 8, np.nan, 12, 15]  # Years of professional experience
}
df = pd.DataFrame(data)


### 1. Impute Missing Values with Mean

- Impute missing values in the 'Age' and 'Income' columns with the mean of each respective column.
- Display the DataFrame after imputation.

HINT: you can dynamically select columns by type using `df.select_dtypes(include=['int', 'int64', 'float']).columns`

In [2]:
# Identify integer columns
integer_columns = df.select_dtypes(include=['int', 'int64', 'float']).columns

# Calculate mean for each integer column
mean_values = df[integer_columns].mean()

# Impute missing values in integer columns with mean values
df[integer_columns] = df[integer_columns].fillna(mean_values)
df

Unnamed: 0,Age,Income,Education,Experience
0,25.0,50000.0,High School,5.0
1,30.0,60000.0,Bachelor,8.0
2,32.5,65000.0,,10.0
3,35.0,70000.0,Master,12.0
4,40.0,80000.0,Doctorate,15.0


### 2. Impute Missing Values with Median

- Impute missing values in the 'Age' and 'Income' columns with the median of each respective column.
- Display the DataFrame after imputation.

In [5]:
# Identify integer columns
integer_columns = df.select_dtypes(include=['int', 'int64', 'float']).columns

# Calculate mean for each integer column
mean_values = df[integer_columns].median()

# Impute missing values in integer columns with mean values
df[integer_columns] = df[integer_columns].fillna(mean_values)
                                                
df

Unnamed: 0,Age,Income,Education,Experience
0,25.0,50000.0,High School,5.0
1,30.0,60000.0,Bachelor,8.0
2,32.5,65000.0,,10.0
3,35.0,70000.0,Master,12.0
4,40.0,80000.0,Doctorate,15.0


### 3. Forward Fill Missing Values in a Time Series

- Forward fill missing values in the 'Education' column.
- Display the DataFrame after imputation.

HINT: you can use `ffill` for this, it uses the order of the rows so you may want to sort it before with `df.sort_values(by=['column_name'])`

In [6]:
# Exercise 3 Solution
df_imputed = df.ffill()
print("\nDataFrame after forward fill in 'Education' column:")
df_imputed


DataFrame after forward fill in 'Education' column:


Unnamed: 0,Age,Income,Education,Experience
0,25.0,50000.0,High School,5.0
1,30.0,60000.0,Bachelor,8.0
2,32.5,65000.0,Bachelor,10.0
3,35.0,70000.0,Master,12.0
4,40.0,80000.0,Doctorate,15.0
