## Day 23: Preprocessing with Pandas and Matplotlib 

1. Create a pandas DataFrame from the lists below. Write a 
code to check if there are any missing values in the columns 
of the DataFrame. 

In [8]:
import pandas as pd 
import numpy as np


names = ["Kelly", np.nan, 'Jon', 'Ken', 'Tim', 'Pel']
grades = [30, 40, 30, 67, np.nan, 55]
age = [15, np.nan, 18, 17, np.nan, 16]

df = pd.DataFrame({"Names":names, "Grades":grades, "Age":age})
df



Unnamed: 0,Names,Grades,Age
0,Kelly,30.0,15.0
1,,40.0,
2,Jon,30.0,18.0
3,Ken,67.0,17.0
4,Tim,,
5,Pel,55.0,16.0


In [9]:
# Checking for missing values 
df.isnull().sum()

Names     1
Grades    1
Age       2
dtype: int64

2. Your supervisor has asked that you show them how to fill 
in the missing values using the library's Sklearn. Write code 
to fill in missing values in the DataFrame above using this 
library. Replace the missing value in the "Name" column 
with the name "Paul." Use the mean strategy for the 
"Grades" column and the median strategy for the "Age" 
column. Make a copy of the original DataFrame. 

In [10]:
from sklearn.impute import SimpleImputer 

# Create a copy of the DataFrame

df_copy = df.copy()

# Create an instance of SimpleImputer with different strategies for each column 

imputer_names = SimpleImputer(strategy='constant', fill_value="Paul")
imputer_grades = SimpleImputer(strategy="mean")
imputer_age = SimpleImputer(strategy='median')

# fit and transform the imputer on selected columns
df_copy[["Names"]] = imputer_names.fit_transform(df_copy[["Names"]])
df_copy[["Grades"]] = imputer_grades.fit_transform(df_copy[["Grades"]])
df_copy[["Age"]] = imputer_age.fit_transform(df_copy[["Age"]])

df_copy

Unnamed: 0,Names,Grades,Age
0,Kelly,30.0,15.0
1,Paul,40.0,16.5
2,Jon,30.0,18.0
3,Ken,67.0,17.0
4,Tim,44.4,16.5
5,Pel,55.0,16.0


3. Using the original DataFrame (question 1) with missing 
values, identify and drop any columns that have more than 
30% missing values. Save this as a new variable. 

In [11]:
# Check percentage of missing values in each column
missing_values = df.isnull().mean()

# Select columns that have more than 50% missing values
columns_to_drop = missing_values[missing_values > 0.30].index

# Drop columns from the dataframe
df_cleaned = df.drop(columns_to_drop, axis=1)
df_cleaned


Unnamed: 0,Names,Grades
0,Kelly,30.0
1,,40.0
2,Jon,30.0
3,Ken,67.0
4,Tim,
5,Pel,55.0


4. Using pandas, check for any duplicate values in the 
"Names" columns of your DataFrame (question 3). 

In [12]:
# Check for duplicated values

df_cleaned.duplicated("Names").sum()

np.int64(0)

5. Use the pandas groupby() method and agg() method to 
calculate the mean, minimum, and maximum of the 
"Grades" column, grouped by the "Names" column. Use a 
copy of the DataFrame you saved in question 2. 

In [15]:
agg_data = df_copy.groupby("Names").agg({"Grades": ["mean", 'min', "max"]})
agg_data

Unnamed: 0_level_0,Grades,Grades,Grades
Unnamed: 0_level_1,mean,min,max
Names,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Jon,30.0,30.0,30.0
Kelly,30.0,30.0,30.0
Ken,67.0,67.0,67.0
Paul,40.0,40.0,40.0
Pel,55.0,55.0,55.0
Tim,44.4,44.4,44.4
