In [1]:
import pandas as pd
import numpy as np

data = {
    'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'],
    'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'],
    'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'],
    'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'],
    'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88],
    'Rank': [2, 1, 4, 3, 8, 1, 5, 3],
    'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B']
}

df = pd.DataFrame(data)
print("Sample DataFrame:")

print(df)

Sample DataFrame:
   School ID     Name         Address         City  Subject  Marks  Rank Grade
0      101.0    Alice     123 Main St  Los Angeles     Math   85.0     2     B
1      102.0      Bob     456 Oak Ave     New York  English   92.0     1     A
2      103.0  Charlie     789 Pine Ln      Houston  Science   78.0     4     C
3        NaN    David      101 Elm St  Los Angeles     Math   89.0     3     B
4      105.0      Eva             NaN        Miami  History    NaN     8     D
5      106.0    Frank    222 Maple Rd          NaN     Math   95.0     1     A
6      107.0    Grace  444 Cedar Blvd      Houston  Science   80.0     5     C
7      108.0    Henry    555 Birch Dr     New York  English   88.0     3     B


1. Removing Rows with Missing Values
 Removing rows with missing values is a simple and straightforward method to handle missing data, used when we want to keep our analysis clean and minimize complexity.

    Advantages:
    Simple and efficient: It’s easy to implement and quickly removes data points with missing values.   
    Cleans data: It removes potentially problematic data points, ensuring that only complete rows remain in the dataset.
    
    Disadvantages:
    Reduces sample size: When rows are removed, the overall dataset shrinks which can affect the power and accuracy of our analysis.Potential bias: If missing data is not random (e.g if certain groups are more likely to have missing values) removing rows could introduce bias.

In [2]:
#In this example, we are removing rows with missing values from the original DataFrame (df) using the dropna() method and then displaying the cleaned DataFrame (df_cleaned).


df_cleaned =  df.dropna()

df_cleaned



Unnamed: 0,School ID,Name,Address,City,Subject,Marks,Rank,Grade
0,101.0,Alice,123 Main St,Los Angeles,Math,85.0,2,B
1,102.0,Bob,456 Oak Ave,New York,English,92.0,1,A
2,103.0,Charlie,789 Pine Ln,Houston,Science,78.0,4,C
6,107.0,Grace,444 Cedar Blvd,Houston,Science,80.0,5,C
7,108.0,Henry,555 Birch Dr,New York,English,88.0,3,B


2. Imputation Methods
    2.1 Mean, Median and Mode Imputation:

    This method involves replacing missing values with the mean, median or mode of the relevant variable. It's a simple approach but it doesn't account for the relationships between variables.

    In this example, we are explaining the imputation techniques for handling missing values in the 'Marks' column of the DataFrame (df). It calculates and fills missing values with the mean, median and mode of the existing values in that column and then prints the results for observation.

        df['Marks'].fillna(df['Marks'].mean()): Fills missing values in the 'Marks' column with the mean value.
        df['Marks'].fillna(df['Marks'].median()): Fills missing values in the 'Marks' column with the median value.
        df['Marks'].fillna(df['Marks'].mode(): Fills missing values in the 'Marks' column with the mode value.
        .iloc[0]: Accesses the first element of the Series which represents the mode.

In [6]:
mean_imputaion = df['Marks'].fillna(df['Marks'].mean())
median_imputaion = df['Marks'].fillna(df['Marks'].mean())
mode_imputaion = df['Marks'].fillna(df['Marks'].mean())
print("\nImputation using Mean:")
print(mean_imputaion)

print("\nImputation using Median:")
print(median_imputaion)

print("\nImputation using Mode:")
print(mode_imputaion)


Imputation using Mean:
0    85.000000
1    92.000000
2    78.000000
3    89.000000
4    86.714286
5    95.000000
6    80.000000
7    88.000000
Name: Marks, dtype: float64

Imputation using Median:
0    85.000000
1    92.000000
2    78.000000
3    89.000000
4    86.714286
5    95.000000
6    80.000000
7    88.000000
Name: Marks, dtype: float64

Imputation using Mode:
0    85.000000
1    92.000000
2    78.000000
3    89.000000
4    86.714286
5    95.000000
6    80.000000
7    88.000000
Name: Marks, dtype: float64


2.2 Forward and Backward Fill

    Forward and backward fill techniques are used to replace missing values by filling them with the nearest non-missing values from the same column. This is useful when there’s an inherent order or sequence in the data.

    The method parameter in fillna() allows to specify the filling strategy.

    df['Marks'].fillna(method='ffill'): This method fills missing values in the 'Marks' column of the DataFrame (df) using a forward fill strategy. It replaces missing values with the last observed non-missing value in the column.
    df['Marks'].fillna(method='bfill'): This method fills missing values in the 'Marks' column using a backward fill strategy. It replaces missing values with the next observed non-missing value in the column.

In [11]:
forward_fill = df['Marks'].ffill()
backward_fill = df['Marks'].bfill()

print("\nForward Fill:")
print(forward_fill)

print("\nBackward Fill:")
print(backward_fill)


Forward Fill:
0    85.0
1    92.0
2    78.0
3    89.0
4    89.0
5    95.0
6    80.0
7    88.0
Name: Marks, dtype: float64

Backward Fill:
0    85.0
1    92.0
2    78.0
3    89.0
4    95.0
5    95.0
6    80.0
7    88.0
Name: Marks, dtype: float64


3. Interpolation Techniques :
    Interpolation is a technique used to estimate missing values based on the values of surrounding data points. Unlike simpler imputation methods (e.g mean, median, mode), interpolation uses the relationship between neighboring values to make more informed estimations.

    The interpolate() method in pandas are divided into Linear and Quadratic.

    df['Marks'].interpolate(method='linear'): This method performs linear interpolation on the 'Marks' column of the DataFrame (df).
    df['Marks'].interpolate(method='quadratic'): This method performs quadratic interpolation on the 'Marks' column.




In [14]:
linear_interpolation = df['Marks'].interpolate(method='linear')
quadratic_interpolation = df['Marks'].interpolate(method='quadratic')

print("\nLinear Interpolation:")
print(linear_interpolation)

print("\nQuadratic Interpolation:")
print(quadratic_interpolation)


Linear Interpolation:
0    85.0
1    92.0
2    78.0
3    89.0
4    92.0
5    95.0
6    80.0
7    88.0
Name: Marks, dtype: float64

Quadratic Interpolation:
0    85.00000
1    92.00000
2    78.00000
3    89.00000
4    98.28024
5    95.00000
6    80.00000
7    88.00000
Name: Marks, dtype: float64
