# Handling Data with some Null values using Pandas

## Git repository of our tutorial Jypyter notebooks (including the codes) and the Data sets: https://github.com/learncodequiz/ipynb_files
## Jupyter notebook file for this video: Handling Data with some Null values using Pandas.ipynb 
## Data set for this video: Data_with_nulls.csv . Keep the data set in the same folder as the Jupyter notebook file. 

In [66]:
import pandas as pd

## To read a file .read_csv() method

In [67]:
# You have a CSV file named 'data1.csv' 
df_with_nulls_csv = pd.read_csv('data_with_nulls.csv')

# NaN (Not a Number)
# Display the DataFrame
print(df_with_nulls_csv)

    Age  Height  Weight
0  28.0   165.0      58
1  24.0     NaN      75
2  22.0   155.0      50
3   NaN   172.0      65
4  29.0   168.0      60


## 1. Removing Rows with Null Values
## 2. Filling Null Values
## 3. Imputing Missing Values

# Removing Rows with Null Values

### .dropna() method

In [68]:
# Removing Rows with Null Values
# Drop rows with null values
df_cleaned = df_with_nulls_csv.dropna()
print(df_cleaned)

    Age  Height  Weight
0  28.0   165.0      58
2  22.0   155.0      50
4  29.0   168.0      60


# Filling Null Values (Mean of each colum)

### .fillna() method

### .mean() method 

In [69]:
# Filling Null Values
# Fill null values with the mean of each column

df_filled = df_with_nulls_csv.fillna(df_with_nulls_csv.mean()) 


# .mean() method is used to calculate the mean (average) value of numeric 
# data in a DataFrame or Series.

# fillna() is a method provided by pandas, for data manipulation and analysis. 
# The fillna() method is used to fill missing or NaN (Not a Number) values in a DataFrame or Series with specified values 
# df_with_nulls_csv = pd.read_csv('data_with_nulls.csv')

print(df_with_nulls_csv.mean())
print("\n")
print(f"{df_with_nulls_csv}")
print("\n")

#
print(df_filled)

Age        25.75
Height    165.00
Weight     61.60
dtype: float64


    Age  Height  Weight
0  28.0   165.0      58
1  24.0     NaN      75
2  22.0   155.0      50
3   NaN   172.0      65
4  29.0   168.0      60


     Age  Height  Weight
0  28.00   165.0      58
1  24.00   165.0      75
2  22.00   155.0      50
3  25.75   172.0      65
4  29.00   168.0      60


### .describe() method is used to generate descriptive statistics of numeric data
### in a DataFrame or Series (columns).

In [70]:
print(df_filled.describe())

             Age      Height   Weight
count   5.000000    5.000000   5.0000
mean   25.750000  165.000000  61.6000
std     2.861381    6.284903   9.2358
min    22.000000  155.000000  50.0000
25%    24.000000  165.000000  58.0000
50%    25.750000  165.000000  60.0000
75%    28.000000  168.000000  65.0000
max    29.000000  172.000000  75.0000


### 25th Percentile (Q1 or First Quartile):
### The 25th percentile is the value below which 25% of the data falls.
### 50th Percentile (Median):
### The 50th percentile is the middle value of the dataset when it is sorted in ascending order.
### 75th Percentile (Q3 or Third Quartile):
### The 75th percentile is the value below which 75% of the data falls. 

# Imputing Missing Values (Using Statistical Tools)

### Another common approach is to impute missing values with more sophisticated methods, such as using the 
### k-nearest neighbors algorithm or regression models to predict the missing values.

### KNNImputer class
### .fit_transform method
### DataFrame constructor

In [71]:
from sklearn.impute import KNNImputer

# The KNNImputer is a class from scikit-learn that provides a method for imputing missing values 
# using the k-nearest neighbors algorithm.
# Initialize the KNNImputer with k=2 (use 2 nearest neighbors for imputation)

knn_imputer = KNNImputer(n_neighbors=2)

# n_neighbors=2: The parameter n_neighbors specifies the number of nearest neighbors to use for imputing the missing values. 
# In this case, we set it to 2, which means the imputer will use the values of the two nearest neighbors 
# to estimate the missing value for each null value.

print(knn_imputer.fit_transform(df_with_nulls_csv))
print("\n")
print(f"{df_with_nulls_csv}")
print("\n")

# Impute missing values using KNN imputation
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df_with_nulls_csv), columns=df_with_nulls_csv.columns)


# This line of code does the following:

    # knn_imputer.fit_transform(df_with_nulls_csv): The fit_transform method of the 
    # KNNImputer class is called on the DataFrame df_with_nulls_csv. This method performs 
    # the imputation of missing values using the KNN algorithm and returns a new NumPy array 
    # with the imputed values.
    
    # columns=df_with_nulls_csv.columns: We specify the column names for the new
    # DataFrame df_imputed to be the same as the original DataFrame df_with_nulls_csv. 
    # This ensures that the columns in the new DataFrame are labeled correctly.
    
    # pd.DataFrame(...): We wrap the resulting NumPy array in a new Pandas DataFrame 
    # using pd.DataFrame(...). This creates a new DataFrame df_imputed with the imputed values.
   

print(df_imputed.describe())

[[ 28.  165.   58. ]
 [ 24.  170.   75. ]
 [ 22.  155.   50. ]
 [ 28.5 172.   65. ]
 [ 29.  168.   60. ]]


    Age  Height  Weight
0  28.0   165.0      58
1  24.0     NaN      75
2  22.0   155.0      50
3   NaN   172.0      65
4  29.0   168.0      60


             Age      Height   Weight
count   5.000000    5.000000   5.0000
mean   26.300000  166.000000  61.6000
std     3.114482    6.670832   9.2358
min    22.000000  155.000000  50.0000
25%    24.000000  165.000000  58.0000
50%    28.000000  168.000000  60.0000
75%    28.500000  170.000000  65.0000
max    29.000000  172.000000  75.0000
