# <font color=green> 1 - Dealing With Missing Data </font>
***
Let’s now look at the different methods that you can use to deal with the missing data.

## <font color=green> 1.1 - Creating a Table </font>
***
Let’s get started by creating a data.

In [34]:
import pandas as pd
import numpy as np

# Existing DataFrame
data = {
    '#Number': ['M1001', 'M1001', 'M1001', 'M1001', 'M1002', 'M1002', 'M1002', 'M1002', 'M1002', 'M1003'],
    'State': ['SP', 'RJ', 'MG', 'ES', 'BA', 'CE', 'PE', 'PI', 'RN', 'AM'],
    'Movie': ['Gremlins', 'Gremlins', 'Gremlins', 'Gremlins', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', np.nan],
    'Score': [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, np.nan],
    'Year': [1985.0, 1985.0, 1985.0, 1985.0, 2010.0, 2010.0, 2010.0, 2010.0, 2010.0, np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,#Number,State,Movie,Score,Year
0,M1001,SP,Gremlins,5.0,1985.0
1,M1001,RJ,Gremlins,5.0,1985.0
2,M1001,MG,Gremlins,5.0,1985.0
3,M1001,ES,Gremlins,5.0,1985.0
4,M1002,BA,The Last Samurai,5.0,2010.0
5,M1002,CE,The Last Samurai,5.0,2010.0
6,M1002,PE,The Last Samurai,5.0,2010.0
7,M1002,PI,The Last Samurai,5.0,2010.0
8,M1002,RN,The Last Samurai,5.0,2010.0
9,M1003,AM,,,


Let's investigate if there's any row with NaN value.

In [35]:
# Check for NaN values using isnull()
nan_values = df.isnull()

# Check for NaN values using isna()
nan_values_alt = df.isna()

# Print the resulting DataFrames
print(nan_values)
print(nan_values_alt)

   #Number  State  Movie  Score   Year
0    False  False  False  False  False
1    False  False  False  False  False
2    False  False  False  False  False
3    False  False  False  False  False
4    False  False  False  False  False
5    False  False  False  False  False
6    False  False  False  False  False
7    False  False  False  False  False
8    False  False  False  False  False
9    False  False   True   True   True
   #Number  State  Movie  Score   Year
0    False  False  False  False  False
1    False  False  False  False  False
2    False  False  False  False  False
3    False  False  False  False  False
4    False  False  False  False  False
5    False  False  False  False  False
6    False  False  False  False  False
7    False  False  False  False  False
8    False  False  False  False  False
9    False  False   True   True   True


***
### <font color=red>Could I know which lines have NaN values?</font>
***

In [36]:
# Check for NaN values using isnull()
nan_values = df.isnull()

# Filter rows with NaN values
filtered_df = df[nan_values.any(axis=1)]
filtered_df

Unnamed: 0,#Number,State,Movie,Score,Year
9,M1003,AM,,,


## <font color=green> 1.2 - Filling the Missing Values – Imputation </font>
***
Let’s substitute NaN for zero by stating each column.

In [27]:
# Data Cleaning
highlighted_df = df.dropna(subset=['#Number'])
highlighted_df.fillna({'Year': 0, 'Movie': 'There needs definition', 'Score': 0}, inplace=True)
highlighted_df

Unnamed: 0,#Number,State,Movie,Score,Year
0,M1001,SP,Gremlins,5.0,1985.0
1,M1001,RJ,Gremlins,5.0,1985.0
2,M1001,MG,Gremlins,5.0,1985.0
3,M1001,ES,Gremlins,5.0,1985.0
4,M1002,BA,The Last Samurai,5.0,2010.0
5,M1002,CE,The Last Samurai,5.0,2010.0
6,M1002,PE,The Last Samurai,5.0,2010.0
7,M1002,PI,The Last Samurai,5.0,2010.0
8,M1002,RN,The Last Samurai,5.0,2010.0
9,M1003,AM,There needs definition,0.0,0.0


## <font color=green> 1.3 - Back to the table again in order to fill NaN cells </font>
***
Without stating columns.

In [28]:
# Existing DataFrame
data = {
    '#Number': ['M1001', 'M1001', 'M1001', 'M1001', 'M1002', 'M1002', 'M1002', 'M1002', 'M1002', 'M1003'],
    'State': ['SP', 'RJ', 'MG', 'ES', 'BA', 'CE', 'PE', 'PI', 'RN', 'AM'],
    'Movie': ['Gremlins', 'Gremlins', 'Gremlins', 'Gremlins', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', np.nan],
    'Score': [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, np.nan],
    'Year': [1985.0, 1985.0, 1985.0, 1985.0, 2010.0, 2010.0, 2010.0, 2010.0, 2010.0, np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,#Number,State,Movie,Score,Year
0,M1001,SP,Gremlins,5.0,1985.0
1,M1001,RJ,Gremlins,5.0,1985.0
2,M1001,MG,Gremlins,5.0,1985.0
3,M1001,ES,Gremlins,5.0,1985.0
4,M1002,BA,The Last Samurai,5.0,2010.0
5,M1002,CE,The Last Samurai,5.0,2010.0
6,M1002,PE,The Last Samurai,5.0,2010.0
7,M1002,PI,The Last Samurai,5.0,2010.0
8,M1002,RN,The Last Samurai,5.0,2010.0
9,M1003,AM,,,


In [29]:
# Replace NaN values with 0
df.fillna(0, inplace=True)
df

Unnamed: 0,#Number,State,Movie,Score,Year
0,M1001,SP,Gremlins,5.0,1985.0
1,M1001,RJ,Gremlins,5.0,1985.0
2,M1001,MG,Gremlins,5.0,1985.0
3,M1001,ES,Gremlins,5.0,1985.0
4,M1002,BA,The Last Samurai,5.0,2010.0
5,M1002,CE,The Last Samurai,5.0,2010.0
6,M1002,PE,The Last Samurai,5.0,2010.0
7,M1002,PI,The Last Samurai,5.0,2010.0
8,M1002,RN,The Last Samurai,5.0,2010.0
9,M1003,AM,0,0.0,0.0


## <font color=green> 1.4 - Drop rows that have NaN values </font>
***
Dropping rows.

In [30]:
# Existing DataFrame
data = {
    '#Number': ['M1001', 'M1001', 'M1001', 'M1001', 'M1002', 'M1002', 'M1002', 'M1002', 'M1002', 'M1003'],
    'State': ['SP', 'RJ', 'MG', 'ES', 'BA', 'CE', 'PE', 'PI', 'RN', 'AM'],
    'Movie': ['Gremlins', 'Gremlins', 'Gremlins', 'Gremlins', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', 'The Last Samurai', np.nan],
    'Score': [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, np.nan],
    'Year': [1985.0, 1985.0, 1985.0, 1985.0, 2010.0, 2010.0, 2010.0, 2010.0, 2010.0, np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,#Number,State,Movie,Score,Year
0,M1001,SP,Gremlins,5.0,1985.0
1,M1001,RJ,Gremlins,5.0,1985.0
2,M1001,MG,Gremlins,5.0,1985.0
3,M1001,ES,Gremlins,5.0,1985.0
4,M1002,BA,The Last Samurai,5.0,2010.0
5,M1002,CE,The Last Samurai,5.0,2010.0
6,M1002,PE,The Last Samurai,5.0,2010.0
7,M1002,PI,The Last Samurai,5.0,2010.0
8,M1002,RN,The Last Samurai,5.0,2010.0
9,M1003,AM,,,


In [31]:
# Drop rows with NaN values
df.dropna(inplace=True)

# Reset the index after dropping rows
df.reset_index(drop=True, inplace=True)

df

Unnamed: 0,#Number,State,Movie,Score,Year
0,M1001,SP,Gremlins,5.0,1985.0
1,M1001,RJ,Gremlins,5.0,1985.0
2,M1001,MG,Gremlins,5.0,1985.0
3,M1001,ES,Gremlins,5.0,1985.0
4,M1002,BA,The Last Samurai,5.0,2010.0
5,M1002,CE,The Last Samurai,5.0,2010.0
6,M1002,PE,The Last Samurai,5.0,2010.0
7,M1002,PI,The Last Samurai,5.0,2010.0
8,M1002,RN,The Last Samurai,5.0,2010.0
