In [1]:
# Importing Library
import pandas as pd
from pathlib import Path

df = pd.read_csv(str(Path().resolve().parent) + "\\4. DataFrame\\sample-data\\titanic-dataset.csv")
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

###### Treatment of missing values is an important step in data science, as missing values can affect the accuracy of statistical analyses and machine learning models. There are several methods for treating missing values, including:

###### 1. **Deletion:** This involves removing rows or columns with missing values from the dataset. This method is simple but can result in loss of information and bias in the remaining data.

![Alt text](image.png)

In [None]:
data_without_missing_values = df.dropna(axis=1)

##### 2. **Imputation:** This involves replacing missing values with estimated values based on the available data. There are several methods for imputing missing values, including mean imputation, median imputation, mode imputation, and regression imputation.
##### Univariate imputer for completing missing values with simple strategies.
##### Replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

In [None]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(df[["Age"]])

##### 3. **Prediction:** This involves using machine learning models to predict missing values based on the available data. This method can be more accurate than imputation but requires more computational resources and may be more complex to implement.


In [None]:

# The choice of missing value treatment method depends on the nature of the data and the research question. It is important to carefully consider the implications of each method and to evaluate the impact of missing value treatment on the results of the analysis.

# In Python, the `pandas` library provides several functions for handling missing values, including `dropna()` for deleting rows or columns with missing values, `fillna()` for imputing missing values, and `interpolate()` for interpolating missing values based on the available data.

In [None]:
# Drops all rows that have any NaN values
df.dropna()
# Drop only if ALL columns are NaN   
df.dropna(how='all')
# Drops row if it does not have at least two 
df.dropna(thresh=2)
# Drops only if NaN in specific column (as asked in the question)
df.dropna(subset=[1])

# Let's learn how to fill NA values
# Fill NA values with 0
df.fillna(0)
# fill value just above from cell
df.fillna(method='ffill') 

# fill value with column df given in dictionary
values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
df.fillna(value=values)
df.fillna(value=values, limit=1)
# fill value with mean of column
df.fillna(df.mean())
# fill value with median of column
df.fillna(df.median())
# fill value with mode of column
df.fillna(df.mode())