# Handling Missing DATAS
The Titanic dataset contains several columns with missing data, making it a great example for learning how to handle missing values. This dataset includes information about passengers, such as age, gender, ticket fare, passenger class (pclass), and embarkation port (embarked).

In [1]:
# loading dataset
import pandas as pd
import numpy as np

titanic_df = pd.read_csv(r"C:\Users\Fuat\Learning Machine Learning/Titanic-Dataset.csv")

In [2]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Detecting Lost Values

In [3]:
print(titanic_df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


From this output:

The age column has 177 missing values.

The embarked and embark_town columns have 2 missing values each.

The deck column has 688 missing values (this column is almost entirely empty).

## Methods for Handling Missing Data

### Dropping Missing Data
If the missing data is minimal, we can drop the rows containing missing values.

In [4]:
# Drop rows with missing values
titanic_df_dropped = titanic_df.dropna()

# Check for missing values again
print(titanic_df_dropped.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


This method is not suitable for columns with a large number of missing values (e.g., the Cabin column), as it results in significant data loss.

### Filling Missing Data with Mean or Median
For numerical columns, we can fill missing values with the column's mean or median.

In [5]:
# Fill missing values in the 'age' column with the median

titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].median())


# Fill missing values in the 'fare' column with the mean
titanic_df['Fare'] = titanic_df['Fare'].fillna(titanic_df['Fare'].mean())

# Check for missing values again
print(titanic_df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


###  Filling Categorical Data with Mode
For categorical columns, we can fill missing values with the mode (most frequent value).

In [8]:
# Fill missing values in the 'embarked' and 'embark_town' columns with the mode
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])

# Check for missing values again
print(titanic_df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64


### Dropping Columns with Too Many Missing Values
If a column has too many missing values (e.g., the Cabin column), we can drop the entire column.


In [9]:
# Drop the 'Cabin' column
titanic_df = titanic_df.drop(columns=['Cabin'])

# Check for missing values again
print(titanic_df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64


## Advanced Methods: Imputation Using Prediction
We can predict missing values based on other columns using machine learning techniques, such as K-Nearest Neighbors (KNN) or Random Forest.

### K-Nearest Neighbors (KNN) Imputation
KNN imputation works by finding the k most similar rows (neighbors) to the row with the missing value and using the average (or mode) of those neighbors to fill the missing value.

In [16]:
titanic_df = pd.read_csv(r"C:\Users\Fuat\Learning Machine Learning/Titanic-Dataset.csv")
print(titanic_df.isnull().sum())
print("********")
from sklearn.impute import KNNImputer

# Select numerical columns for KNN imputation
numerical_columns = ['Age', 'Fare', 'SibSp', 'Parch']

# Initialize the KNNImputer
imputer = KNNImputer(n_neighbors=5)  # Use 5 nearest neighbors

# Apply KNN imputation
titanic_df[numerical_columns] = imputer.fit_transform(titanic_df[numerical_columns])

# Check for missing values
print(titanic_df[numerical_columns].isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
********
Age      0
Fare     0
SibSp    0
Parch    0
dtype: int64


###  Random Forest Imputation
Random Forest is a powerful machine learning algorithm that can be used to predict missing values by treating the column with missing values as the target variable and the other columns as features.

In [25]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Load dataset
titanic_df = pd.read_csv(r"C:\Users\Fuat\Learning Machine Learning/Titanic-Dataset.csv")

print("Missing values before imputation:")
print(titanic_df.isnull().sum())
print("********")

# Encode categorical variables
label_encoder = LabelEncoder()
categorical_columns = ['Sex', 'Embarked', 'Pclass']
for col in categorical_columns:
    titanic_df[col] = label_encoder.fit_transform(titanic_df[col].astype(str))

# Drop irrelevant columns
titanic_df = titanic_df.drop(columns=['Name', 'Ticket', 'Cabin'], errors='ignore')


# Function for Random Forest Imputation
def random_forest_impute(df, target_column):
    # Check if the column has missing values
    if df[target_column].isnull().sum() == 0:
        print(f"No missing values in {target_column}. Skipping imputation.")
        return df

    # Separate rows with and without missing values
    df_missing = df[df[target_column].isnull()]
    df_not_missing = df[~df[target_column].isnull()]

    # Prepare features (X) and target (y)
    X = df_not_missing.drop(columns=[target_column])
    y = df_not_missing[target_column]

    # Ensure only numeric data is used
    X = X.select_dtypes(include=['number'])
    X_missing = df_missing.drop(columns=[target_column]).select_dtypes(include=['number'])

    # Check if data is sufficient for training
    if X.empty or y.empty or X_missing.empty:
        print(f"Insufficient data for training the model for {target_column}.")
        return df

    # Train the Random Forest model
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X, y)

    # Predict missing values
    predicted_values = model.predict(X_missing)
    df.loc[df[target_column].isnull(), target_column] = predicted_values

    return df


# Apply Random Forest Imputation
if 'Age' in titanic_df.columns:
    titanic_df = random_forest_impute(titanic_df, 'Age')

if 'Fare' in titanic_df.columns:
    titanic_df = random_forest_impute(titanic_df, 'Fare')

# Check for missing values after imputation
print("Missing values after imputation:")
print(titanic_df.isnull().sum())


Missing values before imputation:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
********
No missing values in Fare. Skipping imputation.
Missing values after imputation:
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64


### Handling Categorical Data with Random Forest
For categorical columns, you can use a RandomForestClassifier instead of a regressor.

In [39]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Load dataset
titanic_df = pd.read_csv(r"C:\Users\Fuat\Learning Machine Learning/Titanic-Dataset.csv")

print("Missing values before imputation:")
print(titanic_df.isnull().sum())
print("********") 


def random_forest_impute_categorical(df, target_column):
    # Check if the target column exists
    if target_column not in df.columns:
        raise KeyError(f"Column '{target_column}' not found in DataFrame.")
    
    # Check for missing values in the target column
    if df[target_column].isnull().sum() == 0:
        print(f"No missing values in '{target_column}' to impute.")
        return df
    
    # Extract the target column and drop it from the features
    y = df[target_column]
    df_features = df.drop(columns=[target_column])
    
    # Drop non-numeric columns (customize this list as needed)
    df_features = df_features.drop(columns=['Name', 'Ticket', 'Cabin'], errors='ignore')
    
    # One-hot encode categorical features
    df_encoded = pd.get_dummies(df_features, drop_first=True)
    
    # Combine encoded features with the target column
    df_combined = pd.concat([df_encoded, y], axis=1)
    
    # Split into missing and non-missing data
    df_missing = df_combined[df_combined[target_column].isnull()]
    df_not_missing = df_combined[~df_combined[target_column].isnull()]
    
    # Check if there are samples with missing values to impute
    if df_missing.empty:
        print(f"No rows with missing '{target_column}' to impute.")
        return df
    
    # Features and target
    X = df_not_missing.drop(columns=[target_column])
    y = df_not_missing[target_column]
    
    # Train a Random Forest model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X, y)
    
    # Predict missing values
    X_missing = df_missing.drop(columns=[target_column])
    predicted_values = model.predict(X_missing)
    
    # Fill the missing values in the ORIGINAL DataFrame (not the encoded one)
    df.loc[df[target_column].isnull(), target_column] = predicted_values
    
    return df

# Apply the function to 'Embarked' (ensure case matches your DataFrame)
titanic_df = random_forest_impute_categorical(titanic_df, 'Embarked')  # Use 'embarked' if column is lowercase

# Check for missing values
print("Missing Value in Embarked Column: ",  titanic_df['Embarked'].isnull().sum())



Missing values before imputation:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
********
Missing Value in Embarked Column:  0


### Combining KNN and Random Forest
You can combine both methods to handle different types of missing data:

Use KNN for numerical columns.

Use Random Forest for categorical columns.

In [36]:
# Load dataset
titanic_df = pd.read_csv(r"C:\Users\Fuat\Learning Machine Learning/Titanic-Dataset.csv")

print("Missing values before imputation:")
print(titanic_df.isnull().sum())
print("********")

# Apply KNN imputation for numerical columns
numerical_columns = ['Age', 'Fare', 'SibSp', 'Parch']
imputer = KNNImputer(n_neighbors=5)
titanic_df[numerical_columns] = imputer.fit_transform(titanic_df[numerical_columns])

# Apply Random Forest imputation for categorical columns
categorical_columns = ['Embarked']
for column in categorical_columns:
    titanic_df = random_forest_impute_categorical(titanic_df, column)

# Check for missing values
print(titanic_df.isnull().sum())

Missing values before imputation:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
********
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64
