In [None]:
# to read the dataset into a dataframe and perform operations on it
import pandas as pd
# to perform basic array operations
import numpy as np
# for plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns
# for Q-Q plots
import scipy.stats as stats

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv')

In [None]:
data.head()

# Delete Rows with Missing Values:


Missing values can be handled by deleting the rows or columns having null values. If columns have more than half of rows as null then the entire column can be dropped. The rows which are having one or more columns values as null can also be dropped.

Pros:
- A model trained with the removal of all missing values creates a robust model.

Cons:
- Loss of a lot of information.
- Works poorly if the percentage of missing values is excessive in comparison to the complete dataset.

# Impute missing values with Mean/Median:

Columns in the dataset which are having numeric continuous values can be replaced with the mean, median, or mode of remaining values in the column. This method can prevent the loss of data compared to the earlier method. Replacing the above two approximations (mean, median) is a statistical approach to handle the missing values.

The missing values are replaced by the mean value in the above example, in the same way, it can be replaced by the median value.

In [None]:
data["Age"] = data["Age"].replace(np.NaN, data["Age"].mean())
data["Age"] = data["Age"].replace(np.NaN, data["Age"].median())

Pros:
- Prevent data loss which results in deletion of rows or columns
- Works well with a small dataset and easy to implement.

Cons:
- Works only with numerical continuous variables.
- Can cause data leakage
- Does not factor the covariance between features

# Imputation method for categorical columns

When missing values is from categorical columns (string or numerical) then the missing values can be replaced with the most frequent category. If the number of missing values is very large then it can be replaced with a new category.

Pros:
- Prevent data loss which results in deletion of rows or columns
- Works well with a small dataset and easy to implement.
- Negates the loss of data by adding a unique category

Cons:
- Works only with categorical variables.
- Addition of new features to the model while encoding, which may result in poor performance

# Other Imputation Methods:


Depending on the nature of the data or data type, some other imputation methods may be more appropriate to impute missing values.
For example, for the data variable having longitudinal behavior, it might make sense to use the last valid observation to fill the missing value. This is known as the Last observation carried forward (LOCF) method.


In [None]:
data["Age"] = data["Age"].fillna(method='ffill')

# Using Algorithms that support missing values:

All the machine learning algorithms don’t support missing values but some ML algorithms are robust to missing values in the dataset. The k-NN algorithm can ignore a column from a distance measure when a value is missing. Naive Bayes can also support missing values when making a prediction. These algorithms can be used when the dataset contains null or missing values.
The sklearn implementations of naive Bayes and k-Nearest Neighbors in Python does not support the presence of the missing values.
Another algorithm that can be used here is RandomForest that works well on non-linear and the categorical data. It adapts to the data structure taking into consideration the high variance or the bias, producing better results on large datasets.

Pros:
- No need to handle missing values in each column as ML algorithms will handle it efficiently

Cons:
- No implementation of these ML algorithms in the scikit-learn library.


# Prediction of missing values:

In the earlier methods to handle missing values, we do not use correlation advantage of the variable containing the missing value and other variables. Using the other features which don’t have nulls can be used to predict missing values.
The regression or classification model can be used for the prediction of missing values depending on nature (categorical or continuous) of the feature having missing value.

In [None]:
from sklearn.linear_model import LinearRegression
import pandas as pd

data = pd.read_csv("train.csv")
data = data[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]

data["Sex"] = [1 if x=="male" else 0 for x in data["Sex"]]

test_data = data[data["Age"].isnull()]
data.dropna(inplace=True)

y_train = data["Age"]
X_train = data.drop("Age", axis=1)
X_test = test_data.drop("Age", axis=1)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Pros:
- Gives a better result than earlier methods
- Takes into account the covariance between missing value column and other columns.

Cons:
- Considered only as a proxy for the true values

# Imputation using Deep Learning Library — Datawig

In [None]:
#!pip install datawig

In [2]:
import pandas as pd
import datawig


df_train, df_test = datawig.utils.random_split(data)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['Pclass','SibSp','Parch'], # column(s) containing information about the column we want to impute
    output_column= 'Age', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'



NameError: name 'data' is not defined

In [None]:
pip install numpy==1.20.2

In [None]:
!pip3 uninstall numpy

In [None]:
y