# Data Cleaning

This tutorial will show you how to clean your data to be useful for building an ML model. It goes with a video at _______ which provides more context - I would recommend viewing that video to understand what we're doing here, and more importantly trying to do some of this yourself with another data set.

**[SCRIPT]** Before we can do any proper machine learning it is essential to first prepare our data, both test and training sets, for the model to work on. First, let's load some data. I've picked Kaggle's Titanic data set, at the suggestion of ChatGPT, because if I'm going to make a video about AI I'm sure as hell going to use AI.



## Imports

Let's import some libraries we'll need for this. All will become clear.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

## Get The Data

I've uploaded data to the files section (see left menu - folder icon - 'Data' folder).

In [2]:
train_df = pd.read_csv('/content/data/train.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/data/train.csv'

## Explore The Data

Take some time to have a look at the data. We'll output a head for each one, but consider also selecting particular columns, trying to see what unique values each has, what values are missing and so on.

### Training Data

In [None]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Missing Values

Some data sets will miss values. It's inevitable. Consider a table of information about houses. A column with the size of bedroom 3 in sq ft might be missing if the house lacks a 3rd bedroom. We need to find ways to handle that.

### Which Columns have missing values?
We'll look at the data and see where there is missing information.

In [None]:
# Training data
missing_training_data = train_df.isnull().any()
missing_training_data_columns = missing_training_data[missing_training_data == True].index.tolist()
print(missing_training_data_columns)

['Age', 'Cabin', 'Embarked']


### So we know which columns have missing data
Age and fare contain some missing data on both, while Fare and Embarked contain in training and test respectively. Now we need to know how much data. This will help us to decide what to do.

In [None]:
missing_count_training = train_df.isnull().sum()
training_rows = len(train_df)

print (f'Training missing items:\n{missing_count_training[missing_count_training > 0]}\nfrom {training_rows} total rows')

Training missing items:
Age         177
Cabin       687
Embarked      2
dtype: int64
from 891 total rows


### Missing Data Conclusions

So we have about 75% missing data in the Cabin column, and a little under 20% missing on the age column. Does this data matter to our goal of figuring out patterns around who survived and who didn't? One could argue that elderly people and small children might be more vulnerable than other groups, and that certain cabins were located in more dangerous locations than others.