# Data Preprocessing 

"Garbage In, Garbage Out." Our model is only as good as the data we put into it.

We've learned about a bunch of great supervised learning algorithms, however these algorithms will only work when they are fed decent data. In this notebook you'll learn a few tricks for taking a subpar dataset and making it more usable for machine learning.

You will be equipped with three data preprocessing tricks by the end of this lesson (TODO):

1. Label Encoding (string -> numbers)
2. Handling NA Values
3. Feature Engineering

## Titanic Data

We will use the titanic dataset as an example. Let's take a look at our data. 

In [1]:
import pandas as pd

passengers = pd.read_csv("titanic.csv")

# TODO Take a look at your data. Recall .head, .describe, .info, etc
passengers.describe


<bound method NDFrame.describe of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                 

Let's make some observations about our data.

  - *Question 1*: Which features do you think will be most useful in our model? Which will be least useful?

Useful: Fare, Survived, Age, Sex

Unusable: Cabin, Name, Embarked
  

  - *Question 2*: Which features have NA values? How many NA values are there? 

passengers.info()

Cabin, Embarked, Age



## 1. Cleaning Up NA Values

Our supervised learning algorithms typically cannot handle NA values. So we have to clean them up. We have a few options here:

1. Remove all rows with NA values
3. Approximate data
4. Fill with 0

I will show you all three methods, and you will have to choose the one you think is best.

In [2]:
# Option 1: 

cleaned1 = passengers["Age"].dropna()

# Option 2: 

cleaned2 = passengers["Age"].fillna(passengers["Age"].mean())

# Option 3:

cleaned3 = passengers["Age"].fillna(0)


Analyze the top three methods. What are the pros and cons of each? Which do you think will work best? 

In [3]:
# TODO Use .head, .describe, .info to compare the above options. 
cleaned3.sample(10)


378    20.00
657    32.00
431     0.00
604    35.00
113    20.00
721    17.00
426    28.00
344    36.00
803     0.42
190    32.00
Name: Age, dtype: float64

Once you have your answer, update passengers below so that all NA values are handled

In [4]:
# TODO handle NA values in the Age column
passengers["Age"] = cleaned2

In [5]:
# check your answer
assert passengers["Age"].isna().sum().sum() == 0
print("You've successfully removed all NA values from the Age column")


if(len(passengers)< 891):
    print("You might have lost some important data... ")
else:
    print("You also preserved important data")

You've successfully removed all NA values from the Age column
You also preserved important data


## 2. Feature Engineering

What if we could create a brand new feature? That's the idea behind feature engineering. 

Feature engineering is a powerful tool in machine learning that allows us to transform raw data into new features that are more effective for a machine learning model, thereby improving our performance.

### Titanic Titles

In [6]:
passengers["Name"].sample(20)

126                             McMahon, Mr. Martin
39                      Nicola-Yarred, Miss. Jamila
194       Brown, Mrs. James Joseph (Margaret Tobin)
127                       Madsen, Mr. Fridtjof Arne
325                        Young, Miss. Marie Grice
821                               Lulic, Mr. Nikola
691                              Karun, Miss. Manca
83                          Carrau, Mr. Francisco M
727                        Mannion, Miss. Margareth
674                      Watson, Mr. Ennis Hastings
760                              Garfirth, Mr. John
193                      Navratil, Master. Michel M
87                    Slocovski, Mr. Selman Francis
233                  Asplund, Miss. Lillian Gertrud
363                                 Asim, Mr. Adola
273                           Natsch, Mr. Charles H
146    Andersson, Mr. August Edvard ("Wennerstrom")
373                             Ringhini, Mr. Sante
800                            Ponesell, Mr. Martin
807         

Take a look at the "Name" feature in our Titanic dataset. It's not super usable at the moment... 

Let's try to create a brand new feature that makes this data more effective for Machine Learning. 

In [7]:
pattern = r"([A-Za-z]+)\."
titles = passengers["Name"].str.extract(pattern)
passengers["Title"] = titles

In [8]:
# Check out your new feature
passengers["Title"].head()

0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Title, dtype: object

## 3. Label Encoding

We know our algorithms can only understand numbers, so what do we do with our string values? Well, we can map them to a number. For example:
- "Miss" -> 0
- "Mr" -> 1
- "Mrs" -> 2
- ... 


We already went over this process in our decision tree notebook using the label encoder from scikit learn. Let's apply what we learned there and our new feature engineering knowledge to this example.

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
passengers["Encoded_Title"] = encoder.fit_transform(passengers["Title"])






## Conclusion

That's it! Once your dataset looks all nice and tidy, and has some new useful features in it, you can run it through your superivised learning algorithms and train your model. Remember that machine learning is iterative (circular) so you should constantly be coming back to this preprocessing step to see how you can further update your data to improve your accuracy. Happy coding!