# Introduction

Welcome to some real ML!

We'll be using the titanic dataset to predict if a passenger survived this tragic maiden voyage

For tools, we'll stick to scikit-learn to take advantage of their pre-built algorithms, and also cover a few subjects in preprocessing and data cleaning along the way


### ***REMINDER!***

---

Machine learning is defined with a fixed input and output.

How would you fill in the statement below?

"Given \_\_\_, can we predict \_\_\_?"

# The Dataset
We'll be loading a CSV of 1309 passengers on the titanic. Official accounts place the passenger amount slightly above this, and the full list including crew is 2240 souls



## Loading the dataset
Loading a CSV into colab is quite easy.

0. Take the CSV file and upload it into your google drive. If possible, upload it to the "root" of your google drive (not in any folders). If you don't upload it to the root, you'll need to update the location
1. Click the folder button on the left below the variables to open up the files button.
2. Connect the colab runtime to your google drive with the google drive folder button, and grant it permissions to view your drive
3. Under `content/drive/MyDrive`, your full google drive will be visible. Find your file, and right click on it, then click "Copy Path". (note that if it's buried in some folders, find it in those folders)
4. Paste path into the string variable `csv_loc`, then run the cell. If a table pops up, nice! Otherwise, you'll need to debug

In [None]:
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')

csv_loc = ""

df = pd.read_csv(csv_loc)

df

## About the data
---
A data dictionary will sometimes be provided to explain the columns. This is especially helpful when columns can contain engineered features where the math behind the feature is important to understand. Or the column has a confusing name

Here is the data dictionary for the columns on the dataset:

* survived - Survival (0 = No; 1 = Yes)
* pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* name - The passenger's name
* sex - Their sex (male/female in this case)
* age - Age
* sibsp - Number of Siblings/Spouses Aboard
* parch - Number of Parents/Children Aboard
* ticket - Ticket Number
* fare - Passenger Fare
* cabin - Cabin
* embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
* boat - Lifeboat (if survived)
* body - Body number (if did not survive and body was recovered)
* home.dest - home and destination (if known)

# Data Exploration

This is an extremely important first step in looking at a new dataset and should always come first when looking at data in general. The goal here is to see what we will be working with, and if we notice anything ahead of time. Think of this as doing your due-diligence, and it will help guide your next steps.

### Food for thought
---
What would you first want to know about the dataset?

## Null-checking

Missing data can be a nightmare when it comes to data science. Null-checking every value isn't fun, so it is generally cleaned out of the dataset using a variety of methods.

We'll get to data cleaning soon, so for now, let's see what's missing!



In [None]:
# This will tell us if values are missing by converting it to a boolean (true/false) value

df.isnull()

That table isn't very readable, since it doesn't summarize anything. 

Looking through every single value in the table is a waste, so let's make the computer do it for us

In [None]:
# Let's figure out what is *actually* missing
df.isnull().sum()

Yikes! That's a lot of missing data. Those columns may be a lot of work to clean up, but we'll come back to that down below.

Lets also look into the data of some columns.

First off, that home.dest column is interesting, but how can we check what's in it?

### Do it yourself!

If you run the cell below, you'll get just the one column from the data. Look around online and figure out if there's a function you can use to get only the unique values

In [None]:
# Let's see what's in the home.dest columns

df["home.dest"]

# df["home.dest"].?

In [None]:
# That's a lot of values... 
# let's check the age column

df['age'].?
# Don't ask why some ages are really specific decimals, I don't know either

For columns where we know the values already, it's more helpful to see the distribution. Specifically, how could we see what the most common ages were?

### Do it yourself!

There's a way to summarize the column to show the age and the number of times that age appears in the data. Look around for a function that lets you summarize

In [None]:
# It should sort by count descending, and if you look at the length, it tells you the number of unique values too :)
df['age'].?


In [None]:
# and let's check embarked to see where people came from. Use the same function as the cell above!

df['embarked'].?

There are tons of Pandas operations you can use to look into data. Feel free to try out some of the other options

For example, what does this one below do?

In [None]:
df[["sex", "age"]].groupby("sex").describe()

# Data Cleaning, Feature Analysis and Feature Selection
---
Alright, let's get into prepping the data for some ML! This will go over a few different subjects all at once (for the sake of time). As you can probably guess, we don't need to clean features we don't plan on using, so we'll select some features to use first, then clean those

Also, when you see the word "feature" here, let's internally convert that to "column" when talking about the data. "Feature" is a generic term that covers all types of machine learning.


## Selecting features

We are going to be selecting a handful of features by simply removing the features we don't plan on using. We'll also be removing features that are obviously giving us the answer to our question

### Food for thought

1. What columns probably don't affect the outcome too much?
2. Which columns tell you the "same" information?
3. Which columns will be hard to use?

Let's remove some features. To simplify the deletion, we're going to provide the columns as a list, then drop them all at once.

If you make changes to this list, the steps below may change depending on what features you select.

If you delete columns and want them back, run the cell that loads the dataframe from your google drive again to bring back all the columns. Just know that you'll have to run your data cleaning again!

To do this easily, click on the cell below, then go to `runtime` -> `run before` to run all the cells up to this point. It may make you connect your google drive again, so feel free to comment out the drive.mount line if that popup gets annoying

In [None]:
columns_to_remove = ["name", "ticket", "boat", "body", "home.dest", "sibsp", "parch", "cabin"]

df = df.drop(columns_to_remove, axis=1)

df

## Encoding features

For features like embarked and sex, words are not easy for machine learning models to use. Their values are also in an enumerated list we already have. These features are known as categorical features. 
Let's encode those by assigning a number per category, and replacing it

In [None]:
# to convert a categoricl variable, switch the type to category, and then grab the code per
df["sex"]=df["sex"].astype('category').cat.codes

df

In [None]:
# Let's do the same for embarked 
df["embarked"]=?

df

## Removing those pesky null values


In [None]:
# Let's double back around and check where we are on those null values

df.isnull().sum()

A few are still there, and while we could take the time to try and work through that, we are just going to drop those rows for simplicitiy.

Notice the number of rows go down below the dataframe

In [None]:
# dropna drops a row if there is a single column with a null values in it
df = df.dropna()

df

# Machine Learning!

## Prepping the data

To finally prep the data, we need to split it up into test data, and training data



In [None]:
# Let's break apart the column we're predicting and the columns we'll be using to make the prediction

# pop takes the single column out of the dataframe and returns it, so it makes our job pretty easy
target = df.pop("survived")

target

In [None]:
# as you can see the dataframe is missing the column

df

In [None]:
# We're going to use sklearn to speed up a lot of the code here, so we're going to split the data into test data and training data in one line
from sklearn.model_selection import train_test_split

# This saves the output into 4 variables, and the param test_size controls the percentage of values in the test dataset vs the training dataset
x_train, x_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=42)

## Running the algorithm!

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

model = SVC(kernel='linear', random_state=42)

model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))

Boom! You've done machine learning!

### Some other algorithms

Let's try out some other options in the library and see how they do!

In [None]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))

In [None]:
# Decision Trees
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)

model.fit(x_train, y_train)

y_pred = model.predict(x_test)

print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))

This one has an extra twist. We can also visualize this one quite easily to see what the final decision tree looks like.

Run the cell below, and look at the output. Can you figure out what the values in each box mean?

In [None]:
from graphviz import Source
from sklearn import tree

Source(tree.export_graphviz(model, out_file=None, feature_names=x_train.columns.values))

## Writing less code

Before we begin, let's reduce the amount of code we are writing for our models.

Let's make a function that we can use to test functions a bit quicker. 

Look for a pattern in the Decision Tree and SVM classifiers , and make a function that replaces the duplicate code

In [None]:
def train_and_test_model(model, x_train, x_test, y_train, y_test):
  

Now lets use that to try out the random forest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)

train_and_test_model(model, x_train, x_test, y_train, y_test)

## Tuning hyperparameters

While just hitting run and seeing how algorithms do is fun, let's spice things up a bit

This next algorithm will be K-nearest neighbors, and you'll get to see how changing the k value affects the accuracy

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1)

train_and_test_model(model, x_train, x_test, y_train, y_test)

### Automating busy work

We could type every single n value into this variable until a good one shows up...

But we can just do that with loops!

### Do it yourself!
Add a loop in and see which value for n_neighbors performs the best! Print out the number of neighbors, then call the function you already wrote. 

Try out 1-25, and see which one performs the best!

In [None]:
# 
for ???:
  model = ?
  
  print(?)

# Bonus work!

Try going back and adjusting the features you select, or removing cleaning steps by commenting out code. If you're feeling fun, even try engineering your own features to use! You can run all the code in the notebook by going to `runtime` -> `run all` to run everything again after making changes. 

Can you improve the accuracy? Do certain models perform better than others when you make certain changes?