Welcome to the OT Machine Learning Training #1
We just heard a lot about basic ML and decision trees. Now while it is fun and challenging to code your own decision tree algo, I 
think we could better use this time to familiarize ourselves with some of the tools of the trade. This will also include 
learning about how we prepare data for ML and how tuning small parameters on the models can make a huge difference in results.

So, what are these tools of the trade? Well to start, the most popular language for ML is (arguably) Python. So we will be using
python. Don't freak out, it's pretty intuitive. 

What else? 
- Well we need a good way to represent and handle our data (what we will learn from). Pandas is a fantastic data analysis
library very popular in the ML world. Pandas will allow us to read data from csv files, and represent it in a easy to handle way.
- Since we aren't going to write our own Decision Tree right now we need a way to create one. Scikit-learn is the leading machine 
learning library in python. It provides all types of classifiers, including Decision Trees. 
- And of course this is a Jupyter notebook. It will be your favorite thing. You can execute through this notebook in parts or 
'cells' which allows you to move at your own pace and allows me to provide you some structure.

The Task
The goal of this training is to read in 'training' dataset, use it to train a Decision Tree, and successfully classify a 'test'
dataset. The dataset is about the Titanic. You will be predicting whether a given person survived or not. 
The dataset source is [here](https://www.openml.org/d/40945) but you don't need to download anything.

Let's get started


In [1]:
# this is how you import libraries in python
import pandas as pd
import sklearn
from sklearn import tree, preprocessing

Task 1: Get Some Data

Read the 'training.csv' and 'testing.csv' into a pandas dataframes. 
Change the code below to do that. If you do it right, when you run the 
notebook, it will print the training dataset (you'll see passenger information etc). You can
print the testing dataset too if you want.

In [2]:
training_data = pd.read_csv('training.csv')
testing_data = pd.read_csv('testing.csv')
print(training_data, '/n', testing_data)

     pclass  survived                                               name  \
0         3         0                        Alhomaki, Mr. Ilmari Rudolf   
1         1         0                            Kenyon, Mr. Frederick R   
2         3         1  Moubarek, Master. Halim Gonios ('William George')   
3         1         0                            Meyer, Mr. Edgar Joseph   
4         3         1                           Glynn, Miss. Mary Agatha   
5         3         0                     Hakkarainen, Mr. Pekka Pietari   
6         2         0                         Beauchamp, Mr. Henry James   
7         3         0                               Connolly, Miss. Kate   
8         3         0                              Saade, Mr. Jean Nassr   
9         2         1                         Navratil, Master. Michel M   
10        3         0   Katavelas, Mr. Vassilios ('Catavelas Vassilios')   
11        2         0                             Hunt, Mr. George Henry   
12        3 

Task 2: Awesome you have the dataset. Now a pandas dataframe is kind of like a table. It has 
columns with labels representing our attributes, like 'age' 'sex' and 'name' and rows 
representing single instances of data. Kate Winslet would have a row, Leo would have a row etc.
One of our attributes/columns is called 'survived'. This is what you want to teach your model 
to predict. It is our 'target class'. It is okay to use these values for training, that is how
the model learns. But if you use them for testing your model, you are 'cheating'.

What we need to do is extract the 'survived' column from both datasets. The Decision tree we 
are going to use accepts two parameters to its training function: x_train and y_train 

When you later go to score your tree (see how well it does on unknown data) it will need a
x_test and y_test.

So:
x_train/test is all rows and columns except the target column
y_train/test is your target column

Try to create these variables from your dataframes

In [3]:
y_train = training_data[['survived']]
x_train = training_data.drop('survived', axis=1)

y_test = testing_data[['survived']]
x_test = testing_data.drop('survived', axis=1)

print(y_train)

     survived
0           0
1           0
2           1
3           0
4           1
5           0
6           0
7           0
8           0
9           1
10          0
11          0
12          0
13          0
14          0
15          0
16          0
17          0
18          0
19          1
20          1
21          1
22          0
23          0
24          1
25          1
26          1
27          0
28          0
29          0
..        ...
842         0
843         0
844         1
845         1
846         1
847         1
848         0
849         0
850         1
851         0
852         1
853         1
854         0
855         0
856         0
857         0
858         0
859         1
860         1
861         1
862         0
863         0
864         1
865         1
866         1
867         0
868         0
869         1
870         0
871         0

[872 rows x 1 columns]


Task 3: Awesome you have your training data, nicely seperated into instances and targets. Ready to train the tree right?? Not so
fast. One of the tough things about sklearn and other python libraries is they don't allow for categorical variables in models. 
What does that mean? It means attributes like 'name' whose value is something other than numerical. That confuses the model and it
will refect you. No one likes rejection. So we need to do some more data manipulating. Getting a little tedious? This is a lesson
in ML - it's not all about the models. Having good, clean data is a huge priority and takes work. Lucky for you I'll handle this.
Just run the below to make sure you get what you need.

In [4]:
# find the offensive columns
bad_form_attributes_train = x_train.select_dtypes(include=[object])
bad_form_attributes_test = x_test.select_dtypes(include=[object])
print('these are the categorical columns ', bad_form_attributes_test.columns, '\n')

# drop the badies from our nice x_train
x_train = x_train.drop(bad_form_attributes_train.columns, axis=1)
x_test = x_test.drop(bad_form_attributes_test.columns, axis=1)

encoder_train = preprocessing.LabelEncoder()
encoded_columns_train = bad_form_attributes_train.apply(encoder_train.fit_transform)

encoder_test = preprocessing.LabelEncoder()
encoded_columns_test = bad_form_attributes_test.apply(encoder_test.fit_transform)
print('check it out we encoded the badies \n', encoded_columns_test)

x_train = x_train.join(encoded_columns_train, how='left', lsuffix='_left', rsuffix='_right')
x_test = x_test.join(encoded_columns_test, how='left', lsuffix='_left', rsuffix='_right')
print('our new, no categorical training set \n', x_train)

these are the categorical columns  Index(['name', 'sex', 'age', 'ticket', 'cabin', 'embarked', 'boat', 'body',
       'home.dest'],
      dtype='object') 

check it out we encoded the badies 
      name  sex  age  ticket  cabin  embarked  boat  body  home.dest
0     273    1   16      57     66         3    20    38        113
1     213    0   56      41     42         3    14    38        111
2     304    1   22      31      7         3    16    38        107
3     130    1   75     286      0         3    16    38         17
4     142    1   72     336      3         1    16    38        111
5     204    1   49      27     43         3    11    38         14
6     126    1   13     218      0         3    16    38          0
7     365    1   39     119      0         3    16    38         33
8     197    1   38     153      0         2    16    33          0
9     282    1   28      50      0         3    16    20         56
10     47    1   52      19     70         3    16    38   

Task 3: Alright hotshot, time to create your Decision Tree. Go ahead an initialize one. Don't worry about setting any of the 
parameters right now, they all have defaults. Here's a helpful [link](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [5]:
decision_tree = tree.DecisionTreeClassifier()

Task 4: Alright, you think you've got your tree? Time to train it with the x_train and y_train variables you created earlier.

In [6]:
decision_tree = decision_tree.fit(x_train, y_train)

Task 5: Your tree is trained! Hopefully! So let's use it to predict the survival of the testingset. What you should get from using your x_test dataset is an output of 1's and 0's representing the model's guess on survival for each row in the testing set

In [7]:
predictions = decision_tree.predict(x_test)
print(predictions)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


Task 6: That's awesome, you made your first predictions with a machine learning model. But its hard to really get a feel for what that means. So the model says that row x of the testing set died. Okay fine. But is it accurate? Did row x really die? Let's score the model. We have the actual answers for our testing set in y_test. Let's use them to see how you did.

In [8]:
score = decision_tree.score(x_test, y_test)
print(score)

0.316513761468
