Welcome to the OT Machine Learning Training #1
We just heard a lot about basic ML and decision trees. Now while it is fun and challenging to code your own decision tree algo, I 
think we could better use this time to familiarize ourselves with some of the tools of the trade. This will also include 
learning about how we prepare data for ML and how tuning small parameters on the models can make a huge difference in results.

So, what are these tools of the trade? Well to start, the most popular language for ML is (arguably) Python. So we will be using
python. Don't freak out, it's pretty intuitive. 

What else? 
- Well we need a good way to represent and handle our data (what we will learn from). Pandas is a fantastic data analysis
library very popular in the ML world. Pandas will allow us to read data from csv files, and represent it in a easy to handle way.
- Since we aren't going to write our own Decision Tree right now we need a way to create one. Scikit-learn is the leading machine 
learning library in python. It provides all types of classifiers, including Decision Trees. 
- And of course this is a Jupyter notebook. It will be your favorite thing. You can execute through this notebook in parts or 'cells'which allows you to move at your own pace and allows me to provide you some structure.

The Task
The goal of this training is to read in 'training' dataset, use it to train a Decision Tree, and successfully classify a 'test'
dataset. The dataset is about the Titanic. You will be predicting whether a given person survived or not. 
The dataset source is here [blue text](https://www.openml.org/d/40945) but you don't need to download anything.

Let's get started


In [60]:
# this is how you import libraries in python
import pandas as pd
import sklearn
from sklearn import tree, preprocessing

Task 1: Read the 'training.csv' into a pandas dataframe. Change the code below to do that. If you do it right, when you run the 
notebook, it will print the training dataset. You'll see passenger information

In [61]:
training_data = pd.read_csv('training.csv')
print(training_data)

     pclass  survived                                               name  \
0         2         1                          Brown, Miss. Edith Eileen   
1         2         0                          Slemen, Mr. Richard James   
2         1         1                          Wilson, Miss. Helen Alice   
3         3         1             Johansson Palmquist, Mr. Oskar Leander   
4         3         1                           Cohen, Mr. Gurshon 'Gus'   
5         3         0                             Odahl, Mr. Nils Martin   
6         2         1                     Davies, Master. John Morgan Jr   
7         2         1              Abelson, Mrs. Samuel (Hannah Wizosky)   
8         1         1                 Cardeza, Mr. Thomas Drake Martinez   
9         3         0                    Lovell, Mr. John Hall ('Henry')   
10        2         1                      Brown, Miss. Amelia 'Mildred'   
11        2         1            Mallet, Mrs. Albert (Antoinette Magnin)   
12        3 

Task 2: Awesome you have the dataset. Now a pandas dataframe is kinda like a table. It has columns with labels representing our
attributes, like 'age' 'sex' and 'name' and rows representing single instances of data. Kate Winslet would have a row, Leo would 
have a row, etc.

One of our attributes/one of the columns is called 'survived'. This is what you want to teach your model to predict. Because we aretraining, we have those values in this dataset, but we won't when we move to the test dataset.

What we need to do is extract the 'survived' column from the training dataset. The Decision tree we are going to use accepts two 
parameters: x_train and y_train. 

x_train is all rows and columns except the target column
y_train is your target column

Try to create these two variables from your training_data dataframe

In [62]:
y_train = training_data[['survived']]
x_train = training_data.drop('survived', axis=1)

print(y_train)

     survived
0           1
1           0
2           1
3           1
4           1
5           0
6           1
7           1
8           1
9           0
10          1
11          1
12          0
13          0
14          0
15          0
16          0
17          1
18          0
19          0
20          0
21          0
22          0
23          0
24          0
25          0
26          1
27          1
28          1
29          0
..        ...
842         0
843         0
844         1
845         0
846         1
847         0
848         0
849         1
850         1
851         0
852         0
853         1
854         0
855         0
856         0
857         0
858         0
859         0
860         1
861         0
862         0
863         0
864         0
865         0
866         0
867         1
868         0
869         0
870         0
871         1

[872 rows x 1 columns]


Task 3: Awesome you have your training data, nicely seperated into instances and targets. Ready to train the tree right?? Not so
fast. One of the tough things about sklearn and other python libraries is they don't allow for categorical variables in models. 
What does that mean? It means attributes like 'name' whose value is something other than numerical. That confuses the model and it
will refect you. No one likes rejection. So we need to do some more data manipulating. Getting a little tedious? This is a lesson
in ML - it's not all about the models. Having good, clean data is a huge priority and takes work. Lucky for you I'll handle this.
Just run the below to make sure you get what you need.

In [67]:
# find the offensive columns
bad_form_attributes = x_train.select_dtypes(include=[object])
print('these are the categorical columns', bad_form_attributes.columns, '\n')

# drop the badies from our nice x_train
x_train.drop(bad_form_attributes.columns, axis=1)

encoder = preprocessing.LabelEncoder()
encoded_columns = bad_form_attributes.apply(encoder.fit_transform)
print('check it out we encoded the badies \n', encoded_columns)

x_train = x_train.join(encoded_columns, lsuffix='_left', rsuffix='_right')
print(x_train)

these are the categorical columns Index(['name', 'sex', 'age', 'ticket', 'fare', 'cabin', 'embarked', 'boat',
       'body', 'home.dest'],
      dtype='object') 

check it out we encoded the badies 
      name  sex  age  ticket  fare  cabin  embarked  boat  body  home.dest
0     113    0   12     248   123      0         3     6    84         51
1     744    1   42     229     3      0         3    18    84         65
2     855    0   36      80    28    123         1    11    84          0
3     393    1   28     327   186      0         3     7    84          0
4     175    1   15     529   208      0         3     3    84        169
5     599    1   24     501   231      0         3    18    84          0
6     220    1   86     556   121      0         3     6    84        251
7       6    0   30     581    78      0         1     1    84        233
8     139    1   43     614   139     31         1    11    84         12
9     485    1   20     524   171      0         3    18    

Task 3: Alright hotshot, time to create your Decision Tree. Go ahead an initialize one. Don't worry about setting any of the 
parameters right now, they all have defaults. Here's a helpful [link](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [None]:
decision_tree = tree.something_here_that_makes_a_tree

Task 4: Alright, you think you've got your tree? Time to train it with the x_train and y_train variables you created earlier.

In [None]:
decision_tree = decision_tree.train_me_somehow

Task 5: Your tree is trained! Hopefully! So let's use it to predict the survival of the testing set. First we need to load the
testing set.

In [None]:
testing_data = []