Welcome to the OT Machine Learning Training #1
We just heard a lot about basic ML and decision trees. Now while it is fun and challenging to code your own decision tree algo, I 
think we could better use this time to familiarize ourselves with some of the tools of the trade. This will also include 
learning about how we prepare data for ML and how tuning small parameters on the models can make a huge difference in results.

So, what are these tools of the trade? Well to start, the most popular language for ML is (arguably) Python. So we will be using
python. Don't freak out, it's pretty intuitive. 

What else? 
- Well we need a good way to represent and handle our data (what we will learn from). Pandas is a fantastic data analysis
library very popular in the ML world. Pandas will allow us to read data from csv files, and represent it in a easy to handle way.
- Since we aren't going to write our own Decision Tree right now we need a way to create one. Scikit-learn is the leading machine 
learning library in python. It provides all types of classifiers, including Decision Trees. 
- And of course this is a Jupyter notebook. It will be your favorite thing. You can execute through this notebook in parts or 
'cells' which allows you to move at your own pace and allows me to provide you some structure.

The Task
The goal of this training is to read in a 'training' dataset, use it to train a Decision Tree, and successfully classify a 'testing'
dataset. The dataset is about the Titanic. You will be predicting whether a given person survived or not. 
The dataset source is [here](https://www.openml.org/d/40945) but you don't need to download anything.

Let's get started


In [2]:
# this is how you import libraries in python
import pandas as pd
import sklearn
from sklearn import tree, preprocessing
from sklearn.model_selection import train_test_split

Task 1: Get Some Data

Read the 'Titanic.csv' into a pandas dataframes. Change the code below to do that. If you do it right, when you run the 
notebook, it will print the dataset (you'll see passenger information etc).

In [None]:
titanic_data = []
print(titanic_data)

Task 2: Awesome you have the dataset. Now a pandas dataframe is kind of like a table. It has 
columns with labels representing our attributes, like 'age' 'sex' and 'name' and rows 
representing single instances of data. Kate Winslet would have a row, Leo would have a row etc.
One of our attributes/columns is called 'survived'. This is what you want to teach your model 
to predict. It is our 'target class'. 

To use sklearn's Decision Tree, we need to separate our data into x and y. 
y will hold the target class values
x will hold everything else

Change the code below to do that.

In [None]:
# extract the target class values from the titanic_data dataframe
y = titanic_data['put something in here to get what you want']

# now that you have separated the target class into a new variable, a copy remains in the titanic_data. Remove it so you don't have duplicates
x = titanic_data.some_function_here_to_remove('survived', axis=1)

Task 3: Great you now have x and y parts of the data. Now you need to create a training set and a testing set. The training set holds all the instances of the titanic data that your model will use to 'learn'. The testing set is what we will use to test the accuracy of your model after it has been trained. Sklearn has a nice function that seperates a dataset into x_train, x_test, y_train, y_test. Find it and change the below code to use it.

In [None]:
x_train, x_test, y_train, y_test = some_cool_function_here(x, y, train_size=0.75, test_size=0.25)

# you can use this print to inspect your results
print(y_test)

Task 3: Awesome you have your training data, nicely seperated into instances and targets. Ready to train the tree right?? Not so
fast. One of the tough things about sklearn and other python libraries is they don't allow for categorical variables in models. 
What does that mean? It means attributes like 'name' whose value is something other than numerical. That confuses the model and it
will refect you. No one likes rejection. So we need to do some more data manipulating. Getting a little tedious? This is a lesson
in ML - it's not all about the models. Having good, clean data is a huge priority and takes work. Lucky for you I'll handle this.
Just run the below to make sure you get what you need.

In [None]:
# find the offensive columns
bad_form_attributes_train = x_train.select_dtypes(include=[object])
bad_form_attributes_test = x_test.select_dtypes(include=[object])
print('these are the categorical columns ', bad_form_attributes_test.columns, '\n')

# drop the badies from our nice x_train, x_test
x_train = x_train.drop(bad_form_attributes_train.columns, axis=1)
x_test = x_test.drop(bad_form_attributes_test.columns, axis=1)

# use sklearns encoding tool to translate the categorical values into meaningful numeric ones
encoder_train = preprocessing.LabelEncoder()
encoded_columns_train = bad_form_attributes_train.apply(encoder_train.fit_transform)

encoder_test = preprocessing.LabelEncoder()
encoded_columns_test = bad_form_attributes_test.apply(encoder_test.fit_transform)
print('check it out we encoded the badies \n', encoded_columns_test)

# add the now transformed categorical variables back into your dataframes
x_train = x_train.join(encoded_columns_train, how='left', lsuffix='_left', rsuffix='_right')
x_test = x_test.join(encoded_columns_test, how='left', lsuffix='_left', rsuffix='_right')
print('our new, no categorical training set \n', x_train)

Task 3: Alright hotshot, time to create your Decision Tree. Go ahead an initialize one. Don't worry about setting any of the 
parameters right now, they all have defaults. Here's a helpful [link](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [None]:
decision_tree = tree.something_here_that_makes_a_tree

Task 4: Alright, you think you've got your tree? Time to train it with the x_train and y_train variables you created earlier.

In [None]:
decision_tree = decision_tree.train_me_somehow

Task 5: Your tree is trained! Hopefully! So let's use it to predict the survival of the testingset. What you should get from using your x_test dataset is an output of 1's and 0's representing the model's guess on survival for each row in the testing set

In [None]:
predictions = decision_tree.do_something(x_test)
print(predictions)

Task 6: That's awesome, you made your first predictions with a machine learning model. But its hard to really get a feel for what that means. So the model says that row x of the testing set died. Okay fine. But is it accurate? Did row x really die? Let's score the model. We have the actual answers for our testing set in y_test. Let's use them to see how you did.

In [None]:
score = decision_tree.score(x_test, y_test)
print(score)