# Selecting And Training Lab

### Introduction

In this lab, we'll start with collecting our data from github, and finish with training a machine learning model.  Let's get started.

### Loading the data

Let's work with a movie dataset from imdb.  The path to the CSV file is [located here](https://raw.githubusercontent.com/jigsawlabs-student/decision-trees-intro/master/3-practical-ds-4/imdb_movies.csv).

In [2]:
url = 'https://raw.githubusercontent.com/jigsawlabs-student/decision-trees-intro/master/3-practical-ds-4/imdb_movies.csv'

Press `shift + return` on to execute the cell above.

Now load up the data specified in the `url` above into a pandas dataframe and assign it to the variable `df`.

In [10]:
import pandas as pd

df = None

In [4]:
df[:2]

# 	title	genre	budget	runtime	year	month	revenue
# 0	Avatar	Action	237000000	162.0	2009	12	2787965087
# 1	Pirates of the Caribbean: At World's End	Adventure	300000000	169.0	2007	5	961000000

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000,162.0,2009,12,2787965087
1,Pirates of the Caribbean: At World's End,Adventure,300000000,169.0,2007,5,961000000


Now that we have loaded the data, let's explore it.  We can start by viewing the list of columns in our dataframe.

In [11]:
df_cols = None
df_cols

# Index(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'], dtype='object')

Ok, now that we have our list of columns, let's select a single column, the column that will be our target, and assign it to the variable `y`.

In this case, our target is revenue.  Go ahead and assign that column of data to the variable `y`.

In [None]:
y = None
y[:3]
# 0       2787965087
# 1        961000000
# 2        880674609
# Name: revenue, Length: 2000, dtype: int64

Now that we have selected our target and assigned it to the variable `y`, it's time to assign the appropriate features to `X`.  

Let's take another look at our columns to get a sense of which columns we can select as features.

In [16]:
df.columns

Index(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'], dtype='object')

It looks like `genre`, `budget`, `runtime`, `year` and `month` would each make good features.  However, remember that our model can only work with numbers.  Are each of these columns full of numbers?

> Select the first three rows of the dataframe to take a look.

In [6]:
# try it here 


# 	title	genre	budget	runtime	year	month	revenue
# 0	Avatar	Action	237000000	162.0	2009	12	2787965087
# 1	Pirates of the Caribbean: At World's End	Adventure	300000000	169.0	2007	5	961000000
# 2	Spectre	Action	245000000	148.0	2015	10	880674609

Well, `budget`, `runtime`, `year`, and `month` are all numbers.  Select those four columns and assign them to the variable `X`.

> Hint: Remember it is often easiest to assign a list of columns to a variable, and then select the columns.

In [7]:


X = None

> Check that you did this correctly by pressing `shift + return` on the cell below.

In [24]:
X.columns

# Index(['budget', 'runtime', 'year', 'month'], dtype='object')

Index(['budget', 'runtime', 'year', 'month'], dtype='object')

> So we had to leave out the genre column from our features for now as we have not seen how to convert text data into numbers.

### Training our model

Now that we have our feature data stored as `X`, and our target data stored as the variable `y`, and everything in both datasets is a number, it's time to train our model.

1. From the sklearn tree module, import the `DecisionTreeRegressor` model.

> We use the  `DecisionTreeRegressor` because our targets are continuous numbers.  We use the `DecisionTreeClassifier` when the targets are categories -- like customer or not. 

If you'd like to learn more about classification vs regression algorithms, we can [look to Wikipedia](https://en.wikipedia.org/wiki/Machine_learning#Machine_learning_tasks).
> Classification algorithms are used when the outputs are restricted to a limited set of values, and regression algorithms are used when the outputs may have any numerical value within a range. 

So here, because the revenue can be any number between 0 and infinity, we use the DecisionTreeRegressor instead of DecisionTreeClassifier.  Ok, let's load it up.

> Press `shift + return` on the cell below.

In [25]:
from sklearn.tree import DecisionTreeRegressor

Next, we initialize an instance of the `DecisionTreeRegressor` model. 

In [26]:
model = DecisionTreeRegressor()


# DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
#                       max_leaf_nodes=None, min_impurity_decrease=0.0,
#                       min_impurity_split=None, min_samples_leaf=1,
#                       min_samples_split=2, min_weight_fraction_leaf=0.0,
#                       presort=False, random_state=None, splitter='best')

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

Then, train the model on our features `X` and target `y` with the model's `fit` method.

Finally, use the `predict` method to make predictions on the first two observations in the training data.

In [1]:
first_two_observations = X[:2]

predictions = None
predictions
# array([2.78796509e+09, 9.61000000e+08])

NameError: name 'X' is not defined

And if we do not like scientific notation, we can convert our numbers to a list to see the predictions.

In [49]:
list(predictions)
# [2787965087.0, 961000000.0]

[2787965087.0, 961000000.0]

> **Disclaimer**: While we did excellent work training a model on some real data, this does not mean that we can begin relying on our predictions.  There is still more to learn before that.  For example, this model suffers from [overfitting](https://en.wikipedia.org/wiki/Overfitting), whatever that means.  

### Summary

In this lesson, we went from collecting our data from a CSV file, to selecting our data with pandas, to then modeling our data with `sklearn`.  If you were able to do this successfully, give yourself a nice smile and an big exhale.  Really feel that dopamine hit.  You did great work.

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="jigsaw-main.png" width="15%" style="text-align: center"></a>
</center>

### Answers

In [8]:
df = pd.read_csv(url)

In [None]:
y = df.revenue
y[:3]

In [None]:
df[:3]

In [None]:
cols = ['budget', 'runtime', 'year', 'month']

X = df[cols]

In [None]:
model.fit(X, y)

In [None]:
model.predict(first_two_observationss)