# Pandas Selecting Lab

### Introduction

In this lab, we'll start with collecting our data from github, and finish with training a machine learning model.  Let's get started.

### Loading the data

Let's work with a different movie dataset, this one from imdb.  The path to the CSV file is located here.

In [2]:
url = './imdb_movies.csv'

Now load up the data into a pandas dataframe and store it as df.

In [9]:
import pandas as pd
df = pd.read_csv(url)

In [10]:
df[:2]

# 	title	genre	budget	runtime	year	month	revenue
# 0	Avatar	Action	237000000	162.0	2009	12	2787965087
# 1	Pirates of the Caribbean: At World's End	Adventure	300000000	169.0	2007	5	961000000

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000,162.0,2009,12,2787965087
1,Pirates of the Caribbean: At World's End,Adventure,300000000,169.0,2007,5,961000000


Now that we have loaded the data, let's explore it.  We can start by viewing the list of columns in our dataframe.

In [13]:
df_cols = df.columns
df_cols

# Index(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'], dtype='object')

Index(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'], dtype='object')

Ok, now that we have our list of features, let's select a single column, the column that will be our target, and assign it to the variable `y`.

In this case, our target is revenue.  Go ahead and assign that column of data to the variable `y`.

In [15]:
y = df.revenue
y
# 0       2787965087
# 1        961000000
# 2        880674609
# 3       1084939099
# 4        284139100
#            ...    
# 1995      53187659
# 1996             0
# 1997      47351251
# 1998      36642838
# 1999          6399
# Name: revenue, Length: 2000, dtype: int64

0       2787965087
1        961000000
2        880674609
3       1084939099
4        284139100
           ...    
1995      53187659
1996             0
1997      47351251
1998      36642838
1999          6399
Name: revenue, Length: 2000, dtype: int64

Ok, now it's time to assign the appropriate features to X.  Let's take another look at our columns.

In [16]:
df.columns

Index(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'], dtype='object')

Ok, now it looks like `genre`, `budget`, `runtime`, `year` and `month` would each make good features.  However, remember that our model can only work with numbers.  Are each of these columns full of numbers?  The easiest way to find out is to use the `dtypes` method on a dataframe.

In [17]:
df.dtypes

title       object
genre       object
budget       int64
runtime    float64
year         int64
month        int64
revenue      int64
dtype: object

Everytime we see `object`, this means the data in the column is of type string.  So `genre` is not a number, nor is title, but we can use columns `budget`, `runtime`, `year`, and `month` as features.  Select these columns and assign it to the variable `X`.

> Hint: Remember it is often easiest to assign a list of columns to a variable, and then select the columns.

In [20]:
cols = ['budget', 'runtime', 'year', 'month']

X = df[cols]
X.columns

# Index(['budget', 'runtime', 'year', 'month'], dtype='object')

Index(['budget', 'runtime', 'year', 'month'], dtype='object')

### Training our model

Now that we have our feature data stored as `X`, and our target data stored as the variable `y`, and everything in both datasets is a number, it's time to train our model.

1. From the sklearn tree module, import the `DecisionTreeRegressor` model.

In [24]:
from sklearn.tree import DecisionTreeRegressor

Next, initialize an instance of the `DecisionTreeRegressor` model, and then train the model on our features `X` and target `y`. 

In [25]:
model = DecisionTreeRegressor()
model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [27]:
model.predict(X[:2])
# array([2.78796509e+09, 9.61000000e+08])

array([2.78796509e+09, 9.61000000e+08])

### Summary

In this lesson, we went from collecting our data from a CSV file, to selecting our data with pandas, to then modeling our data with `sklearn`.  If you were able to do this successfully, give yourself a nice smile and an big exhale.  Really feel that dopamine hit.  Now, in the next lessons, we'll learn a little bit more about data munging, and then will finish up with one more important concept in data science.