# **Supervised Learning with Decision Trees**

In this lab, you will get an introduction to supervised learning by solving a classification problem using **decision trees**.

Please open [this tutorial](https://youtu.be/7eh4d6sabA0) to follow the steps in this notebook.
This is a one hour tutorial, in which you will learn how to process a data set and use it to make predictions in a classification problem.

In the tutorial, the author installs Jupyter and supporting softwares. These are pre-installed in EdStem, so you will not need to follow those steps.

**Saturn shortcuts**

Press *Ctrl+return* to run each section separately. Please note that some sections depend on the previous sections, and run them in order.
You can run the whole program at once, buy clicking the *Run All* button.



---



# Theory: Important libraries

In programming, libraries are collections of prewritten code that allow you to optimize tasks by not having to write all the code yourself. In the tutorial, a few libraries are mentioned that are helpful in coding machine learning algorithms.

***Check point***



1. Provides a multi-dimensional array
2. Data analysis library that provides a data frame
3. 2-dimensional library for ploting library for creating graphs and plots
4. Provides all the common libraries like Decision Tree
5. Popular machine learning project editor that makes it easy to inspect the data
6. Test


**In parentheses, write the number corresponding to the right item in the listing above as shown in the test example**
* pandas (2)
* matplotlib (3)
* sklearn (4)
* jupyter (5)
* numpy (1)
* test (**6**)


Note: skip the installation and setup of jupyter

# Warm Up: Data Frames
Go to kaggle.com and sign up with your gmail account. This will be helpful in Lab II where you will need to upload your own data set. This week, all data files have been provided for you.

The files relevant to supervised learning are *vgsales.csv* and *music.csv* saved in the SupervisedLearning folder.

To warm up, let's explore the file vgsales.csv. Feel free to try other methods here like head(), len() and list().

In [2]:
# Importing the pandas module to process data files
import pandas as pd

# Using pandas to read in the input file, vgsales.csv saved under SupervisedLearning.
# The data file is saved in the data frame object named 'df'
df = pd.read_csv('SupervisedLearning/vgsales.csv')

Use this space to try different methods that inform you about the data set. Run the section to see the details.

In [3]:
# Write code below to inspect the data frame
df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,259,Asteroids,2600,1980,Shooter,Atari,4.0,0.26,0,0.05,4.31
1,545,Missile Command,2600,1980,Shooter,Atari,2.56,0.17,0,0.03,2.76
2,1768,Kaboom!,2600,1980,Misc,Activision,1.07,0.07,0,0.01,1.15
3,1971,Defender,2600,1980,Misc,Atari,0.99,0.05,0,0.01,1.05
4,2671,Boxing,2600,1980,Fighting,Activision,0.72,0.04,0,0.01,0.77
5,4027,Ice Hockey,2600,1980,Sports,Activision,0.46,0.03,0,0.01,0.49
6,5368,Freeway,2600,1980,Action,Activision,0.32,0.02,0,0.0,0.34
7,6319,Bridge,2600,1980,Misc,Activision,0.25,0.02,0,0.0,0.27
8,6898,Checkers,2600,1980,Misc,Atari,0.22,0.01,0,0.0,0.24
9,240,Pitfall!,2600,1981,Platform,Activision,4.21,0.24,0,0.05,4.5


In [8]:
# Write code below to inspect the shape of the dataset 
df.shape

(49, 11)

In [6]:
# Write code to inspect some statistical information about each column
df.describe()

Unnamed: 0,Rank,Platform,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,49.0,49.0,49.0,49.0,49.0,49.0,49.0,49.0
mean,3525.285714,2600.0,1980.816327,0.875918,0.052449,0.0,0.00898,0.93898
std,2045.688118,0.0,0.39123,0.867467,0.053795,0.0,0.01141,0.930883
min,240.0,2600.0,1980.0,0.21,0.01,0.0,0.0,0.22
25%,1850.0,2600.0,1981.0,0.33,0.02,0.0,0.0,0.36
50%,3405.0,2600.0,1981.0,0.55,0.03,0.0,0.01,0.59
75%,5248.0,2600.0,1981.0,1.03,0.06,0.0,0.01,1.1
max,7150.0,2600.0,1981.0,4.21,0.26,0.0,0.05,4.5


In [9]:
# Write code to output the values of the data frame
df.values

array([[259, 'Asteroids', 2600, 1980, 'Shooter', 'Atari', 4.0, 0.26, 0,
        0.05, 4.31],
       [545, 'Missile Command', 2600, 1980, 'Shooter', 'Atari', 2.56,
        0.17, 0, 0.03, 2.76],
       [1768, 'Kaboom!', 2600, 1980, 'Misc', 'Activision', 1.07, 0.07, 0,
        0.01, 1.15],
       [1971, 'Defender', 2600, 1980, 'Misc', 'Atari', 0.99, 0.05, 0,
        0.01, 1.05],
       [2671, 'Boxing', 2600, 1980, 'Fighting', 'Activision', 0.72, 0.04,
        0, 0.01, 0.77],
       [4027, 'Ice Hockey', 2600, 1980, 'Sports', 'Activision', 0.46,
        0.03, 0, 0.01, 0.49],
       [5368, 'Freeway', 2600, 1980, 'Action', 'Activision', 0.32, 0.02,
        0, 0.0, 0.34],
       [6319, 'Bridge', 2600, 1980, 'Misc', 'Activision', 0.25, 0.02, 0,
        0.0, 0.27],
       [6898, 'Checkers', 2600, 1980, 'Misc', 'Atari', 0.22, 0.01, 0,
        0.0, 0.24],
       [240, 'Pitfall!', 2600, 1981, 'Platform', 'Activision', 4.21,
        0.24, 0, 0.05, 4.5],
       [736, 'Frogger', 2600, 1981, 'Action', 

To learn more commands available to you in EdStem, please click on 'Commands' 

# Classification Problem
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training and Testing sets
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# 1. Import the Data.

In [16]:
# Using pandas to read in the input file, music.csv saved under SupervisedLearning.
# The data file is saved in a data frame object named 'music_data'
music_data = pd.read_csv('SupervisedLearning/music.csv')

# Write code to inspect music_data data frame
music_data

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


# 2. Clean and Prepare the Data

The dataset does not contain any null values and doesn't need cleaning. 
For now, we only need to split data set into input set and output set

In [17]:
# Use the drop function to save the input set in a variable named X
X = music_data.drop(columns=['genre'])

# Write code to inspect X
X

Unnamed: 0,age,gender
0,20,1
1,23,1
2,25,1
3,26,1
4,29,1
5,30,1
6,31,1
7,33,1
8,37,1
9,20,0


In [18]:
# Save the output set in a variable named y
y = music_data['genre']

# Write code to inspect y
y

0        HipHop
1        HipHop
2        HipHop
3          Jazz
4          Jazz
5          Jazz
6     Classical
7     Classical
8     Classical
9         Dance
10        Dance
11        Dance
12     Acoustic
13     Acoustic
14     Acoustic
15    Classical
16    Classical
17    Classical
Name: genre, dtype: object

# 3. Learn and predict with the decision tree algorithm

In [20]:
# Importing the DecisionTreeClassifier package from sklearn
from sklearn.tree import DecisionTreeClassifier

# Define the decision tree classifier model
model = DecisionTreeClassifier()

# Fit the model with X and y
model.fit(X, y)

# With the fitted model, predict two values, a 21 year old male and a 22 year old female
predictions = model.predict([[21,1], [22, 0]])


# Write code to output the predictions array
predictions



array(['HipHop', 'Dance'], dtype=object)

# 4. Evaluate and Modify the Model

In [23]:
# Import train_test_split and accuracy_score packages from sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define the decision tree classifier model
model = DecisionTreeClassifier()

# Fit the model with training sets, X_train and y_train
model.fit(X_train, y_train)

# With the fitted model, predict the testing set, X_test
predictions = model.predict(X_test)

# Compute model accuracy using the accuracy_score metric from sklearn
score = accuracy_score(y_test, predictions)

# Write code to output the score value
score

1.0

# 5. Persisting Models

a. Save the model to a file

In [24]:
# Import the joblib library to save the model for future reuse
import joblib

# Redefine input and output sets, X and y
X = music_data.drop(columns = ['genre'])
y = music_data['genre']

# Define the decision tree classifier model
model = DecisionTreeClassifier()

# Fit the model with X and y
model.fit(X, y)

# Using joblib, write code to save the model 
# to a file named music-recommender.joblib in the SupervisedLearning folder
joblib.dump(model, 'SupervisedLearning/music-recommender.joblib')

['SupervisedLearning/music-recommender.joblib']

b. Import the model and make predictions

In [27]:
# Make sure that you have run the previous
# section 5.a at least once, and that the file 'music-recommender.joblib' exists in the SupervisedLearning folder.

# Load saved model
model = joblib.load('SupervisedLearning/music-recommender.joblib')

# With the fitted model, predict a value
predictions = model.predict([[21,1]])


# Write code to output the predictions array
predictions



array(['HipHop'], dtype=object)

# 6. Earn Your Wings: Visualize the Decision Tree

In [28]:
from sklearn import tree

tree.export_graphviz(model, out_file='SupervisedLearning/music-recommender.dot',
                     feature_names=['age', 'gender'],
                     class_names=sorted(y.unique()),
                     label='all',
                     rounded=True,
                     filled=True)

The dot file will be added to the list of files under SupervisedLearning.
Please follow the instructions in the tutorial to visualize the tree using VS Code.

**Note**: If the graphviz extension in the tutorial does not work for you, please use the other extension in the list (by Joao Pinto).  

If you have successfully opened the dot file in VS Code, take a screenshot, upload it into the SupervisedLearning directory, and insert it in the text box below, by using the image icon to embed it.

![alt text](https://raw.githubusercontent.com/martin-jimenez-01/AI4ALL/main/Screen%20Shot%202022-03-15%20at%206.18.50%20PM.png)