This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

Predicting the level of humor in different people. Columns will be the various questions asked.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [2]:
df = pd.read_csv('cleanedfile.csv')

2. Display columns and describe the data set

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1071 entries, 0 to 1070
Data columns (total 39 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Q1             1071 non-null   int64  
 1   Q2             1071 non-null   int64  
 2   Q3             1071 non-null   int64  
 3   Q4             1071 non-null   int64  
 4   Q5             1071 non-null   int64  
 5   Q6             1071 non-null   int64  
 6   Q7             1071 non-null   int64  
 7   Q8             1071 non-null   int64  
 8   Q9             1071 non-null   int64  
 9   Q10            1071 non-null   int64  
 10  Q11            1071 non-null   int64  
 11  Q12            1071 non-null   int64  
 12  Q13            1071 non-null   int64  
 13  Q14            1071 non-null   int64  
 14  Q15            1071 non-null   int64  
 15  Q16            1071 non-null   int64  
 16  Q17            1071 non-null   int64  
 17  Q18            1071 non-null   int64  
 18  Q19     

In [4]:
df.describe()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender,accuracy
count,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,...,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0,1071.0
mean,2.02521,3.34267,3.078431,2.8338,3.59944,4.152194,3.277311,2.535014,2.582633,2.869281,...,3.945845,2.767507,2.838469,4.010644,3.375537,2.956583,2.762745,70.966387,1.455649,87.542484
std,1.075782,1.112898,1.167877,1.160252,1.061281,0.979315,1.099974,1.23138,1.22453,1.205013,...,1.135189,1.309601,1.233889,0.708479,0.661533,0.41087,0.645982,1371.989249,0.522076,12.038483
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,1.3,0.0,0.0,0.0,14.0,0.0,2.0
25%,1.0,3.0,2.0,2.0,3.0,4.0,3.0,2.0,2.0,2.0,...,3.0,2.0,2.0,3.6,2.9,2.8,2.3,18.5,1.0,80.0
50%,2.0,3.0,3.0,3.0,4.0,4.0,3.0,2.0,2.0,3.0,...,4.0,3.0,3.0,4.1,3.4,3.0,2.8,23.0,1.0,90.0
75%,3.0,4.0,4.0,4.0,4.0,5.0,4.0,3.0,3.0,4.0,...,5.0,4.0,4.0,4.5,3.8,3.3,3.1,31.0,2.0,95.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.1,5.0,5.0,5.0,44849.0,3.0,100.0


3. Prepare Data

In [6]:
# Run this section to inspect X
X = df.drop(columns = ['accuracy'])
X

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q29,Q30,Q31,Q32,affiliative,selfenhancing,agressive,selfdefeating,age,gender
0,2,2,3,1,4,5,4,3,4,3,...,2,4,2,2,4.0,3.5,3.0,2.3,25,2
1,2,3,2,2,4,4,4,3,4,3,...,4,4,3,1,3.3,3.5,3.3,2.4,44,2
2,3,4,3,3,4,4,3,1,2,4,...,2,5,4,2,3.9,3.9,3.1,2.3,50,1
3,3,3,3,4,3,5,4,3,-1,4,...,4,5,3,3,3.6,4.0,2.9,3.3,30,2
4,1,4,2,2,3,5,4,1,4,4,...,2,5,4,2,4.1,4.1,2.9,2.0,52,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1066,3,2,3,3,2,5,3,2,3,4,...,5,4,4,4,2.5,3.3,2.9,3.0,18,2
1067,1,4,5,2,4,4,1,2,2,5,...,1,4,1,2,4.8,3.9,2.5,2.4,31,1
1068,1,4,4,5,4,4,3,5,4,3,...,2,4,1,5,4.4,3.9,3.0,4.3,15,1
1069,3,4,4,3,3,4,3,2,4,3,...,3,4,3,3,3.1,3.6,2.9,2.8,21,2


In [9]:
# Uncomment this section to inpect y
y = df['accuracy']
y

0       100
1        90
2        75
3        85
4        80
       ... 
1066     95
1067     95
1068     95
1069     87
1070     75
Name: accuracy, Length: 1071, dtype: int64

4. Calculate accuracy

In [10]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.13953488372093023

5. Persisting Models

In [11]:
# Save the model to file
joblib.dump(model, 'MODELNAME.joblib')


['MODELNAME.joblib']

5.b. Import the model and make predictions

In [12]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('MODELNAME.joblib')
predictions = model.predict(X_test)
predictions

array([ 90,  85,  90,  90,  60,  90,  70,  95,  85,  60,  80,  98,  66,
        90,  85,  96, 100,  90,  80,  99,  70, 100,  70, 100,  80, 100,
        90,  70,  75,  75, 100,  80,  90,  95,  90,  60,  70,  95,  98,
        90, 100,  90,  80,  99,  85,  80,  93,  90, 100, 100,  80,  98,
        70,  90,  80,  80,  80,  75,  80,  90,  99,  90,  99,  90,  99,
        90,  75,  75,  90,  85,  99,  90,  90,  95,  98,  80,  95,  75,
       100,  96,  95, 100, 100,  90,  90,  80,  90, 100, 100,  90,  95,
        85,  99,  75,  75,  95,  80,  95,  80, 100, 100,  70,  75,  80,
        90, 100, 100,  85,  95,  70,  80,  90, 100,  90, 100,  95,  96,
       100,  85,  80,  90,  75,  80,  98,  60, 100,  90,  60,  70,  80,
        90,  80,  85,  80,  90,  90,  99,   9, 100,  75,  75, 100,  75,
        93,  90,  85,  80,  75,  96,  90,  90,  60,  95,  90,  10, 100,
        75,  80,  80,  75,  90,  90,  95,  90, 100,  90,  85,  90,  90,
        75, 100,  90,  85,  85,  90,  86,  90,  90,  95,  85,  9

6. (Optional) Visualize decision trees

In [13]:
tree.export_graphviz(model, out_file = 'MODELNAME.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
