This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [13]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [15]:
df = pd.read_csv('cleanedfile.csv')

2. Display columns and describe the data set

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138 entries, 0 to 137
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   social_support  138 non-null    float64
 1   health          138 non-null    float64
 2   continent       138 non-null    object 
dtypes: float64(2), object(1)
memory usage: 3.4+ KB


In [17]:
df.describe()

Unnamed: 0,social_support,health
count,138.0,138.0
mean,1.170502,0.710566
std,0.273588,0.236027
min,0.352428,0.108744
25%,1.003694,0.537377
50%,1.219703,0.766572
75%,1.393475,0.864669
max,1.547567,1.137814


3. Prepare Data

In [30]:
# Run this section to inspect X
X = df.drop(columns = ['continent'])
X

Unnamed: 0,social_support,health
0,1.499526,0.961271
1,1.503449,0.979333
2,1.472403,1.040533
3,1.547567,1.000843
4,1.495173,1.008072
...,...,...
133,1.085695,0.494102
134,0.872675,0.442678
135,0.522876,0.572383
136,1.047835,0.375038


In [31]:
y = df['continent']
y

0      Europe
1      Europe
2      Europe
3      Europe
4      Europe
        ...  
133    Africa
134    Africa
135    Africa
136    Africa
137      Asia
Name: continent, Length: 138, dtype: object

4. Calculate accuracy

In [66]:
 # Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.6428571428571429

5. Persisting Models

In [67]:
# Save the model to file
joblib.dump(model, 'happiness.joblib')


['happiness.joblib']

5.b. Import the model and make predictions

In [68]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('happiness.joblib')
predictions = model.predict(X_test)
predictions

array(['Europe', 'North America', 'Europe', 'Africa', 'Europe', 'Europe',
       'Europe', 'Europe', 'Asia', 'Asia', 'Asia', 'Africa', 'Europe',
       'Africa', 'Africa', 'Asia', 'Europe', 'Europe', 'Europe', 'Asia',
       'Africa', 'Asia', 'Asia', 'Asia', 'North America', 'Asia',
       'Africa', 'Africa'], dtype=object)

6. (Optional) Visualize decision trees

In [69]:
tree.export_graphviz(model, out_file = 'happiness.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
