This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [7]:
df = pd.read_csv('Billionaire.csv')

2. Display columns and describe the data set

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2676 entries, 0 to 2675
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      2676 non-null   object
 1   NetWorth  2676 non-null   object
 2   Country   2676 non-null   object
 3   Source    2676 non-null   object
 4   Rank      2676 non-null   int64 
 5   Age       2676 non-null   int64 
 6   Industry  2676 non-null   object
dtypes: int64(2), object(5)
memory usage: 146.5+ KB


In [9]:
df.describe()

Unnamed: 0,Rank,Age
count,2676.0,2676.0
mean,1343.791106,63.113602
std,773.724884,13.445153
min,1.0,18.0
25%,680.0,54.0
50%,1362.0,63.0
75%,2035.0,73.0
max,2674.0,99.0


3. Prepare Data

In [11]:
# Run this section to inspect X

X = df.drop(columns = ['Name','Country','Industry','Source','NetWorth'])
X

Unnamed: 0,Rank,Age
0,1,57
1,2,49
2,3,72
3,4,65
4,5,36
...,...,...
2671,2674,49
2672,2674,65
2673,2674,58
2674,2674,58


In [12]:
# Uncomment this section to inpect y
y = df['NetWorth']
y

0       $177 B
1       $151 B
2       $150 B
3       $124 B
4        $97 B
         ...  
2671      $1 B
2672      $1 B
2673      $1 B
2674      $1 B
2675      $1 B
Name: NetWorth, Length: 2676, dtype: object

4. Calculate accuracy

In [16]:
# Train 80% of the data set and use the rest to test
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()

# model.fit(X, y)
# predictions = model.predict([[700,700], [750, 750], [800,800]]) 

model.fit(X_train, y_train)
predictions = model.predict(X_test)

#^ predict date High 650 Low 600 / High 700 Low 600 / High 800 Low 750 

predictions #prediction result

# # Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.9552238805970149

    5. Persisting Models

In [19]:
# Save the model to file
joblib.dump(model, 'Billionaire.joblib')


['Billionaire.joblib']

5.b. Import the model and make predictions

In [20]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('Billionaire.joblib')
predictions = model.predict(X_test)
predictions

array(['$12.9 B', '$2.9 B', '$1.7 B', '$1.9 B', '$1.6 B', '$2.7 B',
       '$1.2 B', '$1.3 B', '$7.6 B', '$1.7 B', '$5.4 B', '$1.9 B',
       '$1.1 B', '$2.2 B', '$2.8 B', '$5.2 B', '$2.8 B', '$1.5 B',
       '$1.1 B', '$4.6 B', '$4.7 B', '$3 B', '$33.7 B', '$1.6 B',
       '$7.6 B', '$1.1 B', '$2.5 B', '$3.6 B', '$5.3 B', '$33 B',
       '$1.3 B', '$2.3 B', '$2.3 B', '$1.6 B', '$3.8 B', '$1.2 B',
       '$1.4 B', '$1.2 B', '$1.4 B', '$3.5 B', '$2.1 B', '$5.1 B',
       '$5.8 B', '$5.9 B', '$1.3 B', '$1.6 B', '$17.5 B', '$5.8 B',
       '$1.2 B', '$4 B', '$1.3 B', '$1.2 B', '$1.2 B', '$5 B', '$1.2 B',
       '$1.4 B', '$3 B', '$9.4 B', '$4.3 B', '$3.8 B', '$2.9 B', '$2.4 B',
       '$1.4 B', '$1.8 B', '$8.2 B', '$2.1 B', '$1.1 B', '$3.3 B', '$2 B',
       '$1.1 B', '$5.5 B', '$2.3 B', '$1.2 B', '$1.6 B', '$1.4 B',
       '$1.8 B', '$2 B', '$3.4 B', '$3.7 B', '$2.3 B', '$1.4 B', '$5.6 B',
       '$1.4 B', '$25.8 B', '$5.9 B', '$3 B', '$2.8 B', '$2.9 B',
       '$9.6 B', '$2 B', '$2.7 B'

6. (Optional) Visualize decision trees

In [22]:
tree.export_graphviz(model, out_file = 'Billionaire.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
