This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In this classification problem, the model will predict the number of soldiers on active duty in a given year, using all columns of the dataset besides "Active Duty".

In [6]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [7]:
df = pd.read_csv('cleanedfile.csv')

2. Display columns and describe the data set

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Calendar Year                  31 non-null     int64  
 1   Active Duty                    31 non-null     int64  
 2   Full-Time (est) Guard+Reserve  31 non-null     int64  
 3   Selected Reserve FTE           31 non-null     int64  
 4   Total Military FTE             31 non-null     int64  
 5   Total Deaths                   31 non-null     int64  
 6   Accident                       31 non-null     float64
 7   Hostile Action                 31 non-null     int64  
 8   Homicide                       31 non-null     int64  
 9   Illness                        31 non-null     int64  
 10  Pending                        31 non-null     int64  
 11  Self-Inflicted                 31 non-null     int64  
 12  Terrorist Attack               31 non-null     int64

In [9]:
df.describe()

Unnamed: 0,Calendar Year,Active Duty,Full-Time (est) Guard+Reserve,Selected Reserve FTE,Total Military FTE,Total Deaths,Accident,Hostile Action,Homicide,Illness,Pending,Self-Inflicted,Terrorist Attack,Undetermined
count,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0
mean,1995.0,1702285.0,63614.516129,131157.935484,1897057.0,1575.290323,808.806452,155.290323,75.129032,276.741935,2.548387,222.935484,13.548387,20.290323
std,9.092121,337422.9,13263.522619,46394.447399,318690.9,526.564602,391.404345,272.073421,35.189716,89.154909,6.402285,43.044113,47.43616,8.82494
min,1980.0,1367838.0,22000.0,86872.0,1525942.0,796.0,424.0,0.0,26.0,154.0,0.0,150.0,0.0,4.0
25%,1987.5,1406714.0,65000.0,96033.5,1620408.0,1063.0,516.5,0.0,47.0,209.5,0.0,189.0,0.0,14.0
50%,1995.0,1502343.0,66000.0,111491.0,1732632.0,1515.0,605.0,1.0,67.0,256.0,0.0,231.0,1.0,19.0
75%,2002.5,2102580.0,71500.0,158971.0,2254696.0,1968.0,1126.0,229.5,103.5,342.0,0.0,255.0,5.5,26.5
max,2010.0,2177845.0,76000.0,243284.0,2359855.0,2465.0,1556.0,847.0,174.0,457.0,22.0,302.0,263.0,43.0


3. Prepare Data

In [65]:
# Run this section to inspect X
X = df.drop(columns = ['Active Duty'])
X

Unnamed: 0,Calendar Year,Full-Time (est) Guard+Reserve,Selected Reserve FTE,Total Military FTE,Total Deaths,Accident,Hostile Action,Homicide,Illness,Pending,Self-Inflicted,Terrorist Attack,Undetermined
0,1980,22000,86872,2159630,2392,1556.0,0,174,419,0,231,1,11
1,1981,22000,91719,2206751,2380,1524.0,0,145,457,0,241,0,13
2,1982,41000,97458,2251067,2319,1493.0,0,108,446,0,254,2,16
3,1983,49000,100455,2273364,2465,1413.0,18,115,419,0,218,263,19
4,1984,55000,104583,2297922,1999,1293.0,1,84,374,0,225,6,16
5,1985,64000,108806,2323185,2252,1476.0,0,111,363,0,275,5,22
6,1986,69000,113010,2359855,1984,1199.0,2,103,384,0,269,0,27
7,1987,71000,115086,2352697,1983,1172.0,37,104,383,0,260,2,25
8,1988,72000,115836,2309495,1819,1080.0,0,90,321,0,285,17,26
9,1989,74200,117056,2303384,1636,1000.0,23,58,294,0,224,0,37


In [17]:
# Uncomment this section to inpect y
y = df['Active Duty']
y

0     2050758
1     2093032
2     2112609
3     2123909
4     2138339
5     2150379
6     2177845
7     2166611
8     2121659
9     2112128
10    2046806
11    1943937
12    1773996
13    1675269
14    1581649
15    1502343
16    1456266
17    1418773
18    1381034
19    1367838
20    1372352
21    1384812
22    1411200
23    1423348
24    1411287
25    1378014
26    1371533
27    1368226
28    1402227
29    1421668
30    1430985
Name: Active Duty, dtype: int64

4. Calculate accuracy

In [74]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.0

5. Persisting Models

In [75]:
# Save the model to file
joblib.dump(model, 'active-duty-model.joblib')


['active-duty-model.joblib']

5.b. Import the model and make predictions

In [76]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('active-duty-model.joblib')
predictions = model.predict(X_test)
predictions

array([1411287, 1502343, 1368226, 1402227, 2050758, 2138339, 2138339])

6. (Optional) Visualize decision trees

In [77]:
tree.export_graphviz(model, out_file = 'active-duty-model.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
