# Machine learning basics - decision tree

pandas is used to read and innteract with the csv files.
The csv files are simply - 'Comma Seperated Values'.
It gives us a concept of data frame ie like a multidimentional matrix array.
Very useful to manipulate the data for ML forms.

In [1]:
import pandas as pd
data = pd.read_csv('music.csv')
data

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


For training model we do not need the output variable column. 
So we need to drop that and use pure and clean input for decision tree algorithm.
By convention, the variable name 'X' represents the input data without output parameter.

In [2]:
X = data.drop(columns=['genre'])
X

Unnamed: 0,age,gender
0,20,1
1,23,1
2,25,1
3,26,1
4,29,1
5,30,1
6,31,1
7,33,1
8,37,1
9,20,0


By convention output parameter is represented by 'y'.

In [3]:
y = data['genre']
y

0        HipHop
1        HipHop
2        HipHop
3          Jazz
4          Jazz
5          Jazz
6     Classical
7     Classical
8     Classical
9         Dance
10        Dance
11        Dance
12     Acoustic
13     Acoustic
14     Acoustic
15    Classical
16    Classical
17    Classical
Name: genre, dtype: object

In [4]:
from sklearn.tree import DecisionTreeClassifier

In [5]:
model = DecisionTreeClassifier()
model.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Here we are asking it to predict the music like of 21 year old male & 22 year old female.

In [6]:
predictions=model.predict([[21,1],[22,0]])
predictions

array(['HipHop', 'Dance'], dtype=object)

To check for accuracy, we need to split the given data into training and testing data.

In [7]:
from sklearn.model_selection import train_test_split

we will spit the data set using a function which returns a tuple.
We will decompose the tuple to access the splited data sets.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [9]:
model = DecisionTreeClassifier()
model.fit(X_train,y_train)

prediction = model.predict(X_test)

In [10]:
from sklearn.metrics import accuracy_score

In [11]:
accuracy_score(y_test,prediction)

0.25

# Save the ML model as a joblib file 

In [12]:
import joblib

In [13]:
joblib.dump(model,'music-recommender.joblib')

['music-recommender.joblib']

In [14]:
mymodel = joblib.load('music-recommender.joblib')
mymodel.predict([[21,1]])

array(['HipHop'], dtype=object)