Steps:
1. Import the Data
2. Clean the Data - Convert into nummerical values
3. Division into Training Set / Test Set
4. Create a Model using an architecture - e.g. scikit-learn
7. Make predictions
8. Evaluate and Improve ---> Change algorithm or Fine-Tune

Libraries:
1. Numpy : Working with arrays and matrices
2. Pandas : Work on Data frame
3. MatPlotLib : Two dimensional graphs and plots
4. Scikit-Learn : All common algorithms like decision tress and all

Dataset:
1. Kaggle : Popular site for dataset

In [4]:
import pandas as pd

In [7]:
# we have downloaded file vgsales from Kaggle

# vgsales path : ./python Tutorial Supplementary Materials/vgsales.xlxs

# create a data frame object using pandas module

df = pd.read_csv(r'Python Tutorial Supplementary Materials/vgsales.csv')

df # views the dataframe

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


In [8]:
df.shape # returns a tuple corresponding to the data_frame shape

(16598, 11)

In [9]:
df.describe() # gives mathematical statistical values for each column

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.406443,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.828981,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


In [10]:
df.values # returns a list

array([[1, 'Wii Sports', 'Wii', ..., 3.77, 8.46, 82.74],
       [2, 'Super Mario Bros.', 'NES', ..., 6.81, 0.77, 40.24],
       [3, 'Mario Kart Wii', 'Wii', ..., 3.79, 3.31, 35.82],
       ...,
       [16598, 'SCORE International Baja 1000: The Official Game', 'PS2',
        ..., 0.0, 0.0, 0.01],
       [16599, 'Know How 2', 'DS', ..., 0.0, 0.0, 0.01],
       [16600, 'Spirits & Spells', 'GBA', ..., 0.0, 0.0, 0.01]],
      dtype=object)

# MUSIC RECOMMENDATION

In [54]:
# dataset : music.csv; path : Python Tutorial Supplementary Materials/music.csv

import pandas as pd

from sklearn.tree import DecisionTreeClassifier # Descision tree classifier of scikit learn

from sklearn.model_selection import train_test_split # for spliting data_frame into train and test set

from sklearn.metrics import accuracy_score

import joblib # used for training and saving models

from sklearn import tree # method to visualise our graphical tree

In [27]:
music_data = pd.read_csv(r'Python Tutorial Supplementary Materials/music.csv')

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


Cleaning the Data : Removing Null values, preprocessing

In [28]:
# press shift+tab on the method, will show you what all functions are there

X = music_data.drop(columns = 'genre') # X is input data_frame

Unnamed: 0,age,gender
0,20,1
1,23,1
2,25,1
3,26,1
4,29,1
5,30,1
6,31,1
7,33,1
8,37,1
9,20,0


In [29]:
Y = music_data['genre'] # Y is the output data_frame

0        HipHop
1        HipHop
2        HipHop
3          Jazz
4          Jazz
5          Jazz
6     Classical
7     Classical
8     Classical
9         Dance
10        Dance
11        Dance
12     Acoustic
13     Acoustic
14     Acoustic
15    Classical
16    Classical
17    Classical
Name: genre, dtype: object

In [45]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2) # divide set into 80:20 ration for training:testing; returns a tuple which is needed to unpack

Build model with Scikit-Learn

In [46]:
# create a machine learning model based on DecisionTree

model = DecisionTreeClassifier() # create a DecisionTreeClassfier object

model.fit(X_train,Y_train)

DecisionTreeClassifier()

Measure Accuracy of the model

General Rule of thumb:
1. 70% - 80% for training
2. 30% - 20% for testing

It's Super important to clean data

In [52]:
joblib.dump(model, 'music-recommender.joblib') # saves the model

['music-recommender.joblib']

In [47]:
predictions = model.predict(X_test) # feature vectors to predict for

score = accuracy_score(Y_test,predictions) # measures accuracy for the Y_test and the predicitions made by model

print(score)

1.0

In [53]:
model = joblib.load('music-recommender.joblib') # loads trained model

predictions1 = model.predict([[21,1]])
predictions1

array(['HipHop'], dtype=object)

In [56]:
tree.export_graphviz(model, out_file = 'music-recommender.dot', feature_names = ['age','gender'], class_names = sorted(Y.unique()), label = 'all', rounded = True, filled = True) # graph description language

# GRAPH REPRESENTATION

In [6]:
import graphviz

with open("music-recommender.dot") as f: # reads the .dot file
    dot_graph = f.read()
    
graphviz.Source(dot_graph) # represents the graphical read dot_graph object

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.files.Source at 0xc0c0d02f88>