# Machine Learning Tutorial

### Steps:

1. Import data - normally through csv file
2. Clean the data - remove duplicates in data, remove bad data, conversions, etc. - model would learn bad patterns
3. Split Data into Training/Test sets - If 1k+ pics of cats and dogs, 80% can be for training, and 20% for testing
4. Create a Model - select an algorithm that will analyze the data
5. Train the Model - feed the model training data; the model will then look for patterns in the data
6. Make predictions - ask our model, is the image a cat or dog, and model will make a prediction
7. Evaluate and Improve - assess accuracy of predictions; fine-tune parameters of model


In [40]:
import pandas as pd
import os
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split  # With this function we can easily split our data set into two data sets (training/testing)
from sklearn.metrics import accuracy_score  # This function will help measure accuracy of model

In [10]:
pwd = os.getcwd()
pwd

'/Users/jasonisberto/pyproject/pythonMoshTutorial'

In [21]:
music_data = pd.read_csv('music.csv')

### Keyboard shortucts


In [57]:
# Escape --> H : Brings up shortcut help menu

# Hit tab after 'df.' this brings up all attributes and methods for the object
# df.describe()  # While hovering over the method, hit shift and tab to get a description 
# Control + Enter : This allows you to run a cell over and over again without a new cell appearing below it


### Prepare the data

In [24]:
# Split data into two sets: Input set and Output set
# When training a model, output set contains predictions
# After training a model, we give them a new input set

In [27]:
X = music_data.drop(columns=['genre'])  # Input set
y = music_data['genre']  # Output set
train_test_split(X, y)

### Running an algorithm

In [35]:
model = DecisionTreeClassifier()  # Now we have a model
model.fit(X, y)  # This fit() method takes two data sets - input and output

# Then we ask our model to make a prediction: What kind of music does a 21 yr old male like
# Based on our data frame, we'd expect it to be HipHop

predictions = model.predict([ [21, 1], [22, 0] ])  # This method takes a two-dimensional array [ , ], [ , ]
predictions



array(['HipHop', 'Dance'], dtype=object)

### Calculating accuracy of model

In [77]:
# Need to split data set into two sets - Training and Testing
# General rule of thumb - 70-80% to training and 20-30% to testing
# So instead of passing only two samples for predictions, 
# we can pass the data set for training, and compare it to the values in testing 

X = music_data.drop(columns=['genre'])  # Input set
y = music_data['genre']  # Output set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)  # Allocating 20% of our data set to testing (0.2)
# This gets unpacked into 4 tuples, which we can put into their own variables

model = DecisionTreeClassifier()
model.fit(X_train, y_train)  # Pass only the training data sets
predictions = model.predict(X_test)  # Pass the test dataset which is input values that we want to test

score = accuracy_score(y_test, predictions)  # Pass the expected values stored in y_test, and then pass predictions
# This will return an accuracy score from 0-1

score  # Every time you run it, the score can change since the model is randomly picking data
# Initial result is 1.0; when ran again it was 0.75

# If you raise the testing size higher, let's say around 80%, you'll get a really low accuracy score
# This is due to not having as much data to train the model.
# The more data we give our model, and the cleaner the data is, the better accuracy and result



1.0

### Persisting Models

This is important if we are dealing with models with a lot of data. 

Millions of rows lets say that could take minutes or hours to run.

That's why model persistance is important.

We build and train the model, and then save it to a file. If we want to make predictions, we load the model from the file and ask it to make predictions.

That model is already trained.

In [86]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib  # This object has methods for saving and loading modules

In [87]:
# music_data = pd.read_csv('music.csv')
# X = music_data.drop(columns=['genre'])
# y = music_data['genre']

# model = DecisionTreeClassifier()
# model.fit(X, y)

joblib.dump(model, 'music-recommender.joblib')  # Storing trained model into a file


['music-recommender.joblib']

In [89]:
model = joblib.load('music-recommender.joblib')  # Now we load the model we stored in the file
predictions = model.predict([[21, 1]])
predictions



array(['HipHop'], dtype=object)

### Visualizing a Decision Tree

Export model in visual format to see how model makes predictions

In [93]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [91]:
music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre'])
y = music_data['genre']

model = DecisionTreeClassifier()
model.fit(X, y)

DecisionTreeClassifier()

In [94]:
tree.export_graphviz(model, out_file='music-recommender.dot',
                    feature_names=['age', 'gender'],
                    class_names=sorted(y.unique()),
                    label='all',
                    rounded=True,
                    filled=True)

In [None]:
# Need to download VS code. Continue from here.