<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:30%; left:10%;">
    Machine Learning Fundamentals
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:43%; left:10%;">
    Santiago Basulto
</h3>
</div>

<div style="width: 100%; background-color: #222; text-align: center">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Project
</h1>
    
<h3 style="color: #ef7d22; font-weight: normal;">
    Spot-checking algorithms on Tracks data
</h3>

<br><br> 
</div>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

<img src="img/music.jpg"
    style="width:250px; float: right; margin: 0 40px 40px 40px;"></img>

Your task will be find the best algorithm to classify songs as being either 'Hip-Hop' or 'Rock'.

To do that you will apply a **Spot-checking** of different algorithms in order to discover which one might work the best.

We will use [The Echo Nest song dataset](http://millionsongdataset.com/tasteprofile/), which contains tracks alongside the track metrics.


### Hands on! 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Load the `data/tracks_3.csv` file, and store it into `tracks_df` DataFrame.

This file already has wrong observations removed, and it is balanced.

In [None]:
tracks_df = pd.read_csv('data/tracks_3.csv')

tracks_df.head()

### Show the shape of the resulting `tracks_df`.

In [None]:
tracks_df.shape

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Data preparation

Before modeling prepare the data:

#### Create features $X$ and labels $y$

In [None]:
X = tracks_df.drop(['genre_top', 'genre_top_code'], axis=1)
y = tracks_df['genre_top_code']

#### Stantardize the features

Use the `StandardScaler` to standardize the features (`X`) before moving to model creation.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Define an evaluation function

Create a `get_cv_scores` function that receives a `model` parameter with a scikit-learn model and returns the CV scores of that model.

You should use a `5-fold` cross-validation. 5 scores should be returned.

In [None]:
from sklearn.model_selection import cross_val_score

def get_cv_scores(model):
    return cross_val_score(model, X, y, cv=5)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Spot-check algorithms

Create each of the following models and call the `get_cv_scores` function using each model to get its CV scores.

Save the resulting scores in the `results_df` to compare them at the end.

In [None]:
results_df = pd.DataFrame()

#### K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

results_df['KNN'] = get_cv_scores(model)

#### Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

results_df['Decision Trees'] = get_cv_scores(model)

#### Support Vector Machines

In [None]:
from sklearn import svm

model = svm.SVC(gamma='auto',
                random_state=10)

results_df['SVM'] = get_cv_scores(model)

#### Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

results_df['Naive Bayes'] = get_cv_scores(model)

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)

results_df['Random Forest'] = get_cv_scores(model)

#### Gradient Boost Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=10)

results_df['GBC'] = get_cv_scores(model)

#### AdaBoost Classifier (Adaptive Boosting)

In [None]:
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state=10)

results_df['AdaBoost'] = get_cv_scores(model)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

### Present results

Show a boxplot per algorithm using the data you saved in `results_df`.

Which one performs the best? And the worst?


In [None]:
results_df.boxplot(figsize=(14,6), grid=False)

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98729912-57be3e80-237a-11eb-80e4-233ac344b391.png"></img>
</div>