# Project Description ⛳
<hr>

**Your final project will involve all topics covered from Week 2 to 8 by using data to solve a real-life problem. Remember you're doing this with your team**.

You’ve learned a ton about data collection and cleaning, visualization and insight, machine leearning, and model evaluation in this course. The final project is your chance to solve a problem with these from scratch.


`Use the rubric below as a guideline for your project as this will be used in grading your submissions`.

# Data cleaning & preprocessing

- Demonstrate clear understanding of different data cleaning and preprocessing techniques by applying them to your dataset.
- Clearly document (within the notebook) all cleaning and preprocessing steps.

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [16]:
# Read in the data
df = pd.read_csv('music_genre.csv')

# Display first 5 rows
df.head()

Unnamed: 0,artist,song,ids,genre,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,mode,valence
0,Kate Bush,Running Up That Hill (A Deal With God),75FEaRjZTKLhTrFGsfMUXR,Rock,0.72,0.629,298933,0.547,0.00314,10,0.0604,-13.123,0.055,108.375,0,0.197
1,The Killers,Mr. Brightside,003vvx7Niy0yvhvHt4a68B,Rock,0.00121,0.352,222973,0.911,0.0,1,0.0995,-5.23,0.0747,148.033,1,0.236
2,Arctic Monkeys,505,0BxE4FqsDD1Ot4YuBXwAPp,Rock,0.00287,0.526,253587,0.866,7.8e-05,0,0.0945,-5.822,0.0568,140.266,1,0.248
3,Sam Fender,Seventeen Going Under,5rF6YUIlgiat22OT1lWspJ,Rock,0.00438,0.48,297933,0.87,0.00603,1,0.0826,-4.792,0.0362,161.953,1,0.584
4,George Ezra,Green Green Grass,3rk4aJ0vAj3cFUIQEeASkT,Rock,0.0695,0.685,167614,0.738,0.0,8,0.128,-4.413,0.0595,112.972,1,0.8


In [17]:
df.isnull().sum()

artist              0
song                0
ids                 0
genre               0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
speechiness         0
tempo               0
mode                0
valence             0
dtype: int64

In [18]:
#drop useless columns
df.drop(['ids'],axis=1,inplace=True)
df_without_song = df.drop(['song'],axis=1)
df_without_artist = df.drop(['artist'],axis=1)
df_without_artist_song = df.drop(['artist','song'],axis=1)




# Exploratory Data Analysis
- Apply both measure of central tendency and dispersion to understand the data.
- Perform corellation analysis of the dependent and independent variables
- What does the corellation analysis says about the dependent and independent variables

In [19]:
#apply measures of central tendency and measures of dispersion on a dataset

#central tendencies without genre
mean = df_without_artist_song.drop(['genre'],axis=1).mean()
median = df_without_artist_song.drop(['genre'],axis=1).median()
mode = df_without_artist_song.drop(['genre'],axis=1).mode()

#measures of dispersion without genre
std = df_without_artist_song.drop(['genre'],axis=1).std()
var = df_without_artist_song.drop(['genre'],axis=1).var()
range = df_without_artist_song.drop(['genre'],axis=1).max() - df_without_artist_song.drop(['genre'],axis=1).min()

print("Measures of central tendency without genre")
print("Mean:")
print(mean)
print("Median:")
print(median)
print("Mode:")
print(mode)
print("Measures of dispersion without genre")
print("Standard deviation:")
print(std)
print("Variance:")
print(var)
print("Range:")
print(range)


Measures of central tendency without genre
Mean:
acousticness             0.339804
danceability             0.578148
duration_ms         230590.324000
energy                   0.572698
instrumentalness         0.187261
key                      5.392000
liveness                 0.146420
loudness               -10.700616
speechiness              0.071584
tempo                  117.420964
mode                     0.640000
valence                  0.476311
dtype: float64
Median:
acousticness             0.177000
danceability             0.598000
duration_ms         214460.000000
energy                   0.660500
instrumentalness         0.000107
key                      6.000000
liveness                 0.111000
loudness                -6.709500
speechiness              0.047250
tempo                  120.045000
mode                     1.000000
valence                  0.469000
dtype: float64
Mode:
    acousticness  danceability  duration_ms  energy  instrumentalness  key  \
0          0.

In [20]:
#correlation analysis of the independent variables with the dependent variable
corr = df_without_artist_song.drop(['genre'],axis=1).corr()
print("Correlation:")
print(corr)

Correlation:
                  acousticness  danceability  duration_ms    energy  \
acousticness          1.000000     -0.538718     0.184575 -0.858784   
danceability         -0.538718      1.000000    -0.309187  0.480816   
duration_ms           0.184575     -0.309187     1.000000 -0.202881   
energy               -0.858784      0.480816    -0.202881  1.000000   
instrumentalness      0.802812     -0.616671     0.289232 -0.803711   
key                   0.074325      0.110225    -0.092625  0.005436   
liveness             -0.217366      0.070571    -0.148772  0.263407   
loudness             -0.831313      0.564384    -0.204338  0.907639   
speechiness          -0.054851      0.227406    -0.122419  0.089217   
tempo                -0.214211     -0.038844     0.037738  0.245783   
mode                  0.021548     -0.132745     0.036440 -0.036138   
valence              -0.468351      0.487837    -0.343836  0.557997   

                  instrumentalness       key  liveness  loudnes

# Data Visualization & Insight
- Use at least 5 different visuals to tell a story about the data
- Clearly document (within the notebook) 5 different insights you gained from the data

# Feature Engineering
- Convert categorical or non-numeric features into a numerical representation
- Transform neccessary features using feature transformation techniques of your choice.

In [24]:
#feature engineering
#convert genre to numerical values
df_without_artist['genre'] = df['genre'].map({'Classical':0,'Country':1,'EDM':2,'Rap':3,'Rock':4})
df_without_song['genre'] = df['genre'].map({'Classical':0,'Country':1,'EDM':2,'Rap':3,'Rock':4})
df_without_artist_song['genre'] = df['genre'].map({'Classical':0,'Country':1,'EDM':2,'Rap':3,'Rock':4})

#standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_columns = df_without_artist_song.drop(['genre'],axis=1).columns
df_without_artist_song[numerical_columns] = scaler.fit_transform(df_without_artist_song[numerical_columns])

df_without_artist_song.head()





Unnamed: 0,genre,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,speechiness,tempo,mode,valence
0,4,1.041974,0.295349,1.011195,-0.091078,-0.513893,1.287656,-0.81131,-0.263009,-0.24024,-0.353754,-1.333333,-1.112988
1,4,-0.927958,-1.313484,-0.112706,1.19902,-0.522657,-1.227298,-0.442531,0.59397,0.045146,1.197124,0.75,-0.957582
2,4,-0.923408,-0.302881,0.340258,1.03953,-0.522439,-1.506737,-0.48969,0.529694,-0.214164,0.893385,0.75,-0.909765
3,4,-0.91927,-0.570052,0.996399,1.053707,-0.505827,-1.227298,-0.601927,0.641526,-0.512587,1.741484,0.75,0.429116
4,4,-0.740801,0.620601,-0.931794,0.585869,-0.522657,0.728778,-0.173728,0.682676,-0.17505,-0.173983,0.75,1.289826


# Machine Learning
- Use 2 different ML algorithms to build a model using your preprocessed data.
- Compare the 2 models based on their accuracy.

In [34]:
#split the data into training and testing sets
from sklearn.model_selection import train_test_split
X = df_without_artist_song.drop(['genre'],axis=1)
y = df_without_artist_song['genre']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

#train the model logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=54)
lr.fit(X_train,y_train)

#import random forest classifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100,random_state=54)
rfc.fit(X_train,y_train)

#make predictions
predictions_knn = lr.predict(X_test)
predictions_rfc = rfc.predict(X_test)

#import svm
from sklearn.svm import SVC
svc = SVC(random_state=54)
svc.fit(X_train,y_train)
predictions_svc = svc.predict(X_test)


#compare based on accuracy
from sklearn.metrics import accuracy_score
print("KNN:")
print(accuracy_score(y_test,predictions_knn))
print("Random Forest:")
print(accuracy_score(y_test,predictions_rfc))
print("SVM:")
print(accuracy_score(y_test,predictions_svc))


KNN:
0.72
Random Forest:
0.82
SVM:
0.64


# Model Evaluation
- Evaluate the 2 models using a minimum of 4 evaluation metrics

# Deployment
- Able to deploy the ML model to cloud.
- Provides a live working URL to the deployed app.