<a href="https://colab.research.google.com/github/michaelwnau/ai_academy_notebooks/blob/main/spotify_exploratory_analysis_P1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Template: Phase 1

Below are some concrete steps that you can take while doing your analysis. This guide isn't "one size fit all" so you will probably not do everything listed. But it still serves as a good "pipeline" for how to do data analysis.

If you do engage in a step, you should clearly mention it in the notebook.

---


## Loading Data

1. Load up your data
2. Decide what you want to predict

### Refresher on Data Types

* Scalar (no transformation needed)
    * Numeric
    * Discrete
        * Ordinal
        * Binary
* Text
    * Bag of Words, TF-IDF, Embeddings
* Sets (e.g. tags)
    * Can't do simple bag of words, since tags can be multi word
    * One hot encoding
* Time series
    * Naive approaches
        * Last value
        * Average, Median
        * Max/min
* Numeric Data that isn't directly interpretable (e.g. geospatial data)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report


In [None]:
# Load data
playlists = pd.read_csv('/content/drive/MyDrive/AI ACADEMY/2 - Data Mining/8- Week 8/WKS8_Student/archive/playlists.csv')
genres = pd.read_csv('/content/drive/MyDrive/AI ACADEMY/2 - Data Mining/8- Week 8/WKS8_Student/archive/genres_v2.csv')

# Recon the data
print(playlists.head())
print(genres.head())


                 Playlist      Genre
0  19WuHd4MxWLzE1fpMmw4S4  Dark Trap
1  6XyR8uzgkSoDzHuOxxRtLH  Dark Trap
2  37Ij3ofyhvEhFEH8YZMZ2X  Dark Trap
3  07zTlfPpsxeoWdumbkNWMI  Dark Trap
4  2dClSRLsnptdkDQnpi5H2f  Dark Trap
   danceability  energy  key  loudness  mode  speechiness  acousticness  \
0         0.831   0.814    2    -7.364     1       0.4200        0.0598   
1         0.719   0.493    8    -7.230     1       0.0794        0.4010   
2         0.850   0.893    5    -4.783     1       0.0623        0.0138   
3         0.476   0.781    0    -4.710     1       0.1030        0.0237   
4         0.798   0.624    2    -7.668     1       0.2930        0.2170   

   instrumentalness  liveness  valence  ...                      id  \
0          0.013400    0.0556   0.3890  ...  2Vc6NJ9PW9gD9q343XFRKx   
1          0.000000    0.1180   0.1240  ...  7pgJBLVz5VmnL7uGHmRj6p   
2          0.000004    0.3720   0.0391  ...  0vSWgAlfpye0WCGeNmuNhy   
3          0.000000    0.1140   0.1750  ...

  genres = pd.read_csv('/content/drive/MyDrive/AI ACADEMY/2 - Data Mining/8- Week 8/WKS8_Student/archive/genres_v2.csv')


In [None]:
# Check and handle missing values
playlists.isnull().sum()

# Fill numeric missing values with median or mean
playlists.fillna(playlists.median(), inplace=True)


  playlists.fillna(playlists.median(), inplace=True)


In [None]:
# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the 'Playlist' data
X = vectorizer.fit_transform(playlists['Playlist'])

# Convert to dataframe
X_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print(playlists.columns)


Index(['Playlist', 'Genre'], dtype='object')


In [None]:
# Create a OneHotEncoder instance
encoder = OneHotEncoder()

# Fit and transform the Genre data
X = encoder.fit_transform(playlists['Genre'].values.reshape(-1, 1))

# Convert to dataframe
X_df = pd.DataFrame(X.toarray(), columns=encoder.get_feature_names_out())


In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_df, playlists['Genre'], test_size=0.2, random_state=42)

# Create a RandomForestClassifier instance
clf = RandomForestClassifier()

# Fit the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Print a classification report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

   Dark Trap       1.00      1.00      1.00         1
         Emo       1.00      1.00      1.00         1
      Hiphop       1.00      1.00      1.00         1
         Pop       1.00      1.00      1.00         1
         RnB       1.00      1.00      1.00         3
  Trap Metal       1.00      1.00      1.00         1

    accuracy                           1.00         8
   macro avg       1.00      1.00      1.00         8
weighted avg       1.00      1.00      1.00         8



## Exploratory Data Analysis (EDA)

1. Decide if feature selection is needed.
    * Do you have hihgly correlated features?
2. Decide if you have non-scalar attributes.
3. What type of supervised learning is this?
    * Binary Classification
    * Multi-class classification?
    * Ordinal classification [Tricky]
        * Do you want to change this into regression or binarize into binary classification?
    * Regression
4. If doing classification
    1. Decide whether you class variable makes sense.
    2. Figure out what your class balance is
5. Histogram the features
    * Good if distribution is highly skewed
6. Vizualize using reduced dimensions
    * PCA, MVD
    * T-SNE

## Preprocessing

1. Remove meaningless features (e.g. IDs), or unfair features (e.g. percent grade should be removed if predicing final grade)
2. Discretization
3. Transform features into usable formats (standardize dates, vectorize words)
4. Transform data to a wide format (one row per prediction)
5. **Feature Selection**: Remove redundant, noisy features or unhelpful features
6. Feature creation
    * Use an external tool (e.g. analyzing sentiment from text)
7. Revist EDA using processed features