**Dataset**
labeled datasset collected from Spotify (Assignment 1 - Spotify Reviews Rating)

**Objective**
classify Review to a category from 1 to 5. <br>

**Total Estimated Time = 90-120 Mins**

**Evaluation metric**
macro f1 score

### Import used libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
import re
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import make_scorer, f1_score

### Load Dataset

In [2]:
df=pd.read_csv('Assignment 1 - Spotify Reviews Rating.csv')
df.head()

Unnamed: 0,Time_submitted,Review,Rating
0,7/9/2022 15:00,"Great music service, the audio is high quality...",5
1,7/9/2022 14:21,Please ignore previous negative rating. This a...,5
2,7/9/2022 13:27,"This pop-up ""Get the best Spotify experience o...",4
3,7/9/2022 13:26,Really buggy and terrible to use as of recently,1
4,7/9/2022 13:20,Dear Spotify why do I get songs that I didn't ...,1


In [3]:
df.columns

Index(['Time_submitted', 'Review', 'Rating'], dtype='object')

### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [4]:
df.drop('Time_submitted',axis=1,inplace=True)

In [5]:
x=df['Review']
y=df['Rating']

In [6]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=42)

In [7]:
print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

(49275,) (49275,) (12319,) (12319,)


### EDA on training data

- check NaNs

In [8]:
x_train.isna().sum() #no missing values 

0

- check duplicates

In [9]:
x_train.duplicated().sum() #there are duplications that need to be handled 

159

- show a representative sample of data texts to find out required preprocessing steps

In [10]:
sample_texts = x_train.sample(n=10, random_state=3099)  
for text in sample_texts:
    print(text)
    print("*****************")

I love Spotify! I love how much music there is, and the mixes they give you, the podcasts and the end of year roundups
*****************
This new update is stupid. I can only listen like a few seconds of a song make the update back to the way it was
*****************
To many ads for non premium content 😒
*****************
Makes random beeping sounds unassociated with any activity. All notifications, sounds, .. are turned off. Spotify just randomly beeps ...very annoying. Im thinking of dropping it and going with another app.
*****************
I've been using this app for a couple of years. I have premium subscription and right now this app just logged me out without any reason. I can't log in! The official site is not responding. I live in Belarus and I'm really upset because in comparison with Russia we don't have podcasts and our library of music is smaller but app is much more expensive. Spotify canceled the premium subscription in Russia, but the free one still works. My friend has

- check dataset balancing

In [11]:
class_distribution = y_train.value_counts()


In [13]:
print("Class Distribution:")
print(class_distribution) #data is not balanced

Class Distribution:
Rating
5    17629
1    14129
4     6310
2     5733
3     5474
Name: count, dtype: int64


- Cleaning and Preprocessing are:
    - 1 lowercasing
    - 2 removing puctuation and special characters
    - 3 Handling Emojis
    - 4 Removing Stopwords
    - 5 Lemmatization or Stemming
    - 6 Expanding Contractions

### Cleaning and Preprocessing

In [8]:
x_train.drop_duplicates(inplace=True)

In [9]:
x_train.duplicated().sum()

0

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [11]:
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self,  lowercase=True,
                 remove_punctuation=True, remove_stopwords=True, lemmatize=True):
       
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.remove_stopwords = remove_stopwords
        self.lemmatize = lemmatize
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        processed_texts = []
        
        for text in X:
            
            if self.lowercase:
                text = text.lower()
            
            if self.remove_punctuation:
                text = text.translate(str.maketrans('', '', string.punctuation))
            
            if self.remove_stopwords:
                tokens = word_tokenize(text)
                filtered_tokens = [word for word in tokens if word not in self.stop_words]
                text = ' '.join(filtered_tokens)
            
            
            if self.lemmatize:
                tokens = word_tokenize(text)
                lemmatized_tokens = [self.lemmatizer.lemmatize(word) for word in tokens]
                text = ' '.join(lemmatized_tokens)
            
            processed_texts.append(text)
        
        return processed_texts


**You  are doing Great so far!**

### Modelling

##### loading word embeddings

In [12]:
from gensim.models import KeyedVectors


In [13]:
path_to_fasttext_file = 'wiki-news-300d-1M.vec/wiki-news-300d-1M.vec' 

fasttext_model = KeyedVectors.load_word2vec_format(path_to_fasttext_file, binary=False)

testing fast text

In [14]:
word_vector = fasttext_model['king']
print("Word vector for 'king':", word_vector)

Word vector for 'king': [ 1.082e-01  4.450e-02 -3.840e-02  1.100e-03 -8.880e-02  7.130e-02
 -6.960e-02 -4.770e-02  7.100e-03 -4.080e-02 -7.070e-02 -2.660e-02
  5.000e-02 -8.240e-02  8.480e-02 -1.627e-01 -8.510e-02 -2.950e-02
  1.534e-01 -1.828e-01 -2.208e-01  2.430e-02 -9.210e-02 -1.089e-01
 -1.009e-01 -1.190e-02  3.770e-02  2.038e-01  7.200e-02  2.020e-02
  2.798e-01  1.150e-02 -1.510e-02  1.037e-01  4.000e-04 -1.040e-02
  1.960e-02  1.265e-01  8.280e-02 -1.369e-01  1.070e-01  1.270e-01
 -3.490e-02 -6.830e-02 -1.140e-02  3.370e-02  1.260e-02  7.920e-02
  4.400e-02 -2.530e-02  4.890e-02 -7.850e-02 -6.259e-01 -9.720e-02
  1.654e-01 -5.780e-02 -4.370e-02  4.090e-02 -1.820e-02 -1.891e-01
  2.770e-02 -1.460e-02 -5.310e-02  4.260e-02  4.900e-03  4.000e-03
  1.423e-01 -9.750e-02 -3.500e-03  9.630e-02 -1.900e-03 -1.466e-01
 -1.662e-01  6.650e-02 -1.500e-01 -1.267e-01  2.670e-02 -1.560e-01
 -1.442e-01  1.515e-01  2.420e-02 -6.080e-02  9.180e-02 -2.407e-01
 -4.110e-02 -1.420e-02  6.550e-02 -3.5

In [15]:
similar_words = fasttext_model.most_similar('king', topn=5)
print("Most similar words to 'king':", similar_words)

Most similar words to 'king': [('kings', 0.7969563603401184), ('queen', 0.7638539671897888), ('monarch', 0.739997148513794), ('King', 0.7281951904296875), ('prince', 0.7132730484008789)]


In [16]:
def text_to_embeddings(text, embedding_model, embedding_size=300):
    words = text.split()
    embeddings = []
    for word in words:
        if word in embedding_model:
            embeddings.append(embedding_model[word])
        else:
            embeddings.append([0] * embedding_size) 

    return embeddings

X_train_embeddings = [text_to_embeddings(text, fasttext_model) for text in x_train]
X_test_embeddings = [text_to_embeddings(text, fasttext_model) for text in x_test]


In [17]:

X_train_flattened = np.array([np.mean(embeddings, axis=0) for embeddings in X_train_embeddings])
X_test_flattened = np.array([np.mean(embeddings, axis=0) for embeddings in X_test_embeddings])

model = Sequential([
    Dense(128, activation='relu', input_shape=(300,)), 
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid') 
])


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['f1_score'])
model.fit(X_train_flattened, y_train, epochs=10, batch_size=32, validation_split=0.2)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - f1_score: 1.5170 - loss: -49343.8672 - val_f1_score: 1.5216 - val_loss: -794491.7500
Epoch 2/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 1.5176 - loss: -1565785.0000 - val_f1_score: 1.5216 - val_loss: -5232335.5000
Epoch 3/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 1.5170 - loss: -7098204.0000 - val_f1_score: 1.5216 - val_loss: -14742295.0000
Epoch 4/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 1.5178 - loss: -17889386.0000 - val_f1_score: 1.5216 - val_loss: -30048072.0000
Epoch 5/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 1.5181 - loss: -34525836.0000 - val_f1_score: 1.5216 - val_loss: -51875628.0000
Epoch 6/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - f1_score: 1.5189 -

<keras.src.callbacks.history.History at 0x2bc41c9ce90>

#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [18]:
# Evaluate the model on test data
loss, accuracy = model.evaluate(X_test_flattened, y_test)
print(f"Test Accuracy: {f1_score}")


[1m385/385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 638us/step - f1_score: 1.5203 - loss: -287219872.0000
Test Accuracy: <function f1_score at 0x000002BBC37991C0>


### Enhancement

### Conclusion and final results


#### Done!