### Text Classification for Spotify Reviews
The goal of this project is to extract key features from the Spotify Reviews for both positive and negative reviews. Natural language processing techniques using TF-IDF on bigrams and trigrams to extract features of importance and classify reviews as positive or negative. The data will be downloaded from kaggle.

In [10]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import kagglehub

In [2]:
# Import Spotify Review data from kaggle
path = kagglehub.dataset_download("alexandrakim2201/spotify-dataset")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\krist\.cache\kagglehub\datasets\alexandrakim2201\spotify-dataset\versions\2


In [21]:
df = pd.read_csv(path+'\DATASET.csv')
df.head()

Unnamed: 0,Review,label
0,"Great music service, the audio is high quality...",POSITIVE
1,Please ignore previous negative rating. This a...,POSITIVE
2,"This pop-up ""Get the best Spotify experience o...",NEGATIVE
3,Really buggy and terrible to use as of recently,NEGATIVE
4,Dear Spotify why do I get songs that I didn't ...,NEGATIVE


#### Text Pre-processing
The review data needs to be cleaned for analysis such as:
- Set text to lowercase
- Remove punctuation
- Remove stop words

In [22]:
# Set reviews all to lowercase
df['Review'] = df['Review'].str.lower()
df.head()

Unnamed: 0,Review,label
0,"great music service, the audio is high quality...",POSITIVE
1,please ignore previous negative rating. this a...,POSITIVE
2,"this pop-up ""get the best spotify experience o...",NEGATIVE
3,really buggy and terrible to use as of recently,NEGATIVE
4,dear spotify why do i get songs that i didn't ...,NEGATIVE


In [23]:
# Remove punctuation
df['Review'] = df['Review'].str.replace(r'[^\w\s]', '', regex=True)
df.head()

Unnamed: 0,Review,label
0,great music service the audio is high quality ...,POSITIVE
1,please ignore previous negative rating this ap...,POSITIVE
2,this popup get the best spotify experience on ...,NEGATIVE
3,really buggy and terrible to use as of recently,NEGATIVE
4,dear spotify why do i get songs that i didnt p...,NEGATIVE


In [24]:
# Remove extra whitespace
df['Review'] = df['Review'].str.strip()

In [25]:
# Remove stopwords
#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Tokenize words
df['Review'] = df['Review'].apply(lambda x: str(x).split())

In [35]:
df['Review Tokenized'] = df['Review'].apply(lambda x: [w for w in x if not w in stop_words])
df.head()

Unnamed: 0,Review,label,Review Tokenized
0,"[great, music, service, the, audio, is, high, ...",POSITIVE,"[great, music, service, audio, high, quality, ..."
1,"[please, ignore, previous, negative, rating, t...",POSITIVE,"[please, ignore, previous, negative, rating, a..."
2,"[this, popup, get, the, best, spotify, experie...",NEGATIVE,"[popup, get, best, spotify, experience, androi..."
3,"[really, buggy, and, terrible, to, use, as, of...",NEGATIVE,"[really, buggy, terrible, use, recently]"
4,"[dear, spotify, why, do, i, get, songs, that, ...",NEGATIVE,"[dear, spotify, get, songs, didnt, put, playli..."


In [None]:
# Lemmatize words
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
lem = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\krist\AppData\Roaming\nltk_data...


In [64]:
df['Lemmatize Review'] = df['Review Tokenized'].apply(lambda x: [lem.lemmatize(l) for l in x])
df.head()

Unnamed: 0,Review,label,Review Tokenized,Lemmatize Review
0,"[great, music, service, the, audio, is, high, ...",POSITIVE,"[great, music, service, audio, high, quality, ...","[great, music, service, audio, high, quality, ..."
1,"[please, ignore, previous, negative, rating, t...",POSITIVE,"[please, ignore, previous, negative, rating, a...","[please, ignore, previous, negative, rating, a..."
2,"[this, popup, get, the, best, spotify, experie...",NEGATIVE,"[popup, get, best, spotify, experience, androi...","[popup, get, best, spotify, experience, androi..."
3,"[really, buggy, and, terrible, to, use, as, of...",NEGATIVE,"[really, buggy, terrible, use, recently]","[really, buggy, terrible, use, recently]"
4,"[dear, spotify, why, do, i, get, songs, that, ...",NEGATIVE,"[dear, spotify, get, songs, didnt, put, playli...","[dear, spotify, get, song, didnt, put, playlis..."


In [66]:
# Combine cleaned words back to sentence
df['Clean Review'] = df['Lemmatize Review'].apply(lambda x: ' '.join(x))
df['Clean Review']

0        great music service audio high quality app eas...
1        please ignore previous negative rating app sup...
2        popup get best spotify experience android 12 a...
3                       really buggy terrible use recently
4        dear spotify get song didnt put playlist shuff...
                               ...                        
52697                                             yes best
52698    spotify heart feb 2024 heart music lyric langu...
52699    tried open app wont open restarted phone ill t...
52700                                                 good
52701                 nice app play music affordable price
Name: Clean Review, Length: 52702, dtype: object