### Text Classification for Spotify Reviews
The goal of this project is to extract key features from the Spotify Reviews for both positive and negative reviews. Natural language processing techniques using TF-IDF on bigrams and trigrams to extract features of importance and classify reviews as positive or negative. The data will be downloaded from kaggle.

In [10]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import kagglehub

In [2]:
# Import Spotify Review data from kaggle
path = kagglehub.dataset_download("alexandrakim2201/spotify-dataset")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\krist\.cache\kagglehub\datasets\alexandrakim2201\spotify-dataset\versions\2


In [83]:
df = pd.read_csv(path+'\DATASET.csv')
df.head()

Unnamed: 0,Review,label
0,"Great music service, the audio is high quality...",POSITIVE
1,Please ignore previous negative rating. This a...,POSITIVE
2,"This pop-up ""Get the best Spotify experience o...",NEGATIVE
3,Really buggy and terrible to use as of recently,NEGATIVE
4,Dear Spotify why do I get songs that I didn't ...,NEGATIVE


#### Text Pre-processing
The review data needs to be cleaned for analysis such as:
- Set text to lowercase
- Remove punctuation
- Remove stop words

In [84]:
# Set reviews all to lowercase
df['Review'] = df['Review'].str.lower()
df.head()

Unnamed: 0,Review,label
0,"great music service, the audio is high quality...",POSITIVE
1,please ignore previous negative rating. this a...,POSITIVE
2,"this pop-up ""get the best spotify experience o...",NEGATIVE
3,really buggy and terrible to use as of recently,NEGATIVE
4,dear spotify why do i get songs that i didn't ...,NEGATIVE


In [85]:
# Remove punctuation
df['Review'] = df['Review'].str.replace(r'[^\w\s]', '', regex=True)
df.head()

Unnamed: 0,Review,label
0,great music service the audio is high quality ...,POSITIVE
1,please ignore previous negative rating this ap...,POSITIVE
2,this popup get the best spotify experience on ...,NEGATIVE
3,really buggy and terrible to use as of recently,NEGATIVE
4,dear spotify why do i get songs that i didnt p...,NEGATIVE


In [86]:
# Remove extra whitespace
df['Review'] = df['Review'].str.strip()

In [87]:
# Remove stopwords
#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Tokenize words
df['Review'] = df['Review'].apply(lambda x: str(x).split())

In [88]:
df['Review Tokenized'] = df['Review'].apply(lambda x: [w for w in x if not w in stop_words])
df.head()

Unnamed: 0,Review,label,Review Tokenized
0,"[great, music, service, the, audio, is, high, ...",POSITIVE,"[great, music, service, audio, high, quality, ..."
1,"[please, ignore, previous, negative, rating, t...",POSITIVE,"[please, ignore, previous, negative, rating, a..."
2,"[this, popup, get, the, best, spotify, experie...",NEGATIVE,"[popup, get, best, spotify, experience, androi..."
3,"[really, buggy, and, terrible, to, use, as, of...",NEGATIVE,"[really, buggy, terrible, use, recently]"
4,"[dear, spotify, why, do, i, get, songs, that, ...",NEGATIVE,"[dear, spotify, get, songs, didnt, put, playli..."


In [89]:
# Lemmatize words
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
lem = WordNetLemmatizer()

In [90]:
df['Lemmatize Review'] = df['Review Tokenized'].apply(lambda x: [lem.lemmatize(l) for l in x])
df.head()

Unnamed: 0,Review,label,Review Tokenized,Lemmatize Review
0,"[great, music, service, the, audio, is, high, ...",POSITIVE,"[great, music, service, audio, high, quality, ...","[great, music, service, audio, high, quality, ..."
1,"[please, ignore, previous, negative, rating, t...",POSITIVE,"[please, ignore, previous, negative, rating, a...","[please, ignore, previous, negative, rating, a..."
2,"[this, popup, get, the, best, spotify, experie...",NEGATIVE,"[popup, get, best, spotify, experience, androi...","[popup, get, best, spotify, experience, androi..."
3,"[really, buggy, and, terrible, to, use, as, of...",NEGATIVE,"[really, buggy, terrible, use, recently]","[really, buggy, terrible, use, recently]"
4,"[dear, spotify, why, do, i, get, songs, that, ...",NEGATIVE,"[dear, spotify, get, songs, didnt, put, playli...","[dear, spotify, get, song, didnt, put, playlis..."


In [91]:
# Stem words
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()

In [92]:
df['Stem Words'] = df['Lemmatize Review'].apply(lambda x: [stemmer.stem(w) for w in x])
df['Stem Words']

0        [great, music, servic, audio, high, qualiti, a...
1        [pleas, ignor, previou, neg, rate, app, super,...
2        [popup, get, best, spotifi, experi, android, 1...
3                    [realli, buggi, terribl, use, recent]
4        [dear, spotifi, get, song, didnt, put, playlis...
                               ...                        
52697                                           [ye, best]
52698    [spotifi, heart, feb, 2024, heart, music, lyri...
52699    [tri, open, app, wont, open, restart, phone, i...
52700                                               [good]
52701              [nice, app, play, music, afford, price]
Name: Stem Words, Length: 52702, dtype: object

In [93]:
# Combine cleaned words back to sentence
df['Clean Review'] = df['Stem Words'].apply(lambda x: ' '.join(x))
df['Clean Review']

0        great music servic audio high qualiti app easi...
1        pleas ignor previou neg rate app super great g...
2        popup get best spotifi experi android 12 annoy...
3                          realli buggi terribl use recent
4        dear spotifi get song didnt put playlist shuff...
                               ...                        
52697                                              ye best
52698    spotifi heart feb 2024 heart music lyric langu...
52699    tri open app wont open restart phone ill tap i...
52700                                                 good
52701                     nice app play music afford price
Name: Clean Review, Length: 52702, dtype: object

In [94]:
pos_review = list(df['Clean Review'].loc[df['label'] == "POSITIVE"])
pos_review

['great music servic audio high qualiti app easi use also quick friendli support',
 'pleas ignor previou neg rate app super great give five star',
 'love select lyric provid song your listen',
 'great app best mp3 music app ever use one problem cant play song find song despit app wonder recommend best',
 'hav music like superðÿœ',
 'improv ia recommend song find similar song itll best music app youtub better everyth els spotifi king',
 'alway favorit platform listen music dont like subscrib alot thing one must',
 'voic sweet hearabl',
 'amaz music experi',
 'like everyth like listen commerci even better even like song ive want know name brought thank spotifi',
 'jammin hyuh old school hardcor thrash havent heard year return youth ye pleas',
 'everyth perfect add light theme',
 'best music app ever seen',
 'good app listen music',
 'wide rang song collect',
 'spotifi awesom love obscur artist think fingertip well easi access newer thing updat lost star constantli ask review ive alreadi 

In [98]:
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(pos_review)

# Get feature names
feature_names = vectorizer.get_feature_names_out()


# Create a dataframe from the TF-IDF matrix
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(tfidf_df)


        00  000  0000  0003  0005  001  010  0143   03  0350  ...  \
0      0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
1      0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
2      0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
3      0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
4      0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
...    ...  ...   ...   ...   ...  ...  ...   ...  ...   ...  ...   
23274  0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
23275  0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
23276  0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
23277  0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   
23278  0.0  0.0   0.0   0.0   0.0  0.0  0.0   0.0  0.0   0.0  ...   

       ðššðšðššðšðšðšðš  ðšžðšÿðšžðš   ðž  ðžðð  ðžððâœðÿ  ðžðšððððððžð   ðˆ  \
0                   0.0          0.0  0.0   0.0       0.0           0.0  0.0   
1          