# ACTIVITY 2 - PRE PROCESSING THE DATA:

#### This secitionn is to preprocess text documents, such as song lyrics, by normalizing and cleaning them for natural language processing (NLP) tasks. It standardizes the text by converting it to lowercase, removing special characters, punctuation, and numbers, and filtering out both standard English stopwords and custom stopwords specific to the domain (e.g., filler words like "yeah" and "gonna"). The code tokenizes the text into individual words, removes irrelevant tokens, and handles stray apostrophes to produce a clean and consistent version of the input. This prepares the data for more effective analysis in tasks like clustering, classification, or sentiment analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import re
import spacy
from gensim import corpora, models
import gensim
from gensim.matutils import hellinger
from scipy.spatial.distance import cosine
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
df = pd.read_csv('ReducedDataset.csv')

In [None]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,ALink,SName,SLink,Lyric,language
0,139419,/foo-fighters/,"Hey, Johnny Park!",/foo-fighters/hey-johnny-park.html,Come and I'll take you under\nThis beautiful b...,en
1,290738,/mxpx/,Call In Sick,/mxpx/call-in-sick.html,"Oh how I missed you,\nOh how I needed you toda...",en
2,162905,/arch-enemy/,Despicable Heroes,/arch-enemy/despicable-heroes.html,"I spit in your face, preacers and leaders\nSpe...",en
3,281035,/the-maine/,Whoever She Is,/the-maine/whoever-she-is.html,I thought I had my girl but she ran away\nMy c...,en
4,253213,/a-ha/,Days On End,/a-ha/days-on-end.html,Do know why winter's such a cold and lonely pl...,en


In [None]:
df_en = df[df['language']=='en']

In [None]:
df_en.head(5)

Unnamed: 0.1,Unnamed: 0,ALink,SName,SLink,Lyric,language
0,139419,/foo-fighters/,"Hey, Johnny Park!",/foo-fighters/hey-johnny-park.html,Come and I'll take you under\nThis beautiful b...,en
1,290738,/mxpx/,Call In Sick,/mxpx/call-in-sick.html,"Oh how I missed you,\nOh how I needed you toda...",en
2,162905,/arch-enemy/,Despicable Heroes,/arch-enemy/despicable-heroes.html,"I spit in your face, preacers and leaders\nSpe...",en
3,281035,/the-maine/,Whoever She Is,/the-maine/whoever-she-is.html,I thought I had my girl but she ran away\nMy c...,en
4,253213,/a-ha/,Days On End,/a-ha/days-on-end.html,Do know why winter's such a cold and lonely pl...,en


In [None]:
df_en.shape[0]

38363

In [None]:
print(df_en['Lyric'].iloc[20])

I'm comin' home, I've done my time
Now I've got to know what is and isn't mine
If you received my letter telling you I'd soon be free
Then you'll know just what to do
If you still want me
If you still want me
Whoa, tie a yellow ribbon 'round the old oak tree
It's been three long years
Do ya still want me?
If I don't see a ribbon round the old oak tree
I'll stay on the bus
Forget about us
Put the blame on me
If I don't see a yellow ribbon round the old oak tree

Bus driver, please look for me
'cause I couldn't bear to see what I might see
I'm really still in prison
And my love, she holds the key
A simple yellow ribbon's what I need to set me free
I wrote and told her please

Whoa, tie a yellow ribbon round the old oak tree
It's been three long years
Do ya still want me?
If I don't see a ribbon round the old oak tree
I'll stay on the bus
Forget about us
Put the blame on me
If I don't see a yellow ribbon round the old oak tree

Now the whole damned bus is cheerin'
And I can't believe I se

#### This code is to preprocess text documents—like song lyrics—by normalizing and filtering them to make the data more suitable for NLP tasks. This includes removing unnecessary elements like stopwords, special characters, and digits while ensuring the text is standardized and clean.

#### Following are the function of the code below

#### *1. Custom Stopwords:*

#### In addition to the default English stopwords from NLTK, a custom_stop_words list has been defined to remove domain-specific or irrelevant words, such as "yeah," "gonna," "like," and other common filler words in song lyrics. These are combined with the default stopwords to create the final stop_words list.

#### *2. Text Normalization:*

#### Converts all text to lowercase for consistency.

#### Removes non-alphabetic characters (excluding spaces and apostrophes) using re.sub.
#### Strips unnecessary whitespace to clean the text.

#### *3. Tokenization and Filtering:*

#### Uses the WordPunctTokenizer to split the text into words.
#### Filters out words that are in the stop_words list or are purely numeric.

#### *4. Punctuation Removal:*

#### Adds a step to remove all punctuation using re.sub and string.punctuation. This ensures that punctuation marks left after tokenization are cleaned up.

#### *5. Apostrophe Cleanup:*

#### Handles stray apostrophes by cleaning up unnecessary or misplaced apostrophes.

In [None]:
# Download the stopwords library
nltk.download('stopwords')

# Establish a word punctuation tokenizer
wpt = nltk.WordPunctTokenizer()

# Establish the English stop words
basic_stop_words = nltk.corpus.stopwords.words('english')

custom_stop_words = ['get', 'yeah', 's', 'ai', 'ca', 'like', 'nt', 'ta', 'oh', 'got', 'gonna','goin','na','I', "i'm", "ain't", 'come', 'make', 'know', 'gotta']

stop_words = basic_stop_words + custom_stop_words

def normalize_document(doc):
    # Lowercase and remove special characters and whitespaces
    doc = re.sub(r"[^a-zA-Z\s']", '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()

    # Tokenize document
    tokens = wpt.tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words and not token.isdigit()]
    # Re-create the document from filtered tokens
    doc = ' '.join(filtered_tokens)

        # Remove punctuation
    doc = re.sub(f"[{re.escape(string.punctuation)}]", '', doc)

    doc = re.sub(r"'\s*", "", doc)
    return doc

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#### This below section of codes preprocesses the text data in the Lyric column of the DataFrame df_en by applying the normalize_document function to each entry. Using np.vectorize, it cleans and standardizes the text by removing unwanted characters, punctuation, stopwords, and digits while converting it to lowercase. The result is stored in norm_corpus as a NumPy array of cleaned lyrics, ready for further NLP analysis. However, for larger datasets, using pandas' .apply() method may offer better performance.

In [None]:
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(df_en['Lyric'])

In [None]:
print(norm_corpus[0])

 take beautiful bruise  colors everything fades time  true wish another stab undercover change mind  impossible  let  never selling sit watch every mood eyes still remind angels hover eyes change blind blue  impossible  let  never selling sit watch every mood  found reward  throw away long  share piece mine  impossible  let  never selling sit watch every mood


In [None]:
doc = nlp(norm_corpus[0].item())

for token in doc:
    print(token.text, token.lemma_)

   
take take
beautiful beautiful
bruise bruise
   
colors color
everything everything
fades fade
time time
   
true true
wish wish
another another
stab stab
undercover undercover
change change
mind mind
   
impossible impossible
   
let let
   
never never
selling sell
sit sit
watch watch
every every
mood mood
eyes eye
still still
remind remind
angels angel
hover hover
eyes eye
change change
blind blind
blue blue
   
impossible impossible
   
let let
   
never never
selling sell
sit sit
watch watch
every every
mood mood
   
found find
reward reward
   
throw throw
away away
long long
   
share share
piece piece
mine mine
   
impossible impossible
   
let let
   
never never
selling sell
sit sit
watch watch
every every
mood mood


In [None]:
lemmatized_corpus = []

for text in norm_corpus:
    doc = nlp(text.item())
    lemmatized_text = " ".join([token.lemma_ for token in doc])
    lemmatized_corpus.append(lemmatized_text)

In [None]:
print(lemmatized_corpus[0])

  take beautiful bruise   color everything fade time   true wish another stab undercover change mind   impossible   let   never sell sit watch every mood eye still remind angel hover eye change blind blue   impossible   let   never sell sit watch every mood   find reward   throw away long   share piece mine   impossible   let   never sell sit watch every mood


In [None]:
df = pd.DataFrame(lemmatized_corpus, columns=['lyrics'])
df.to_csv('output_file.csv', index=False)

In [None]:
tokenized_corpus = [text.split() for text in lemmatized_corpus]

dictionary = corpora.Dictionary(tokenized_corpus)
corpus = [dictionary.doc2bow(text) for text in tokenized_corpus]

In [None]:
# check the work frequency within our dictionary that we will be using for the lda model, if a word is too frequenct it may lose value
# Too frequent words have been added as stop words on each iteration.
top_10_most_frequent_words = sorted(dictionary.cfs.items(), key=lambda item: item[1], reverse=True)[:10]
print(top_10_most_frequent_words)
for token_id, frequency in top_10_most_frequent_words:
    word = dictionary[token_id]
    print(f"Word: {word}, Frequency: {frequency}")

[(202, 64544), (246, 54282), (209, 40490), (67, 37575), (32, 34549), (104, 33670), (16, 32458), (30, 31321), (282, 30233), (21, 29150)]
Word: love, Frequency: 64544
Word: go, Frequency: 54282
Word: say, Frequency: 40490
Word: see, Frequency: 37575
Word: time, Frequency: 34549
Word: one, Frequency: 33670
Word: let, Frequency: 32458
Word: take, Frequency: 31321
Word: want, Frequency: 30233
Word: never, Frequency: 29150


In [None]:
lda_model = gensim.models.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=35, random_state = 42)

In [None]:
for topic_id in range(10):
    topic = lda_model.print_topic(topic_id)
    print(f"Topic {topic_id}: {topic}")

Topic 0: 0.016*"well" + 0.016*"man" + 0.016*"go" + 0.013*"say" + 0.010*"good" + 0.010*"little" + 0.009*"home" + 0.008*"old" + 0.008*"big" + 0.007*"one"
Topic 1: 0.032*"god" + 0.029*"we" + 0.023*"lord" + 0.021*"sing" + 0.015*"song" + 0.014*"let" + 0.014*"king" + 0.013*"heaven" + 0.013*"world" + 0.013*"child"
Topic 2: 0.023*"life" + 0.014*"live" + 0.013*"people" + 0.011*"world" + 0.010*"man" + 0.010*"one" + 0.010*"lie" + 0.008*"see" + 0.007*"time" + 0.007*"mind"
Topic 3: 0.018*"run" + 0.015*"burn" + 0.015*"fire" + 0.012*"die" + 0.012*"take" + 0.011*"break" + 0.010*"eye" + 0.010*"fall" + 0.010*"heart" + 0.009*"feel"
Topic 4: 0.022*"go" + 0.019*"time" + 0.018*"never" + 0.018*"say" + 0.016*"one" + 0.015*"see" + 0.015*"feel" + 0.015*"want" + 0.014*"love" + 0.013*"way"
Topic 5: 0.137*"not" + 0.083*"la" + 0.078*"I" + 0.075*"do" + 0.075*"m" + 0.045*"s" + 0.039*"da" + 0.037*"can" + 0.034*"you" + 0.024*"be"
Topic 6: 0.258*"love" + 0.147*"baby" + 0.039*"ooh" + 0.028*"ah" + 0.022*"need" + 0.019*"sa

In [None]:
# Verify that a singular word is not used in a large majority of the topics or else that word may not hold much value
word_counts = {}
num_topics = lda_model.num_topics

for topic_id in range(num_topics):
    topic_words = lda_model.show_topic(topic_id, topn=10)

    for word, prob in topic_words:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

word_count_topic_list = pd.DataFrame(list(word_counts.items()), columns=['Word', 'Count'])

word_count_topic_list = word_count_topic_list.sort_values(by='Count', ascending=False)

print(word_count_topic_list.head(10))

     Word  Count
9     one      3
2      go      3
3     say      3
24    see      2
25   time      2
18  world      2
36   feel      2
15    let      2
39   love      2
38   want      2


In [None]:
song_topic_distribution = [lda_model[doc] for doc in corpus]

In [None]:
def calculate_hellinger_distance(song_dist_1, song_dist_2):
    return hellinger(song_dist_1, song_dist_2)

input_song_index = 139

input_song_dist = song_topic_distribution[input_song_index]

hellinger_distances = []

for i, song_dist in enumerate(song_topic_distribution):

    if i != input_song_index:

        distance = calculate_hellinger_distance(input_song_dist, song_dist)
        hellinger_distances.append((distance, i))

hellinger_distances.sort(key=lambda x: x[0])

top_5_similar_songs = hellinger_distances[:5]

# Print the results
print(f"\nTop 5 most similar songs to the input song {df_en['SName'].iloc[input_song_index]} by {df_en['ALink'].iloc[input_song_index]}:")
for distance, song_index in top_5_similar_songs:
    print(f"Song {df_en['SName'].iloc[song_index]} by {df_en['ALink'].iloc[song_index]}")


Top 5 most similar songs to the input song Star by /beck/:
Song This House is Haunted by /alice-cooper/
Song Big Star by /10000-maniacs/
Song She Sets The City On Fire by /gavin-degraw/
Song Goodnite, Dr. Death by /my-chemical-romance/
Song Dirty Boys by /david-bowie/


In [None]:
song_topic_distribution[65]

[(0, 0.16637278),
 (1, 0.011681465),
 (2, 0.010167061),
 (3, 0.086907074),
 (4, 0.2458164),
 (6, 0.022881556),
 (7, 0.45506352)]

In [None]:
print(df_en['SName'].iloc[65])
print(df_en['Lyric'].iloc[65])

So Much Pain
[Master P and (Mo B. Dick)]
Oh like that (Ooh)
Tre-8 they ain't ready for this dog (Ooh)
Smoke One and No Limit (Ooh)
All the way from California to New Orleans
Ask em' about it, so much pain boy (So much pain)

[Master P]
Birds in the kitchen, palms itchin'
And all y'all niggas in the game pay attention
As I teach, ain't got no time to preach
2 for 3, 4 for 5, 16 a fuckin' key
Don't laugh, niggas like to backstab
But where I'm from see yo brother on a body slab
New Orleans, the city of the candy cream
A bunch of projects full of jackers and dope fiends
As I cry, think one day I gotta die
But I don't give a fuck cause ain't no love from the outside
As I walk to the projects
Niggas killin' dope fiends behind fuckin' county checks
And my younger homie smokin' dope
The niggas I used to hang with doin' that boy broke
And they gone off that water, water
Ain't no love from New Orleans all the way back to Florida
It's just a bunch of pain

[Chorus: Mo B. Dick]
So much pain, so mu

In [None]:
song_topic_distribution[2881]

[(0, 0.18533686),
 (1, 0.020666035),
 (2, 0.048350923),
 (3, 0.08434779),
 (4, 0.08937986),
 (6, 0.017986083),
 (7, 0.40650696),
 (8, 0.015483296),
 (9, 0.12802215)]

In [None]:
print(df_en['SName'].iloc[2881])
print(df_en['Lyric'].iloc[2881])

Run This Town
(Rihanna)
Feel it coming in the air
Hear the screams from everywhere
I'm addicted to the the thrill
It's a dangerous love affair
Can't be scaring nickels down
Got a problem, tell me now
Only thing that's on my mind
Is who gon' run this town tonight
Is who gon' run this town tonight
We gon' run this town

(Jay-Z)
We are, yeah, I said it, we are
This is Roc Nation, pledge your allegiance
Get y'all fatigues on, all black everything
Black cards, black cars, all black everything
And our girls are blackbirds, riding with they Dillingers
I get more in-depth if you boys really real enough
This is La Familia, I'll explain later
But for now, let me get back to this paper
I'm a couple bands down and I'm tryna get back
I gave Doug a grip, I lost a flip for five stacks
Yeah, I'm talking five comma six zeroes dot zero, here girl
Back to running circles 'round niggas, now we squared up
Hold up

(Rihanna)
Life's a game but it's not fair
I break the rules so I don't care
So I keep doing m

In [None]:
print(df_en['Lyric'].iloc[3])

I thought I had my girl but she ran away
My car got stolen and I'm gonna be late
For work this week
Make that the fourth day straight
But I'm fine with it
I thought I had it all but I gave it away
I quit that old job now I'm doin' okay
Those material things
They can't get in my way
Cause I'm over it

(PRE-CHORUS)
But whatever she may be

(CHORUS)
She could be money, cars, fear of the dark
Your best friends or just strangers in bars
Whoever she is, whoever she may be
One thing's for sure
You don't have to worry

And this is the part where you find out who you are
And these are the friends
Those who've been there from the start
So to hell with the bad news
Dirt on your new shoes
It rained all of May
Till the month of June

(PRE-CHORUS)

(CHORUS)

And every day in every way oh she will look the same
And every care you used to have just seems to float away
To hell with your new shit
And whether or not you think you fit in

(CHORUS)

She could be rainy days, minimum wage
A book that ends wi

In [None]:
print(df_en['Lyric'].iloc[1415])

I got a sixty-five Cadillac
Sparetyre on the back
I got a charge account at Goldblatt's
But I don't have you

I got women to the left of me
I got women to the right of me
I got chicks all around me
But I don't have you

I got a tavern, a liquor store
I got the numbers four forty-four
I got a Mojo and don't you know
I'm all dressed up with no place to go

I got women to the left of me
I got women to the right of me
I got chicks all around me
But I ain't got you, oh baby

Yeah I got a tavern, a liquor store
I got the numbers four forty-four
I got a Mojo and don't you know
I'm all dressed up with no place to go

I got a sixty-five Cadillac
Sparetyre on the back
Charge account at Goldblatt's
But I ain't got you, come on baby

Yeah I got a tavern, a liquor store
I got the numbers four forty-four
I got a Mojo and don't you know
I'm all dressed up with no place to go

I got a closet full of clothes
And no matter where I go
Got a ring in my nose
But I ain't got you

I said I ain't got you
Baby

#### This code analyzes the sentiment and similarity between two song lyrics. It uses the VADER sentiment analyzer to compute the negative sentiment scores for the lyrics, assessing their emotional tone. It then applies the TF-IDF vectorizer to convert the lyrics into numerical representations based on term importance and calculates their cosine similarity to measure how similar the two lyrics are in content. The output includes the sentiment scores for each song and the similarity score between them.

In [None]:
nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()

def get_sentiment_score(lyrics):
    sentiment = analyzer.polarity_scores(lyrics)
    return sentiment['neg']

sentiment1 = get_sentiment_score(lemmatized_corpus[496])
sentiment2 = get_sentiment_score(lemmatized_corpus[274])

tfidf_vectorizer = TfidfVectorizer()
lyrics_matrix = tfidf_vectorizer.fit_transform([lemmatized_corpus[496], lemmatized_corpus[274]])
lyrics_similarity = cosine_similarity(lyrics_matrix)

print("Sentiment Score for Song 1:", sentiment1)
print("Sentiment Score for Song 2:", sentiment2)
print("Cosine Similarity between Lyrics:", lyrics_similarity[0][1])

Sentiment Score for Song 1: 0.171
Sentiment Score for Song 2: 0.403
Cosine Similarity between Lyrics: 0.17456553631358962


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [None]:
## Preprocessing
# Tokenize
# Lowercase
# punctuation removal
# number removal
# Stop word removal

## Additional Preprocessing
# Lemmatization and or stemming
# Analysis word frequency determine if we need to remove high frequency low value words
# Stop word removal

## Data Check
# Explore word stats
# Everything is clean

## NLP Processing
# Sentiment Analysis
# LDA
# POS taging/Similarity

## Front End
# Django or Flask additions

Test
ANOTHER TEST


FINAL TEST!