# **ACTIVITY 1 : Lemmatize Dataset for Songs**

#### The purpose of this first activity of code is to preprocess a collection of song lyrics for natural language processing (NLP) tasks. Preprocessing involves cleaning and transforming the text to prepare it for analysis or modeling. Specifically, the code aims to:

#### **Normalize Text**: Remove unnecessary characters, convert text to lowercase, tokenize words, remove stopwords and digits, and clean stray punctuation. This ensures the text is standardized and free from irrelevant details.

#### **Lemmatize Words**: Reduce words to their base or dictionary form (e.g., "running" to "run"), which helps in reducing linguistic variations and improving the consistency of the text.

#### This preprocessing step is essential for tasks like clustering, topic modeling, or sentiment analysis, where clean and uniform text input is critical for achieving meaningful and accurate results.





In [None]:
# Reducting Dataset to Upload to Github
import pandas as pd
import nltk
import re
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np
import spacy

In [None]:
df = pd.read_csv(r'ReducedDataset.csv')

In [None]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,ALink,SName,SLink,Lyric,language
0,139419,/foo-fighters/,"Hey, Johnny Park!",/foo-fighters/hey-johnny-park.html,Come and I'll take you under\nThis beautiful b...,en
1,290738,/mxpx/,Call In Sick,/mxpx/call-in-sick.html,"Oh how I missed you,\nOh how I needed you toda...",en
2,162905,/arch-enemy/,Despicable Heroes,/arch-enemy/despicable-heroes.html,"I spit in your face, preacers and leaders\nSpe...",en
3,281035,/the-maine/,Whoever She Is,/the-maine/whoever-she-is.html,I thought I had my girl but she ran away\nMy c...,en
4,253213,/a-ha/,Days On End,/a-ha/days-on-end.html,Do know why winter's such a cold and lonely pl...,en


#### **TEXT NORMALIZATION**

#### This below code preprocesses song lyrics in a DataFrame, focusing on text normalization and lemmatization. It begins by defining a function, normalize_document, that removes non-alphabetic characters (except spaces and apostrophes), converts text to lowercase, tokenizes it, removes stopwords and digits, and cleans stray apostrophes. The apply method applies this normalization function to the DataFrame column Lyric, producing a cleaned version of the text.

#### After normalization, the code uses SpaCy's NLP model to lemmatize the text. Each normalized document is processed to extract lemmas for all tokens, resulting in a lemmatized version of the lyrics. The final DataFrame includes columns for both normalized and lemmatized lyrics, ensuring the text is clean, concise, and ready for downstream NLP tasks like clustering or classification.

In [None]:
nlp = spacy.load('en_core_web_sm')

# Download the stopwords library
nltk.download('stopwords')

# Establish a word punctuation tokenizer
wpt = nltk.WordPunctTokenizer()

# Establish the English stop words
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # Lowercase and remove special characters and whitespaces
    doc = re.sub(r"[^a-zA-Z\s']", '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()


    # Tokenize document
    tokens = wpt.tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words and not token.isdigit()]
    # Re-create the document from filtered tokens
    doc = ' '.join(filtered_tokens)

    doc = re.sub(r"'\s*", "", doc)
    return doc

normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(df['Lyric'])
lemmatized_corpus = []

for text in norm_corpus[0:df.shape[0]]:
    doc = nlp(text.item())
    lemmatized_text = " ".join([token.lemma_ for token in doc])
    lemmatized_corpus.append(lemmatized_text)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### **PRE PREPARATION OF DATASET**:

####The below code saves the preprocessed dataset for further use and ensures it is clean, complete, and ready for NLP tasks like clustering, classification, or visualization. It also provides a checkpoint to inspect and validate the output.

In [None]:
df['Lemmatized_Lyrics'] = lemmatized_corpus
df.dropna(inplace = True)
#df.drop(32186, axis = 0, inplace = True)
df.to_csv(r'C:\Users\divyamishra\Desktop\Nimish_Student\INTRO TO NLP FOR DATA SCIENCE\Project Work\LemmatizedDataset.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,ALink,SName,SLink,Lyric,language,Lemmatized_Lyrics
0,139419,/foo-fighters/,"Hey, Johnny Park!",/foo-fighters/hey-johnny-park.html,Come and I'll take you under\nThis beautiful b...,en,come take beautiful bruise color everything fa...
1,290738,/mxpx/,Call In Sick,/mxpx/call-in-sick.html,"Oh how I missed you,\nOh how I needed you toda...",en,oh miss oh need today oh miss oh need today ca...
2,162905,/arch-enemy/,Despicable Heroes,/arch-enemy/despicable-heroes.html,"I spit in your face, preacers and leaders\nSpe...",en,spit face preacer leader spew false dogma beli...
3,281035,/the-maine/,Whoever She Is,/the-maine/whoever-she-is.html,I thought I had my girl but she ran away\nMy c...,en,think girl run away car get steal go to late w...
4,253213,/a-ha/,Days On End,/a-ha/days-on-end.html,Do know why winter's such a cold and lonely pl...,en,know winter cold lonely place breath bleach fa...
