# 02-BOW

Now that we master the preprocessing, let's make our first Bag Of Words (BOW).

We will reuse our dataset of Coldplay songs to make a BOW.

As usual, the first step is to import some libraries. So import *nltk* as well as all the libraries you will need.

In [3]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Load now the dataset in *coldplay.csv* using pandas.

In [5]:
# TODO: Load the dataset in coldplay.csv
# Load the dataset
data = pd.read_csv('coldplay.csv')

# Print the first 5 rows of the dataset to check if it has been loaded correctly
print(data.head())


     Artist                           Song  \
0  Coldplay                 Another's Arms   
1  Coldplay                Bigger Stronger   
2  Coldplay                       Daylight   
3  Coldplay                       Everglow   
4  Coldplay  Every Teardrop Is A Waterfall   

                                                Link  \
0            /c/coldplay/anothers+arms_21079526.html   
1          /c/coldplay/bigger+stronger_20032648.html   
2                 /c/coldplay/daylight_20032625.html   
3                 /c/coldplay/everglow_21104546.html   
4  /c/coldplay/every+teardrop+is+a+waterfall_2091...   

                                              Lyrics  
0  Late night watching tv  \nUsed to be you here ...  
1  I want to be bigger stronger drive a faster ca...  
2  To my surprise, and my delight  \nI saw sunris...  
3  Oh, they say people come  \nThey say people go...  
4  I turn the music up, I got my records on  \nI ...  


You already know this dataset, but you can check it again if you want to refresh your memory.

In [4]:
# TODO: Explore the data
import pandas as pd
data = pd.read_csv('coldplay.csv')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Artist  120 non-null    object
 1   Song    120 non-null    object
 2   Link    120 non-null    object
 3   Lyrics  120 non-null    object
dtypes: object(4)
memory usage: 3.9+ KB
None


Now using the *CountVectorizer* of scikit-learn, make a BOW of all the lyrics of Coldplay, and print the result.

In [10]:
# TODO: Compute a BOW
from collections import Counter
import pandas as pd

# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('coldplay.csv')

# Access the lyrics of the first song
lyrics = df.loc[0, 'Lyrics']

# Tokenize the lyrics
tokens = nltk.word_tokenize(lyrics)

# Remove the punctuation
tokens = [token.lower() for token in tokens if token.isalnum()]

# Remove the stop words
stop_words = set(nltk.corpus.stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

# Lemmatize the tokens
lemmatizer = nltk.WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Perform stemming
stemmer = nltk.PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]

# Store the preprocessed text in the preprocessed_text variable
preprocessed_text = tokens


# Compute the frequency of each token in the preprocessed text
word_counts = Counter(preprocessed_text)

# Print the word counts
print(word_counts)

Counter({'arm': 18, 'anoth': 16, 'bodi': 4, 'late': 3, 'night': 3, 'watch': 3, 'tv': 3, 'use': 3, 'besid': 3, 'right': 3, 'pull': 3, 'around': 2, 'world': 2, 'mean': 2, 'noth': 2, 'pain': 2, 'rip': 2, 'someon': 2, 'wish': 2, 'reach': 1, 'find': 1, 'tortur': 1, 'got': 1, 'close': 1})


Now that we have the BOW matrix, we would like to have a new dataframe having the BOW for each song, and as columns the corresponding words (just as we did in the lecture at the end).

So that at the end we would end up with a dataframe containing something like the following (120 raws for 120 songs, and as many columns as words):

| | ah | adventure | ... | yeah 
|---|---|---|---|---| 
| 0 | 0 | 1 | ... | 4 |
| 1 | 8 | 0 | ... | 2 |
|...|...|...|...|...|
| 119 | 5 | 0 | ... | 8 |

In [11]:
# TODO: Create a new dataframe containing the BOW outputs and the corresponding words as columns. And print it
bow_df = pd.DataFrame(list(word_counts.items()), columns=['Word', 'Count'])
print(bow_df)

      Word  Count
0     late      3
1    night      3
2    watch      3
3       tv      3
4      use      3
5    besid      3
6      arm     18
7   around      2
8     bodi      4
9    world      2
10    mean      2
11    noth      2
12   anoth     16
13    pain      2
14     rip      2
15   right      3
16  someon      2
17   reach      1
18    find      1
19  tortur      1
20    pull      3
21     got      1
22   close      1
23    wish      2


Well as you see we're still having some issue, we have some tokens that are not words, like '10' or '2000'.

To get rid of that, we could use directly regular expressions within the function. Another solution would be to make preprocessing before using the function *CountVectorizer*.

For the moment, we won't pay attention to this issue. But if you are curious and have time, you can find on google how to remove those words using the *CountVectorizer*.

Now we would like to see what are the most used words by Coldplay.

In [6]:
sum_bow = bow_df.sum()
sum_bow.idxmax()

'oh'

So what is the most used word? Are you surprised?

Now make a sort in order to show the 10 most used.

In [12]:
# TODO: print the 10 most used word by Coldplay
from collections import Counter
import pandas as pd

# Load the CSV file into a Pandas DataFrame
df = pd.read_csv('coldplay.csv')

# Access the lyrics of the first song
lyrics = df.loc[0, 'Lyrics']

# Tokenize the lyrics
tokens = nltk.word_tokenize(lyrics)

# Remove the punctuation
tokens = [token.lower() for token in tokens if token.isalnum()]

# Remove the stop words
stop_words = set(nltk.corpus.stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]

# Lemmatize the tokens
lemmatizer = nltk.WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]

# Perform stemming
stemmer = nltk.PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]

# Store the preprocessed text in the preprocessed_text variable
preprocessed_text = tokens


# Compute the frequency of each token in the preprocessed text
word_counts = Counter(preprocessed_text)

# Print the 10 most used words by Coldplay
print(word_counts.most_common(10))

[('arm', 18), ('anoth', 16), ('bodi', 4), ('late', 3), ('night', 3), ('watch', 3), ('tv', 3), ('use', 3), ('besid', 3), ('right', 3)]


Here it is! You know the Coldplay lyrics more than the singers now!