# 01-Preprocessing

The first NLP exercise is about preprocessing.

You will practice preprocessing using NLTK on raw data. 
This is the first step in most of the NLP projects, so you have to master it.

We will play with the *coldplay.csv* dataset, containing all the songs and lyrics of Coldplay.

As you know, the first step is to import some libraries. So import *nltk* as well as all the libraries you will need.

In [1]:
# Import NLTK and all the needed libraries
import nltk
nltk.download('punkt') #Run this line one time to get the resource
nltk.download('stopwords') #Run this line one time to get the resource
nltk.download('wordnet') #Run this line one time to get the resource
nltk.download('averaged_perceptron_tagger') #Run this line one time to get the resource
import numpy as np
import pandas as pd

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/laravaroni/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/laravaroni/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/laravaroni/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/laravaroni/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Load now the dataset using pandas.

In [2]:
# TODO: Load the dataset in coldplay.csv
df = pd.read_csv("coldplay.csv")
df.head()

Unnamed: 0,Artist,Song,Link,Lyrics
0,Coldplay,Another's Arms,/c/coldplay/anothers+arms_21079526.html,Late night watching tv \nUsed to be you here ...
1,Coldplay,Bigger Stronger,/c/coldplay/bigger+stronger_20032648.html,I want to be bigger stronger drive a faster ca...
2,Coldplay,Daylight,/c/coldplay/daylight_20032625.html,"To my surprise, and my delight \nI saw sunris..."
3,Coldplay,Everglow,/c/coldplay/everglow_21104546.html,"Oh, they say people come \nThey say people go..."
4,Coldplay,Every Teardrop Is A Waterfall,/c/coldplay/every+teardrop+is+a+waterfall_2091...,"I turn the music up, I got my records on \nI ..."


Now, check the dataset, play with it a bit: what are the columns? How many lines? Is there missing data?...

In [3]:
# TODO: Explore the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Artist  120 non-null    object
 1   Song    120 non-null    object
 2   Link    120 non-null    object
 3   Lyrics  120 non-null    object
dtypes: object(4)
memory usage: 3.9+ KB


There are 120 rows and no null in the 4 columns, as the attributes have 120 non null values, as object.

Now select the song 'Every Teardrop Is A Waterfall' and save the Lyrics text into a variable. Print the output of this variable.

In [16]:
# TODO: Select the song 'Every Teardrop Is A Waterfall'
song = df[df["Song"]=="Every Teardrop Is A Waterfall"]["Lyrics"].values[0]
song

"I turn the music up, I got my records on  \nI shut the world outside until the lights come on  \nMaybe the streets alight, maybe the trees are gone  \nI feel my heart start beating to my favourite song  \n  \nAnd all the kids they dance, all the kids all night  \nUntil Monday morning feels another life  \nI turn the music up  \nI'm on a roll this time  \nAnd heaven is in sight  \n  \nI turn the music up, I got my records on  \nFrom underneath the rubble sing a rebel song  \nDon't want to see another generation drop  \nI'd rather be a comma than a full stop  \n  \nMaybe I'm in the black, maybe I'm on my knees  \nMaybe I'm in the gap between the two trapezes  \nBut my heart is beating and my pulses start  \nCathedrals in my heart  \n  \nAs we saw oh this light I swear you, emerge blinking into  \nTo tell me it's alright  \nAs we soar walls, every siren is a symphony  \nAnd every tear's a waterfall  \nIs a waterfall  \nOh  \nIs a waterfall  \nOh oh oh  \nIs a is a waterfall  \nEvery tear

As you can see, there is some preprocessing needed here. So let's do it! What is usually the first step?

Tokenization, yes. So do tokenization on the lyrics of Every Teardrop Is A Waterfall.

So you may have to import the needed library from NLTK if you did not yet.

Be careful, the output you have from your pandas dataframe may not have the right type, so manipulate it wisely to get a string.

In [17]:
# TODO: Tokenize the lyrics of the song and save the tokens into a variable and print it
from nltk.tokenize import word_tokenize

tokenized_lyrics = word_tokenize(song)
tokenized_lyrics

['I',
 'turn',
 'the',
 'music',
 'up',
 ',',
 'I',
 'got',
 'my',
 'records',
 'on',
 'I',
 'shut',
 'the',
 'world',
 'outside',
 'until',
 'the',
 'lights',
 'come',
 'on',
 'Maybe',
 'the',
 'streets',
 'alight',
 ',',
 'maybe',
 'the',
 'trees',
 'are',
 'gone',
 'I',
 'feel',
 'my',
 'heart',
 'start',
 'beating',
 'to',
 'my',
 'favourite',
 'song',
 'And',
 'all',
 'the',
 'kids',
 'they',
 'dance',
 ',',
 'all',
 'the',
 'kids',
 'all',
 'night',
 'Until',
 'Monday',
 'morning',
 'feels',
 'another',
 'life',
 'I',
 'turn',
 'the',
 'music',
 'up',
 'I',
 "'m",
 'on',
 'a',
 'roll',
 'this',
 'time',
 'And',
 'heaven',
 'is',
 'in',
 'sight',
 'I',
 'turn',
 'the',
 'music',
 'up',
 ',',
 'I',
 'got',
 'my',
 'records',
 'on',
 'From',
 'underneath',
 'the',
 'rubble',
 'sing',
 'a',
 'rebel',
 'song',
 'Do',
 "n't",
 'want',
 'to',
 'see',
 'another',
 'generation',
 'drop',
 'I',
 "'d",
 'rather',
 'be',
 'a',
 'comma',
 'than',
 'a',
 'full',
 'stop',
 'Maybe',
 'I',
 "'m",

It begins to look good. But still, we have the punctuation to remove, so let's do this.

In [18]:
# TODO: Remove the punctuation, then save the result into a variable and print it
tokenized_lyrics_no_punc = [t for t in tokenized_lyrics if t.isalpha()]
tokenized_lyrics_no_punc

['I',
 'turn',
 'the',
 'music',
 'up',
 'I',
 'got',
 'my',
 'records',
 'on',
 'I',
 'shut',
 'the',
 'world',
 'outside',
 'until',
 'the',
 'lights',
 'come',
 'on',
 'Maybe',
 'the',
 'streets',
 'alight',
 'maybe',
 'the',
 'trees',
 'are',
 'gone',
 'I',
 'feel',
 'my',
 'heart',
 'start',
 'beating',
 'to',
 'my',
 'favourite',
 'song',
 'And',
 'all',
 'the',
 'kids',
 'they',
 'dance',
 'all',
 'the',
 'kids',
 'all',
 'night',
 'Until',
 'Monday',
 'morning',
 'feels',
 'another',
 'life',
 'I',
 'turn',
 'the',
 'music',
 'up',
 'I',
 'on',
 'a',
 'roll',
 'this',
 'time',
 'And',
 'heaven',
 'is',
 'in',
 'sight',
 'I',
 'turn',
 'the',
 'music',
 'up',
 'I',
 'got',
 'my',
 'records',
 'on',
 'From',
 'underneath',
 'the',
 'rubble',
 'sing',
 'a',
 'rebel',
 'song',
 'Do',
 'want',
 'to',
 'see',
 'another',
 'generation',
 'drop',
 'I',
 'rather',
 'be',
 'a',
 'comma',
 'than',
 'a',
 'full',
 'stop',
 'Maybe',
 'I',
 'in',
 'the',
 'black',
 'maybe',
 'I',
 'on',
 'my

We will now remove the stop words.

In [26]:
# TODO: remove the stop words using NLTK. Then put the result into a variable and print it
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

 
filtered_sentence = [token for token in tokenized_lyrics_no_punc if token.lower() not in stop_words]

print(filtered_sentence)

['turn', 'music', 'got', 'records', 'shut', 'world', 'outside', 'lights', 'come', 'Maybe', 'streets', 'alight', 'maybe', 'trees', 'gone', 'feel', 'heart', 'start', 'beating', 'favourite', 'song', 'kids', 'dance', 'kids', 'night', 'Monday', 'morning', 'feels', 'another', 'life', 'turn', 'music', 'roll', 'time', 'heaven', 'sight', 'turn', 'music', 'got', 'records', 'underneath', 'rubble', 'sing', 'rebel', 'song', 'want', 'see', 'another', 'generation', 'drop', 'rather', 'comma', 'full', 'stop', 'Maybe', 'black', 'maybe', 'knees', 'Maybe', 'gap', 'two', 'trapezes', 'heart', 'beating', 'pulses', 'start', 'Cathedrals', 'heart', 'saw', 'oh', 'light', 'swear', 'emerge', 'blinking', 'tell', 'alright', 'soar', 'walls', 'every', 'siren', 'symphony', 'every', 'tear', 'waterfall', 'waterfall', 'Oh', 'waterfall', 'Oh', 'oh', 'oh', 'waterfall', 'Every', 'tear', 'waterfall', 'Oh', 'oh', 'oh', 'hurt', 'hurt', 'bad', 'still', 'raise', 'flag', 'Oh', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'Every

Okay we begin to have much less words in our song, right?

Next step is lemmatization. But we had an issue in the lectures, you remember? Let's learn how to do it properly now.

First let's try to do it naively. Import the WordNetLemmatizer and perform lemmatization with default options.

In [28]:
# TODO: Perform lemmatization using WordNetLemmatizer on our tokens
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
tokenized_lemmatize = [wnl.lemmatize(token) for token in filtered_sentence]
print (tokenized_lemmatize)

['turn', 'music', 'got', 'record', 'shut', 'world', 'outside', 'light', 'come', 'Maybe', 'street', 'alight', 'maybe', 'tree', 'gone', 'feel', 'heart', 'start', 'beating', 'favourite', 'song', 'kid', 'dance', 'kid', 'night', 'Monday', 'morning', 'feel', 'another', 'life', 'turn', 'music', 'roll', 'time', 'heaven', 'sight', 'turn', 'music', 'got', 'record', 'underneath', 'rubble', 'sing', 'rebel', 'song', 'want', 'see', 'another', 'generation', 'drop', 'rather', 'comma', 'full', 'stop', 'Maybe', 'black', 'maybe', 'knee', 'Maybe', 'gap', 'two', 'trapeze', 'heart', 'beating', 'pulse', 'start', 'Cathedrals', 'heart', 'saw', 'oh', 'light', 'swear', 'emerge', 'blinking', 'tell', 'alright', 'soar', 'wall', 'every', 'siren', 'symphony', 'every', 'tear', 'waterfall', 'waterfall', 'Oh', 'waterfall', 'Oh', 'oh', 'oh', 'waterfall', 'Every', 'tear', 'waterfall', 'Oh', 'oh', 'oh', 'hurt', 'hurt', 'bad', 'still', 'raise', 'flag', 'Oh', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'Every', 'tear', '

As you can see, it worked well on nouns (plural words are now singular for example).

But verbs are not OK: we would 'is' to become 'be' for example.

To do that, we need to do POS-tagging. So let's do this now.

POS-tagging means Part of speech tagging: basically it will classify words into categories: like verbs, nouns, advers and so on...

In order to do that, we will use NLTK and the function *pos_tag*. You have to do it on the step before lemmatization, so use your variable containing all the tokens without punctuation and without stop words.

Hint: you can check on the internet how the *pos_tag* function works [here](https://www.nltk.org/book/ch05.html)

In [40]:
# TODO: use the function pos_tag of NLTK to perform POS-tagging and print the result

tags = nltk.pos_tag(filtered_sentence)
tags

[('turn', 'NN'),
 ('music', 'NN'),
 ('got', 'VBD'),
 ('records', 'NNS'),
 ('shut', 'VBN'),
 ('world', 'NN'),
 ('outside', 'IN'),
 ('lights', 'NNS'),
 ('come', 'VBP'),
 ('Maybe', 'RB'),
 ('streets', 'NNS'),
 ('alight', 'VBD'),
 ('maybe', 'RB'),
 ('trees', 'NNS'),
 ('gone', 'VBN'),
 ('feel', 'JJ'),
 ('heart', 'NN'),
 ('start', 'NN'),
 ('beating', 'VBG'),
 ('favourite', 'NN'),
 ('song', 'NN'),
 ('kids', 'NNS'),
 ('dance', 'NN'),
 ('kids', 'NNS'),
 ('night', 'NN'),
 ('Monday', 'NNP'),
 ('morning', 'NN'),
 ('feels', 'NNS'),
 ('another', 'DT'),
 ('life', 'NN'),
 ('turn', 'NN'),
 ('music', 'NN'),
 ('roll', 'NN'),
 ('time', 'NN'),
 ('heaven', 'JJ'),
 ('sight', 'VBD'),
 ('turn', 'NN'),
 ('music', 'NN'),
 ('got', 'VBD'),
 ('records', 'NNS'),
 ('underneath', 'IN'),
 ('rubble', 'JJ'),
 ('sing', 'VBG'),
 ('rebel', 'NN'),
 ('song', 'NN'),
 ('want', 'VBP'),
 ('see', 'NN'),
 ('another', 'DT'),
 ('generation', 'NN'),
 ('drop', 'NN'),
 ('rather', 'RB'),
 ('comma', 'JJ'),
 ('full', 'JJ'),
 ('stop', 'NN')

As you can see, it does not return values like 'a', 'n', 'v' or 'r' as the WordNet lemmatizer is expecting...

So we have to convert the values from the NLTK POS-tagging to put them into the WordNet Lemmatizer. This is done in the function *get_wordnet_pos* written below. Try to understand it, and then we will reuse it.

In [35]:
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    output = np.asarray(pos_tag)
    for i in range(len(pos_tag)):
        if pos_tag[i][1].startswith('J'):
            output[i][1] = wordnet.ADJ
        elif pos_tag[i][1].startswith('V'):
            output[i][1] = wordnet.VERB
        elif pos_tag[i][1].startswith('R'):
            output[i][1] = wordnet.ADV
        else:
            output[i][1] = wordnet.NOUN
    return output

So now you have all we need to perform properly the lemmatization.

So you have to use the following to do so:
* your tags from the POS-tagging performed
* the function *get_wordnet_pos*
* the *WordNetLemmatizer*

In [43]:
tags = get_wordnet_pos(tags)
print(tags)

[['turn' 'n']
 ['music' 'n']
 ['got' 'n']
 ['records' 'n']
 ['shut' 'n']
 ['world' 'n']
 ['outside' 'n']
 ['lights' 'n']
 ['come' 'n']
 ['Maybe' 'n']
 ['streets' 'n']
 ['alight' 'n']
 ['maybe' 'n']
 ['trees' 'n']
 ['gone' 'n']
 ['feel' 'n']
 ['heart' 'n']
 ['start' 'n']
 ['beating' 'n']
 ['favourite' 'n']
 ['song' 'n']
 ['kids' 'n']
 ['dance' 'n']
 ['kids' 'n']
 ['night' 'n']
 ['Monday' 'n']
 ['morning' 'n']
 ['feels' 'n']
 ['another' 'n']
 ['life' 'n']
 ['turn' 'n']
 ['music' 'n']
 ['roll' 'n']
 ['time' 'n']
 ['heaven' 'n']
 ['sight' 'n']
 ['turn' 'n']
 ['music' 'n']
 ['got' 'n']
 ['records' 'n']
 ['underneath' 'n']
 ['rubble' 'n']
 ['sing' 'n']
 ['rebel' 'n']
 ['song' 'n']
 ['want' 'n']
 ['see' 'n']
 ['another' 'n']
 ['generation' 'n']
 ['drop' 'n']
 ['rather' 'n']
 ['comma' 'n']
 ['full' 'n']
 ['stop' 'n']
 ['Maybe' 'n']
 ['black' 'n']
 ['maybe' 'n']
 ['knees' 'n']
 ['Maybe' 'n']
 ['gap' 'n']
 ['two' 'n']
 ['trapezes' 'n']
 ['heart' 'n']
 ['beating' 'n']
 ['pulses' 'n']
 ['start' 'n

In [44]:
# TODO: Perform the lemmatization properly
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
tokenized_lemmatize = [wnl.lemmatize(token, pos) for token, pos in tags]
print(tokenized_lemmatize)

['turn', 'music', 'got', 'record', 'shut', 'world', 'outside', 'light', 'come', 'Maybe', 'street', 'alight', 'maybe', 'tree', 'gone', 'feel', 'heart', 'start', 'beating', 'favourite', 'song', 'kid', 'dance', 'kid', 'night', 'Monday', 'morning', 'feel', 'another', 'life', 'turn', 'music', 'roll', 'time', 'heaven', 'sight', 'turn', 'music', 'got', 'record', 'underneath', 'rubble', 'sing', 'rebel', 'song', 'want', 'see', 'another', 'generation', 'drop', 'rather', 'comma', 'full', 'stop', 'Maybe', 'black', 'maybe', 'knee', 'Maybe', 'gap', 'two', 'trapeze', 'heart', 'beating', 'pulse', 'start', 'Cathedrals', 'heart', 'saw', 'oh', 'light', 'swear', 'emerge', 'blinking', 'tell', 'alright', 'soar', 'wall', 'every', 'siren', 'symphony', 'every', 'tear', 'waterfall', 'waterfall', 'Oh', 'waterfall', 'Oh', 'oh', 'oh', 'waterfall', 'Every', 'tear', 'waterfall', 'Oh', 'oh', 'oh', 'hurt', 'hurt', 'bad', 'still', 'raise', 'flag', 'Oh', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'Every', 'tear', '

What do you think?

Still not perfect, but it's the best we can do for now.

Now you can try stemming, with the help of the lecture, and see the differences compared to the lemmatization

In [45]:
# TODO: Perform stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_sentence]
print(stemmed_tokens)

['turn', 'music', 'got', 'record', 'shut', 'world', 'outsid', 'light', 'come', 'mayb', 'street', 'alight', 'mayb', 'tree', 'gone', 'feel', 'heart', 'start', 'beat', 'favourit', 'song', 'kid', 'danc', 'kid', 'night', 'monday', 'morn', 'feel', 'anoth', 'life', 'turn', 'music', 'roll', 'time', 'heaven', 'sight', 'turn', 'music', 'got', 'record', 'underneath', 'rubbl', 'sing', 'rebel', 'song', 'want', 'see', 'anoth', 'gener', 'drop', 'rather', 'comma', 'full', 'stop', 'mayb', 'black', 'mayb', 'knee', 'mayb', 'gap', 'two', 'trapez', 'heart', 'beat', 'puls', 'start', 'cathedr', 'heart', 'saw', 'oh', 'light', 'swear', 'emerg', 'blink', 'tell', 'alright', 'soar', 'wall', 'everi', 'siren', 'symphoni', 'everi', 'tear', 'waterfal', 'waterfal', 'oh', 'waterfal', 'oh', 'oh', 'oh', 'waterfal', 'everi', 'tear', 'waterfal', 'oh', 'oh', 'oh', 'hurt', 'hurt', 'bad', 'still', 'rais', 'flag', 'oh', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'wa', 'everi', 'tear', 'everi', 'tear', 'everi', 'teardrop', 'wate

Do you see the difference? What would you use?