<a href="https://colab.research.google.com/github/lilianabs/live-nlp-sessions/blob/main/Assignment_Day_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

For this example, we'll use the [Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started/overview) dataset.

In [3]:
df = pd.read_csv("train.csv")

In [4]:
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


We'll preprocess the text by following the next steps:

1. Tokenization
2. Lowercase
3. Stopwords
4. Stemming (maybe)
5. Lemmatization
6. Bag of words
7. TF-IDF

In [8]:
# Lowercase the text
df['text'] = df['text'].str.lower()

In [24]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [30]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

In [21]:
STOP_WORDS = stopwords.words('english')

def remove_stop_words(sentence):
  word_tokens = word_tokenize(sentence)
  filtered_sentence = [word for word in word_tokens if word not in STOP_WORDS]
  filtered_sentence = ' '.join(word for word in filtered_sentence)
  return filtered_sentence

In [22]:
df['text_without_stopwords'] = df['text'].apply(remove_stop_words)

In [23]:
df.head()

Unnamed: 0,id,keyword,location,text,target,text_without_stopwords
0,1,,,our deeds are the reason of this #earthquake m...,1,deeds reason # earthquake may allah forgive us
1,4,,,forest fire near la ronge sask. canada,1,forest fire near la ronge sask . canada
2,5,,,all residents asked to 'shelter in place' are ...,1,residents asked 'shelter place ' notified offi...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive # wildfires evacuation o..."
4,7,,,just got sent this photo from ruby #alaska as ...,1,got sent photo ruby # alaska smoke # wildfires...


In [26]:
lemmatizer = WordNetLemmatizer()

def lemmatize(sentence):
  word_tokens = word_tokenize(sentence)
  lemmatized_sentence = [lemmatizer.lemmatize(word) for word in word_tokens]
  lemmatized_sentence = ' '.join(word for word in lemmatized_sentence)
  return lemmatized_sentence

In [31]:
df['lemmatized_sentence'] = df['text'].apply(lemmatize)

In [32]:
df.head()

Unnamed: 0,id,keyword,location,text,target,text_without_stopwords,lemmatized_sentence
0,1,,,our deeds are the reason of this #earthquake m...,1,deeds reason # earthquake may allah forgive us,our deed are the reason of this # earthquake m...
1,4,,,forest fire near la ronge sask. canada,1,forest fire near la ronge sask . canada,forest fire near la ronge sask . canada
2,5,,,all residents asked to 'shelter in place' are ...,1,residents asked 'shelter place ' notified offi...,all resident asked to 'shelter in place ' are ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive # wildfires evacuation o...","13,000 people receive # wildfire evacuation or..."
4,7,,,just got sent this photo from ruby #alaska as ...,1,got sent photo ruby # alaska smoke # wildfires...,just got sent this photo from ruby # alaska a ...
