Text cleaning and preperation are very crucial steps in NLP. It is necessary to clean the data before performing any other operations on it, because otherwise the dataset looks like a cluster of words that computer doesn't understand.

The steps that we will go through in this exercise are as follows:

- Removing punctuation and digits
- Tokenization
- Removing stopwords
- Lemmatize/Stem

In this exercise we will deal with a dataset that classifies news as fake or real.

Import libraries

In [1]:
import pandas as pd
import string
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


For this exercise, we will take Fake and True datasets and then merge them into news dataset.

In [2]:
# news = pd.read_csv("/content/drive/MyDrive/INFOSYS723/Data/Fake.csv")
news = pd.read_csv("Fake.csv")
news.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


# Step 1: Remove Punctuation and digits

The `title` text has several punctuations. Punctuations are often unnecessary as it doesn’t add value or meaning to the NLP model. The “string” library has 32 punctuations.

In [3]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Now, to remove punctuation, we would need to iterate through each word in the `title` text, then check if the word is punctutation or not. If it is, then throw, otherwise keep it. At the end, join all of the words which are not punctuations.


In [6]:
# a list to store non-punctuation words
non_punct = []

# loops through each sentence in the `title` column and then checks each word in the text
# keeps it in the letters list if non-punctuation
for word in news['title']:
    letters = [letter for letter in word if letter not in string.punctuation and not letter.isdigit()]
    non_punct.append(''.join(letters))

news['title_non_punct'] = pd.Series(non_punct)
news.head()

Unnamed: 0,title,text,subject,date,title_non_punct
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Donald Trump Sends Out Embarrassing New Year’...
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Drunk Bragging Trump Staffer Started Russian ...
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Sheriff David Clarke Becomes An Internet Joke...
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Trump Is So Obsessed He Even Has Obama’s Name...
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Pope Francis Just Called Out Donald Trump Dur...


The column `title_non_punct` after step 1 has removed the punctuations from the title texts.

## Step 2: Tokenization

Tokenizing is the process of splitting strings into a list of words. We will make use of Regular Expressions or regex to do the splitting. Regex can be used to describe a search pattern.

In [8]:
# a list to store tokens
tokens = []

# loops through each sentence in the `title_non_punct` column and then
# splits on one or more non-word character and then
# also converts each word in the title to lowercase
for word in news['title_non_punct']:
    split_words = re.split("\W+",word)
    tokens.append([word.lower() for word in split_words])


news['title_non_punct_split'] = pd.Series(tokens)
news.head()

Unnamed: 0,title,text,subject,date,title_non_punct,title_non_punct_split
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Donald Trump Sends Out Embarrassing New Year’...,"[, donald, trump, sends, out, embarrassing, ne..."
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Drunk Bragging Trump Staffer Started Russian ...,"[, drunk, bragging, trump, staffer, started, r..."
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Sheriff David Clarke Becomes An Internet Joke...,"[, sheriff, david, clarke, becomes, an, intern..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Trump Is So Obsessed He Even Has Obama’s Name...,"[, trump, is, so, obsessed, he, even, has, oba..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Pope Francis Just Called Out Donald Trump Dur...,"[, pope, francis, just, called, out, donald, t..."


The `title_non_punct_split` column after step 2 has created a list, by splitting all the non-word characters.

## Step 3: Remove Stopwords

Now, we have a list of words without any punctuation. Let’s go ahead and remove the stop words. Stop words are irrelevant words that won’t help in identifying a text as real or fake. We will use “nltk” library for stop-words and some of the stop words in this library are :

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [10]:
stopword = nltk.corpus.stopwords.words('english')
# pring the first 11 stop words in the list
print(stopword[:11])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've"]


Let's remove stop words from the tokenized list of words in each title

In [11]:
# a list to store stopwords
stopwords = []

# loops through each sentence in the `title_non_punct_split` column and then
# stores words in a list of non_stop if the word is not a stopword
# then finally appends each list into stopwords
for title in news['title_non_punct_split']:
    non_stop = [word for word in title if word not in stopword]
    stopwords.append(non_stop)

news['title_non_punct_split_non_stopwords'] = pd.Series(stopwords)
news.head()

Unnamed: 0,title,text,subject,date,title_non_punct,title_non_punct_split,title_non_punct_split_non_stopwords
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Donald Trump Sends Out Embarrassing New Year’...,"[, donald, trump, sends, out, embarrassing, ne...","[, donald, trump, sends, embarrassing, new, ye..."
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Drunk Bragging Trump Staffer Started Russian ...,"[, drunk, bragging, trump, staffer, started, r...","[, drunk, bragging, trump, staffer, started, r..."
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Sheriff David Clarke Becomes An Internet Joke...,"[, sheriff, david, clarke, becomes, an, intern...","[, sheriff, david, clarke, becomes, internet, ..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Trump Is So Obsessed He Even Has Obama’s Name...,"[, trump, is, so, obsessed, he, even, has, oba...","[, trump, obsessed, even, obama, name, coded, ..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Pope Francis Just Called Out Donald Trump Dur...,"[, pope, francis, just, called, out, donald, t...","[, pope, francis, called, donald, trump, chris..."


The `title_non_punct_split_non_stopwords` column after step 3 has removed the unnecessary stop words.

## Step 4 : Lemmatize/ Stem

Stemming and Lemmatizing is the process of reducing a word to its root form. The main purpose is to reduce variations of the same word, thereby reducing the corpus of words we include in the model. ***The difference between stemming and lemmatizing is that, stemming chops off the end of the word without taking into consideration the context of the word. Whereas, Lemmatizing considers the context of the word and shortens the word into its root form based on the dictionary definition.*** Stemming is a faster process compared to Lemmantizing. Hence, it a trade-off between speed and accuracy.

In [12]:
ps = PorterStemmer()

In [13]:
# a list to store stemmed words
stemmed_list = []

# loops through each sentence in the `title_non_punct_split_wo_stopwords` column and then
# creates a list 'stemmed_list' to store the stemmed version of each word
for title in news['title_non_punct_split_non_stopwords']:
    stemmed = [ps.stem(word) for word in title ]
    stemmed_list.append(stemmed)

news['title_non_punct_split_non_stopwords_stemmed'] = pd.Series(stemmed_list)
news.head()

Unnamed: 0,title,text,subject,date,title_non_punct,title_non_punct_split,title_non_punct_split_non_stopwords,title_non_punct_split_non_stopwords_stemmed
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Donald Trump Sends Out Embarrassing New Year’...,"[, donald, trump, sends, out, embarrassing, ne...","[, donald, trump, sends, embarrassing, new, ye...","[, donald, trump, send, embarrass, new, year, ..."
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Drunk Bragging Trump Staffer Started Russian ...,"[, drunk, bragging, trump, staffer, started, r...","[, drunk, bragging, trump, staffer, started, r...","[, drunk, brag, trump, staffer, start, russian..."
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Sheriff David Clarke Becomes An Internet Joke...,"[, sheriff, david, clarke, becomes, an, intern...","[, sheriff, david, clarke, becomes, internet, ...","[, sheriff, david, clark, becom, internet, jok..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Trump Is So Obsessed He Even Has Obama’s Name...,"[, trump, is, so, obsessed, he, even, has, oba...","[, trump, obsessed, even, obama, name, coded, ...","[, trump, obsess, even, obama, name, code, web..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Pope Francis Just Called Out Donald Trump Dur...,"[, pope, francis, just, called, out, donald, t...","[, pope, francis, called, donald, trump, chris...","[, pope, franci, call, donald, trump, christma..."


Reference: https://towardsdatascience.com/nlp-in-python-data-cleaning-6313a404a470