<a href="https://colab.research.google.com/github/joynaomi81/Text-Preprocessing-in-NLP/blob/main/Text_Preprocessing_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Preprocessing in NLP Steps:

* Import necessary library
* Load data.
* Data cleaning.
* Text Preprocessing:
* Lowercase
* Removal of punctuations
* Removal of stopwords
* Removal of frequent words
* Removal of rare words
* Removal of special characters
* Stemming
* Lemmatization
* Removal of URLs
* Removal of HTML Tags
* Spelling Correction


In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [35]:
df = pd.read_csv('/content/drive/MyDrive/text.csv')

In [36]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [37]:
pd.set_option('display.max_colwidth', None)
data = df[['comment_text']]

In [38]:
df.head() # Check for the first 5 rows

Unnamed: 0,comment_text,toxic
0,"This letter perfectly illustrates why any hoped for ""reconciliation"" is sheer stupidity.",1
1,One muslim casualty vs the hundreds and thousands of victims of muslim terrorism this year alone. Darn islamophobia.,1
2,(fuck you Osama bin laden and your afghanistani terrorist cunts),1
3,"As long as Trump keeps Stiggin' It to the libs, Palin-Americans won't care.\n\nHealthcare be damned! Full steam ahead!",1
4,This article is a load of crap.... Another Fake News POLL!,1


In [39]:
df.tail() # Check for the last 5 rows

Unnamed: 0,comment_text,toxic
19995,i like smiley pancakes and crap on stick,0
19996,"""\n\n""""żem"""" is not equal to """"że"""". """"żem"""" is an older langage word and means """"że jestem"""" in English means """"I am"""".\n\n""",0
19997,"""\n\n Headlines \n\nCan you please add this comment to the Headlines on the main page, because I don't have access, since I am not an admin. Thanks!\nThe 79th Academy Awards is held in the Kodak Theatre in Los Angeles, California.\n """,0
19998,"Thank You, sorry.–",0
19999,"Schooling \n\nI attended Harrison Trimble in Moncton, N.B and thats where Sidney earned a diploma from Harrison Trimble High School (Source: http://timestranscript.canadaeast.com/sports/article/698292\n\nI remember meeting him and seeing his photo on the Graduation Wall after.\n\nCan someone edit this to include it?",0


In [40]:
# Check for the columns lables
df.columns

Index(['comment_text', 'toxic'], dtype='object')

In [41]:
df.info() # Information about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   comment_text  20000 non-null  object
 1   toxic         20000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


In [42]:
df.describe() # Descriptive statistics of the dataset

Unnamed: 0,toxic
count,20000.0
mean,0.5
std,0.500013
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [43]:
df.shape # Check for the shape of the dataset

(20000, 2)

In [44]:
# Check for unique values
df.nunique()

Unnamed: 0,0
comment_text,19960
toxic,2


# Data cleaning

In [45]:
#check for missing values
df.isna().sum()

Unnamed: 0,0
comment_text,0
toxic,0


In [46]:
# Checking for duplicate rows in the DataFrame
df.duplicated().sum()

39

There are 39 rows in the DataFrame that are duplicates of other rows.

In [47]:
# Drop duplicates rows
df = df.drop_duplicates()


In [48]:
# Check for the new data shape
df.shape

(19961, 2)

In [49]:
df.loc[:1] # locate a specific row

Unnamed: 0,comment_text,toxic
0,"This letter perfectly illustrates why any hoped for ""reconciliation"" is sheer stupidity.",1
1,One muslim casualty vs the hundreds and thousands of victims of muslim terrorism this year alone. Darn islamophobia.,1


# Data Pre-processing

## Convert data to Lowercase

In [50]:
df.loc[:,'comment_text'] = df['comment_text'].str.lower()
df.head()

Unnamed: 0,comment_text,toxic
0,"this letter perfectly illustrates why any hoped for ""reconciliation"" is sheer stupidity.",1
1,one muslim casualty vs the hundreds and thousands of victims of muslim terrorism this year alone. darn islamophobia.,1
2,(fuck you osama bin laden and your afghanistani terrorist cunts),1
3,"as long as trump keeps stiggin' it to the libs, palin-americans won't care.\n\nhealthcare be damned! full steam ahead!",1
4,this article is a load of crap.... another fake news poll!,1


## Removal of Punctuations

In [51]:
def remove_punctuations(text):
  punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
  for x in text:
    if x in punctuations:
      text = text.replace(x, "")
  return text

In [52]:
df.loc[:,'comment_text'] = df['comment_text'].apply(remove_punctuations)
df.head()

Unnamed: 0,comment_text,toxic
0,this letter perfectly illustrates why any hoped for reconciliation is sheer stupidity,1
1,one muslim casualty vs the hundreds and thousands of victims of muslim terrorism this year alone darn islamophobia,1
2,fuck you osama bin laden and your afghanistani terrorist cunts,1
3,as long as trump keeps stiggin it to the libs palinamericans wont care\n\nhealthcare be damned full steam ahead,1
4,this article is a load of crap another fake news poll,1


## Removal of Stopwords

In [53]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [54]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [55]:
words = set(stopwords.words('english'))
def remove_stopwords(text):
  return " ".join([word for word in str(text).split() if word not in words])

In [56]:
df.loc[:, 'comment_text'] = df['comment_text'].apply(lambda a: remove_stopwords(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectly illustrates hoped reconciliation sheer stupidity,1
1,one muslim casualty vs hundreds thousands victims muslim terrorism year alone darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunts,1
3,long trump keeps stiggin libs palinamericans wont care healthcare damned full steam ahead,1
4,article load crap another fake news poll,1


##  Removal of Frequent Words

In [58]:
from collections import Counter
word_count = Counter()
for text in df['comment_text'].values:
  for word in text.split():
    word_count[word] += 1

word_count.most_common(10)

[('article', 4083),
 ('like', 3222),
 ('page', 3115),
 ('would', 2904),
 ('one', 2866),
 ('dont', 2775),
 ('people', 2653),
 ('==', 2536),
 ('wikipedia', 2445),
 ('fuck', 2249)]

In [59]:
FREQ_WORDS = set([w for (w, wc) in word_count.most_common(4)])
def remove_freqwords(text):
  return " ".join([word for word in str(text).split() if word not in FREQ_WORDS])

In [60]:
df['comment_text'] = df['comment_text'].apply(lambda a: remove_freqwords(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectly illustrates hoped reconciliation sheer stupidity,1
1,one muslim casualty vs hundreds thousands victims muslim terrorism year alone darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunts,1
3,long trump keeps stiggin libs palinamericans wont care healthcare damned full steam ahead,1
4,load crap another fake news poll,1


In [61]:
RARE_WORDS = set([word for (word, wc) in word_count.most_common()[-20: -1]])
RARE_WORDS

{'110s10818',
 '79th',
 'cavalry',
 'cio',
 'growers',
 'jestem',
 'kodak',
 'langage',
 'moncton',
 'pancakes',
 'parmenion',
 'persepolis',
 'questioner',
 'regrupped',
 'rfx',
 'sidney',
 'sorry–',
 'thessalian',
 'xfd'}

In [62]:
def remove_rarewords(text):
  return " ".join([word for word in str(text).split() if word not in RARE_WORDS])

In [63]:
df['comment_text'] = df['comment_text'].apply(lambda a: remove_rarewords(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectly illustrates hoped reconciliation sheer stupidity,1
1,one muslim casualty vs hundreds thousands victims muslim terrorism year alone darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunts,1
3,long trump keeps stiggin libs palinamericans wont care healthcare damned full steam ahead,1
4,load crap another fake news poll,1


## Removal of Special Characters

In [64]:
import re
def remove_spl_chars(text):
  text = re.sub('[^a-zA-Z0-9]', ' ', text)
  text = re.sub('\s+', ' ', text)
  return text

In [65]:
df['comment_text'] = df['comment_text'].apply(lambda a: remove_spl_chars(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectly illustrates hoped reconciliation sheer stupidity,1
1,one muslim casualty vs hundreds thousands victims muslim terrorism year alone darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunts,1
3,long trump keeps stiggin libs palinamericans wont care healthcare damned full steam ahead,1
4,load crap another fake news poll,1


## Stemming

In [66]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def stem_words(text):
  return " ".join([stemmer.stem(word) for word in text.split()])

In [67]:
df['comment_text'] = df['comment_text'].apply(lambda a: stem_words(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectli illustr hope reconcili sheer stupid,1
1,one muslim casualti vs hundr thousand victim muslim terror year alon darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunt,1
3,long trump keep stiggin lib palinamerican wont care healthcar damn full steam ahead,1
4,load crap anoth fake news poll,1


## Lemmatization

In [68]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [69]:
import nltk
nltk.download('averaged_perceptron_tagger')
import nltk
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmetizer = WordNetLemmatizer()
nltk.download('wordnet')
wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "AV": wordnet.ADV}

def lemmatize_words(text):
  pos_tagged_text = nltk.pos_tag(text.split())
  return " ".join([lemmetizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [70]:
df['comment_text'] = df['comment_text'].apply(lambda a: lemmatize_words(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectli illustr hope reconcili sheer stupid,1
1,one muslim casualti v hundr thousand victim muslim terror year alon darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunt,1
3,long trump keep stiggin lib palinamerican wont care healthcar damn full steam ahead,1
4,load crap anoth fake news poll,1


## Removal of URLs

In [72]:
import re

def remove_urls(text):
  return re.sub(r'https?://\S+|www\.\S+', '', text) # remove urls

df['comment_text'] = df['comment_text'].apply(lambda a: remove_urls(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectli illustr hope reconcili sheer stupid,1
1,one muslim casualti v hundr thousand victim muslim terror year alon darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunt,1
3,long trump keep stiggin lib palinamerican wont care healthcar damn full steam ahead,1
4,load crap anoth fake news poll,1


## Removal of HTML Tags

In [73]:
def remove_html_tags(text):
  return re.sub(r'<.*?>', '', text)

df['comment_text'] = df['comment_text'].apply(lambda a: remove_html_tags(a))
df.head()

Unnamed: 0,comment_text,toxic
0,letter perfectli illustr hope reconcili sheer stupid,1
1,one muslim casualti v hundr thousand victim muslim terror year alon darn islamophobia,1
2,fuck osama bin laden afghanistani terrorist cunt,1
3,long trump keep stiggin lib palinamerican wont care healthcar damn full steam ahead,1
4,load crap anoth fake news poll,1


## Spelling Correction

In [74]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m52.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1


In [75]:
from spellchecker import SpellChecker
spell = SpellChecker()

def correct_spellings(text):
  corrected_text = []
  misspelled_words = spell.unknown(text.split())

In [76]:
df['spell_text'] = df['comment_text'].apply(lambda a: correct_spellings(a))
df.head()

Unnamed: 0,comment_text,toxic,spell_text
0,letter perfectli illustr hope reconcili sheer stupid,1,
1,one muslim casualti v hundr thousand victim muslim terror year alon darn islamophobia,1,
2,fuck osama bin laden afghanistani terrorist cunt,1,
3,long trump keep stiggin lib palinamerican wont care healthcar damn full steam ahead,1,
4,load crap anoth fake news poll,1,
