When working with text in Natural Language Processing,
text pre-processing is a necessary step (NLP). Real-world human-written text data
includes a variety of misspelled words, short words, unique symbols, emoticons

In [47]:
import os
import pandas as pd
import string
import re
import nltk
import numpy as np
#Path where I have the folder with the text file so in my case as you se below
basepath = 'C:\\Users\\Muhmad\\Downloads\\aclImdb_v1\\aclImdb'

# Reading the data from folder

The code will enter each folder and read the data and will mark each one in a class as pos or neg

In [48]:
df = pd.DataFrame(columns=['Review', 'Class'])
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        labels=l
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                 txt = infile.read()
            df.loc[len(df.index)] = [txt, labels]


## Printing the data that I have stored in DataFrame as df

In [60]:
df

Unnamed: 0,Review,Class,text_wo_stopfreq,text_wo_stopfreqrare
0,I went and saw this movie last night after bei...,pos,went saw last night coaxed friends mine Ill ad...,went saw last night coaxed friends mine Ill ad...
1,Actor turned director Bill Paxton follows up h...,pos,Actor turned director Bill Paxton follows prom...,Actor turned director Bill Paxton follows prom...
2,As a recreational golfer with some knowledge o...,pos,As recreational golfer knowledge sports histor...,As recreational golfer knowledge sports histor...
3,"I saw this film in a sneak preview, and it is ...",pos,saw sneak preview delightful cinematography un...,saw sneak preview delightful cinematography un...
4,Bill Paxton has taken the true story of the 19...,pos,Bill Paxton taken true story 1913 US golf open...,Bill Paxton taken true story 1913 US golf open...
...,...,...,...,...
49996,"Towards the end of the movie, I felt it was to...",neg,Towards end felt technical felt classroom watc...,Towards end felt technical felt classroom watc...
49997,This is the kind of movie that my enemies cont...,neg,kind enemies content watch time bloody true wa...,kind enemies content watch time bloody true wa...
49998,I saw 'Descent' last night at the Stockholm Fi...,neg,saw Descent last night Stockholm Film Festival...,saw Descent last night Stockholm Film Festival...
49999,Some films that you pick up for a pound turn o...,neg,Some films pick pound turn rather 23rd Century...,Some films pick pound turn rather 23rd Century...


## Lower Casing
All data should be converted to lowercase as this will aid in preprocessing and
#later parsing stages of the NLP application.

In [49]:
df["text_lower"] = df["Review"].str.lower()
df.head()

Unnamed: 0,Review,Class,text_lower
0,I went and saw this movie last night after bei...,pos,i went and saw this movie last night after bei...
1,Actor turned director Bill Paxton follows up h...,pos,actor turned director bill paxton follows up h...
2,As a recreational golfer with some knowledge o...,pos,as a recreational golfer with some knowledge o...
3,"I saw this film in a sneak preview, and it is ...",pos,"i saw this film in a sneak preview, and it is ..."
4,Bill Paxton has taken the true story of the 19...,pos,bill paxton has taken the true story of the 19...


## Punctuations Removal
Output of string.punctuation
!"#$%&'()*+, -./:;<=>?@[\]^_`{|}~
Cleans all of thos so we dont have noise in out data

In [50]:
df.drop(["text_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["text_wo_punct"] = df["Review"].apply(lambda text: remove_punctuation(text))
df.head()

Unnamed: 0,Review,Class,text_wo_punct
0,I went and saw this movie last night after bei...,pos,I went and saw this movie last night after bei...
1,Actor turned director Bill Paxton follows up h...,pos,Actor turned director Bill Paxton follows up h...
2,As a recreational golfer with some knowledge o...,pos,As a recreational golfer with some knowledge o...
3,"I saw this film in a sneak preview, and it is ...",pos,I saw this film in a sneak preview and it is d...
4,Bill Paxton has taken the true story of the 19...,pos,Bill Paxton has taken the true story of the 19...


## Stopwords Removal

#The idea is to exclude words that appear frequently throughout all of the corpus's documents.
#Pronouns and articles are typically categorized as stop words.
#These terms are not highly discriminative because they have little relevance in some NLP tasks like information retrieval and classification.

In [51]:
nltk.download('stopwords')
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Muhmad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Review,Class,text_wo_punct,text_wo_stop
0,I went and saw this movie last night after bei...,pos,I went and saw this movie last night after bei...,I went saw movie last night coaxed friends min...
1,Actor turned director Bill Paxton follows up h...,pos,Actor turned director Bill Paxton follows up h...,Actor turned director Bill Paxton follows prom...
2,As a recreational golfer with some knowledge o...,pos,As a recreational golfer with some knowledge o...,As recreational golfer knowledge sports histor...
3,"I saw this film in a sneak preview, and it is ...",pos,I saw this film in a sneak preview and it is d...,I saw film sneak preview delightful The cinema...
4,Bill Paxton has taken the true story of the 19...,pos,Bill Paxton has taken the true story of the 19...,Bill Paxton taken true story 1913 US golf open...


## Frequent Words Removal
Because the most common words don't provide us with much information,
it is advantageous to exclude them.

When i Printed all the text from text files so I saw alot of <br>,
Wich we dont need it in out text since it just noise

In [52]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('I', 142712),
 ('br', 113790),
 ('The', 88546),
 ('movie', 82288),
 ('film', 73486),
 ('one', 46299),
 ('like', 37489),
 ('This', 29146),
 ('good', 27401),
 ('would', 23754)]

In [53]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["text_wo_stop"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,Review,Class,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,I went and saw this movie last night after bei...,pos,I went and saw this movie last night after bei...,I went saw movie last night coaxed friends min...,went saw last night coaxed friends mine Ill ad...
1,Actor turned director Bill Paxton follows up h...,pos,Actor turned director Bill Paxton follows up h...,Actor turned director Bill Paxton follows prom...,Actor turned director Bill Paxton follows prom...
2,As a recreational golfer with some knowledge o...,pos,As a recreational golfer with some knowledge o...,As recreational golfer knowledge sports histor...,As recreational golfer knowledge sports histor...
3,"I saw this film in a sneak preview, and it is ...",pos,I saw this film in a sneak preview and it is d...,I saw film sneak preview delightful The cinema...,saw sneak preview delightful cinematography un...
4,Bill Paxton has taken the true story of the 19...,pos,Bill Paxton has taken the true story of the 19...,Bill Paxton taken true story 1913 US golf open...,Bill Paxton taken true story 1913 US golf open...


## Rare Words Removal
Using for example names as a predictor for a text classification is a problem

In [54]:
df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))
df.head()

Unnamed: 0,Review,Class,text_wo_stopfreq,text_wo_stopfreqrare
0,I went and saw this movie last night after bei...,pos,went saw last night coaxed friends mine Ill ad...,went saw last night coaxed friends mine Ill ad...
1,Actor turned director Bill Paxton follows up h...,pos,Actor turned director Bill Paxton follows prom...,Actor turned director Bill Paxton follows prom...
2,As a recreational golfer with some knowledge o...,pos,As recreational golfer knowledge sports histor...,As recreational golfer knowledge sports histor...
3,"I saw this film in a sneak preview, and it is ...",pos,saw sneak preview delightful cinematography un...,saw sneak preview delightful cinematography un...
4,Bill Paxton has taken the true story of the 19...,pos,Bill Paxton taken true story 1913 US golf open...,Bill Paxton taken true story 1913 US golf open...


## Emoticons Removal
Emoticons do not have any significant for NLP classification since it repeat it self over and over again

In [55]:
# src : https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py
EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley"
}

In [56]:
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

remove_emoticons("Hello :-)")

'Hello '

## Conversion of Emoticons/Emojis to Words

In [57]:
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

text = "Hello :-) :-)"
convert_emoticons(text)

'Hello Happy_face_smiley Happy_face_smiley'

## URLs Removal
URLS do not have any significant for NLP classification. So its just noise
let say 10000 of people juse same url wich wont be usefull for us in NLP

In [58]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [59]:
text = "Check the documentation at https://docs.python.org/3/"
remove_urls(text)

'Check the documentation at '