Let's apply all the preprocessing methods we have discussed so far on our Zomato dataset and see how everything works together

We import the necessary libraries

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


We import our data "zomato_reviews" which is a data related to restaurant's reviews

In [None]:
df = pd.read_csv("/content/zomato_reviews.csv")
df.head(5)

Unnamed: 0,Review,sentiment
0,Virat Kohli did a great thing to open his rest...,positive
1,This place have some really heathy options to ...,positive
2,Aerocity is the most finest place in Delhi for...,positive
3,"Yesterday evening there was small team lunch ,...",positive
4,I find aerocity to be the best place in delhi ...,positive


In [None]:
corpus = pd.Series(df.Review.tolist()).astype(str)

In [None]:
corpus

0       Virat Kohli did a great thing to open his rest...
1       This place have some really heathy options to ...
2       Aerocity is the most finest place in Delhi for...
3       Yesterday evening there was small team lunch ,...
4       I find aerocity to be the best place in delhi ...
                              ...                        
1591    || DESI LANE || So we were at alipore's most h...
1592    "Desi Lane" is one of the most trending place ...
1593    One of the cool and pocket pinch restaurant at...
1594    "DESI LANE" one of the best places in town and...
1595    Looking for good place for lunch but dont wann...
Length: 1596, dtype: object

We lowercase our data as a start

In [None]:
def lowercase(corpus):
  corpus = corpus.apply(lambda x: x.lower() if isinstance(x, str) else x)
  return corpus

In [None]:
lowercase(corpus)

0       virat kohli did a great thing to open his rest...
1       this place have some really heathy options to ...
2       aerocity is the most finest place in delhi for...
3       yesterday evening there was small team lunch ,...
4       i find aerocity to be the best place in delhi ...
                              ...                        
1591    || desi lane || so we were at alipore's most h...
1592    "desi lane" is one of the most trending place ...
1593    one of the cool and pocket pinch restaurant at...
1594    "desi lane" one of the best places in town and...
1595    looking for good place for lunch but dont wann...
Length: 1596, dtype: object

Then,we clean our data by cleaning it from special characters and punctuation

In [None]:
def text_clean(corpus, keep_list):
    '''
    Purpose : Function to keep only alphabets, digits and certain words (punctuations, qmarks, tabs etc. removed)

    Input : Takes a text corpus, 'corpus' to be cleaned along with a list of words, 'keep_list', which have to be retained
            even after the cleaning process

    Output : Returns the cleaned text corpus

    '''
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            if word not in keep_list:
                p1 = re.sub(pattern='[^a-zA-Z0-9]',repl=' ',string=word)
                p1 = p1.lower()
                qs.append(p1)
            else : qs.append(word)
        cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
    return cleaned_corpus

Stopwords Removal: we remove the stopwords that are defined in the NLTK library except the wh-words.


In [None]:
def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom','the']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

#Lemmatization

Here we apply the lemmatization algorithm (WordNetLemmatizer) to our dataset.In this case ,we lemmatize the verbs.


In [None]:
def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

#Stemming

We perform stemming on a text corpus using  the PorterStemmer which consists on removing suffixes.

In [None]:
def stem(corpus):
    stemmer = PorterStemmer()
    corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    return corpus

For data cleaning, we define a function that incorporates all the essential preprocessing techniques.







In [None]:
def preprocess(corpus, keep_list, cleaning = True, stemming = False, lemmatization = False, remove_stopwords = True):
    '''
    Purpose : Function to perform all pre-processing tasks (cleaning, stemming, lemmatization, stopwords removal etc.)

    Input :
    'corpus' - Text corpus on which pre-processing tasks will be performed
    'keep_list' - List of words to be retained during cleaning process
    'cleaning', 'stemming', 'lemmatization', 'remove_stopwords' - Boolean variables indicating whether a particular task should
                                                                  be performed or not
    'stem_type' - Choose between Porter stemmer or Snowball(Porter2) stemmer. Default is "None", which corresponds to Porter
                  Stemmer. 'snowball' corresponds to Snowball Stemmer

    Note : Either stemming or lemmatization should be used. There's no benefit of using both of them together

    Output : Returns the processed text corpus

    '''
    corpus=lowercase(corpus)
    if cleaning == True:
        corpus = text_clean(corpus, keep_list)

    if remove_stopwords == True:
        corpus = stopwords_removal(corpus)
    else :
        corpus = [[x for x in x.split()] for x in corpus]

    if lemmatization == True:
        corpus = lemmatize(corpus)


    if stemming == True:
        corpus = stem(corpus)

    corpus = [' '.join(x) for x in corpus]

    return corpus

Here are the words that are kept during the data preprocessing


In [None]:
common_dot_words = ['U.S.A', 'Mr.', 'Mrs.', 'D.C.']

# In this example ,we decided to apply the text cleaning and removing stop-words

In [None]:
corpus_preprocessed=preprocess(corpus, keep_list=common_dot_words, cleaning = True, stemming = False, lemmatization = False, remove_stopwords = True)

  cleaned_corpus = pd.Series()
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
  cleaned_corpus = cleaned_corpus.append(pd.S

We will compare the data before and after preprocessing. First, we will print the data without preprocessing, and then we will print the preprocessed data.

In [None]:
corpus[0]

'Virat Kohli did a great thing to open his restaurant in an exquisite place of Delhi. Wide range of food with lots and lots of options on drinks. Courteous staff with a quick response on anything.'

In [None]:
corpus_preprocessed[0]

'virat kohli great thing open restaurant exquisite place delhi wide range food lots lots options drinks courteous staff quick response anything'