Building a Text Preprocessing Pipeline
--
So far, we have completed most of the text manipulation and processing
techniques and methods. In this section, let’s do something interesting.

Problem
--
You want to build an end-to-end text preprocessing pipeline. Whenever
you want to do preprocessing for any NLP application, you can directly
plug in data to this pipeline function and get the required clean text data as
the output.

Solution
--
The simplest way to do this by creating the custom function with all the
techniques learned so far.

In [1]:
# Read/create the text data
# Let’s create a list of strings and assign it to a variable. 
# Maybe a tweet sample:
tweet_sample= "How to take control of your #debt https://personal.vanguard.com/us/insights/saving-investing/debt-management. #Best advice for #family #financial #success (@PrepareToWin)"


In [2]:
#import nltk
#nltk.download('wordnet')

# Execute the below function to process the tweet:
def processRow(row):
 import re
 import nltk
 from textblob import TextBlob
 from nltk.corpus import stopwords
 from nltk.stem import PorterStemmer
 from textblob import Word
 from nltk.util import ngrams
 from nltk.tokenize import word_tokenize
 
 tweet = row

#Lower case
 tweet.lower()

#Removes unicode strings like "\u002c"  -> ,(comma)
 tweet = re.sub(r'(\\u[0-9A-Fa-f]+)',r'', tweet)
    
# Removes non-ascii characters. note : \x00 to \x7f is 00 to 255
# non-ascii characters like copyrigth symbol, trademark symbol
 tweet = re.sub(r'[^\x00-\x7f]',r'',tweet)
               
#convert any url to URL
 tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
               
#Convert any @Username to "AT_USER"
 tweet = re.sub('@[^\s]+','AT_USER',tweet)

#Replace multiple white spaces with one white space
 tweet = re.sub('[\s]+', ' ', tweet)
#Replace multiple break-line i.e enter-key,  with one white space
 tweet = re.sub('[\n]+', ' ', tweet)

# Removes hastag in front of a word
# or simply said Replace #word with word
 tweet = re.sub(r'#([^\s]+)', r'\1', tweet)

#Replace non-word chars with a single white space
 tweet = re.sub(r'[^\w]', ' ', tweet)
  
# Remove smiley face symbols
 tweet = tweet.replace(':)','')
 tweet = tweet.replace(':(','')
# below we have removed all possible emoticons / smiley faces

#Removes emoticons from text
 tweet = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', '', tweet)

#remove numbers
 tweet = ''.join([i for i in tweet if not i.isdigit()])

#remove multiple exclamation  -> this is optional
 tweet = re.sub(r"(\!)\1+", ' ', tweet)

#remove multiple question marks -> this is optional
 tweet = re.sub(r"(\?)\1+", ' ', tweet)

#remove multistop -> this is optional
 tweet = re.sub(r"(\.)\1+", ' ', tweet)

#trim -> this is optional, as this would have been removed 
# by the above [^\w] step
 tweet = tweet.strip('\'"')

# making lemma of the cleaned word
 from textblob import Word
 tweet =" ".join([Word(word).lemmatize() for word in tweet.split()])

 row = tweet
 return row
               
# call the function with your data
processRow(tweet_sample)

'How to take control of your debt URL Best advice for family financial success AT_USER'

**Note** : The above text pre-processing (*`cleaning`*) function totally depends on type / nature of data. One has to accordingly code his function. Above is just an example to clean tweet data. Depending on the kind of cleaning you plan on the data the above function can go from simple to complex coding.

QnA time ( 20 - 25 mins )
--

w.r.t this kaggle competion : 
https://www.kaggle.com/c/quora-question-pairs/overview

<font color='green'> <b>Let's do some simple text cleaning and apply skills learned in this Notebook. </b></font>

In [7]:
# load the modules / libraries
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re
from string import punctuation

In [8]:
# load the Training Data set
train = pd.read_csv("quora_train_set.csv", index_col='id')[:1000]
train.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [9]:
# Check for any null values
print(train.isnull().sum())

qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64


In [10]:
# Preview some of the pairs of questions
a = 0 
for i in range(a,a+10):
    print(train.question1[i])
    print(train.question2[i])
    print()

What is the step by step guide to invest in share market in india?
What is the step by step guide to invest in share market?

What is the story of Kohinoor (Koh-i-Noor) Diamond?
What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?

How can I increase the speed of my internet connection while using a VPN?
How can Internet speed be increased by hacking through DNS?

Why am I mentally very lonely? How can I solve it?
Find the remainder when [math]23^{24}[/math] is divided by 24,23?

Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?
Which fish would survive in salt water?

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?

Should I buy tiago?
What keeps childern active and far from phone and video games?

How can I be a good geologist?
What should I do to be a great geologist?

When do you use シ instea

In [14]:
# we have defined our own list of Stop words
stop_words = ['the','a','an','and','but','if','or','because','as','what','which','this','that','these','those','then',
              'just','so','than','such','both','through','about','for','is','of','while','during','to','What','Which',
              'Is','If','While','This']

In [11]:
def text_to_wordlist(text, remove_stop_words=True, stem_words=False):
    # Clean the text, 
    # with the option to remove stop_words 
    # and to stem words.

    # Clean the text in your own way.  This gives better results. 
    # So instead of using only regex, we have identified the 
    # short forms people type & would use in english
    # and replaced them.
    # This list could go even bigger.
    text = re.sub(r"[^A-Za-z0-9]", " ", text)
    text = re.sub(r"what's", "", text)
    text = re.sub(r"What's", "", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"I'm", "I am", text)
    text = re.sub(r" m ", " am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"60k", " 60000 ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e-mail", "email", text)
    text = re.sub(r"\s{2,}", " ", text)
    text = re.sub(r"quikly", "quickly", text)
    text = re.sub(r" usa ", " America ", text)
    text = re.sub(r" USA ", " America ", text)
    text = re.sub(r" u s ", " America ", text)
    text = re.sub(r" uk ", " England ", text)
    text = re.sub(r" UK ", " England ", text)
    text = re.sub(r"india", "India", text)
    text = re.sub(r"switzerland", "Switzerland", text)
    text = re.sub(r"china", "China", text)
    text = re.sub(r"chinese", "Chinese", text) 
    text = re.sub(r"imrovement", "improvement", text)
    text = re.sub(r"intially", "initially", text)
    text = re.sub(r"quora", "Quora", text)
    text = re.sub(r" dms ", "direct messages ", text)  
    text = re.sub(r"demonitization", "demonetization", text) 
    text = re.sub(r"actived", "active", text)
    text = re.sub(r"kms", " kilometers ", text)
    text = re.sub(r"KMs", " kilometers ", text)
    text = re.sub(r" cs ", " computer science ", text) 
    text = re.sub(r" upvotes ", " up votes ", text)
    text = re.sub(r" iPhone ", " phone ", text)
    text = re.sub(r"\0rs ", " rs ", text) 
    text = re.sub(r"calender", "calendar", text)
    text = re.sub(r"ios", "operating system", text)
    text = re.sub(r"gps", "GPS", text)
    text = re.sub(r"gst", "GST", text)
    text = re.sub(r"programing", "programming", text)
    text = re.sub(r"bestfriend", "best friend", text)
    text = re.sub(r"dna", "DNA", text)
    text = re.sub(r"III", "3", text) 
    text = re.sub(r"the US", "America", text)
    text = re.sub(r"Astrology", "astrology", text)
    text = re.sub(r"Method", "method", text)
    text = re.sub(r"Find", "find", text) 
    text = re.sub(r"banglore", "Banglore", text)
    text = re.sub(r" J K ", " JK ", text)
    
    # Remove punctuation from text
    text = ''.join([c for c in text if c not in punctuation])
    
    # Optionally, remove stop words
    if remove_stop_words:
        text = text.split()
        text = [w for w in text if not w in stop_words]
        text = " ".join(text)
    
    # Optionally, shorten words to their stems
    if stem_words:
        text = text.split()
        stemmer = SnowballStemmer('english')
        stemmed_words = [stemmer.stem(word) for word in text]
        text = " ".join(stemmed_words)
    # Please note , we could have also generated lemma. 
    # from textblob import Word
    # text = " ".join([Word(word).lemmatize() for word in text.split()])

       
    # Return a list of words
    return(text)

In [12]:
# calling the above user defined function
def process_questions(question_list, questions, 
                      question_list_name, 
                      dataframe):
    '''transform questions and display progress'''
    for question in questions:
        question_list.append(text_to_wordlist(question))
        if len(question_list) % 100 == 0:
            progress = len(question_list)/len(dataframe) * 100
            print("{} is {}% complete.".format(question_list_name, round(progress, 1)))

In [15]:
# Creating a empty list to hold the cleaned Question
train_question1 = []

# calling the above user defn fn.
process_questions(train_question1, train.question1, 
                  'train_question1', train)

train_question1 is 10.0% complete.
train_question1 is 20.0% complete.
train_question1 is 30.0% complete.
train_question1 is 40.0% complete.
train_question1 is 50.0% complete.
train_question1 is 60.0% complete.
train_question1 is 70.0% complete.
train_question1 is 80.0% complete.
train_question1 is 90.0% complete.
train_question1 is 100.0% complete.


In [16]:
# Creating a empty list to hold the cleaned Question
train_question2 = []

# calling the above user defn fn.
process_questions(train_question2, train.question2, 
                  'train_question2', train)

train_question2 is 10.0% complete.
train_question2 is 20.0% complete.
train_question2 is 30.0% complete.
train_question2 is 40.0% complete.
train_question2 is 50.0% complete.
train_question2 is 60.0% complete.
train_question2 is 70.0% complete.
train_question2 is 80.0% complete.
train_question2 is 90.0% complete.
train_question2 is 100.0% complete.


In [17]:
# Preview some transformed pairs of questions
a = 0 
for i in range(a,a+10):
    print(train_question1[i])
    print(train_question2[i])
    print()

step by step guide invest in share market in India
step by step guide invest in share market

story Kohinoor Koh i Noor Diamond
would happen Indian government stole Kohinoor Koh i Noor diamond back

How can I increase speed my internet connection using VPN
How can Internet speed be increased by hacking DNS

Why am I mentally very lonely How can I solve it
find remainder when math 23 24 math divided by 24 23

one dissolve in water quickly sugar salt methane carbon di oxide
fish would survive in salt water

astrology I am Capricorn Sun Cap moon cap rising does say me
I am triple Capricorn Sun Moon ascendant in Capricorn does say me

Should I buy tiago
keeps childern active far from phone video games

How can I be good geologist
should I do be great geologist

When do you use instead
When do you use instead

Motorola company Can I hack my Charter Motorolla DCX3400
How do I hack Motorola DCX3400 free internet

