## Hello!

Welcome to your first objective of action learning plan. The notebook will discuss all the necessary steps for cleaning or preprocessing user generated text data. One needs to carefully follow the Programming Guidelines and incase of any doubt, reach out in the discussion forum. 

Right at the bottom of this notebook you will find reference materials links. There are also links for activities and reading materials to refresh your mind or to take a switch from work.


**The time limit to complete this objective is 4 days.**



## What is Text Preprocessing?

Text Preprocessing (also known as wrangling or normalization) is a process that consists of a series of steps to clean and standardize text data that can further be used by NLP or Machine Learning or Deep Learning system of algorithms. Some of the popular techinques for text preprocessing inlcudes removing accented characters, tokenization, case conversion, removing speical characters, removing stopwords and spelling corrections. Other related terms for text preprocessing are stemming and lemmatization.

The scope of this notebook is to clean the review text data using the list of text preprocessing steps discussed below. The aim is to prepare preprocessing pipeline for any form of used generated text.

##  Basic Python Programming Guidelines:

- All codes should be written within a virtual environment.
- Each preprocessing step should be defined as an individual function.
- The functions or definitions should contain docstrings (example below illustrates that)
- Comment all the necessary lines of code to keep them review ready.
- Name of the variables and functions should clearly represent the entity they store and calculate respectivley.
- Avoid use of nested loops whenever possible.
- All functions should be timed during their run on whole data. Runtime calculation is crucial to check the time complexity of code.
- You are free to use internet services to research on related works, but don't copy paste code from any other source. It is advised to take lessons from other peoples work and write the code by yourself.
- Codes not written according to the above mentioned points will not pass review phase.

In [3]:
# an example on how to design your functions

def count_words_in_sentence(input_sentence):
    
    # this is a docstring
    """
    The function will calcuate and return the number of 
    words present in a given sentence.
    
    arguments:
        input_text: input sentence of type "string". 
                    
    return:
        word_count: variable to store the count of total
                    words in the input sentence
    
    Example:
    Input: "i am Karan"
    Output: 3
    """
    
    # strip white spaces at ends and split sentence into list of words
    # pay attention to variable names used
    splitted_sentence = input_sentence.strip().split()
    
    # count words
    word_count = len(splitted_sentence)
    
    # alternatively
#     word_count = len(input_sentence.strip().split())

    return word_count

# test the code with a sample example
sentence = "I have to prepare a text preprocessing pipeline"
print(f"The total number of words in input text: {count_words_in_sentence(sentence)}")

The total number of words in input text: 8


## Note:
All preprocessing steps are described using a sample text. The descriptions will be purely language and mathemactics based, and candidate is reponsible to implement them in code.

In [65]:
# all the necessary installs
#!pip install pandas
#!pip install unidecode

In [81]:
# all the necessary imports 
import pandas as pd
import unidecode
import time

In [8]:
# load your data
filepath = "C:/Users/SVPNPA/npl/wet_food.csv" # define your file path here
wet_food_data = pd.read_csv(filepath) # read your data
wet_food_data.head()

Unnamed: 0.1,Unnamed: 0,brand_name,product_id,product_name,review_id,rating,review_title,review_author,review_date,review_text
0,0,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215226291,4,My finicky bunch seem to enjoy this one!,EliG,2019-12-10,My little ones are tough customers but Friskie...
1,1,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215207322,4,Cats Like it,Beth,2019-12-09,Cats like it and good variety for them. Love t...
2,2,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215186121,5,That's what's for dinner,Judy,2019-12-09,My cats absolutely love shreds they start bugg...
3,3,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215141820,5,Cat loves it,Lakeffect,2019-12-08,"Our indoor, lazy 16+ year cat has barfed every..."
4,4,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215136825,5,One of my kitties favorite,Ronnifox,2019-12-08,I have two kitties they are 18 years old my bo...


### Premilinary Analyses of the data
This section helps to know about the data and its features such as data types, dimensions, null values

In [15]:
# Checking for null values and datatypes of the data
wet_food_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133218 entries, 0 to 133217
Data columns (total 10 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Unnamed: 0     133218 non-null  int64 
 1   brand_name     133218 non-null  object
 2   product_id     133218 non-null  int64 
 3   product_name   133218 non-null  object
 4   review_id      133218 non-null  int64 
 5   rating         133218 non-null  int64 
 6   review_title   133218 non-null  object
 7   review_author  133214 non-null  object
 8   review_date    133218 non-null  object
 9   review_text    133218 non-null  object
dtypes: int64(4), object(6)
memory usage: 10.2+ MB


This depicts that the dataset doesn't have any null values and also shows the data types of each variables in the data.

In [16]:
# The dimensions of the dataset 
wet_food_data.shape

(133218, 10)

The data set has 133218 rows and 10 columns.

In [39]:
# Checking for duplicate rows in the dataset
wet_food_data.duplicated(subset='review_text').sum()

2559

The data has 2559 duplicates which means the first duplicate is kept in the data and the other duplicates are depicted in this figure.

In [41]:
# To View those duplicated rows
wet_food_data.loc[wet_food_data.duplicated(subset='review_text'),:]

Unnamed: 0.1,Unnamed: 0,brand_name,product_id,product_name,review_id,rating,review_title,review_author,review_date,review_text
2270,691,Fancy Feast,76047,Fancy Feast Classic Seafood Feast Variety Pack...,171126228,5,Our cats like it.,Christie,2017-02-06,Our cats have been eating this food for a long...
2597,147,Fancy Feast,75978,Fancy Feast Gravy Lovers Poultry & Beef Feast ...,211323965,5,Cats love it,CatFamilyof3,2019-07-18,My cats love this food. It was the first time ...
2691,241,Fancy Feast,75978,Fancy Feast Gravy Lovers Poultry & Beef Feast ...,207724352,5,Happy with product,Milly,2019-03-18,Happy with product but want to put it on 2 mon...
2743,293,Fancy Feast,75978,Fancy Feast Gravy Lovers Poultry & Beef Feast ...,206620713,5,Love this food,ariane,2019-01-31,We adopted our cat and this is what she was be...
2796,346,Fancy Feast,75978,Fancy Feast Gravy Lovers Poultry & Beef Feast ...,205343075,5,Purchased for charity,allaboutcats,2018-12-13,I purchased this product for my favorite rescu...
...,...,...,...,...,...,...,...,...,...,...
132505,32,Friskies,155207,Friskies Meaty Bits Ocean Fish in Sauce Canned...,208999575,5,Da Boyz Love It,cobbweb10,2019-05-04,All 3 of my hairballs (Joey + Chandler + Krame...
132510,37,Friskies,155207,Friskies Meaty Bits Ocean Fish in Sauce Canned...,208616312,5,Bought this for the feral colony,PetOwner,2019-04-22,There's a feral colony in my neighborhood that...
132975,427,Friskies,93793,Friskies Cat Concoctions Variety Pack Canned C...,178388914,5,Meow Wow Wow!!!,CrankyOldFart,2017-04-30,My Cats response was Meow Wow! Izzy is bouncin...
133077,16,Whiskas,89476,"Whiskas Purrfectly Fish Pouches with Tuna, Sar...",204072632,5,Tasty delight for older cats,KZKAT,2018-10-26,Our older cat was not enjoying dry food as muc...


In [45]:
# Dropping the duplicated review text and saving it in the orginial dataframe
wet_food_data = wet_food_data.drop_duplicates(subset='review_text')

In [46]:
# The new dimensions of the dataset after dropping the duplicates
wet_food_data.shape

(130659, 10)

The new dataframe consists of 130659 rows and 10 columns after dropping 2559 duplicates. 

### Step1: Remove Accented Characters
Accentend characters are letters with marks on top of them. When dealing with user generated text, accented characters are common to be encountered. Hence, we need to convert these characters to standard ASCII form.

**Example:**

*We aře ready to get a new pet --> We are ready to get a new pet*

*My cat lovés this food product --> My cat loves this food product*

**Libraries required**: unicodedata or unidecode

In [85]:
# Removal of accented characters from the dataset
def accented_characters_removal(wet_food_data):
    """
    The function will remove the accented characters 
    present in the text by converting them into ASCII form.
    
    It also includes the execution time of the function
    
    """
    start=time.time()
    wet_food_data['review_text'] = wet_food_data['review_text'].apply(unidecode.unidecode)
    end=time.time()
    return wet_food_data['review_text'], end-start
    
# Calling the accented function on the dataset
accented_characters_removal(wet_food_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


(0         My little ones are tough customers but Friskie...
 1         Cats like it and good variety for them. Love t...
 2         My cats absolutely love shreds they start bugg...
 3         Our indoor, lazy 16+ year cat has barfed every...
 4         I have two kitties they are 18 years old my bo...
                                 ...                        
 133213    My 12 year old does not like this either. I ha...
 133214    Accidentally bought these instead of pate and ...
 133215    My two cats both won't eat this food.\r\nThey ...
 133216    My cats are a little picky.  They eat Nutro mi...
 133217    Gizmo loved Nutro and couldn't get enough of i...
 Name: review_text, Length: 130659, dtype: object,
 0.32889676094055176)

In [86]:
wet_food_data.head(5)

Unnamed: 0.1,Unnamed: 0,brand_name,product_id,product_name,review_id,rating,review_title,review_author,review_date,review_text
0,0,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215226291,4,My finicky bunch seem to enjoy this one!,EliG,2019-12-10,My little ones are tough customers but Friskie...
1,1,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215207322,4,Cats Like it,Beth,2019-12-09,Cats like it and good variety for them. Love t...
2,2,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215186121,5,That's what's for dinner,Judy,2019-12-09,My cats absolutely love shreds they start bugg...
3,3,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215141820,5,Cat loves it,Lakeffect,2019-12-08,"Our indoor, lazy 16+ year cat has barfed every..."
4,4,Friskies,76368,"Friskies Shreds Variety Pack Canned Cat Food, ...",215136825,5,One of my kitties favorite,Ronnifox,2019-12-08,I have two kitties they are 18 years old my bo...


### Step2: Case Conversion

Get all the text into lower case.

**Example:**

*My Cat is my world --> my cat is my world*


In [8]:
# complete the code for text lowercasing
def lower_casing_text():
    # write your code
    pass

### Step3: Reduce repeated characters and punctuations

In user generated text there are words where some characters are repeated incorrectly to express the writers sentiment or sometimes they are spelling errors. Removal of them is essential to prepare data for higher level NLP task.

**Example:**

*I had greaaaattt experience with this product........ plus it is reaaaallllyyyy good for my dog*

Observe how characters in word *great*,*really* and *........* are repeated to incorrect numbers. 

Reduce these repeatation to unit characters. If for words like "really" for which one repeatation of 'l' is correct are encountered, then you can correct them later in spell checking.

**Libraries required**: re or regex

In [7]:
# completed the code for removing repeated characters and punctuations
def reducing_incorrect_character_repeatation():
    # write your code 
    pass

### Step4: Expand contraction words

In English Language there exist shortened versions of words used in both written and spoken form.

**For Example:** *won't, don't, isn't and more*

These type of shortened words are called contractions. Expansion of contraction words helps to standardize the word form of data. A dictionary map of contraction words is required to implement it correctly. The tutorial folder comes with a dictionary file for contraction mapping, one is free to make use of it. Below is a small sample of examples from that file,

**CONTRACTION_MAP =** {
"ain't": "is not",

"aren't": "are not",

"can't": "cannot",

"can't've": "cannot have",

"'cause": "because",
...........


**Libraries required:** re or regex, pycontractions, and spacy

In [21]:
# complete the code for expanding contraction words
def expanding_contractions():
    # write your code
    pass

### Step5: Remove special characters

Characters like *#, %, &, @ and so on* are considered to belong into class of special characters. We can consider non-alphabets, non-digits and punctuations in English language to belong under special characters. These characters do not always have a major role in the data. Sometimes they are used by people to make emoticons like :) or sometimes to design the text to make it attractive. Now, with that been said, it not wise to always remove all kinds of speical characters from the text. Consider the example below,

*My dog weighs 3.5 pounds, he is not so healthly & is in need of a vet.*

If the aim is to remove all special characters and keep only alphabets and digits, then the processed sentence will look like something below,

*My dog weighs 35 pounds he is not so healthy is in need of a vet*

The things which gets completely wrong is that the value *3.5* is changed to *35*, the symbol *&* is completely ommitted, and removing *,* makes the text grammatically wrong. Therefore, the task here is not only to remove special characteres from data, but to reasonably know what to remove and what to be kept.

Along with execution of this step, candidate will need to provide the list of special characters to be removed with proper reasons.


**Libraries needed:** re or regex

In [1]:
# completed code for removing special characters
def removing_special_characters():
    # write your code here
    pass

### Step6: Remove stopwords

Stopwords are words which usually occur most frequently if you aggregate a corpus of text based on singular tokens and checked their frequencies. Words like "a", "the", "and", and so on are stopwords. There is no exhaustive list of stopwords. 

**Libraries Required:** nltk

In [1]:
# completed code for removing stopwords
def removing_stopwords():
    # write your code here
    pass

### Step7: Lemmatization

Lemmatization is the process of find root word (called lemma) of the words in your data. This helps us to resolve problems of prural forms of words and tense form gets changed to base form.

**Libraries required**: nltk

In [2]:
# completed code for lemmatization
def lemmatization():
    # write your code here
    pass

### Step8: Correct mis-spelled words in text

It is often encountered that user generated text contains mis-spelled words. These can be typing errors or spelling mistakes in general. To correct mis-spellings is very crucial for text analysis. Incorrect words can effect Keyword Extraction, Named Entity Recognition and most importantly accurate Information Extraction from text.

**Resources provided:** A file containing list of correct words for correcting mis-spelled words.

**Libraries required:** re, nltk and textdistance

In [None]:
# completed code for spelling corrections
def spelling_correction():
    # write your code here
    pass

### Step9: Putting in all in single function

Write a main function to serialize the above steps on data.

In [2]:
def preprocessing_text(input_text, special_characters=False,
                      stopwords=False, lemmatize=False,
                      spell_correct=False):
    """
    This function will preprocess input text and return
    the clean text.
    """

### References:

- https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html
- https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79
- https://www.dataquest.io/blog/pandas-big-data/
- https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c
- https://www.typingclub.com/