# NLP Notes:
Natural Language Processing is a branch of Artificial Intelligence that analyzes, processes, and efficiently retrieves information text data. By utilizing the power of NLP one can solve a huge range of real-world problems which include summarizing documents, title generator, caption generator, fraud detection, speech recognition, recommendation system, machine translation, etc.

## Import Common packages

In [1]:
import numpy as np
import pandas as pd
import re
import string
import math

## Import NLP related packages

In [2]:
#pip install contractions
import contractions
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

## Import Data and Drop duplicates

In [3]:
df = pd.read_csv('C:/Users/NLP/data/media_group.csv')
# Drop duplicates
df.drop_duplicates(inplace=True)

In [4]:
df.head(10)

Unnamed: 0,focus_group_subtype,focus_group_subtype_id,doc_no_within_subtype,question_id,question_text,parent_num,parent_answer
0,media_group,3,1,2,how did your child use technology before the p...,5,My son goes to our charter school. Before the ...
1,media_group,3,1,2,how did your child use technology before the p...,1,It was pretty minimal for school. It was mostl...
2,media_group,3,1,2,how did your child use technology before the p...,4,My child also had access to the computer befor...
3,media_group,3,1,2,how did your child use technology before the p...,3,My son before the pandemic was mostly iPad for...
4,media_group,3,1,2,how did your child use technology before the p...,2,"My son's older through all these kids, but he'..."
5,media_group,3,1,2,how did your child use technology before the p...,1,"I don't know. It's been ... I mean, because sh..."
6,media_group,3,1,2,how did your child use technology before the p...,6,Okay. I have an 11-year-old boy who is in sixt...
7,media_group,3,1,3,what do you anticipate as people return to in ...,3,"Okay. Real quick. Both my kids, my son is ADHD..."
8,media_group,3,1,3,what do you anticipate as people return to in ...,1,"The return to school, I mean, right now, she's..."
9,media_group,3,1,3,what do you anticipate as people return to in ...,4,I would say my son's school is not going back ...


# Preprocessing of text Data
1. Expand contraction
2. Case handling
3. Remove punctuations
4. Remove words and digits containing digits
5. Remove stop word
6. Lemmatization
7. Remove Extra Spaces 

#### 1. Expand contraction
Contraction is the shortened form of a word like don’t stands for do not, aren’t stands for are not. Like this, we need to expand this contraction in the text data for better analysis.

In [5]:
def expand_contraction(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text:contractions.fix(text))
        
    return df

#### 2. Case handling
If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine. for example, words like Ball and ball are treated differently by machine. So, we need to make the text in the same case and the most preferred case is a lower case to avoid such problems.

In [6]:
def case_handling(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].str.lower() 
        
    return df       

#### 3. Remove punctuations
One of the other text processing techniques is removing punctuations. there are total 32 main punctuations that need to be taken care of. we can directly use the string module with a regular expression to replace any punctuation in text with an empty string

In [7]:
def remove_punctuations(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text: re.sub('[%s]' % re.escape(string.punctuation), '' , text))
        
    return df   

#### 4. Remove words and digits containing digits
Sometimes it happens that words and digits combine are written in the text which creates a problem for machines to understand. hence, We need to remove the words and digits which are combined like game57 or game5ts7. This type of word is difficult to process so better to remove them or replace them with an empty string. we use regular expressions for this. 

In [8]:
def remove_words_dgits(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text: re.sub('W*dw*','',text))
        
    return df

#### 5. Remove stopword
Stopwords are the most commonly occurring words in a text which do not provide any valuable information. stopwords like they, there, this, where, etc are some of the stopwords. NLTK library is a common library that is used to remove stopwords and include approximately 180 stopwords which it removes. If we want to add any new word to a set of words then it is easy using the add method.

In [9]:
def remove_stopwords(df, columns=[]):
    
    stop_words = set(stopwords.words('english'))
    
    def remove_sw(text):
        txt_output = " ".join([word for word in str(text).split() if word not in stop_words])
        return txt_output
    
    for col in columns:
        df[col] = df[col].apply(lambda text: remove_sw(text))
    
    return df

#### 6. Lemmatization
Lemmatization is similar to stemming, used to stem the words into root word but differs in working. Actually, Lemmatization is a systematic way to reduce the words into their lemma by matching them with a language dictionary.

In [10]:
def lemmatize_words(df, columns=[]):
    
    lemmatizer = WordNetLemmatizer()
    
    def lemmatize(text):
        text_output = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
        return text_output
    
    for col in columns:
        df[col] = df[col].apply(lambda text: lemmatize(text))
        
    return df

#### 7. Remove Extra Spaces 
Most of the time text data contain extra spaces or while performing the above preprocessing techniques more than one space is left between the text so we need to control this problem. regular expression library performs well to solve this problem

In [11]:
def remove_extra_spaces(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text: re.sub(' +', ' ', text))
        
    return df 

### Data preprocessing entry point

In [12]:
def data_preprocessing(df, columns=[]):
    
    df = expand_contraction(df,columns)
    df = case_handling(df,columns) 
    df = remove_punctuations(df,columns=['parent_answer', 'question_text'])
    df = remove_words_dgits(df,columns=['parent_answer', 'question_text'])  
    df = remove_stopwords(df,columns=['parent_answer', 'question_text']) 
    df = lemmatize_words(df, columns=['parent_answer', 'question_text'])
    df = remove_extra_spaces(df,columns=['parent_answer', 'question_text']) 
    
    return df

In [13]:
columns=['parent_answer', 'question_text']
output_df =  data_preprocessing(df, columns)

In [14]:
output_df.head(10)

Unnamed: 0,focus_group_subtype,focus_group_subtype_id,doc_no_within_subtype,question_id,question_text,parent_num,parent_answer
0,media_group,3,1,2,chil use technology panemic eucational purpose,5,son go charter school panemic using computer w...
1,media_group,3,1,2,chil use technology panemic eucational purpose,1,pretty minimal school mostly mean use chromebo...
2,media_group,3,1,2,chil use technology panemic eucational purpose,4,chil also ha access computer panemic school al...
3,media_group,3,1,2,chil use technology panemic eucational purpose,3,son panemic mostly ipa ha accommoation woul us...
4,media_group,3,1,2,chil use technology panemic eucational purpose,2,son oler ki still trying get iploma high schoo...
5,media_group,3,1,2,chil use technology panemic eucational purpose,1,know mean remote certain time mean horribleit ...
6,media_group,3,1,2,chil use technology panemic eucational purpose,6,okay 11yearol boy sixth grae sevenyearol girl ...
7,media_group,3,1,3,anticipate people return person concern anythi...,3,okay real quick ki son ah aughter ah learning ...
8,media_group,3,1,3,anticipate people return person concern anythi...,1,return school mean right three ays one week tw...
9,media_group,3,1,3,anticipate people return person concern anythi...,4,woul say son school going back session going r...
