<div class="alert alert-danger" role="alert">
    <span style="font-size:20px">&#9888;</span> <span style="font-size:16px">This is a read-only notebook! If you want to make and save changes, save a copy by clicking on <b>File</b> &#8594; <b>Save a copy</b>. If this is already a copy, you can delete this cell.</span>
</div>

# Basic NLP Preprocessing

This notebook shows some basic preprocessing commonly used for Natural Language Processing (NLP). 

In [2]:
# Add path of the folder 'resources' to the path from which we can import modules  
import sys
sys.path.append('../utilities')

In [3]:
import pandas as pd
from nlp import TextPreparation

pd.set_option('display.max_colwidth', 500)

In [6]:
raw_dataset = pd.read_csv("./sample_data/sample_data_2020.csv", encoding="utf8")

# Add a row index
text_field = "ConsumerComplaintNarrative"
raw_dataset['Row'] = range(raw_dataset.shape[0])

# Copy the dataset
dataset = raw_dataset[["Row", text_field]].copy()

dataset.head()

Unnamed: 0,Row,ConsumerComplaintNarrative
0,0,I have several items on my credit report that are not mine. I noticed it because I recently applied for a credit card and I wasnt approved so it prompts me to check my credit report. These items listed are not mine and need to be removed. USDDEPTOFEDXXXX XXXX {$1400.00}
1,1,"I have called on several occasions to request my student loans be updated to reflect correct payments and Equifax continues to reflect inaccurate reporting this making them incorrect. The payment history is not reflected corrected and I am not sure why. All on-time payments have been removed, dropping my payment history from 97 % to currently 30 %. When I contacted the original creditor ( XXXX ), I was told that there was only 3 late payments reflected on my account and I haven't been late ..."
2,2,"The following accounts both opened and closed are not mines, my identity has been compromised and it has not only affected me emotionally but my credit file as well. I'm demanding the immediate removal of the following unauthorized and unknown accounts. 1. XXXX XXXX XXXX {$1700.00} 2. XXXX XXXX XXXX XXXX 3. XXXX XXXX XXXX 4 . XXXX XXXX 5. XXXX XXXX XXXX XXXXXXXX XXXX 6. XXXX XXXX XXXX 7. XXXX XXXX XXXX XXXX 8. XXXX XXXX XXXX"
3,3,"This is not a personal problem that affects just me. This is affecting every consumer in our country. You have to fight & make several phone calls to get to your free annual credit report or even if you're denied credit. All 3 of these agencies direct you totally to their "" pay for sites ''. This is a scam & is directed totally to people of low income or seniors. People with high incomes & other means of income do not rely on these so called credit companies. I also realize my complaint wil..."
4,4,"XXXX and TransUnion have been reporting these fraudulent accounts under my name XXXX XXXX is fraudulent on my credit Account # XXXX XXXX XXXX XXXX is fraudulent on my credit Account # XXXX XXXX XXXX XXXX is fraudulent on my credit Account # XXXX XXXX XXXX is fraudulent on my credit Account # XXXX XXXX XXXX XXXX XXXX is fraudulent on my credit Account # XXXX XXXX XXXX XXXX is fraudulent on my credit Account # XXXX I didnt benefit, apply for, and authorize these accounts."


## Preprocessing

### Removing undesired characters.

This processe makes the words much more standardized for the language models commonly used for NLP.

In [7]:
dataset[text_field] = TextPreparation.lowercase(dataset[text_field])
dataset[text_field] = TextPreparation.expand_contractions(dataset[text_field])
dataset[text_field] = TextPreparation.remove_special_chars(dataset[text_field])
dataset[text_field] = TextPreparation.remove_numbers(dataset[text_field])

dataset.head()

Unnamed: 0,Row,ConsumerComplaintNarrative
0,0,i have several items on my credit report that are not mine i noticed it because i recently applied for a credit card and i wasnt approved so it prompts me to check my credit report these items listed are not mine and need to be removed usddeptofed
1,1,i have called on several occasions to request my student loans be updated to reflect correct payments and equifax continues to reflect inaccurate reporting this making them incorrect the payment history is not reflected corrected and i am not sure why all ontime payments have been removed dropping my payment history from to currently when i contacted the original creditor i was told that there was only late payments reflected on my account and i have not been late in over a year however equi...
2,2,the following accounts both opened and closed are not mines my identity has been compromised and it has not only affected me emotionally but my credit file as well im demanding the immediate removal of the following unauthorized and unknown accounts
3,3,this is not a personal problem that affects just me this is affecting every consumer in our country you have to fight make several phone calls to get to your free annual credit report or even if you are denied credit all of these agencies direct you totally to their pay for sites this is a scam is directed totally to people of low income or seniors people with high incomes other means of income do not rely on these so called credit companies i also realize my complaint will be tossed never a...
4,4,and transunion have been reporting these fraudulent accounts under my name is fraudulent on my credit account is fraudulent on my credit account is fraudulent on my credit account is fraudulent on my credit account is fraudulent on my credit account is fraudulent on my credit account i didnt benefit apply for and authorize these accounts


### Removing stop words

Stop words are a list of words that are insignificant to the text meaning, like "the", "on", "a", "an" and so on and so forth [[1]](https://en.wikipedia.org/wiki/Stop_word). By removing this words we reduce noise in the text interpretation.


In [11]:
dataset[text_field] = TextPreparation.remove_stopwords(dataset[text_field])

dataset.head()

Unnamed: 0,Row,ConsumerComplaintNarrative
0,0,sever item credit report mine notic recent appli credit card wasnt approv prompt check credit report item list mine need remov usddeptof
1,1,call sever occa request student loan updat reflect correct payment equifax continu reflect inaccur report make incorrect payment histori reflect correct sure ontim payment remov drop payment histori current contact origin creditor told late payment reflect account havent late year howev equifax continu reflect inform correct destroy charact limit abil obtain credit
2,2,follow account open close mine ident compromi affect emot credit file well im demand immedi remov follow unauthor unknown account
3,3,person problem affect affect everi consum countri fight make sever phone call get free annual credit report even deni credit agenc direct total pay site scam direct total peopl low incom senior peopl high incom mean incom reli call credit compani also realiz complaint toss never address die
4,4,transunion report fraudul account name fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account didnt benefit appli author account


In [8]:
TextPreparation.get_proper_nouns(dataset[text_field])

['_syncb',
 '|date',
 'x',
 'uber',
 '____________',
 '~',
 'namber',
 'yo',
 'miss',
 'fargo',
 'umber',
 '_bank']

### Stem text
Stemming, also called suffix stripping, is a technique used to reduce text dimensionality. Stemming is also a type of text normalization that enables you to standardize some words into specific expressions also called stems [[2]](https://towardsdatascience.com/stemming-corpus-with-nltk-7a6a6d02d3e5#:~:text=Stemming%2C%20also%20called%20suffix%20stripping,specific%20expressions%20also%20called%20stems.).


In [10]:
dataset[text_field] = TextPreparation.stem_text(dataset[text_field])

dataset.head()

Unnamed: 0,Row,ConsumerComplaintNarrative
0,0,sever item credit report mine notic recent appli credit card wasnt approv prompt check credit report item list mine need remov usddeptof
1,1,call sever occa request student loan updat reflect correct payment equifax continu reflect inaccur report make incorrect payment histori reflect correct sure ontim payment remov drop payment histori current contact origin creditor told late payment reflect account havent late year howev equifax continu reflect inform correct destroy charact limit abil obtain credit
2,2,follow account open close mine ident compromi affect emot credit file well im demand immedi remov follow unauthor unknown account
3,3,person problem affect affect everi consum countri fight make sever phone call get free annual credit report even your deni credit agenc direct total pay site scam direct total peopl low incom senior peopl high incom mean incom reli call credit compani also realiz complaint toss never address die
4,4,transunion report fraudul account name fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account didnt benefit appli author account


### Save output

In [16]:
import os
os.makedirs("sample_output", exist_ok=True)

dataset.to_csv("sample_output/pre_processed_sentences.csv", index=False)