<a href="https://colab.research.google.com/github/gokul8747/NLP/blob/main/DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning process for raw Text data   
Standard preprocessing techniques used for raw text data cleaning:   
1. Tokenization
2. Removing punctuation and numeric characters
2. Lowercasing  
4. Spelling correction  
3. Stop word removal    
5. Stemming   
6. Lemmatization
7. POS tagging


In [26]:
import pandas as pd
import nltk
import re
import string
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
paragraph = '''Bhagat Singh was a young revolutionary freedom fighter who had sacrificed his life for making India a free nation.
This young freedom fighter was born on 28th September in the year 1907 in the Punjab region that lies in Pakistan at present.
Bhagat Singh was initially impressed by the Gandhian principles and thus supported the Swadeshi movement.
Later, he thought of dropping the idea of non-violence because his mind totally changed after the Chauri-Chaura incident during the Non-cooperation movement.
He thought that revolt is necessary and only the way to drive out the Britishers from the nation.
This is the point where the ideology of Mahatma Gandhi and Bhagat Singh changed.
He had been imprisoned several times but never slowed down his struggle for independence.
The movies made on Bhagat Singh reveal the life events of this great hero of the nation.
'''

## Tokenization   
Paragraph to sentence tokenization

In [3]:
sent_tokens = sent_tokenize(paragraph)
sent_tokens

['Bhagat Singh was a young revolutionary freedom fighter who had sacrificed his life for making India a free nation.',
 'This young freedom fighter was born on 28th September in the year 1907 in the Punjab region that lies in Pakistan at present.',
 'Bhagat Singh was initially impressed by the Gandhian principles and thus supported the Swadeshi movement.',
 'Later, he thought of dropping the idea of non-violence because his mind totally changed after the Chauri-Chaura incident during the Non-cooperation movement.',
 'He thought that revolt is necessary and only the way to drive out the Britishers from the nation.',
 'This is the point where the ideology of Mahatma Gandhi and Bhagat Singh changed.',
 'He had been imprisoned several times but never slowed down his struggle for independence.',
 'The movies made on Bhagat Singh reveal the life events of this great hero of the nation.']

Converting Raw text data into pandas dataframe

In [4]:
df = pd.DataFrame({'text':sent_tokens})

In [5]:
df

Unnamed: 0,text
0,Bhagat Singh was a young revolutionary freedom...
1,This young freedom fighter was born on 28th Se...
2,Bhagat Singh was initially impressed by the Ga...
3,"Later, he thought of dropping the idea of non-..."
4,He thought that revolt is necessary and only t...
5,This is the point where the ideology of Mahatm...
6,He had been imprisoned several times but never...
7,The movies made on Bhagat Singh reveal the lif...


## Removing punctuation and numeric characters

In [6]:
df['alpha_only'] = df['text'].apply(lambda x: " ".join(i.removesuffix('th') for i in x.split()))
df['alpha_only'].values

array(['Bhagat Singh was a young revolutionary freedom fighter who had sacrificed his life for making India a free nation.',
       'This young freedom fighter was born on 28 September in the year 1907 in the Punjab region that lies in Pakistan at present.',
       'Bhagat Singh was initially impressed by the Gandhian principles and thus supported the Swadeshi movement.',
       'Later, he thought of dropping the idea of non-violence because his mind totally changed after the Chauri-Chaura incident during the Non-cooperation movement.',
       'He thought that revolt is necessary and only the way to drive out the Britishers from the nation.',
       'This is the point where the ideology of Mahatma Gandhi and Bhagat Singh changed.',
       'He had been imprisoned several times but never slowed down his struggle for independence.',
       'The movies made on Bhagat Singh reveal the life events of this great hero of the nation.'],
      dtype=object)

In [7]:
df['alpha_only'] = df['alpha_only'].apply(lambda x: re.sub('[^a-zA-Z]'," ",x))

In [8]:
df[['text','alpha_only']]

Unnamed: 0,text,alpha_only
0,Bhagat Singh was a young revolutionary freedom...,Bhagat Singh was a young revolutionary freedom...
1,This young freedom fighter was born on 28th Se...,This young freedom fighter was born on Sept...
2,Bhagat Singh was initially impressed by the Ga...,Bhagat Singh was initially impressed by the Ga...
3,"Later, he thought of dropping the idea of non-...",Later he thought of dropping the idea of non ...
4,He thought that revolt is necessary and only t...,He thought that revolt is necessary and only t...
5,This is the point where the ideology of Mahatm...,This is the point where the ideology of Mahatm...
6,He had been imprisoned several times but never...,He had been imprisoned several times but never...
7,The movies made on Bhagat Singh reveal the lif...,The movies made on Bhagat Singh reveal the lif...


description:
alpha_only == alphabets only

## Lowercasing

In [9]:
df['lowercase'] = df['alpha_only'].apply(lambda x: x.lower())

In [10]:
df[['alpha_only','lowercase']]

Unnamed: 0,alpha_only,lowercase
0,Bhagat Singh was a young revolutionary freedom...,bhagat singh was a young revolutionary freedom...
1,This young freedom fighter was born on Sept...,this young freedom fighter was born on sept...
2,Bhagat Singh was initially impressed by the Ga...,bhagat singh was initially impressed by the ga...
3,Later he thought of dropping the idea of non ...,later he thought of dropping the idea of non ...
4,He thought that revolt is necessary and only t...,he thought that revolt is necessary and only t...
5,This is the point where the ideology of Mahatm...,this is the point where the ideology of mahatm...
6,He had been imprisoned several times but never...,he had been imprisoned several times but never...
7,The movies made on Bhagat Singh reveal the lif...,the movies made on bhagat singh reveal the lif...


## Spelling correction

In [11]:
from textblob import TextBlob

In [12]:
df['spellcheck'] = df['lowercase'].apply(lambda x: str(TextBlob(x).correct()))

In [13]:
df[['lowercase','spellcheck']]

Unnamed: 0,lowercase,spellcheck
0,bhagat singh was a young revolutionary freedom...,bhagat sigh was a young revolutionary freedom ...
1,this young freedom fighter was born on sept...,this young freedom fighter was born on sept...
2,bhagat singh was initially impressed by the ga...,bhagat sigh was initially impressed by the gan...
3,later he thought of dropping the idea of non ...,later he thought of dropping the idea of non ...
4,he thought that revolt is necessary and only t...,he thought that revolt is necessary and only t...
5,this is the point where the ideology of mahatm...,this is the point where the ideology of mahatm...
6,he had been imprisoned several times but never...,he had been imprisoned several times but never...
7,the movies made on bhagat singh reveal the lif...,the moves made on bhagat sigh reveal the life ...


## Stop word removal    
Words that frequently occur in sentences and carry no significant meaning in sentences.

In [14]:
from nltk.corpus import stopwords
df['stop_processed'] = df['spellcheck'].apply(lambda x: " ".join(i for i in x.split() if i not in stopwords.words('english')))

In [15]:
df[['lowercase','stop_processed']]

Unnamed: 0,lowercase,stop_processed
0,bhagat singh was a young revolutionary freedom...,bhagat sigh young revolutionary freedom fighte...
1,this young freedom fighter was born on sept...,young freedom fighter born september year punj...
2,bhagat singh was initially impressed by the ga...,bhagat sigh initially impressed gandhian princ...
3,later he thought of dropping the idea of non ...,later thought dropping idea non violence mind ...
4,he thought that revolt is necessary and only t...,thought revolt necessary way drive britishers ...
5,this is the point where the ideology of mahatm...,point ideology mahatma gandhi bhagat sigh changed
6,he had been imprisoned several times but never...,imprisoned several times never slowed struggle...
7,the movies made on bhagat singh reveal the lif...,moves made bhagat sigh reveal life events grea...


## Stemming

In [16]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
df['stemming'] = df['stop_processed'].apply(lambda x: " ".join(list(stemmer.stem(i) for i in x.split())))

In [17]:
df[['stop_processed','stemming']]

Unnamed: 0,stop_processed,stemming
0,bhagat sigh young revolutionary freedom fighte...,bhagat sigh young revolutionari freedom fighte...
1,young freedom fighter born september year punj...,young freedom fighter born septemb year punjab...
2,bhagat sigh initially impressed gandhian princ...,bhagat sigh initi impress gandhian principl th...
3,later thought dropping idea non violence mind ...,later thought drop idea non violenc mind total...
4,thought revolt necessary way drive britishers ...,thought revolt necessari way drive british nation
5,point ideology mahatma gandhi bhagat sigh changed,point ideolog mahatma gandhi bhagat sigh chang
6,imprisoned several times never slowed struggle...,imprison sever time never slow struggl independ
7,moves made bhagat sigh reveal life events grea...,move made bhagat sigh reveal life event great ...


## Lemmatization

In [23]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df['lemmatization'] = df['stop_processed'].apply(lambda x: " ".join(lemmatizer.lemmatize(i) for i in x.split()))

In [24]:
df[['stop_processed','lemmatization']]

Unnamed: 0,stop_processed,lemmatization
0,bhagat sigh young revolutionary freedom fighte...,bhagat sigh young revolutionary freedom fighte...
1,young freedom fighter born september year punj...,young freedom fighter born september year punj...
2,bhagat sigh initially impressed gandhian princ...,bhagat sigh initially impressed gandhian princ...
3,later thought dropping idea non violence mind ...,later thought dropping idea non violence mind ...
4,thought revolt necessary way drive britishers ...,thought revolt necessary way drive britisher n...
5,point ideology mahatma gandhi bhagat sigh changed,point ideology mahatma gandhi bhagat sigh changed
6,imprisoned several times never slowed struggle...,imprisoned several time never slowed struggle ...
7,moves made bhagat sigh reveal life events grea...,move made bhagat sigh reveal life event great ...


## POS tagging

In [27]:
df["pos"] = df['lemmatization'].apply(lambda x: nltk.pos_tag(nltk.word_tokenize(x)))

In [29]:
df['pos'][0]

[('bhagat', 'NN'),
 ('sigh', 'JJ'),
 ('young', 'JJ'),
 ('revolutionary', 'NN'),
 ('freedom', 'NN'),
 ('fighter', 'NN'),
 ('sacrificed', 'VBD'),
 ('life', 'NN'),
 ('making', 'VBG'),
 ('india', 'JJ'),
 ('free', 'JJ'),
 ('nation', 'NN')]