# NLP Exploratory Data Analysis (EDA)

This notebook explores the required steps required for a Natural Language Processing (NLP) application. This includes:
- Section 1 - Text Cleaning
    - Spelling Correction
- Section 2 - Preprocessing
    - Sentence Segmentation (not applicable in this case) & Tokenization
    - Stop word removal
    - Stemming & Lemmatization


**Insights:**
- The dataset is quite balanced (57/43)
- Location require extensive preprocessing. One of the approaches might be dropping those below certain count, as it can be considered polluted records.

In [14]:
import os
import sys
import pandas as pd
from nltk.tokenize import TweetTokenizer

config = {}

# 0. Load dataset and basic counts

In [5]:
# Load dataset
input_path = "../input"
raw_tweets = pd.read_csv(os.path.join(input_path, "train.csv")).set_index(['id'])
raw_tweets.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
(raw_tweets.target.value_counts() / len(raw_tweets)) * 100.0

0    57.034021
1    42.965979
Name: target, dtype: float64

In [12]:
raw_tweets.location.value_counts()

USA                    104
New York                71
United States           50
London                  45
Canada                  29
                      ... 
MontrÌ©al, QuÌ©bec       1
Montreal                 1
ÌÏT: 6.4682,3.18287      1
Live4Heed??              1
Lincoln                  1
Name: location, Length: 3341, dtype: int64

# 2. Text Preprocessing

Convert the text into a Bag-of-Words and preprocess it, removing stop words and performing stemming and lemmatization

In [16]:
# Use Twitter Tokenizer to tokenize tweets
tokenizer = TweetTokenizer()

# Tokenize raw tweets
raw_tweets['twitterTokens'] = raw_tweets.apply(lambda x: tokenizer.tokenize(x.text), axis=1)
raw_tweets.head()

Unnamed: 0_level_0,keyword,location,text,target,twitterTokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1,"[Our, Deeds, are, the, Reason, of, this, #eart..."
4,,,Forest fire near La Ronge Sask. Canada,1,"[Forest, fire, near, La, Ronge, Sask, ., Canada]"
5,,,All residents asked to 'shelter in place' are ...,1,"[All, residents, asked, to, ', shelter, in, pl..."
6,,,"13,000 people receive #wildfires evacuation or...",1,"[13,000, people, receive, #wildfires, evacuati..."
7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[Just, got, sent, this, photo, from, Ruby, #Al..."


In [None]:
# Pending - Remove stop words
# Pending - Perform stemming