Data sourced from Kaggle competition [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started/overview)

In [1]:
# import core libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import joblib

In [3]:
# pre-processing

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

import nltk
from textblob import TextBlob
from langdetect import detect

import contractions

In [4]:
# modelling

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

In [5]:
# metrics/evaluation

import scikitplot as skplt
from matplotlib.colors import ListedColormap
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [6]:
# instantiating the train and test sets

train = pd.read_csv('data/train.csv')
test =  pd.read_csv('data/test.csv')

In [7]:
# combining the train and test sets for the purpose of EDA and Data Cleaning/Feature Engineering

df = pd.concat([train, test], ignore_index=True)

In [8]:
print("Training Dataframe Shape: {}".format(str(train.shape)))
print("Test Dataframe Shape: {}".format(str(test.shape)))
print("Combined Dataframe Shape: {}".format(str(df.shape)))

Training Dataframe Shape: (7613, 5)
Test Dataframe Shape: (3263, 4)
Combined Dataframe Shape: (10876, 5)


In [9]:
# example of tweets

df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1.0
1,4,,,Forest fire near La Ronge Sask. Canada,1.0
2,5,,,All residents asked to 'shelter in place' are ...,1.0
3,6,,,"13,000 people receive #wildfires evacuation or...",1.0
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10876 entries, 0 to 10875
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        10876 non-null  int64  
 1   keyword   10789 non-null  object 
 2   location  7238 non-null   object 
 3   text      10876 non-null  object 
 4   target    7613 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 425.0+ KB


### Dealing with null values

In [11]:
# null values in the training set

df.isnull().sum()

id             0
keyword       87
location    3638
text           0
target      3263
dtype: int64

### Keyword

In [12]:
df.keyword.unique().shape

(222,)

In [13]:
# trying to understand whether the null values in keyword have any relevance - they don't

df[df.keyword.isnull()].target.value_counts()

1.0    42
0.0    19
Name: target, dtype: int64

In [14]:
# creating a new category for the null keyword and location values

df.fillna({'keyword': 'unknown', 'location': 'unknown'},inplace=True)

In [15]:
# cleaning the keyword column

df.replace({'keyword': '%20'}, {'keyword': '_'}, inplace=True, regex=True)

### Location

In [16]:
# given how messy and the location column is, it's unlikely that we'll be able to clean it for modelling purposes

df.location.value_counts().head(20)

unknown            3638
USA                 141
New York            109
United States        65
London               58
Canada               42
Nigeria              40
Worldwide            35
India                35
Los Angeles, CA      34
UK                   33
Kenya                32
Washington, DC       31
Mumbai               28
United Kingdom       26
California           25
Australia            25
Los Angeles          24
Chicago, IL          23
San Francisco        23
Name: location, dtype: int64

# Text

### Language

In [17]:
# checking that all tweets are in English

# lang_series = df.text.apply(lambda x: detect(x))

In [18]:
# saving lang_series as a joblib file

# joblib.dump(lang_series, 'jlib_files/lang_series.jlib')

In [19]:
# loading lang_series jlib file

lang_series = joblib.load('jlib_files/lang_series.jlib')

In [20]:
df['language'] = lang_series

In [21]:
df[df.language != 'en'].sample(5)

Unnamed: 0,id,keyword,location,text,target,language
4492,6388,hurricane,"#1 Vacation Destination,HAWAII",HURRICANE GUILLERMO LIVE NOAA TRACKING / LOOPI...,1.0,vi
5401,7707,panicking,Oxford / bristol,Okay NOW I AM PANICKING,0.0,tl
5083,7250,nuclear_disaster,unknown,Nuclear deal disaster.\n\n#IranDeal #NoNuclear...,0.0,id
7637,84,ablaze,unknown,SETTING MYSELF ABLAZE http://t.co/6vMe7P5XhC,,de
6593,9441,survivors,unknown,Remembrance http://t.co/ii4EwE1QIr #Hiroshima...,1.0,pt


It seems that the language detector function isn't doing a very good job of picking up some of the tweets' language. Regardless, it seems that all of the tweets are in English so we don't have to worry about dealing with other languages.

In [22]:
# dropping language column from dataset

df.drop('language', 1, inplace=True)

### Using the tweet-preprocessor package

In [23]:
import preprocessor as p

### Cleaning tweets

In [24]:
# removing the tweet characteristics below from the tweets

p.set_options(p.OPT.URL, p.OPT.EMOJI,p.OPT.SMILEY, p.OPT.MENTION, p.OPT.RESERVED)

In [25]:
p.clean(df.text[31])

'Wholesale Markets ablaze'

In [26]:
df['text_clean'] = df.text.apply(lambda x: p.clean(x))

### Creating meta-data for tweet characteristics

In [27]:
tweet_tokenized = df.text.apply(lambda x:p.tokenize(x))

In [28]:
# creating a for-loop to add columns for the tweet meta-data features

for feature in ['url', 'hashtag', 'smiley', 'mention']:
    feature_counter = []
    for tweet in tweet_tokenized:
        counter = 0
        for word in tweet.split():
            if word == "$"+feature.upper()+"$":
                counter += 1
        feature_counter.append(counter)
    df["tweet_"+feature] = feature_counter

### Text meta-data: length of tweet, number of words and average word length

In [29]:
import string

In [30]:
df['tweet_characters'] = df.text_clean.apply(lambda x: len(x))

In [31]:
def word_counter(tweet):   
    no_punct = ''.join([x for x in tweet if x not in string.punctuation])
    word_lst = no_punct.split()      
    return len(word_lst)

In [32]:
df['tweet_words'] = df.text_clean.apply(word_counter)

In [33]:
words = ''.join([x for x in 'Our Deeds are the Reason of this #earthquake ' if x not in string.punctuation]).split()

In [34]:
sum(map(len, words))/len(words)

4.5

In [35]:
def ave_word_length(tweet):
    no_punct = ''.join([x for x in tweet if x not in string.punctuation])
    word_lst = no_punct.split()
    return sum(map(len, word_lst))/len(word_lst)

In [36]:
df['tweet_av_word_length'] = df.text_clean.apply(ave_word_length)

### Remove punctuation completely

In [37]:
# removing punctuation from tweets

for punct in string.punctuation:
    df['text_clean'] = df.text_clean.str.replace(punct,'',regex=True)

### Removing digits

In [39]:
df['no_num'] = df.text_clean.replace('\d+','xxxxnumber',regex=True)

### Expanding contractions

In [40]:
df['text_no_contr'] = df.no_num.apply(lambda x: ' '.join([contractions.fix(word) for word in x.split()]))

### Tokenizing Tweets

In [41]:
df['tokenized'] = df.text_no_contr.apply(nltk.word_tokenize)

### Change to lower-case

In [45]:
df['lower'] = df.tokenized.apply(lambda x: [word.lower() for word in x])

### Beginning the lemmatization process

In [47]:
df.lower.apply(nltk.tag.pos_tag)

0        [(our, PRP$), (deeds, NNS), (are, VBP), (the, ...
1        [(forest, JJS), (fire, NN), (near, IN), (la, J...
2        [(all, DT), (residents, NNS), (asked, VBD), (t...
3        [(xxxxnumber, JJ), (people, NNS), (receive, VB...
4        [(just, RB), (got, VBN), (sent, VBD), (this, D...
                               ...                        
10871    [(earthquake, NN), (safety, NN), (los, NN), (a...
10872    [(storm, NN), (in, IN), (ri, NN), (worse, JJR)...
10873    [(green, JJ), (line, NN), (derailment, NN), (i...
10874    [(meg, NN), (issues, NNS), (hazardous, JJ), (w...
10875    [(cityofcalgary, NN), (has, VBZ), (activated, ...
Name: lower, Length: 10876, dtype: object

TBC: https://towardsdatascience.com/preprocessing-text-data-using-python-576206753c28

### To-do list:

- create broader categories for the keyword and, potentially, location columns
- use more visualizations through the data cleaning process (to start with: countvectorize before any data cleaning has started to show the words that appear the most frequently)


#### Text Pre-processing

- ~~check the language that the tweet is written in~~
- ~~remove digits~~
- ~~expand contractions~~
- ~~convert to lowercase~
- ~~remove punctuation~~ (maybe include meta-data for punctuation instead?)
- ~~tokenize words~~
- lemmatize words
- remove stop-words
- ~~hashtag extraction~~

- does the text contain emojis?

#### Feature Engineering

- meta-data
    - - ~~how many hash-tags each tweet contains~~
    - ~~no. of emojis~~
    - ~~number of words~~
    - ~~number of characters~~
- sentiment analysis (textblob)
- ~~average word length~~
- use spacy to extract location from location variable

#### EDA

- word clouds for each target variable
- seperate the below by each target variable
    - number of characters in each tweet
    - average word length in each sentence
    - most commonly appearing ngrams of various lenghts
    - textblob for sentiment analysis
    - use speech tagging
    - frequency of most common words
    - number of words with a given number of appearances
    
#### Other

- Research the use of LDA and NMF
    
    
Useful articles: 

https://towardsdatascience.com/preprocessing-text-data-using-python-576206753c28

https://towardsdatascience.com/basic-tweet-preprocessing-in-python-efd8360d529e

https://medium.com/spatial-data-science/how-to-extract-locations-from-text-with-natural-language-processing-9b77035b3ea4

