# Disaster Tweet Classification wi
## NLP with Disaster Tweets
------------------
### GOAL
- Predicting whether a given tweet is about a real disaster or not
- if so, predict a `1`. if not, predict a `0`


### Reference
- [competition main page](https://www.kaggle.com/c/nlp-getting-started)
- [word2vec code](https://www.kaggle.com/slatawa/simple-implementation-of-word2vec)
- [example code](https://www.kaggle.com/datarohitingole/disaster-tweet-classification-ridgeclassifiercv)
- [comparing the performance of different Machine Learning Algorithm](https://dibyendudeb.com/comparing-machine-learning-algorithms/)

# 0. Importing Libraries

In [1]:
# for loading and preprocessing the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

# for training the model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree, linear_model, neighbors, naive_bayes, ensemble
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# for evaluating classification model
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

# for data cleaning
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')

# for word2vec
import gensim
from gensim.models import Word2Vec

# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2. Data Preprocessing
## Contents
#### 1. Clean the data
- dealing with missing values
- replace some commonly occuring shorthands
- remove any characters other then alphabets
- convert all dicitonary to lower case(for consistency)
- lemmatize

#### 2. word toeknization

#### 3. Word2Vector





## Data Description
--------------
### Files
- `train.csv` : the training set
- `test.csv` : the test set
- `sample_submission.csv` : a sample submission file in the correct format

### Columns
- `id` : a unique identifier for each tweet
- `text` : the text of the tweet
- `location` : the location the tweet was sent from 
- `keyword` : a particular keyword from th tweet
- `target` : in train.csv only, this denotes whether a tweet is about a real disaster(1) or not(0)

In [2]:
# loading the data set
data_path = './data/'
train = pd.read_csv(data_path + 'train.csv')
#test = pd.read_csv(data_path + 'test.csv')

#all_data = [train,test]

In [3]:
# split the data <train : test = 8 : 2>
train, test = train_test_split(train, test_size = 0.20, random_state = 0)

In [4]:
print('train_shape:', train.shape)
print('test_shape;', test.shape)

train_shape: (6090, 5)
test_shape; (1523, 5)


In [5]:
train.head()

Unnamed: 0,id,keyword,location,text,target
1386,1999,bush%20fires,,Ted Cruz fires back at Jeb &amp; Bush: ÛÏWe l...,0
4048,5751,forest%20fires,,This is the first year the Forest Service spen...,1
3086,4428,electrocute,,@lightseraphs pissed at you and could have the...,0
272,396,apocalypse,ColoRADo,I'm gonna fight Taylor as soon as I get there.,0
7462,10678,wounds,"Tampa, FL",@NicolaClements4 IÛªm not sure that covering ...,0


In [6]:
test.head()

Unnamed: 0,id,keyword,location,text,target
311,454,armageddon,Wrigley Field,@KatieKatCubs you already know how this shit g...,0
4970,7086,meltdown,Two Up Two Down,@LeMaireLee @danharmon People Near Meltdown Co...,0
527,762,avalanche,Score Team Goals Buying @,1-6 TIX Calgary Flames vs COL Avalanche Presea...,0
6362,9094,suicide%20bomb,Roadside,If you ever think you running out of choices i...,0
800,1160,blight,Laventillemoorings,If you dotish to blight your car go right ahea...,0


## 2-1. Clean the data
- Dealing with Missing Values

In [7]:
all_data = [train,test]
for data in all_data:
    data.drop(["location", "id"], axis = 1, inplace = True)

In [8]:
# data prepocessing with regrex

def remove_URL(text): # remove url pattern in text
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

def remove_html(text): # remove html pattern in text
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return html.sub(r'', text)
    #return re.sub(html, '', text)

def remove_punct(text): # remove punctuation in text: (;, ', ", :, ., , etc..)
  table = str.maketrans('', '', string.punctuation)
  return text.translate(table)

In [9]:
for data in all_data:
  data['text'] = data['text'].apply(lambda x: remove_URL(x))
  data['text'] = data['text'].apply(lambda x: remove_html(x))
  data['text'] = data['text'].apply(lambda x: remove_punct(x))

- Replace some commonly occuring shorthands

In [10]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"you'll", "you will", text)
    text = re.sub(r"i'll", "i will", text)
    text = re.sub(r"she'll", "she will", text)
    text = re.sub(r"he'll", "he will", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"there's", "there is", text)
    text = re.sub(r"here's", "here is", text)
    text = re.sub(r"who's", "who is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"don't", "do not", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    return text

In [11]:
train['clean_text'] = train['text'].apply(clean_text)
test['clean_text'] = test['text'].apply(clean_text)

- remove any characters other then alphabets
- convert all dicitonary to lower case(for consistency)
- lemmatize

In [12]:
def massage_text(text):
    ## remove anything other then characters and put everything in lowercase
    tweet = re.sub("[^a-zA-Z]", ' ', text)
    tweet = tweet.lower()
    tweet = tweet.split()

    lem = WordNetLemmatizer()
    tweet = [lem.lemmatize(word) for word in tweet
             if word not in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    return tweet
    print('--here goes nothing')
    print(text)
    print(tweet)

train['clean_text'] = train['text'].apply(massage_text)
test['clean_text'] = test['text'].apply(massage_text)

In [13]:
train.iloc[0:10][['text','clean_text']]

Unnamed: 0,text,clean_text
1386,Ted Cruz fires back at Jeb Bush ÛÏWe lose be...,ted cruz fire back jeb bush lose republican li...
4048,This is the first year the Forest Service spen...,first year forest service spent half annual bu...
3086,lightseraphs pissed at you and could have thei...,lightseraphs pissed could pikachu electrocute
272,Im gonna fight Taylor as soon as I get there,im gonna fight taylor soon get
7462,NicolaClements4 IÛªm not sure that covering m...,nicolaclements sure covering head wound scab s...
4778,kabwandi Breaking news Unconfirmed I just hear...,kabwandi breaking news unconfirmed heard loud ...
260,The annihilation of Jeb Christie Kasich is le...,annihilation jeb christie kasich le hour away ...
2921,Wtf this mom just drowned her child,wtf mom drowned child
2162,MayorofLondon pls reduce cyclist deaths with a...,mayoroflondon pls reduce cyclist death compuls...
4818,DoctorDryadma mass murder here we come,doctordryadma mass murder come


## 2-2. Word Tokenization
- tokenize the clean text column

In [14]:
train['tokens']=train['clean_text'].apply(lambda x: word_tokenize(x))
test['tokens'] = test['clean_text'].apply(lambda x: word_tokenize(x))

In [15]:
train.head(5)

Unnamed: 0,keyword,text,target,clean_text,tokens
1386,bush%20fires,Ted Cruz fires back at Jeb Bush ÛÏWe lose be...,0,ted cruz fire back jeb bush lose republican li...,"[ted, cruz, fire, back, jeb, bush, lose, repub..."
4048,forest%20fires,This is the first year the Forest Service spen...,1,first year forest service spent half annual bu...,"[first, year, forest, service, spent, half, an..."
3086,electrocute,lightseraphs pissed at you and could have thei...,0,lightseraphs pissed could pikachu electrocute,"[lightseraphs, pissed, could, pikachu, electro..."
272,apocalypse,Im gonna fight Taylor as soon as I get there,0,im gonna fight taylor soon get,"[im, gon, na, fight, taylor, soon, get]"
7462,wounds,NicolaClements4 IÛªm not sure that covering m...,0,nicolaclements sure covering head wound scab s...,"[nicolaclements, sure, covering, head, wound, ..."


## 2-3. Word2Vec
- convert our data(words) into vectors

In [16]:
#first, create a list corpus which we would be using to train word2vec mappings
def fn_pre_process_data(doc):
    for rec in doc:
        yield gensim.utils.simple_preprocess(rec)

corpus = list(fn_pre_process_data(train['clean_text']))
corpus += list(fn_pre_process_data(test['clean_text']))

In [17]:
#inititate the embedding model, we will come back to the passed arguments later
print('initiated ...')
wv_model = Word2Vec(corpus,vector_size=150,window=3,min_count=2)
#wv_model.build_vocab(corpus)
wv_model.train(corpus,total_examples=len(corpus),epochs=10)
#wv_model.save(data_path + 'word2vec.model')

initiated ...


(581368, 681810)

In [18]:
# convert the train and text tokens
def get_word_embeddings(token_list,vector,k=150):
    if len(token_list) < 1:
        return np.zeros(k)
    else:
        vector = vector.wv['word']
        vectorized = [vector if word in vector else np.random.rand(k) for word in token_list] 
    
    sum = np.sum(vectorized,axis=0)
    ## return the average
    return sum/len(vectorized)       

def get_embeddings(tokens,vector):
        embeddings = tokens.apply(lambda x: get_word_embeddings(x, wv_model))
        return list(embeddings)

In [19]:
train_embeddings = get_embeddings(train['tokens'],wv_model)
test_embeddings = get_embeddings(test['tokens'],wv_model)

  vectorized = [vector if word in vector else np.random.rand(k) for word in token_list]


# 3. Model
## Contents
- train the model
    - RidgeClassifierCV
    - sgd classifier
    - BernoulliNB 
    - RandomForest

## Model Description
--------------
### Ensemble
- Combine the predictions of several base estimators built with a given learning algorithm 
    - in order to improve generalizability / robustness over a single estimator.
- Boosting of Ensemble types

### Performance - f1-score


## 3-2. Train the model

In [20]:
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),
    
    #GLM
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),

    XGBClassifier(),
    CatBoostClassifier()  
    ]

# Comapring all MLA
- precision
- recall
- accuracy
- f1-score

In [21]:
train_embeddings = np.array(train_embeddings)
test_embeddings = np.array(test_embeddings)

In [22]:
# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

row_index = 0
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

for alg in MLA:
    predicted = alg.fit(train_embeddings, train['target']).predict(test_embeddings)
    fp, tp, th = roc_curve(test['target'], predicted)

    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(train_embeddings, train['target']), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(test_embeddings, test['target']), 4)
    
    
    recall = recall_score(test['target'], predicted)
    precision = precision_score(test['target'], predicted)
    MLA_compare.loc[row_index, 'Precission'] = precision
    MLA_compare.loc[row_index, 'Recall'] = recall
    MLA_compare.loc[row_index, 'F1-score'] = round((2*precision*recall)/(precision+recall),4)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1

  _warn_prf(average, modifier, msg_start, len(result))
  MLA_compare.loc[row_index, 'F1-score'] = round((2*precision*recall)/(precision+recall),4)


Learning rate set to 0.022283
0:	learn: 0.6922278	total: 71.6ms	remaining: 1m 11s
1:	learn: 0.6913586	total: 84.8ms	remaining: 42.3s
2:	learn: 0.6905717	total: 98.3ms	remaining: 32.7s
3:	learn: 0.6897308	total: 112ms	remaining: 27.9s
4:	learn: 0.6889525	total: 125ms	remaining: 24.9s
5:	learn: 0.6883684	total: 139ms	remaining: 23s
6:	learn: 0.6875978	total: 153ms	remaining: 21.6s
7:	learn: 0.6869746	total: 166ms	remaining: 20.6s
8:	learn: 0.6863366	total: 180ms	remaining: 19.9s
9:	learn: 0.6857262	total: 194ms	remaining: 19.2s
10:	learn: 0.6851798	total: 208ms	remaining: 18.7s
11:	learn: 0.6845785	total: 225ms	remaining: 18.5s
12:	learn: 0.6839261	total: 240ms	remaining: 18.2s
13:	learn: 0.6832911	total: 254ms	remaining: 17.9s
14:	learn: 0.6826651	total: 268ms	remaining: 17.6s
15:	learn: 0.6821761	total: 282ms	remaining: 17.4s
16:	learn: 0.6815828	total: 296ms	remaining: 17.1s
17:	learn: 0.6808831	total: 310ms	remaining: 16.9s
18:	learn: 0.6803183	total: 324ms	remaining: 16.7s
19:	learn

166:	learn: 0.6261206	total: 2.46s	remaining: 12.3s
167:	learn: 0.6258768	total: 2.48s	remaining: 12.3s
168:	learn: 0.6256494	total: 2.49s	remaining: 12.3s
169:	learn: 0.6253990	total: 2.51s	remaining: 12.3s
170:	learn: 0.6251013	total: 2.53s	remaining: 12.3s
171:	learn: 0.6248667	total: 2.54s	remaining: 12.2s
172:	learn: 0.6245695	total: 2.56s	remaining: 12.2s
173:	learn: 0.6242045	total: 2.58s	remaining: 12.2s
174:	learn: 0.6238761	total: 2.59s	remaining: 12.2s
175:	learn: 0.6236153	total: 2.6s	remaining: 12.2s
176:	learn: 0.6233463	total: 2.62s	remaining: 12.2s
177:	learn: 0.6229693	total: 2.63s	remaining: 12.1s
178:	learn: 0.6227366	total: 2.64s	remaining: 12.1s
179:	learn: 0.6224338	total: 2.65s	remaining: 12.1s
180:	learn: 0.6220884	total: 2.67s	remaining: 12.1s
181:	learn: 0.6218125	total: 2.68s	remaining: 12.1s
182:	learn: 0.6215819	total: 2.7s	remaining: 12s
183:	learn: 0.6212509	total: 2.71s	remaining: 12s
184:	learn: 0.6209194	total: 2.73s	remaining: 12s
185:	learn: 0.620626

335:	learn: 0.5764525	total: 4.94s	remaining: 9.77s
336:	learn: 0.5761701	total: 4.96s	remaining: 9.76s
337:	learn: 0.5758716	total: 4.97s	remaining: 9.74s
338:	learn: 0.5756308	total: 4.99s	remaining: 9.72s
339:	learn: 0.5753483	total: 5s	remaining: 9.71s
340:	learn: 0.5750510	total: 5.01s	remaining: 9.69s
341:	learn: 0.5747971	total: 5.03s	remaining: 9.67s
342:	learn: 0.5745339	total: 5.04s	remaining: 9.66s
343:	learn: 0.5742412	total: 5.05s	remaining: 9.64s
344:	learn: 0.5739380	total: 5.07s	remaining: 9.62s
345:	learn: 0.5737468	total: 5.08s	remaining: 9.6s
346:	learn: 0.5734608	total: 5.09s	remaining: 9.59s
347:	learn: 0.5731469	total: 5.11s	remaining: 9.57s
348:	learn: 0.5728636	total: 5.12s	remaining: 9.55s
349:	learn: 0.5726211	total: 5.14s	remaining: 9.54s
350:	learn: 0.5723215	total: 5.15s	remaining: 9.52s
351:	learn: 0.5720721	total: 5.17s	remaining: 9.51s
352:	learn: 0.5717763	total: 5.18s	remaining: 9.49s
353:	learn: 0.5714858	total: 5.19s	remaining: 9.48s
354:	learn: 0.57

506:	learn: 0.5218389	total: 7.41s	remaining: 7.21s
507:	learn: 0.5214925	total: 7.43s	remaining: 7.19s
508:	learn: 0.5212323	total: 7.44s	remaining: 7.18s
509:	learn: 0.5209608	total: 7.46s	remaining: 7.16s
510:	learn: 0.5205734	total: 7.47s	remaining: 7.15s
511:	learn: 0.5202256	total: 7.49s	remaining: 7.14s
512:	learn: 0.5198006	total: 7.5s	remaining: 7.12s
513:	learn: 0.5195715	total: 7.51s	remaining: 7.11s
514:	learn: 0.5192365	total: 7.53s	remaining: 7.09s
515:	learn: 0.5189424	total: 7.54s	remaining: 7.07s
516:	learn: 0.5184864	total: 7.55s	remaining: 7.06s
517:	learn: 0.5181176	total: 7.57s	remaining: 7.04s
518:	learn: 0.5178382	total: 7.58s	remaining: 7.03s
519:	learn: 0.5174365	total: 7.6s	remaining: 7.01s
520:	learn: 0.5170525	total: 7.61s	remaining: 7s
521:	learn: 0.5168440	total: 7.62s	remaining: 6.98s
522:	learn: 0.5165227	total: 7.64s	remaining: 6.97s
523:	learn: 0.5163448	total: 7.65s	remaining: 6.95s
524:	learn: 0.5160007	total: 7.66s	remaining: 6.93s
525:	learn: 0.515

673:	learn: 0.4688644	total: 9.86s	remaining: 4.77s
674:	learn: 0.4684176	total: 9.87s	remaining: 4.75s
675:	learn: 0.4681203	total: 9.89s	remaining: 4.74s
676:	learn: 0.4677833	total: 9.9s	remaining: 4.72s
677:	learn: 0.4675114	total: 9.92s	remaining: 4.71s
678:	learn: 0.4672753	total: 9.93s	remaining: 4.7s
679:	learn: 0.4669324	total: 9.95s	remaining: 4.68s
680:	learn: 0.4665222	total: 9.96s	remaining: 4.67s
681:	learn: 0.4661768	total: 9.98s	remaining: 4.65s
682:	learn: 0.4657911	total: 9.99s	remaining: 4.64s
683:	learn: 0.4655015	total: 10s	remaining: 4.62s
684:	learn: 0.4652088	total: 10s	remaining: 4.61s
685:	learn: 0.4649126	total: 10s	remaining: 4.6s
686:	learn: 0.4646690	total: 10.1s	remaining: 4.59s
687:	learn: 0.4643467	total: 10.1s	remaining: 4.57s
688:	learn: 0.4640566	total: 10.1s	remaining: 4.56s
689:	learn: 0.4637196	total: 10.1s	remaining: 4.55s
690:	learn: 0.4634269	total: 10.2s	remaining: 4.54s
691:	learn: 0.4632792	total: 10.2s	remaining: 4.53s
692:	learn: 0.4629811

835:	learn: 0.4224019	total: 12.3s	remaining: 2.42s
836:	learn: 0.4220811	total: 12.3s	remaining: 2.4s
837:	learn: 0.4218006	total: 12.4s	remaining: 2.39s
838:	learn: 0.4215540	total: 12.4s	remaining: 2.38s
839:	learn: 0.4213085	total: 12.4s	remaining: 2.36s
840:	learn: 0.4211241	total: 12.4s	remaining: 2.35s
841:	learn: 0.4208252	total: 12.4s	remaining: 2.33s
842:	learn: 0.4205644	total: 12.4s	remaining: 2.32s
843:	learn: 0.4202942	total: 12.5s	remaining: 2.3s
844:	learn: 0.4200728	total: 12.5s	remaining: 2.29s
845:	learn: 0.4197815	total: 12.5s	remaining: 2.27s
846:	learn: 0.4194827	total: 12.5s	remaining: 2.26s
847:	learn: 0.4192429	total: 12.6s	remaining: 2.25s
848:	learn: 0.4189220	total: 12.6s	remaining: 2.24s
849:	learn: 0.4186566	total: 12.6s	remaining: 2.22s
850:	learn: 0.4185183	total: 12.6s	remaining: 2.21s
851:	learn: 0.4182414	total: 12.6s	remaining: 2.2s
852:	learn: 0.4180292	total: 12.7s	remaining: 2.18s
853:	learn: 0.4177615	total: 12.7s	remaining: 2.17s
854:	learn: 0.4

999:	learn: 0.3819480	total: 15s	remaining: 0us


In [23]:
# f1-score 기준 정렬
MLA_compare.sort_values(by = ['F1-score'], ascending = False, inplace = True)    
MLA_compare

Unnamed: 0,MLA used,Train Accuracy,Test Accuracy,Precission,Recall,F1-score,AUC
5,PassiveAggressiveClassifier,0.4342,0.4183,0.418253,1.0,0.5898,0.5
8,Perceptron,0.434,0.4183,0.418253,1.0,0.5898,0.5
10,GaussianNB,0.5358,0.5108,0.444215,0.675039,0.5358,0.533908
13,ExtraTreeClassifier,1.0,0.5128,0.423133,0.453689,0.4379,0.504497
12,DecisionTreeClassifier,1.0,0.5036,0.413643,0.44741,0.4299,0.495714
0,AdaBoostClassifier,0.6202,0.5167,0.418451,0.398744,0.4084,0.500162
11,KNeighborsClassifier,0.6906,0.5102,0.410214,0.390895,0.4003,0.493416
14,XGBClassifier,1.0,0.524,0.419414,0.359498,0.3872,0.500855
3,GradientBoostingClassifier,0.7685,0.5561,0.448819,0.268446,0.336,0.515713
15,CatBoostClassifier,0.9941,0.5568,0.448925,0.262166,0.331,0.515395
