# Disaster Tweet Classification wi
## NLP with Disaster Tweets
------------------
### GOAL
- Predicting whether a given tweet is about a real disaster or not
- if so, predict a `1`. if not, predict a `0`


### Reference
- [competition main page](https://www.kaggle.com/c/nlp-getting-started)
- [word2vec code](https://www.kaggle.com/slatawa/simple-implementation-of-word2vec)
- [example code](https://www.kaggle.com/datarohitingole/disaster-tweet-classification-ridgeclassifiercv)
- [comparing the performance of different Machine Learning Algorithm](https://dibyendudeb.com/comparing-machine-learning-algorithms/)

# 0. Importing Libraries

In [1]:
# for loading and preprocessing the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

# for training the model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree, linear_model, neighbors, naive_bayes, ensemble
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# for evaluating classification model
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

# for data cleaning
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')

# for word2vec
import gensim
from gensim.models import Word2Vec

# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2. Data Preprocessing
## Contents
#### 1. Clean the data
- dealing with missing values
- replace some commonly occuring shorthands
- remove any characters other then alphabets
- convert all dicitonary to lower case(for consistency)
- lemmatize

#### 2. word toeknization

#### 3. Word2Vector





## Data Description
--------------
### Files
- `train.csv` : the training set
- `test.csv` : the test set
- `sample_submission.csv` : a sample submission file in the correct format

### Columns
- `id` : a unique identifier for each tweet
- `text` : the text of the tweet
- `location` : the location the tweet was sent from 
- `keyword` : a particular keyword from th tweet
- `target` : in train.csv only, this denotes whether a tweet is about a real disaster(1) or not(0)

In [2]:
# loading the data set
data_path = './data/'
train = pd.read_csv(data_path + 'train.csv')
#test = pd.read_csv(data_path + 'test.csv')

#all_data = [train,test]

In [3]:
# split the data <train : test = 8 : 2>
train, test = train_test_split(train, test_size = 0.20, random_state = 0)

In [4]:
print('train_shape:', train.shape)
print('test_shape;', test.shape)

train_shape: (6090, 5)
test_shape; (1523, 5)


In [5]:
train.head()

Unnamed: 0,id,keyword,location,text,target
1386,1999,bush%20fires,,Ted Cruz fires back at Jeb &amp; Bush: ÛÏWe l...,0
4048,5751,forest%20fires,,This is the first year the Forest Service spen...,1
3086,4428,electrocute,,@lightseraphs pissed at you and could have the...,0
272,396,apocalypse,ColoRADo,I'm gonna fight Taylor as soon as I get there.,0
7462,10678,wounds,"Tampa, FL",@NicolaClements4 IÛªm not sure that covering ...,0


In [6]:
test.head()

Unnamed: 0,id,keyword,location,text,target
311,454,armageddon,Wrigley Field,@KatieKatCubs you already know how this shit g...,0
4970,7086,meltdown,Two Up Two Down,@LeMaireLee @danharmon People Near Meltdown Co...,0
527,762,avalanche,Score Team Goals Buying @,1-6 TIX Calgary Flames vs COL Avalanche Presea...,0
6362,9094,suicide%20bomb,Roadside,If you ever think you running out of choices i...,0
800,1160,blight,Laventillemoorings,If you dotish to blight your car go right ahea...,0


## 2-1. Clean the data
- Dealing with Missing Values

In [7]:
all_data = [train,test]
for data in all_data:
    data.drop(["location", "id"], axis = 1, inplace = True)

In [8]:
# data prepocessing with regrex

def remove_URL(text): # remove url pattern in text
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

def remove_html(text): # remove html pattern in text
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return html.sub(r'', text)
    #return re.sub(html, '', text)

def remove_punct(text): # remove punctuation in text: (;, ', ", :, ., , etc..)
  table = str.maketrans('', '', string.punctuation)
  return text.translate(table)

In [9]:
for data in all_data:
  data['text'] = data['text'].apply(lambda x: remove_URL(x))
  data['text'] = data['text'].apply(lambda x: remove_html(x))
  data['text'] = data['text'].apply(lambda x: remove_punct(x))

- Replace some commonly occuring shorthands

In [10]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"you'll", "you will", text)
    text = re.sub(r"i'll", "i will", text)
    text = re.sub(r"she'll", "she will", text)
    text = re.sub(r"he'll", "he will", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"there's", "there is", text)
    text = re.sub(r"here's", "here is", text)
    text = re.sub(r"who's", "who is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"don't", "do not", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    return text

In [11]:
train['clean_text'] = train['text'].apply(clean_text)
test['clean_text'] = test['text'].apply(clean_text)

- remove any characters other then alphabets
- convert all dicitonary to lower case(for consistency)
- lemmatize

In [12]:
def massage_text(text):
    ## remove anything other then characters and put everything in lowercase
    tweet = re.sub("[^a-zA-Z]", ' ', text)
    tweet = tweet.lower()
    tweet = tweet.split()

    lem = WordNetLemmatizer()
    tweet = [lem.lemmatize(word) for word in tweet
             if word not in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    return tweet
    print('--here goes nothing')
    print(text)
    print(tweet)

train['clean_text'] = train['text'].apply(massage_text)
test['clean_text'] = test['text'].apply(massage_text)

In [13]:
train.iloc[0:10][['text','clean_text']]

Unnamed: 0,text,clean_text
1386,Ted Cruz fires back at Jeb Bush ÛÏWe lose be...,ted cruz fire back jeb bush lose republican li...
4048,This is the first year the Forest Service spen...,first year forest service spent half annual bu...
3086,lightseraphs pissed at you and could have thei...,lightseraphs pissed could pikachu electrocute
272,Im gonna fight Taylor as soon as I get there,im gonna fight taylor soon get
7462,NicolaClements4 IÛªm not sure that covering m...,nicolaclements sure covering head wound scab s...
4778,kabwandi Breaking news Unconfirmed I just hear...,kabwandi breaking news unconfirmed heard loud ...
260,The annihilation of Jeb Christie Kasich is le...,annihilation jeb christie kasich le hour away ...
2921,Wtf this mom just drowned her child,wtf mom drowned child
2162,MayorofLondon pls reduce cyclist deaths with a...,mayoroflondon pls reduce cyclist death compuls...
4818,DoctorDryadma mass murder here we come,doctordryadma mass murder come


## 2-2. Word Tokenization
- tokenize the clean text column

In [14]:
train['tokens']=train['clean_text'].apply(lambda x: word_tokenize(x))
test['tokens'] = test['clean_text'].apply(lambda x: word_tokenize(x))

In [15]:
train.head(5)

Unnamed: 0,keyword,text,target,clean_text,tokens
1386,bush%20fires,Ted Cruz fires back at Jeb Bush ÛÏWe lose be...,0,ted cruz fire back jeb bush lose republican li...,"[ted, cruz, fire, back, jeb, bush, lose, repub..."
4048,forest%20fires,This is the first year the Forest Service spen...,1,first year forest service spent half annual bu...,"[first, year, forest, service, spent, half, an..."
3086,electrocute,lightseraphs pissed at you and could have thei...,0,lightseraphs pissed could pikachu electrocute,"[lightseraphs, pissed, could, pikachu, electro..."
272,apocalypse,Im gonna fight Taylor as soon as I get there,0,im gonna fight taylor soon get,"[im, gon, na, fight, taylor, soon, get]"
7462,wounds,NicolaClements4 IÛªm not sure that covering m...,0,nicolaclements sure covering head wound scab s...,"[nicolaclements, sure, covering, head, wound, ..."


## 2-3. Word2Vec
- convert our data(words) into vectors

In [16]:
#first, create a list corpus which we would be using to train word2vec mappings
def fn_pre_process_data(doc):
    for rec in doc:
        yield gensim.utils.simple_preprocess(rec)

corpus = list(fn_pre_process_data(train['clean_text']))
corpus += list(fn_pre_process_data(test['clean_text']))

In [17]:
#inititate the embedding model, we will come back to the passed arguments later
print('initiated ...')
wv_model = Word2Vec(corpus,vector_size=150,window=3,min_count=2)
#wv_model.build_vocab(corpus)
wv_model.train(corpus,total_examples=len(corpus),epochs=10)
#wv_model.save(data_path + 'word2vec.model')

initiated ...


(581392, 681810)

In [18]:
# convert the train and text tokens
def get_word_embeddings(token_list,vector,k=150):
    if len(token_list) < 1:
        return np.zeros(k)
    else:

        vectorized = [vector.wv[word] if word in vector.wv else np.random.rand(k) for word in token_list] 
    
    sum = np.sum(vectorized,axis=0)
    ## return the average
    return sum/len(vectorized)       

def get_embeddings(tokens,vector):
        embeddings = tokens.apply(lambda x: get_word_embeddings(x, wv_model))
        return list(embeddings)

In [19]:
train_embeddings = get_embeddings(train['tokens'],wv_model)
test_embeddings = get_embeddings(test['tokens'],wv_model)

# 3. Model
## Contents
- train the model
    - RidgeClassifierCV
    - sgd classifier
    - BernoulliNB 
    - RandomForest

## Model Description
--------------
### Ensemble
- Combine the predictions of several base estimators built with a given learning algorithm 
    - in order to improve generalizability / robustness over a single estimator.
- Boosting of Ensemble types

### Performance - f1-score


## 3-2. Train the model

In [20]:
MLA = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),
    
    #GLM
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),

    XGBClassifier(),
    CatBoostClassifier()  
    ]

# Comapring all MLA
- precision
- recall
- accuracy
- f1-score

In [21]:
train_embeddings = np.array(train_embeddings)
test_embeddings = np.array(test_embeddings)

In [22]:
# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

row_index = 0
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

for alg in MLA:
    predicted = alg.fit(train_embeddings, train['target']).predict(test_embeddings)
    fp, tp, th = roc_curve(test['target'], predicted)

    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(train_embeddings, train['target']), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(test_embeddings, test['target']), 4)
    
    
    recall = recall_score(test['target'], predicted)
    precision = precision_score(test['target'], predicted)
    MLA_compare.loc[row_index, 'Precission'] = precision
    MLA_compare.loc[row_index, 'Recall'] = recall
    MLA_compare.loc[row_index, 'F1-score'] = round((2*precision*recall)/(precision+recall),4)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1



Learning rate set to 0.022283
0:	learn: 0.6880650	total: 72.6ms	remaining: 1m 12s
1:	learn: 0.6829057	total: 90.7ms	remaining: 45.3s
2:	learn: 0.6780950	total: 108ms	remaining: 35.8s
3:	learn: 0.6732733	total: 123ms	remaining: 30.7s
4:	learn: 0.6683880	total: 141ms	remaining: 28s
5:	learn: 0.6640175	total: 158ms	remaining: 26.1s
6:	learn: 0.6599150	total: 179ms	remaining: 25.4s
7:	learn: 0.6560046	total: 198ms	remaining: 24.5s
8:	learn: 0.6526400	total: 220ms	remaining: 24.2s
9:	learn: 0.6489597	total: 241ms	remaining: 23.8s
10:	learn: 0.6461711	total: 258ms	remaining: 23.2s
11:	learn: 0.6428210	total: 273ms	remaining: 22.5s
12:	learn: 0.6397829	total: 294ms	remaining: 22.3s
13:	learn: 0.6369906	total: 309ms	remaining: 21.7s
14:	learn: 0.6340990	total: 325ms	remaining: 21.3s
15:	learn: 0.6312397	total: 344ms	remaining: 21.1s
16:	learn: 0.6282032	total: 363ms	remaining: 21s
17:	learn: 0.6256723	total: 379ms	remaining: 20.7s
18:	learn: 0.6228066	total: 396ms	remaining: 20.5s
19:	learn: 0

167:	learn: 0.5115072	total: 2.87s	remaining: 14.2s
168:	learn: 0.5111268	total: 2.88s	remaining: 14.2s
169:	learn: 0.5108128	total: 2.9s	remaining: 14.2s
170:	learn: 0.5104957	total: 2.92s	remaining: 14.1s
171:	learn: 0.5102353	total: 2.93s	remaining: 14.1s
172:	learn: 0.5099234	total: 2.95s	remaining: 14.1s
173:	learn: 0.5095318	total: 2.96s	remaining: 14.1s
174:	learn: 0.5091301	total: 2.98s	remaining: 14s
175:	learn: 0.5088234	total: 3s	remaining: 14s
176:	learn: 0.5085720	total: 3.02s	remaining: 14s
177:	learn: 0.5083068	total: 3.03s	remaining: 14s
178:	learn: 0.5079265	total: 3.05s	remaining: 14s
179:	learn: 0.5076234	total: 3.06s	remaining: 14s
180:	learn: 0.5071598	total: 3.08s	remaining: 13.9s
181:	learn: 0.5067720	total: 3.09s	remaining: 13.9s
182:	learn: 0.5065051	total: 3.11s	remaining: 13.9s
183:	learn: 0.5062024	total: 3.12s	remaining: 13.9s
184:	learn: 0.5057962	total: 3.14s	remaining: 13.8s
185:	learn: 0.5054480	total: 3.15s	remaining: 13.8s
186:	learn: 0.5052215	total:

327:	learn: 0.4654632	total: 5.35s	remaining: 11s
328:	learn: 0.4651749	total: 5.37s	remaining: 10.9s
329:	learn: 0.4649218	total: 5.38s	remaining: 10.9s
330:	learn: 0.4646632	total: 5.4s	remaining: 10.9s
331:	learn: 0.4643310	total: 5.41s	remaining: 10.9s
332:	learn: 0.4640078	total: 5.43s	remaining: 10.9s
333:	learn: 0.4637719	total: 5.44s	remaining: 10.9s
334:	learn: 0.4634910	total: 5.45s	remaining: 10.8s
335:	learn: 0.4631479	total: 5.47s	remaining: 10.8s
336:	learn: 0.4627591	total: 5.48s	remaining: 10.8s
337:	learn: 0.4625634	total: 5.5s	remaining: 10.8s
338:	learn: 0.4622835	total: 5.52s	remaining: 10.8s
339:	learn: 0.4620502	total: 5.53s	remaining: 10.7s
340:	learn: 0.4618028	total: 5.55s	remaining: 10.7s
341:	learn: 0.4615575	total: 5.56s	remaining: 10.7s
342:	learn: 0.4612874	total: 5.58s	remaining: 10.7s
343:	learn: 0.4609097	total: 5.6s	remaining: 10.7s
344:	learn: 0.4606630	total: 5.63s	remaining: 10.7s
345:	learn: 0.4604212	total: 5.65s	remaining: 10.7s
346:	learn: 0.460

498:	learn: 0.4208604	total: 8.02s	remaining: 8.05s
499:	learn: 0.4206906	total: 8.03s	remaining: 8.03s
500:	learn: 0.4203260	total: 8.05s	remaining: 8.02s
501:	learn: 0.4199800	total: 8.06s	remaining: 8s
502:	learn: 0.4195321	total: 8.08s	remaining: 7.98s
503:	learn: 0.4192800	total: 8.09s	remaining: 7.96s
504:	learn: 0.4190206	total: 8.11s	remaining: 7.95s
505:	learn: 0.4186680	total: 8.12s	remaining: 7.93s
506:	learn: 0.4184065	total: 8.14s	remaining: 7.91s
507:	learn: 0.4180886	total: 8.15s	remaining: 7.89s
508:	learn: 0.4177466	total: 8.17s	remaining: 7.88s
509:	learn: 0.4174997	total: 8.18s	remaining: 7.86s
510:	learn: 0.4171370	total: 8.2s	remaining: 7.84s
511:	learn: 0.4167813	total: 8.21s	remaining: 7.83s
512:	learn: 0.4165555	total: 8.22s	remaining: 7.81s
513:	learn: 0.4161989	total: 8.25s	remaining: 7.8s
514:	learn: 0.4159168	total: 8.26s	remaining: 7.78s
515:	learn: 0.4156392	total: 8.28s	remaining: 7.76s
516:	learn: 0.4154336	total: 8.29s	remaining: 7.74s
517:	learn: 0.415

658:	learn: 0.3778234	total: 10.5s	remaining: 5.41s
659:	learn: 0.3775064	total: 10.5s	remaining: 5.4s
660:	learn: 0.3773217	total: 10.5s	remaining: 5.38s
661:	learn: 0.3771854	total: 10.5s	remaining: 5.36s
662:	learn: 0.3769709	total: 10.5s	remaining: 5.35s
663:	learn: 0.3767061	total: 10.5s	remaining: 5.33s
664:	learn: 0.3765717	total: 10.5s	remaining: 5.31s
665:	learn: 0.3764364	total: 10.6s	remaining: 5.3s
666:	learn: 0.3763026	total: 10.6s	remaining: 5.28s
667:	learn: 0.3761842	total: 10.6s	remaining: 5.26s
668:	learn: 0.3758263	total: 10.6s	remaining: 5.25s
669:	learn: 0.3754936	total: 10.6s	remaining: 5.23s
670:	learn: 0.3752223	total: 10.6s	remaining: 5.22s
671:	learn: 0.3749206	total: 10.7s	remaining: 5.2s
672:	learn: 0.3746929	total: 10.7s	remaining: 5.18s
673:	learn: 0.3745622	total: 10.7s	remaining: 5.17s
674:	learn: 0.3743778	total: 10.7s	remaining: 5.15s
675:	learn: 0.3740827	total: 10.7s	remaining: 5.14s
676:	learn: 0.3738015	total: 10.7s	remaining: 5.12s
677:	learn: 0.3

828:	learn: 0.3379530	total: 13.1s	remaining: 2.71s
829:	learn: 0.3378054	total: 13.1s	remaining: 2.69s
830:	learn: 0.3376814	total: 13.2s	remaining: 2.67s
831:	learn: 0.3375159	total: 13.2s	remaining: 2.66s
832:	learn: 0.3371831	total: 13.2s	remaining: 2.64s
833:	learn: 0.3369113	total: 13.2s	remaining: 2.63s
834:	learn: 0.3367074	total: 13.2s	remaining: 2.61s
835:	learn: 0.3364482	total: 13.2s	remaining: 2.6s
836:	learn: 0.3362540	total: 13.3s	remaining: 2.58s
837:	learn: 0.3362415	total: 13.3s	remaining: 2.56s
838:	learn: 0.3360191	total: 13.3s	remaining: 2.55s
839:	learn: 0.3357390	total: 13.3s	remaining: 2.53s
840:	learn: 0.3354793	total: 13.3s	remaining: 2.52s
841:	learn: 0.3351707	total: 13.3s	remaining: 2.5s
842:	learn: 0.3348327	total: 13.4s	remaining: 2.49s
843:	learn: 0.3345730	total: 13.4s	remaining: 2.47s
844:	learn: 0.3344172	total: 13.4s	remaining: 2.46s
845:	learn: 0.3342159	total: 13.4s	remaining: 2.44s
846:	learn: 0.3339422	total: 13.4s	remaining: 2.42s
847:	learn: 0.

992:	learn: 0.3060834	total: 15.8s	remaining: 111ms
993:	learn: 0.3060779	total: 15.8s	remaining: 95.3ms
994:	learn: 0.3059510	total: 15.8s	remaining: 79.4ms
995:	learn: 0.3057689	total: 15.8s	remaining: 63.5ms
996:	learn: 0.3055911	total: 15.8s	remaining: 47.7ms
997:	learn: 0.3054220	total: 15.9s	remaining: 31.8ms
998:	learn: 0.3052103	total: 15.9s	remaining: 15.9ms
999:	learn: 0.3049997	total: 15.9s	remaining: 0us


In [23]:
# f1-score 기준 정렬
MLA_compare.sort_values(by = ['F1-score'], ascending = False, inplace = True)    
MLA_compare

Unnamed: 0,MLA used,Train Accuracy,Test Accuracy,Precission,Recall,F1-score,AUC
8,Perceptron,0.6938,0.7039,0.648089,0.638932,0.6435,0.694748
15,CatBoostClassifier,0.9227,0.7347,0.754923,0.541601,0.6307,0.707595
14,XGBClassifier,0.989,0.7216,0.712575,0.56044,0.6274,0.698956
4,RandomForestClassifier,0.989,0.7347,0.76659,0.525903,0.6238,0.705389
2,ExtraTreesClassifier,0.989,0.74,0.796069,0.508634,0.6207,0.707477
3,GradientBoostingClassifier,0.791,0.7295,0.755102,0.522763,0.6178,0.700433
6,RidgeClassifierCV,0.722,0.7295,0.775061,0.497645,0.6061,0.696904
7,SGDClassifier,0.7197,0.7137,0.718004,0.519623,0.6029,0.686448
0,AdaBoostClassifier,0.7368,0.7019,0.692632,0.516484,0.5917,0.675849
11,KNeighborsClassifier,0.7944,0.6763,0.629032,0.55102,0.5874,0.658693
