# TEST3: add non-disaster, disaster 400 lines
## performing measurement
### Disaster Tweet Classification with Disaster Tweets

------------------
### GOAL
- comparing the performance of different word representation
    - word2vec vs. word count
- Predicting whether a given tweet is about a real disaster or not
- if so, predict a `1`. if not, predict a `0`


### Reference
- [competition main page](https://www.kaggle.com/c/nlp-getting-started)
- [word2vec code](https://www.kaggle.com/slatawa/simple-implementation-of-word2vec)
- [word2vec-implementation-for-beginner](https://www.kaggle.com/manavkapadnis/nlp-word2vec-implementation-for-beginner)
- [word count](https://www.kaggle.com/datarohitingole/disaster-tweet-classification-ridgeclassifiercv)
- [comparing the performance of different Machine Learning Algorithm](https://dibyendudeb.com/comparing-machine-learning-algorithms/)
- [Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT](https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794)

# 0. Importing Libraries

In [1]:
# for loading and preprocessing the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

# for training the model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree, linear_model, neighbors, naive_bayes, ensemble
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# for evaluating classification model
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

# for data cleaning
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')


# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxPooling1D, LSTM

# for word2vec
import gensim
from gensim.models import Word2Vec

# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2. Data Preprocessing
## Contents
### 1. Clean the data
- dealing with missing values
- replace some commonly occuring shorthands
- remove any characters other then alphabets
- convert all dicitonary to lower case(for consistency)
- lemmatize

### 2-1. word tokenization and word2vec

### 2-2. Convert text to vectors using Counter vectorizer





## Data Description
--------------
### Files
- `train.csv` : the training set
- `test.csv` : the test set
- `sample_submission.csv` : a sample submission file in the correct format

### Columns
- `id` : a unique identifier for each tweet
- `text` : the text of the tweet
- `location` : the location the tweet was sent from 
- `keyword` : a particular keyword from th tweet
- `target` : in train.csv only, this denotes whether a tweet is about a real disaster(1) or not(0)

In [2]:
# loading the data set
data_path = './data/'
add_nonD_D = pd.read_csv(data_path + 'train_add_nonD400_D400.csv')

In [3]:
print('train_shape:', add_nonD_D.shape)
add_nonD_D.head()

train_shape: (8413, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
# target아 -1인 경우, 외래어(영어가 아닌)이므로 드랍
indexNames = add_nonD_D[add_nonD_D['target']==-1].index
add_nonD_D.drop(indexNames , inplace=True)
print('train_shape:', add_nonD_D.shape)
add_nonD_D.head()

train_shape: (8411, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# split the data <train : test = 8 : 2>
train, test = train_test_split(add_nonD_D, test_size = 0.20, random_state = 0)

In [6]:
print('train shape:', train.shape)
train.head()

train shape: (6728, 5)


Unnamed: 0,id,keyword,location,text,target
5366,7656,panic,The Internetz,Plane Panic What kind of douchebag. Bubble Gum,0
4695,6675,landslide,Dundee,DUNDEE NEWS: Army veteran fears loose rocks fr...,1
5340,7622,pandemonium,www.facebook.com/stuntfm,Pandemonium In Aba As Woman Delivers Baby With...,0
4763,6776,lightning,,Dry thunderstorms with lightning possible in t...,1
2212,3168,deluge,617-BTOWN-BEATDOWN,Photo: forrestmankins: Colorado camping. http:...,0


In [7]:
print('test shape:', test.shape)
test.head()

test shape: (1683, 5)


Unnamed: 0,id,keyword,location,text,target
7425,10622,wounded,worldwide,Officer Wounded Suspect Killed in Exchange of ...,1
3542,5064,famine,"Kyiv, Ukraine",#Russia 'food crematoria' provoke outrage in c...,1
1467,2115,catastrophe,Florida,@deb117 7/30 that catastrophe man opens school...,0
8136,123,smudging%20smudgewand%20handmade%20covenstuff%...,,Buy my dead flowers _arvest moons are coming s...,0
3873,5506,flames,Houston |??| Corsicana,@CW_Hoops you better make all your shots tomor...,0


## 1. Clean the data
#### (1) Dealing with Missing Values

In [8]:
all_data = [train,test]
for data in all_data:
    data.drop(["location", "id"], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [9]:
# data prepocessing with regrex

def remove_URL(text): # remove url pattern in text
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

def remove_html(text): # remove html pattern in text
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return html.sub(r'', text)
    #return re.sub(html, '', text)

def remove_punct(text): # remove punctuation in text: (;, ', ", :, ., , etc..)
  table = str.maketrans('', '', string.punctuation)
  return text.translate(table)

#### (2) Replace some commonly occuring shorthands

In [10]:
for data in all_data:
  data['text'] = data['text'].apply(lambda x: remove_URL(x))
  data['text'] = data['text'].apply(lambda x: remove_html(x))
  data['text'] = data['text'].apply(lambda x: remove_punct(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda x: remove_URL(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda x: remove_html(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda x: remove_punct(x))


In [11]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"you'll", "you will", text)
    text = re.sub(r"i'll", "i will", text)
    text = re.sub(r"she'll", "she will", text)
    text = re.sub(r"he'll", "he will", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"there's", "there is", text)
    text = re.sub(r"here's", "here is", text)
    text = re.sub(r"who's", "who is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"don't", "do not", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    return text

In [12]:
train['clean_text'] = train['text'].apply(clean_text)
test['clean_text'] = test['text'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['clean_text'] = train['text'].apply(clean_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['clean_text'] = test['text'].apply(clean_text)


In [13]:
train.head()

Unnamed: 0,keyword,text,target,clean_text
5366,panic,Plane Panic What kind of douchebag Bubble Gum,0,plane panic what kind of douchebag bubble gum
4695,landslide,DUNDEE NEWS Army veteran fears loose rocks fro...,1,dundee news army veteran fears loose rocks fro...
5340,pandemonium,Pandemonium In Aba As Woman Delivers Baby With...,0,pandemonium in aba as woman delivers baby with...
4763,lightning,Dry thunderstorms with lightning possible in t...,1,dry thunderstorms with lightning possible in t...
2212,deluge,Photo forrestmankins Colorado camping,0,photo forrestmankins colorado camping


#### (3) remove any characters other then alphabets
#### (4) convert all dicitonary to lower case(for consistency)
#### (5) lemmatize

In [14]:
def massage_text(text):
    ## remove anything other then characters and put everything in lowercase
    tweet = re.sub("[^a-zA-Z]", ' ', text) # (3)
    tweet = tweet.lower() # (4) 
    tweet = tweet.split()

    lem = WordNetLemmatizer() # (5)
    tweet = [lem.lemmatize(word) for word in tweet
             if word not in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    return tweet
    print('--here goes nothing')
    print(text)
    print(tweet)

train['clean_text'] = train['text'].apply(massage_text)
test['clean_text'] = test['text'].apply(massage_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['clean_text'] = train['text'].apply(massage_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['clean_text'] = test['text'].apply(massage_text)


In [15]:
train.iloc[0:10][['text','clean_text']]

Unnamed: 0,text,clean_text
5366,Plane Panic What kind of douchebag Bubble Gum,plane panic kind douchebag bubble gum
4695,DUNDEE NEWS Army veteran fears loose rocks fro...,dundee news army veteran fear loose rock dunde...
5340,Pandemonium In Aba As Woman Delivers Baby With...,pandemonium aba woman delivers baby without fa...
4763,Dry thunderstorms with lightning possible in t...,dry thunderstorm lightning possible pinpoint v...
2212,Photo forrestmankins Colorado camping,photo forrestmankins colorado camping
7237,eyecuts Erasuterism I love 96 Gal Deco to deat...,eyecuts erasuterism love gal deco death even b...
4284,The Prophet peace be upon him said Save yourse...,prophet peace upon said save hellfire even giv...
4083,Strong Thunderstorm 4 Miles North of Japton Mo...,strong thunderstorm mile north japton moving s...
4894,I see a massacre,see massacre
272,Im gonna fight Taylor as soon as I get there,im gonna fight taylor soon get


## 2-1. Word Tokenization and word2vec
- tokenize the clean text column

In [16]:
train['tokens'] = train['clean_text'].apply(lambda x: word_tokenize(x))
test['tokens'] = test['clean_text'].apply(lambda x: word_tokenize(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['tokens'] = train['clean_text'].apply(lambda x: word_tokenize(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['tokens'] = test['clean_text'].apply(lambda x: word_tokenize(x))


In [17]:
train.head()

Unnamed: 0,keyword,text,target,clean_text,tokens
5366,panic,Plane Panic What kind of douchebag Bubble Gum,0,plane panic kind douchebag bubble gum,"[plane, panic, kind, douchebag, bubble, gum]"
4695,landslide,DUNDEE NEWS Army veteran fears loose rocks fro...,1,dundee news army veteran fear loose rock dunde...,"[dundee, news, army, veteran, fear, loose, roc..."
5340,pandemonium,Pandemonium In Aba As Woman Delivers Baby With...,0,pandemonium aba woman delivers baby without fa...,"[pandemonium, aba, woman, delivers, baby, with..."
4763,lightning,Dry thunderstorms with lightning possible in t...,1,dry thunderstorm lightning possible pinpoint v...,"[dry, thunderstorm, lightning, possible, pinpo..."
2212,deluge,Photo forrestmankins Colorado camping,0,photo forrestmankins colorado camping,"[photo, forrestmankins, colorado, camping]"


#### convert our data(words) into vectors

In [18]:
#first, create a list corpus which we would be using to train word2vec mappings
def fn_pre_process_data(doc):
    for rec in doc:
        yield gensim.utils.simple_preprocess(rec)

corpus = list(fn_pre_process_data(train['clean_text']))
corpus += list(fn_pre_process_data(test['clean_text']))

In [19]:
#inititate the embedding model, we will come back to the passed arguments later
print('initiated ...')
wv_model = Word2Vec(corpus,vector_size=150,window=3,min_count=2)
#wv_model.build_vocab(corpus)
wv_model.train(corpus,total_examples=len(corpus),epochs=10)
#wv_model.save(data_path + 'word2vec.model')

initiated ...


(687670, 805860)

In [20]:
# convert the train and text tokens
def get_word_embeddings(token_list,vector,k=150):
    if len(token_list) < 1:
        return np.zeros(k)
    else:

        vectorized = [vector.wv[word] if word in vector.wv else np.random.rand(k) for word in token_list] 
    
    sum = np.sum(vectorized,axis=0)
    ## return the average
    return sum/len(vectorized)       

def get_embeddings(tokens,vector):
        embeddings = tokens.apply(lambda x: get_word_embeddings(x, wv_model))
        return list(embeddings)

In [21]:
train_embeddings = get_embeddings(train['tokens'],wv_model)
test_embeddings = get_embeddings(test['tokens'],wv_model)

In [22]:
model_path = './model/'
wv_model.save(model_path + 'word2vec_1.model')

## 2-2. Convert text to vectors using Counter vectorizer

### What is the Count Vectorizer?
- convert a collection of text documents to a matrix of token counts

### How to Use
```python
# python example code
corpus = ["This is the first document", "This document is the second document", "And this is the thrid one"]
vectorize = CounterVectorize()
X = vectorize.fit_transform(corpus)
```
- vectorizer.get_feature_names_out()
> array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], ...)
- X.toarray()
> [[0 1 1 1 0 0 1 0 1]  
 [0 2 0 1 0 1 1 0 1]  
 [1 0 0 1 1 0 1 1 1]]

In [23]:
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=2000)

X = count_vectorizer.fit_transform(train["clean_text"]).toarray()
test_tmp = count_vectorizer.transform(test["clean_text"]).toarray()
y = train['target']

In [24]:
X_train = X
X_test = test_tmp
y_train = train['target']
y_test = test['target']

# 3. Model
## Contents
- train the model
    - RidgeClassifierCV
    - sgd classifier
    - BernoulliNB 
    - RandomForest

## Model Description
--------------
### Ensemble
- Combine the predictions of several base estimators built with a given learning algorithm 
    - in order to improve generalizability / robustness over a single estimator.
- Boosting of Ensemble types

### Performance - f1-score


## 3-2. Train the model

In [25]:
MLA = [
    #Ensemble Methods
    ensemble.RandomForestClassifier(),
    
    #GLM
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB()
    ]

# Comapring all MLA
## word2vec
- precision
- recall
- accuracy
- f1-score

In [26]:
train_embeddings = np.array(train_embeddings)
test_embeddings = np.array(test_embeddings)

In [27]:
# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

row_index = 0
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

for alg in MLA:
    predicted = alg.fit(train_embeddings, y_train).predict(test_embeddings)
    fp, tp, th = roc_curve(y_test, predicted)

    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(train_embeddings,y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(test_embeddings, y_test), 4)
    
    
    recall = recall_score(test['target'], predicted)
    precision = precision_score(test['target'], predicted)
    MLA_compare.loc[row_index, 'Precission'] = precision
    MLA_compare.loc[row_index, 'Recall'] = recall
    MLA_compare.loc[row_index, 'F1-score'] = round((2*precision*recall)/(precision+recall),4)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1

In [28]:
# f1-score 기준 정렬
MLA_compare.sort_values(by = ['F1-score'], ascending = False, inplace = True)    
MLA_compare

Unnamed: 0,MLA used,Train Accuracy,Test Accuracy,Precission,Recall,F1-score,AUC
0,RandomForestClassifier,0.9896,0.7611,0.815068,0.526549,0.6398,0.722976
3,BernoulliNB,0.5966,0.5954,0.498389,0.684366,0.5768,0.609845
1,RidgeClassifierCV,0.7308,0.7201,0.771654,0.433628,0.5552,0.673531
2,SGDClassifier,0.728,0.7166,0.791304,0.402655,0.5337,0.665507


# Comapring all MLA
## wordCount
- precision
- recall
- accuracy
- f1-score

In [29]:
# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

row_index = 0
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

for alg in MLA:
    predicted = alg.fit(X_train, y_train).predict(X_test)
    fp, tp, th = roc_curve(y_test, predicted)

    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 4)
    
    
    recall = recall_score(y_test, predicted)
    precision = precision_score(y_test, predicted)
    MLA_compare.loc[row_index, 'Precission'] = precision
    MLA_compare.loc[row_index, 'Recall'] = recall
    MLA_compare.loc[row_index, 'F1-score'] = round((2*precision*recall)/(precision+recall),4)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1

In [30]:
# f1-score 기준 정렬
MLA_compare.sort_values(by = ['F1-score'], ascending = False, inplace = True)    
MLA_compare

Unnamed: 0,MLA used,Train Accuracy,Test Accuracy,Precission,Recall,F1-score,AUC
3,BernoulliNB,0.8463,0.8045,0.804538,0.679941,0.737,0.784249
0,RandomForestClassifier,0.9811,0.7879,0.746544,0.716814,0.7314,0.776318
1,RidgeClassifierCV,0.8707,0.8004,0.814338,0.653392,0.725,0.776447
2,SGDClassifier,0.9086,0.7867,0.762768,0.682891,0.7206,0.769804
