# Compare word2vec and word counting
## performing measurement
### Disaster Tweet Classification with Disaster Tweets

------------------
### GOAL
- comparing the performance of different word representation
    - word2vec vs. word count
- Predicting whether a given tweet is about a real disaster or not
- if so, predict a `1`. if not, predict a `0`


### Reference
- [competition main page](https://www.kaggle.com/c/nlp-getting-started)
- [word2vec code](https://www.kaggle.com/slatawa/simple-implementation-of-word2vec)
- [word2vec-implementation-for-beginner](https://www.kaggle.com/manavkapadnis/nlp-word2vec-implementation-for-beginner)
- [word count](https://www.kaggle.com/datarohitingole/disaster-tweet-classification-ridgeclassifiercv)
- [comparing the performance of different Machine Learning Algorithm](https://dibyendudeb.com/comparing-machine-learning-algorithms/)
- [Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT](https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794)

# 0. Importing Libraries

In [1]:
# for loading and preprocessing the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

# for training the model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree, linear_model, neighbors, naive_bayes, ensemble
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# for evaluating classification model
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

# for data cleaning
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')


# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxPooling1D, LSTM

# for word2vec
import gensim
from gensim.models import Word2Vec

# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

[nltk_data] Downloading package wordnet to /Users/mac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 2. Data Preprocessing
## Contents
### 1. Clean the data
- dealing with missing values
- replace some commonly occuring shorthands
- remove any characters other then alphabets
- convert all dicitonary to lower case(for consistency)
- lemmatize

### 2-1. word tokenization and word2vec

### 2-2. Convert text to vectors using Counter vectorizer





## Data Description
--------------
### Files
- `train.csv` : the training set
- `test.csv` : the test set
- `sample_submission.csv` : a sample submission file in the correct format

### Columns
- `id` : a unique identifier for each tweet
- `text` : the text of the tweet
- `location` : the location the tweet was sent from 
- `keyword` : a particular keyword from th tweet
- `target` : in train.csv only, this denotes whether a tweet is about a real disaster(1) or not(0)

In [2]:
# loading the data set
data_path = './data/'
train = pd.read_csv(data_path + 'train.csv')

In [3]:
print('train_shape:', train.shape)
train.head()

train_shape: (7613, 5)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
# split the data <train : test = 8 : 2>
train, test = train_test_split(train, test_size = 0.20, random_state = 0)

In [5]:
print('train shape:', train.shape)
train.head()

train shape: (6090, 5)


Unnamed: 0,id,keyword,location,text,target
1386,1999,bush%20fires,,Ted Cruz fires back at Jeb &amp; Bush: ÛÏWe l...,0
4048,5751,forest%20fires,,This is the first year the Forest Service spen...,1
3086,4428,electrocute,,@lightseraphs pissed at you and could have the...,0
272,396,apocalypse,ColoRADo,I'm gonna fight Taylor as soon as I get there.,0
7462,10678,wounds,"Tampa, FL",@NicolaClements4 IÛªm not sure that covering ...,0


In [6]:
print('test shape:', test.shape)
test.head()

test shape: (1523, 5)


Unnamed: 0,id,keyword,location,text,target
311,454,armageddon,Wrigley Field,@KatieKatCubs you already know how this shit g...,0
4970,7086,meltdown,Two Up Two Down,@LeMaireLee @danharmon People Near Meltdown Co...,0
527,762,avalanche,Score Team Goals Buying @,1-6 TIX Calgary Flames vs COL Avalanche Presea...,0
6362,9094,suicide%20bomb,Roadside,If you ever think you running out of choices i...,0
800,1160,blight,Laventillemoorings,If you dotish to blight your car go right ahea...,0


## 1. Clean the data
#### (1) Dealing with Missing Values

In [7]:
all_data = [train,test]
for data in all_data:
    data.drop(["location", "id"], axis = 1, inplace = True)

In [8]:
# data prepocessing with regrex

def remove_URL(text): # remove url pattern in text
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

def remove_html(text): # remove html pattern in text
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return html.sub(r'', text)
    #return re.sub(html, '', text)

def remove_punct(text): # remove punctuation in text: (;, ', ", :, ., , etc..)
  table = str.maketrans('', '', string.punctuation)
  return text.translate(table)

#### (2) Replace some commonly occuring shorthands

In [9]:
for data in all_data:
  data['text'] = data['text'].apply(lambda x: remove_URL(x))
  data['text'] = data['text'].apply(lambda x: remove_html(x))
  data['text'] = data['text'].apply(lambda x: remove_punct(x))

In [10]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"you'll", "you will", text)
    text = re.sub(r"i'll", "i will", text)
    text = re.sub(r"she'll", "she will", text)
    text = re.sub(r"he'll", "he will", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"there's", "there is", text)
    text = re.sub(r"here's", "here is", text)
    text = re.sub(r"who's", "who is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"don't", "do not", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    return text

In [11]:
train['clean_text'] = train['text'].apply(clean_text)
test['clean_text'] = test['text'].apply(clean_text)

In [12]:
train.head()

Unnamed: 0,keyword,text,target,clean_text
1386,bush%20fires,Ted Cruz fires back at Jeb Bush ÛÏWe lose be...,0,ted cruz fires back at jeb bush ûïwe lose be...
4048,forest%20fires,This is the first year the Forest Service spen...,1,this is the first year the forest service spen...
3086,electrocute,lightseraphs pissed at you and could have thei...,0,lightseraphs pissed at you and could have thei...
272,apocalypse,Im gonna fight Taylor as soon as I get there,0,im gonna fight taylor as soon as i get there
7462,wounds,NicolaClements4 IÛªm not sure that covering m...,0,nicolaclements4 iûªm not sure that covering m...


#### (3) remove any characters other then alphabets
#### (4) convert all dicitonary to lower case(for consistency)
#### (5) lemmatize

In [13]:
def massage_text(text):
    ## remove anything other then characters and put everything in lowercase
    tweet = re.sub("[^a-zA-Z]", ' ', text) # (3)
    tweet = tweet.lower() # (4) 
    tweet = tweet.split()

    lem = WordNetLemmatizer() # (5)
    tweet = [lem.lemmatize(word) for word in tweet
             if word not in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    return tweet
    print('--here goes nothing')
    print(text)
    print(tweet)

train['clean_text'] = train['text'].apply(massage_text)
test['clean_text'] = test['text'].apply(massage_text)

In [14]:
train.iloc[0:10][['text','clean_text']]

Unnamed: 0,text,clean_text
1386,Ted Cruz fires back at Jeb Bush ÛÏWe lose be...,ted cruz fire back jeb bush lose republican li...
4048,This is the first year the Forest Service spen...,first year forest service spent half annual bu...
3086,lightseraphs pissed at you and could have thei...,lightseraphs pissed could pikachu electrocute
272,Im gonna fight Taylor as soon as I get there,im gonna fight taylor soon get
7462,NicolaClements4 IÛªm not sure that covering m...,nicolaclements sure covering head wound scab s...
4778,kabwandi Breaking news Unconfirmed I just hear...,kabwandi breaking news unconfirmed heard loud ...
260,The annihilation of Jeb Christie Kasich is le...,annihilation jeb christie kasich le hour away ...
2921,Wtf this mom just drowned her child,wtf mom drowned child
2162,MayorofLondon pls reduce cyclist deaths with a...,mayoroflondon pls reduce cyclist death compuls...
4818,DoctorDryadma mass murder here we come,doctordryadma mass murder come


## 2-1. Word Tokenization and word2vec
- tokenize the clean text column

In [15]:
train['tokens'] = train['clean_text'].apply(lambda x: word_tokenize(x))
test['tokens'] = test['clean_text'].apply(lambda x: word_tokenize(x))

In [16]:
train.head()

Unnamed: 0,keyword,text,target,clean_text,tokens
1386,bush%20fires,Ted Cruz fires back at Jeb Bush ÛÏWe lose be...,0,ted cruz fire back jeb bush lose republican li...,"[ted, cruz, fire, back, jeb, bush, lose, repub..."
4048,forest%20fires,This is the first year the Forest Service spen...,1,first year forest service spent half annual bu...,"[first, year, forest, service, spent, half, an..."
3086,electrocute,lightseraphs pissed at you and could have thei...,0,lightseraphs pissed could pikachu electrocute,"[lightseraphs, pissed, could, pikachu, electro..."
272,apocalypse,Im gonna fight Taylor as soon as I get there,0,im gonna fight taylor soon get,"[im, gon, na, fight, taylor, soon, get]"
7462,wounds,NicolaClements4 IÛªm not sure that covering m...,0,nicolaclements sure covering head wound scab s...,"[nicolaclements, sure, covering, head, wound, ..."


#### convert our data(words) into vectors

In [17]:
#first, create a list corpus which we would be using to train word2vec mappings
def fn_pre_process_data(doc):
    for rec in doc:
        yield gensim.utils.simple_preprocess(rec)

corpus = list(fn_pre_process_data(train['clean_text']))
corpus += list(fn_pre_process_data(test['clean_text']))

In [18]:
#inititate the embedding model, we will come back to the passed arguments later
print('initiated ...')
wv_model = Word2Vec(corpus,vector_size=150,window=3,min_count=2)
#wv_model.build_vocab(corpus)
wv_model.train(corpus,total_examples=len(corpus),epochs=10)
#wv_model.save(data_path + 'word2vec.model')

initiated ...


(581532, 681810)

In [19]:
# convert the train and text tokens
def get_word_embeddings(token_list,vector,k=150):
    if len(token_list) < 1:
        return np.zeros(k)
    else:

        vectorized = [vector.wv[word] if word in vector.wv else np.random.rand(k) for word in token_list] 
    
    sum = np.sum(vectorized,axis=0)
    ## return the average
    return sum/len(vectorized)       

def get_embeddings(tokens,vector):
        embeddings = tokens.apply(lambda x: get_word_embeddings(x, wv_model))
        return list(embeddings)

In [20]:
train_embeddings = get_embeddings(train['tokens'],wv_model)
test_embeddings = get_embeddings(test['tokens'],wv_model)

In [21]:
model_path = './model/'
wv_model.save(model_path + 'word2vec_1.model')

## 2-2. Convert text to vectors using Counter vectorizer

### What is the Count Vectorizer?
- convert a collection of text documents to a matrix of token counts

### How to Use
```python
# python example code
corpus = ["This is the first document", "This document is the second document", "And this is the thrid one"]
vectorize = CounterVectorize()
X = vectorize.fit_transform(corpus)
```
- vectorizer.get_feature_names_out()
> array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'], ...)
- X.toarray()
> [[0 1 1 1 0 0 1 0 1]  
 [0 2 0 1 0 1 1 0 1]  
 [1 0 0 1 1 0 1 1 1]]

In [22]:
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=2000)

X = count_vectorizer.fit_transform(train["clean_text"]).toarray()
test_tmp = count_vectorizer.transform(test["clean_text"]).toarray()
y = train['target']

In [23]:
X_train = X
X_test = test_tmp
y_train = train['target']
y_test = test['target']

# 3. Model
## Contents
- train the model
    - RidgeClassifierCV
    - sgd classifier
    - BernoulliNB 
    - RandomForest

## Model Description
--------------
### Ensemble
- Combine the predictions of several base estimators built with a given learning algorithm 
    - in order to improve generalizability / robustness over a single estimator.
- Boosting of Ensemble types

### Performance - f1-score


## 3-2. Train the model

In [24]:
MLA = [
    #Ensemble Methods
    ensemble.RandomForestClassifier(),
    
    #GLM
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    
    #Navies Bayes
    naive_bayes.BernoulliNB()
    ]

# Comapring all MLA
## word2vec
- precision
- recall
- accuracy
- f1-score

In [25]:
train_embeddings = np.array(train_embeddings)
test_embeddings = np.array(test_embeddings)

In [30]:
# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

row_index = 0
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

for alg in MLA:
    predicted = alg.fit(train_embeddings, y_train).predict(test_embeddings)
    fp, tp, th = roc_curve(y_test, predicted)

    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(train_embeddings,y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(test_embeddings, y_test), 4)
    
    
    recall = recall_score(test['target'], predicted)
    precision = precision_score(test['target'], predicted)
    MLA_compare.loc[row_index, 'Precission'] = precision
    MLA_compare.loc[row_index, 'Recall'] = recall
    MLA_compare.loc[row_index, 'F1-score'] = round((2*precision*recall)/(precision+recall),4)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1

In [31]:
# f1-score 기준 정렬
MLA_compare.sort_values(by = ['F1-score'], ascending = False, inplace = True)    
MLA_compare

Unnamed: 0,MLA used,Train Accuracy,Test Accuracy,Precission,Recall,F1-score,AUC
2,SGDClassifier,0.7358,0.7183,0.682456,0.610675,0.6446,0.703193
0,RandomForestClassifier,0.989,0.7393,0.772727,0.533752,0.6314,0.710443
1,RidgeClassifierCV,0.7236,0.717,0.752451,0.481947,0.5876,0.683976
3,BernoulliNB,0.5924,0.6041,0.521092,0.659341,0.5821,0.611837


# Comapring all MLA
## wordCount
- precision
- recall
- accuracy
- f1-score

In [28]:
# Comparing all machine learning algorithms
from sklearn.metrics import mean_squared_error,confusion_matrix, precision_score, recall_score, auc,roc_curve

row_index = 0
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

for alg in MLA:
    predicted = alg.fit(X_train, y_train).predict(X_test)
    fp, tp, th = roc_curve(y_test, predicted)

    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index,'MLA used'] = MLA_name
    MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 4)
    MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 4)
    
    
    recall = recall_score(y_test, predicted)
    precision = precision_score(y_test, predicted)
    MLA_compare.loc[row_index, 'Precission'] = precision
    MLA_compare.loc[row_index, 'Recall'] = recall
    MLA_compare.loc[row_index, 'F1-score'] = round((2*precision*recall)/(precision+recall),4)
    MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)

    row_index+=1

In [29]:
# f1-score 기준 정렬
MLA_compare.sort_values(by = ['F1-score'], ascending = False, inplace = True)    
MLA_compare

Unnamed: 0,MLA used,Train Accuracy,Test Accuracy,Precission,Recall,F1-score,AUC
1,RidgeClassifierCV,0.8677,0.8017,0.833002,0.657771,0.7351,0.781481
2,SGDClassifier,0.9118,0.7833,0.759729,0.704867,0.7313,0.772298
3,BernoulliNB,0.8417,0.7978,0.827038,0.653061,0.7298,0.777434
0,RandomForestClassifier,0.98,0.7695,0.744027,0.684458,0.713,0.757579
