Link : https://www.kaggle.com/competitions/nlp-getting-started/overview/evaluation

## Blue Print

1. Check dataset
2. Cleaning
3. Preprocessing
4. Data Split (using stratified sampling)

## Error Function

$F_1 = 2\frac{precision * recall}{precision + recall}$ (1 is the best, 0 is the worst) where:
 
precision = $\frac{TP}{TP+FP}$, recall = $\frac{TP}{TP+FN}$

In [1]:
# sklearn.metrics.f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', 
# sample_weight=None, zero_division='warn')
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

## 1. Data Investigation

In [1]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score
from nltk.corpus import stopwords

In [3]:
# Load dataset
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

print(train.shape, test.shape)
train.head()

(7613, 5) (3263, 4)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
# Types of values in each column
print(train.dtypes)

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object


In [5]:
print(train[train["target"] == 0]["text"].values[10])       # not a disaster tweet
print(train[train["target"] == 1]["text"].values[1])  # disaster tweet

No way...I can't eat that shit
Forest fire near La Ronge Sask. Canada


## 2. Data Cleaning

In [5]:
# Check percentage of missing values
print("Ratio of missing values in training dataset")
print("missing keyword: ", str(round(train["keyword"].isnull().sum()/train.shape[0], 2)))
print("missing location: ", str(round(train["location"].isnull().sum()/train.shape[0], 2)), "\n")

print("Ratio of missing values in testing dataset")
print("missing keyword: ", str(round(test["keyword"].isnull().sum()/test.shape[0], 2)))
print("missing location: ", str(round(test["location"].isnull().sum()/test.shape[0], 2)))

Ratio of missing values in training dataset
missing keyword:  0.01
missing location:  0.33 

Ratio of missing values in testing dataset
missing keyword:  0.01
missing location:  0.34


=> Since the test dataset also contains missing values in 'keyword' and 'location', we will drop these columns and use 'text' column only.

## 3. Preprocessing

In [6]:
# Drop irrelevant columns
train = train.drop(["keyword", "location"], axis=1)
test = test.drop(["keyword", "location"], axis=1)

train.head()

Unnamed: 0,id,text,target
0,1,Our Deeds are the Reason of this #earthquake M...,1
1,4,Forest fire near La Ronge Sask. Canada,1
2,5,All residents asked to 'shelter in place' are ...,1
3,6,"13,000 people receive #wildfires evacuation or...",1
4,7,Just got sent this photo from Ruby #Alaska as ...,1


In [28]:
# Data Split
X_train, X_test, y_train, y_test = train_test_split(train.loc[:, "id":"text"], train["target"], 
                                    test_size=0.3, random_state=0, stratify=train["target"])

print(X_train.shape, X_test.shape)

(5329, 2) (2284, 2)


In [31]:
# Store "id"
id_train = X_train["id"]
id_test = X_test["id"]
id_train.head()

4812    6849
2080    2988
652      944
3528    5043
3007    4321
Name: id, dtype: int64

In [63]:
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action = 'ignore')
 
import gensim
from gensim.models import Word2Vec

In [75]:
X_train["text"]

4812    @samanthaturne19 It was... Nagaski another act...
2080            IS ROSS DEAD NOOOOOOOOOOOO @MikeParrActor
652     My hair is poverty at the moment need to get a...
3528    DK Eyewitness Travel Guide: Denmark: travel gu...
3007    Good way to end the day!!! Geyser plus dust st...
                              ...                        
7248    Slightly doesn't help that he has Suh and Wake...
7606    Suicide bomber kills 15 in Saudi security site...
1885            BHAVANA'S MOM HAS CRUSHED EVERYONE'S SOUL
3865    @AWickedAssassin want to burst into flames! *A...
3397    If Ryan doesn't release new music soon I might...
Name: text, Length: 5329, dtype: object

In [113]:
# Get a set of stopwords
stops = set(stopwords.words('english'))

# Iterate through each sentence in the file
words= []

for n in range(X_train["text"].size):
    
    for i in sent_tokenize(X_train["text"].iat[n,]):
        temp = []

    # Tokenize the sentence into words
        for j in word_tokenize(i):
            # Remove stopwords
            if j not in stops:
                temp.append(j.lower())
 
        words.append(temp)
 

In [114]:
# Create CBOW model
model1 = gensim.models.Word2Vec(words, min_count = 1, vector_size = 100, window = 5)
 
# Create Skip Gram model
model2 = gensim.models.Word2Vec(words, min_count = 1, vector_size = 100, window = 5, sg = 1)

In [115]:
# Training models
model1.train(X_train["text"].to_list(), total_examples=X_train["text"].size, epochs=100)
model2.train(X_train["text"].to_list(), total_examples=X_train["text"].size, epochs=100)

(37346701, 53788100)

In [116]:
vector1 = model1.wv['the']
print(vector1.shape)
vector1

(100,)


array([-0.17201994,  0.82599264,  0.58110154,  0.3490145 , -0.28057832,
       -1.1681844 ,  0.44197983,  1.790733  , -1.2729242 , -0.6335378 ,
       -0.13763733, -1.2766062 , -0.16552702,  0.42283386,  0.2751394 ,
       -0.8845238 ,  0.18137072, -0.6120297 , -0.20170568, -2.229894  ,
        0.22716671,  0.18639818,  0.9284154 , -0.50896245,  0.28298327,
        0.10083327, -0.6911999 , -0.11858609, -1.2433386 ,  0.56876373,
        1.140331  , -0.3805148 ,  0.46324775, -0.928187  , -0.2536032 ,
        0.5832042 ,  0.2582204 , -0.13615142,  0.06893081, -1.023462  ,
        0.09030799, -0.65809274, -1.0658691 ,  0.40708056,  0.66919833,
       -0.1484067 , -0.6610274 ,  0.2436525 ,  0.26382402,  0.3381059 ,
        0.3667162 , -0.73021173, -0.4292996 , -0.31083685, -0.5745647 ,
        0.30175442,  0.6702341 , -0.47154623, -1.026466  ,  0.1862987 ,
        0.2066574 , -0.02472978,  0.46969846, -0.23072624, -1.1395023 ,
        1.1983379 ,  0.47830915,  0.31281734, -0.81926775,  0.78

In [122]:
vector2 = model2.wv['a']
print(vector2.shape)
vector2

(100,)


array([-0.0033361 ,  0.4520584 ,  0.2187106 ,  0.31054527, -0.31159282,
       -0.5085897 ,  0.266268  ,  0.79013944, -0.4125841 , -0.15072934,
       -0.04414058, -0.57939327, -0.08941706,  0.21201734,  0.2507787 ,
       -0.4503934 ,  0.20949388, -0.05029261, -0.18786691, -0.99376774,
        0.09160941,  0.0572008 ,  0.5907578 , -0.32889533,  0.13718675,
       -0.09445567, -0.5245663 , -0.08533747, -0.13559638,  0.23782782,
        0.5938129 , -0.24731852,  0.01723937, -0.37664297, -0.13251756,
        0.22409038,  0.08590531,  0.1410172 ,  0.06434929, -0.49022636,
       -0.38383722, -0.49251467, -0.509529  ,  0.31234667,  0.01470367,
       -0.19196014, -0.37103054,  0.00327496, -0.12410864,  0.28737208,
        0.11301386, -0.7354538 , -0.64497966, -0.0422104 , -0.17666326,
        0.20426291,  0.43987328, -0.2825645 , -0.550254  ,  0.10979778,
        0.08163922,  0.10014983,  0.2590056 ,  0.1313418 , -0.4210227 ,
        0.75438267,  0.17566605,  0.09868337, -0.58332705,  0.50

In [136]:
## Validate

# words # list of words (stopwords are removed)
# stops # list of stopwords
# model1.wv['a'] # vetor for the word 'a'

vect_pred1 = np.zeros(X_train.shape[1])
# vect_pred2 = np.zeros(X_train.shape[1])


for n in range(X_train["text"].size):   # each sentence
    vector_for_sentence = np.zeros(100)
    for i in sent_tokenize(X_train["text"].iat[n,]):    # each word
        if i in words:
            vector_for_sentence += model1.wv[i]         # sum of vectors
    vect_pred1[n] = vector_for_sentence

In [None]:
print(vect_pred1.shape)
vect_pred1

In [None]:
# Reshape
Xtr = vect_pred1.reshape(vect_pred1.shape[0], vect_pred1.shape[1]*Xtr.shape[2])
Xtr = np.hstack([np.ones((vect_pred1.shape[0],1)), vect_pred1[:,:] ])   # add intercepts(ones)
Ytr = y_train.reshape(y_train.shape[0], 1) 
print("Xtr: " + str(Xtr.shape), " Ytr: "+ str(Ytr.shape))

In [None]:
# Classification (Logistic Regression)

# Make a prediction with weights
def predict(x, w):
	z = w.dot(x)
	return 1.0 / (1.0 + exp(-z))

# Estimate coefficients using stochastic gradient descent
def train_weights(X, y, l_epoch_span, epoch_size, weights, threshold=0.002):
  n, m    = X.shape   # n= , m=
  batch_size = 25

  for batch in range(epoch_size):  # batch = 0, 1, , ..., 49
    l_rate = l_epoch_span[batch]  # learning rate

    # Randomly select 25 numbers
    arr = np.arange(n)
    indices = np.random.choice(arr, size=batch_size)  

    sum_error = 0   # summed errors from each batch
    for b in range(batch_size):  # b = 0, 1, ..., 24
      ind = indices[b]
      prediction = predict(X[ind,:], weights)
      error = abs(prediction - y[ind])
      sum_error += error
      weights = weights - 1.00 * l_rate * (sum_error / batch_size) * X[ind,:]

    print('sum_error at batch #' + str(batch) + ' is ', str(sum_error))
  
    if sum_error<threshold:
      break

  return weights

In [None]:
epoch_size = 50
n_span = np.arange(epoch_size)
l_epoch_span = 1/((1+(2*n_span))**3)    # list of learning rates
init_weights = np.zeros((1,Xtr.shape[1]))
weights = train_weights(Xtr, Ytr, l_epoch_span, epoch_size, init_weights) 

In [None]:
# Validate
for i in range(Xtr.shape[0]):
    Ypred1 = predict(Xtr[i,:], weights)

print(Ypred1.size, type(Ypred1))

_____

In [37]:
# Convert a collection of text documents to a matrix of token counts
count_vectorizer = CountVectorizer(stop_words="english")
count_train = count_vectorizer.fit_transform(X_train["text"].values)
count_test = count_vectorizer.transform(X_test["text"].values)

print(count_train.shape, count_train[10].shape)

(5329, 16619)


In [91]:
count_vectorizer.get_stop_words()

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [47]:
count_train[5000]

<1x16619 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

## 4. 

text preprocessing
- Word2Vec
- tweettoeknizer 
- tokenize -> lower -> Counter() and .most_common()
- remove stop words stopwords.word('english')? english_stops?