Link : https://www.kaggle.com/competitions/nlp-getting-started/overview/evaluation

## Blue Print

1. Check dataset
2. Cleaning
3. Preprocessing
4. Data Split (using stratified sampling)

## Error Function

$F_1 = 2\frac{precision * recall}{precision + recall}$ (1 is the best, 0 is the worst) where:
 
precision = $\frac{TP}{TP+FP}$, recall = $\frac{TP}{TP+FN}$

In [1]:
# sklearn.metrics.f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', 
# sample_weight=None, zero_division='warn')
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

## 1. Data Investigation

In [1]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score
from nltk.corpus import stopwords

In [2]:
# Load dataset
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

print(train.shape, test.shape)
train.head()

(7613, 5) (3263, 4)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
# Types of values in each column
print(train.dtypes)

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object


In [4]:
print(train[train["target"] == 0]["text"].values[10])       # not a disaster tweet
print(train[train["target"] == 1]["text"].values[1])  # disaster tweet

No way...I can't eat that shit
Forest fire near La Ronge Sask. Canada


## 2. Data Cleaning

In [5]:
# Check percentage of missing values
print("Ratio of missing values in training dataset")
print("missing keyword: ", str(round(train["keyword"].isnull().sum()/train.shape[0], 2)))
print("missing location: ", str(round(train["location"].isnull().sum()/train.shape[0], 2)), "\n")

print("Ratio of missing values in testing dataset")
print("missing keyword: ", str(round(test["keyword"].isnull().sum()/test.shape[0], 2)))
print("missing location: ", str(round(test["location"].isnull().sum()/test.shape[0], 2)))

Ratio of missing values in training dataset
missing keyword:  0.01
missing location:  0.33 

Ratio of missing values in testing dataset
missing keyword:  0.01
missing location:  0.34


=> Since the test dataset also contains missing values in 'keyword' and 'location', we will drop these columns and use 'text' column only.

## 3. Preprocessing

In [6]:
# Drop irrelevant columns
train = train.drop(["keyword", "location"], axis=1)
test = test.drop(["keyword", "location"], axis=1)

train.head()

Unnamed: 0,id,text,target
0,1,Our Deeds are the Reason of this #earthquake M...,1
1,4,Forest fire near La Ronge Sask. Canada,1
2,5,All residents asked to 'shelter in place' are ...,1
3,6,"13,000 people receive #wildfires evacuation or...",1
4,7,Just got sent this photo from Ruby #Alaska as ...,1


In [172]:
train.loc[:, ["target"]]

Unnamed: 0,target
0,1
1,1
2,1
3,1
4,1
...,...
7608,1
7609,1
7610,1
7611,1


In [173]:
# Data Split
X_train, X_test, y_train, y_test = train_test_split(train.loc[:, "id":"text"], train.loc[:, ["target"]], 
                                    test_size=0.3, random_state=0, stratify=train.loc[:, ["target"]])

print(X_train.shape, X_test.shape)

(5329, 2) (2284, 2)


In [174]:
# Store "id"
id_train = X_train["id"]
id_test = X_test["id"]
id_train

4812     6849
2080     2988
652       944
3528     5043
3007     4321
        ...  
7248    10378
7606    10866
1885     2708
3865     5496
3397     4864
Name: id, Length: 5329, dtype: int64

In [9]:
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action = 'ignore')
 
import gensim
from gensim.models import Word2Vec

In [175]:
# Get a set of stopwords
stops = set(stopwords.words('english'))

# Iterate through each sentence in the file
words= []

for n in range(X_train["text"].size):
    
    for i in sent_tokenize(X_train["text"].iat[n,]):
        # temp = []

    # Tokenize the sentence into words
        for j in word_tokenize(i):
            # Remove stopwords
            if j.lower() not in stops:
                # temp.append(j.lower())
                words.append(j.lower())
 
        # words.append(temp)
 

In [12]:
# Create CBOW model
model1 = gensim.models.Word2Vec(words, min_count = 1, vector_size = 100, window = 5)
 
# Create Skip Gram model
model2 = gensim.models.Word2Vec(words, min_count = 1, vector_size = 100, window = 5, sg = 1)

In [13]:
# Training models
model1.train(X_train["text"].to_list(), total_examples=X_train["text"].size, epochs=100)
model2.train(X_train["text"].to_list(), total_examples=X_train["text"].size, epochs=100)

(37343259, 53788100)

In [19]:
vector1 = model1.wv['sky']
print(vector1.shape)
vector1

(100,)


array([-0.02004962,  0.10037511,  0.07350621,  0.04939653, -0.03476002,
       -0.16195391,  0.04609038,  0.21808012, -0.16392478, -0.08426891,
       -0.02270396, -0.15489374, -0.02817422,  0.06093396,  0.03923214,
       -0.10855886,  0.01493815, -0.08206686, -0.02305904, -0.29797217,
        0.02627587,  0.02536908,  0.10857567, -0.05503755,  0.03082208,
        0.01130182, -0.09580877, -0.00886846, -0.14580329,  0.06464776,
        0.14711061, -0.05381641,  0.05042885, -0.11926185, -0.03723684,
        0.07290098,  0.03564359, -0.01150773,  0.01626032, -0.13782905,
        0.01676124, -0.0866912 , -0.12689015,  0.04537553,  0.09070694,
       -0.02262278, -0.10084937,  0.03906291,  0.04117893,  0.04786964,
        0.04090618, -0.09457989, -0.05056809, -0.0350938 , -0.06172173,
        0.02995279,  0.09060061, -0.07169975, -0.13605106,  0.01613214,
        0.01843433, -0.00498997,  0.05055493, -0.02341642, -0.14718884,
        0.15326908,  0.06144311,  0.03757308, -0.10605472,  0.10

In [142]:
## Validate

# words # list of words (stopwords are removed)
# stops # list of stopwords
# model1.wv['a'] # vetor for the word 'a'

vect_pred1 = [0]*X_train.shape[0]
# vect_pred2 = [None]*X_train.shape[0]

for n in range(X_train["text"].size):   # n: 0 ~ 5328
    vector_for_sentence = np.zeros((100,100))
    for s in sent_tokenize(X_train["text"].iat[n,]):    # each sentence
        for w in word_tokenize(s):     # each word
            if w.lower() in words:
                vector_for_sentence += model1.wv[w.lower()]         # sum of word vectors
        vect_pred1[n] = vector_for_sentence     # replace to sentence vector

In [143]:
vect_pred1 = np.array(vect_pred1)
print(vect_pred1.shape, vect_pred1[5328].shape)
vect_pred1[6]


(5329, 100, 100) (100, 100)


array([[ 1.87824123,  3.64435208,  4.23504359, ..., -2.26995379,
         1.73575195, -1.97714323],
       [ 1.87824123,  3.64435208,  4.23504359, ..., -2.26995379,
         1.73575195, -1.97714323],
       [ 1.87824123,  3.64435208,  4.23504359, ..., -2.26995379,
         1.73575195, -1.97714323],
       ...,
       [ 1.87824123,  3.64435208,  4.23504359, ..., -2.26995379,
         1.73575195, -1.97714323],
       [ 1.87824123,  3.64435208,  4.23504359, ..., -2.26995379,
         1.73575195, -1.97714323],
       [ 1.87824123,  3.64435208,  4.23504359, ..., -2.26995379,
         1.73575195, -1.97714323]])

In [178]:
# Reshape
Xtr1 = vect_pred1.reshape(vect_pred1.shape[0], vect_pred1.shape[1]*vect_pred1.shape[2])
Xtr1 = np.hstack([np.ones((Xtr1.shape[0],1)), Xtr1[:,:] ])   # add intercepts(ones)
Ytr1 = y_train
print("Xtr1: " + str(Xtr1.shape), " Ytr1: "+ str(Ytr1.shape))

Xtr1: (5329, 10001)  Ytr1: (5329, 1)


In [181]:
# Classification (Logistic Regression)

# Make a prediction with weights
def predict(x, w):
	z = w.dot(x)
	return 1.0 / (1.0 + np.exp(-z))

# Estimate coefficients using stochastic gradient descent
def train_weights(X, y, l_epoch_span, epoch_size, weights, threshold=0.002):
  n, m    = X.shape   # n= , m=
  batch_size = 25

  for batch in range(epoch_size):  # batch = 0, 1, , ..., 49
    l_rate = l_epoch_span[batch]  # learning rate

    # Randomly select 25 numbers
    arr = np.arange(n)
    indices = np.random.choice(arr, size=batch_size)  

    sum_error = 0   # summed errors from each batch
    for b in range(batch_size):  # b = 0, 1, ..., 24
      ind = indices[b]
      prediction = predict(X[ind,:], weights)
      error = abs(prediction - y[ind])
      sum_error += error
      weights = weights - 1.00 * l_rate * (sum_error / batch_size) * X[ind,:]

    print('sum_error at batch #' + str(batch) + ' is ', str(sum_error))
  
    if sum_error<threshold:
      break

  return weights

In [182]:
epoch_size = 50
n_span = np.arange(epoch_size)
l_epoch_span = 1/((1+(2*n_span))**3)    # list of learning rates
init_weights = np.zeros((1,Xtr1.shape[1]))
weights = train_weights(Xtr1, Ytr1, l_epoch_span, epoch_size, init_weights) 

KeyError: 138

In [None]:
# Validate
for i in range(Xtr1.shape[0]):
    Ypred1 = predict(Xtr1[i,:], weights)

print(Ypred1.size, type(Ypred1))

_____

In [37]:
# Convert a collection of text documents to a matrix of token counts
count_vectorizer = CountVectorizer(stop_words="english")
count_train = count_vectorizer.fit_transform(X_train["text"].values)
count_test = count_vectorizer.transform(X_test["text"].values)

print(count_train.shape, count_train[10].shape)

(5329, 16619)


In [91]:
count_vectorizer.get_stop_words()

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [47]:
count_train[5000]

<1x16619 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

## 4. 

text preprocessing
- Word2Vec
- tweettoeknizer 
- tokenize -> lower -> Counter() and .most_common()
- remove stop words stopwords.word('english')? english_stops?