<a href="https://colab.research.google.com/github/ortiz363/csci5541_tutorial1/blob/main/Tutorial_1_Extract_Textual_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset can be downloaded at : https://www.kaggle.com/c/nlp-getting-started/data



Drive Link : https://drive.google.com/drive/folders/1vIUiuyf0QmZdwBJG6zlhtRh6j7acxrHQ?usp=share_link

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0. You’ll have access to a dataset of 10,000 tweets that were hand classified. 
Each sample in the train and test set has the following information:

1. A keyword from that tweet (although this may be blank!)
2. The location the tweet was sent from (may also be blank)
3. The text of a tweet

In [1]:
# thanks to : https://www.kaggle.com/code/mariapushkareva/nlp-disaster-tweets-with-glove-and-lstm/notebook
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

import re
import string
import numpy as np 
import random
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.metrics import (
    precision_score, 
    recall_score, 
    f1_score, 
    classification_report,
    accuracy_score
)

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Loading the dataset ###

In [5]:
train_df = pd.read_csv("/content/train.csv")
test_df = pd.read_csv("/content/test.csv")

In [6]:
train_df.columns

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

In [7]:
train_df.shape

(7613, 5)

In [8]:
test_df.columns

Index(['id', 'keyword', 'location', 'text'], dtype='object')

In [9]:
test_df.shape

(3263, 4)

In [10]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [11]:
train_df = train_df.dropna(subset=['text'])  #Drop only if NaN in specific column 'text'
test_df = test_df.dropna(subset=['text']) 

In [12]:
train_df.shape # evidently everything is present in the text column!

(7613, 5)

What does a disaster tweet look like ?

In [13]:
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

What does a non-disaster tweet look like ?

In [14]:
train_df[train_df["target"] == 0]["text"].values[1]

'I love fruits'

Is dataset balanced or imbalanced ?
Removing special tweet specific symbols

In [15]:
balance_counts = train_df.groupby('target')['target'].agg('count').values
balance_counts # nearly equal!

array([4342, 3271])

### Text Cleaning ###

In [16]:
#Special thanks to https://www.kaggle.com/tanulsingh077 for this function
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [17]:
train_df['text_clean'] = train_df['text'].apply(clean_text)
train_df.head()

Unnamed: 0,id,keyword,location,text,target,text_clean
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...


Stopwords are commonly used words in English which have no contextual meaning in an sentence. So therefore we remove them before classification. Some examples removing stopwords are:



In [18]:
stop_words = stopwords.words('english')
more_stopwords = ['u', 'im', 'c']
stop_words = stop_words + more_stopwords

def remove_stopwords(text):
    text = ' '.join(word for word in text.split(' ') if word not in stop_words)
    return text
    
train_df['text_clean'] = train_df['text_clean'].apply(remove_stopwords)
train_df.head()

Unnamed: 0,id,keyword,location,text,target,text_clean
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deeds reason earthquake may allah forgive us
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,residents asked shelter place notified officer...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders ca...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo ruby alaska smoke wildfires pou...


There can be additional cleaning steps such as Stemming, Lemmatization, demojization etc. But in this tutorial we will stick to stopword removal. 

Target Encoding : what if there was no 0 and 1 label; and instead there were textual labels such as "ham" and "spam" ? Here there is no change even after encoding as the labels are already 0 and 1. 

In [19]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(train_df['target'])

train_df['target_encoded'] = le.transform(train_df['target'])
train_df.head()

Unnamed: 0,id,keyword,location,text,target,text_clean,target_encoded
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deeds reason earthquake may allah forgive us,1
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1,residents asked shelter place notified officer...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders ca...,1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo ruby alaska smoke wildfires pou...,1


### Building Vectors ###

Currently, we have the tweets as lists of tokens and now we need to convert each of those tweets into a vector the SciKit Learn's algorithm models can work with.

We'll do that in three steps using the bag-of-words model:

Count how many times does a word occur in each message (Known as term frequency)
Each vector will have as many dimensions as there are unique words in the Tweets corpus. We will first use SciKit Learn's CountVectorizer. This model will convert a collection of text documents to a matrix of token counts.

We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message.

In [20]:
# how to define X and y for use with COUNTVECTORIZER
x = train_df['text_clean']
y = train_df['target_encoded']

print(len(x), len(y))

7613 7613


In [21]:
# Split into train and test sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)
print(len(x_train), len(y_train))
print(len(x_test), len(y_test))

5709 5709
1904 1904


In [22]:
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer()
vect.fit(x_train)

CountVectorizer()

In [23]:
# Use the trained to create a document-term matrix from train and test sets
x_train_dtm = vect.transform(x_train)
x_test_dtm = vect.transform(x_test)

In [24]:
x_test_dtm

<1904x13742 sparse matrix of type '<class 'numpy.int64'>'
	with 13519 stored elements in Compressed Sparse Row format>

### Tuning the vectorizer ###

CountVectorizer's tunable parameters:

stop_words: Since CountVectorizer just counts the occurrences of each word in its vocabulary, extremely common words like ‘the’, ‘and’, etc. will become very important features while they add little meaning to the text. Your model can often be improved if you don’t take those words into account. Stop words are just a list of words you don’t want to use as features. You can set the parameter stop_words=’english’ to use a built-in list. Alternatively you can set stop_words equal to some custom list. This parameter defaults to None.

ngram_range: An n-gram is just a string of n words in a row. E.g. the sentence ‘I am Groot’ contains the 2-grams ‘I am’ and ‘am Groot’. The sentence is itself a 3-gram. Set the parameter ngram_range=(a,b) where a is the minimum and b is the maximum size of ngrams you want to include in your features. The default ngram_range is (1,1). In a recent project where I modeled job postings online, I found that including 2-grams as features boosted my model’s predictive power significantly. This makes intuitive sense; many job titles such as ‘data scientist’, ‘data engineer’, and ‘data analyst’ are 2 words long.

min_df, max_df: These are the minimum and maximum document frequencies words/n-grams must have to be used as features. If either of these parameters are set to integers, they will be used as bounds on the number of documents each feature must be in to be considered as a feature. If either is set to a float, that number will be interpreted as a frequency rather than a numerical limit. min_df defaults to 1 (int) and max_df defaults to 1.0 (float).

max_features: This parameter is pretty self-explanatory. The CountVectorizer will choose the words/features that occur most frequently to be in its’ vocabulary and drop everything else.

You would set these parameters when initializing your CountVectorizer object as shown below.

In [25]:
vect_tunned = CountVectorizer(stop_words='english', ngram_range=(1,2), min_df=0.1, max_df=0.7, max_features=100)

**Building vectors**
The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's CountVectorizer to count the words in each tweet and turn them into data our machine learning model can process.

Note: a vector is, in this context, a set of numbers that a machine learning model can work with. 

In [26]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [27]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


The above tells us that:

There are 54 unique words (or "tokens") in the first five tweets.
The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.
Now let's create vectors for all of our tweets.

In [28]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

### Simple Linear Model ###

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a linear connection. So let's build a linear model and see whats the baseline result. This classifier first converts the target values into {-1, 1} and then treats the problem as a regression task (multi-output regression in the multiclass case).



In [29]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

Let's test our model and see how well it does on the training data. For this we'll use cross-validation - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

Let us use the F1 metric.

In [30]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.59453669, 0.56498283, 0.64082434])

### Multinomial Naive Bayes ###

Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve higher accuracy levels.

Naïve Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.


In [31]:
# Create a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [32]:
scores = model_selection.cross_val_score(nb, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.66763006, 0.6557971 , 0.72431507])

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 -0.72 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSTM / RNNs, etc) - give any of them a shot!

In the meantime, let's do predictions on our training set.

### Predictions and Test Metrics with NB model ###

In [33]:
# Train the model
nb.fit(x_train_dtm, y_train)

MultinomialNB()

In [34]:
# Make class anf probability predictions
y_pred_class = nb.predict(x_test_dtm)
y_pred_prob = nb.predict_proba(x_test_dtm)[:, 1]

In [35]:
# calculate accuracy of class predictions
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.7951680672268907


In [36]:
import plotly.figure_factory as ff

x_axes = ['Normal', 'Disaster']
y_axes =  ['Disaster', 'Normal']

def conf_matrix(z, x=x_axes, y=y_axes):
    
    z = np.flip(z, 0)

    # change each element of z to type string for annotations
    z_text = [[str(y) for y in x] for x in z]

    # set up figure 
    fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z_text, colorscale='Viridis')

    # add title
    fig.update_layout(title_text='<b>Confusion matrix</b>',
                      xaxis = dict(title='Predicted value'),
                      yaxis = dict(title='Real value')
                     )

    # add colorbar
    fig['data'][0]['showscale'] = True
    
    return fig

In [37]:
conf_matrix(metrics.confusion_matrix(y_test, y_pred_class))

In [38]:
# Calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.8542542528999992