## Data Cleaning
As we know,twitter tweets always have to be cleaned before we go onto modelling.So we will do some basic cleaning such as spelling correction,removing punctuations,removing html tags and emojis etc.So let's start.

In [4]:
import numpy as np
import pandas as pd
import string
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))
import re
from nltk.tokenize import word_tokenize
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
from keras.models import Sequential
from keras.layers import Embedding,LSTM,Dense,SpatialDropout1D
from keras.initializers import Constant
from sklearn.model_selection import train_test_split
from keras.optimizers import Adam

#Load the train and test set from the csv files
tweet= pd.read_csv('../dataset/train.csv')
test= pd.read_csv('../dataset/test.csv')

In [5]:
# concat the train and test dataframes
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
df=pd.concat([tweet,test])

#testing
df.shape
df.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1.0
1,4,,,Forest fire near La Ronge Sask. Canada,1.0
2,5,,,All residents asked to 'shelter in place' are ...,1.0
3,6,,,"13,000 people receive #wildfires evacuation or...",1.0
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1.0
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1.0
6,10,,,#flood #disaster Heavy rain causes flash flood...,1.0
7,13,,,I'm on top of the hill and I can see a fire in...,1.0
8,14,,,There's an emergency evacuation happening now ...,1.0
9,15,,,I'm afraid that the tornado is coming to our a...,1.0


### Removing urls

In [6]:
def remove_URL(text):
    # Compile a regular expression pattern into a regular expression object, 
    # which can be used for substitution later
    
    # '?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. 
    # ab? will match either ‘a’ or ‘ab’.
    
    # '\S' Matches any character which is not a whitespace character. 
    
    # '+' Causes the resulting RE to match 1 or more repetitions of the preceding RE. 
    # ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match 
    # just ‘a’.

    # A|B, where A and B can be arbitrary REs, creates a regular expression that 
    # will match either A or B
    
    # (Dot.) In the default mode, this matches any character except a newline
    
    # https://docs.python.org/3/library/re.html#re.compile
    url = re.compile(r'https?://\S+|www\.\S+')

    # Return the string obtained by replacing the leftmost non-overlapping occurrences 
    # of pattern in string by the replacement ''
    # https://docs.python.org/3/library/re.html#re.sub
    return url.sub(r'',text)

#checking
remove_URL("New competition launched :https://www.kaggle.com/c/nlp-getting-started")

'New competition launched :'

In [7]:
# apply the remove_URL function on every text of df
# keywords: apply lambda
df['text']=df['text'].apply(lambda x : remove_URL(x))

### Removing HTML tags

In [14]:
def remove_html(text):
    # Compile a regular expression pattern into a regular expression object, 
    # which can be used for substitution later
    
    
    # '*' Causes the resulting RE to match 0 or more repetitions of the preceding RE, 
    # as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ 
    #followed by any number of ‘b’s.

    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

#checking
example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""
remove_html(example)

'\nReal or Fake\nKaggle \ngetting started\n'

In [15]:
# remove all html tags from all the texts of df
df['text']=df['text'].apply(lambda x : remove_html(x))

### Romoving Emojis

In [16]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    # '[]' Used to indicate a set of characters.
    # '\u', '\U', and '\N' escape sequences are only recognized in Unicode patterns
    # Ranges of characters can be indicated by giving two characters and separating them by a '-', 
    # for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers 
    # from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or 
    # if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal '-'.
    
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

#checking
remove_emoji("Omg another Earthquake 😔😔")

'Omg another Earthquake '

In [17]:
df['text']=df['text'].apply(lambda x: remove_emoji(x))

### Removing punctuations

In [12]:
# a function to remove all punctuations
def remove_punct(text):    
    # create a translation table (dictionary which maps all '' to '' and all punctuation marks to None)
    # get all the puncuations from the string module
    # https://docs.python.org/3/library/stdtypes.html#str.maketrans
    table=str.maketrans('','',string.punctuation)
    
    # translate the text using the table
    # https://docs.python.org/3/library/stdtypes.html#str.translate
    return text.translate(table)

#checking
print(remove_punct("I am a #king"))

I am a king


In [13]:
# remove punctuation marks from the text column of df
df['text']=df['text'].apply(lambda x : remove_punct(x))

### Spelling Correction


Even if I'm not good at spelling I can correct it with python :) I will use `pyspellcheker` to do that.

In [119]:
!pip install pyspellchecker



In [120]:
from spellchecker import SpellChecker

# create a SpellChecker object
# https://pyspellchecker.readthedocs.io/en/latest/code.html#spellchecker
spell = SpellChecker()

def correct_spellings(text):
    
    # initialize a list to store the correct text
    corrected_text = []
    
    # use the Specllcheker object to get the unknown words by splitting the text
    # https://pyspellchecker.readthedocs.io/en/latest/code.html#spellchecker.SpellChecker.unknown
    # keywords: split
    misspelled_words = spell.unknown(text.split())
    
    # check whether each word in the text is a correct word
    # keywords: split
    for word in text.split():
        
        # if the word is a misspelled word then correct the word and append it to corrected_text
        # https://pyspellchecker.readthedocs.io/en/latest/code.html#spellchecker.SpellChecker.correction
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        # if the word is a correct word then append it to the corrected_text
        else:
            corrected_text.append(word)
    
    # join the list 'correct_text' with whitespaces and return
    # https://docs.python.org/3/library/stdtypes.html#str.join
    return " ".join(corrected_text)

#checking
correct_spellings("corect me plese")

'correct me please'

In [121]:
# correct the 'text' column of df
#df['text']=df['text'].apply(lambda x : correct_spellings(x))

## GloVe for Vectorization

Here we will use GloVe pretrained corpus model to represent our words.It is available in 3 varieties :50D ,100D and 200 Dimentional.We will try 100 D here.

In [122]:
# function to create a corpus from a dataframe
def create_corpus(df):
    
    # initialize corpus as an empty list
    corpus=[]
    
    # Instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable)
    # loop through the text column in df using tqdm
    # https://tqdm.github.io/docs/tqdm/
    for tweet in tqdm(df['text']):
        
        # tokenize each tweet. If all the characters in a word are letters and if the word is not a stopword
        # then covert the word into lower case. Create an array of such processed words from a tweet
        # https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize
        # https://docs.python.org/3/library/stdtypes.html#str.isalpha
        # keywords: for in if & not in
        words=[word.lower() for word in word_tokenize(tweet) if((word.isalpha()==1) & (word not in stop))]
        
        # append the processed words of a tweet to corpus
        corpus.append(words)
        
    return corpus

# create corpus from dataframe df
corpus=create_corpus(df)

#checking
print(len(corpus))
print(corpus[:10])
        

100%|██████████| 10876/10876 [00:01<00:00, 7564.55it/s]

10876
[['our', 'deeds', 'reason', 'earthquake', 'may', 'allah', 'forgive', 'us'], ['forest', 'fire', 'near', 'la', 'ronge', 'sask', 'canada'], ['all', 'residents', 'asked', 'shelter', 'place', 'notified', 'officers', 'no', 'evacuation', 'shelter', 'place', 'orders', 'expected'], ['people', 'receive', 'wildfires', 'evacuation', 'orders', 'california'], ['just', 'got', 'sent', 'photo', 'ruby', 'alaska', 'smoke', 'wildfires', 'pours', 'school'], ['rockyfire', 'update', 'california', 'hwy', 'closed', 'directions', 'due', 'lake', 'county', 'fire', 'cafire', 'wildfires'], ['flood', 'disaster', 'heavy', 'rain', 'causes', 'flash', 'flooding', 'streets', 'manitou', 'colorado', 'springs', 'areas'], ['im', 'top', 'hill', 'i', 'see', 'fire', 'woods'], ['theres', 'emergency', 'evacuation', 'happening', 'building', 'across', 'street'], ['im', 'afraid', 'tornado', 'coming', 'area']]





In [123]:
# initialize an empty dictionary so that we can add words as keys and the vector embedding of that word as value 
embedding_dict={}

# Each line of glove.6B.100d.txt contains a word and it's glove vector embedding. Open the file in read mode
# https://docs.python.org/3/library/functions.html#open
with open('model/glove.6B.100d.txt','r') as f:
    
    #iterate through each line of the text
    for line in f:
        
        #split each line by whitespaces
        values=line.split()
        
        #get the 1st segment from the splitted line as word
        word=values[0]
        
        # get the vector embeddings and turn it into a numpy array of float32
        # https://numpy.org/devdocs/reference/generated/numpy.asarray.html
        vectors=np.asarray(values[1:],'float32')
        
        # set the word and it's embedding as key value pair in the dictionary
        embedding_dict[word]=vectors
        
# close the file
# https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
f.close()

#checking
print(len(list(embedding_dict.items())))
list(embedding_dict.items())[:2]

400000


[('the',
  array([-0.038194, -0.24487 ,  0.72812 , -0.39961 ,  0.083172,  0.043953,
         -0.39141 ,  0.3344  , -0.57545 ,  0.087459,  0.28787 , -0.06731 ,
          0.30906 , -0.26384 , -0.13231 , -0.20757 ,  0.33395 , -0.33848 ,
         -0.31743 , -0.48336 ,  0.1464  , -0.37304 ,  0.34577 ,  0.052041,
          0.44946 , -0.46971 ,  0.02628 , -0.54155 , -0.15518 , -0.14107 ,
         -0.039722,  0.28277 ,  0.14393 ,  0.23464 , -0.31021 ,  0.086173,
          0.20397 ,  0.52624 ,  0.17164 , -0.082378, -0.71787 , -0.41531 ,
          0.20335 , -0.12763 ,  0.41367 ,  0.55187 ,  0.57908 , -0.33477 ,
         -0.36559 , -0.54857 , -0.062892,  0.26584 ,  0.30205 ,  0.99775 ,
         -0.80481 , -3.0243  ,  0.01254 , -0.36942 ,  2.2167  ,  0.72201 ,
         -0.24978 ,  0.92136 ,  0.034514,  0.46745 ,  1.1079  , -0.19358 ,
         -0.074575,  0.23353 , -0.052062, -0.22044 ,  0.057162, -0.15806 ,
         -0.30798 , -0.41625 ,  0.37972 ,  0.15006 , -0.53212 , -0.2055  ,
         -1.2526

In [124]:
# define max length of a sequence as 50
MAX_LEN=50

# create a Tokenizer object. This class allows to vectorize a text corpus, by turning each text into either 
# a sequence of integers (each integer being the index of a token in a dictionary).
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
tokenizer_obj=Tokenizer()

# Update internal vocabulary of tokenizer based on a list of sequences.
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts
tokenizer_obj.fit_on_texts(corpus)

# Transforms each word in corpus to a integer.
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences
sequences=tokenizer_obj.texts_to_sequences(corpus)

#checking
print("2nd line of corpus as a list of words:",corpus[2])
print("2nd line of corpus as a sequence of integers:",sequences[2])

2nd line of corpus as a list of words: ['all', 'residents', 'asked', 'shelter', 'place', 'notified', 'officers', 'no', 'evacuation', 'shelter', 'place', 'orders', 'expected']
2nd line of corpus as a sequence of integers: [119, 1469, 1386, 2104, 645, 7972, 1667, 77, 204, 2104, 645, 1559, 1143]


In [125]:
# Pads sequences to the same length (in our case, the MAX_LEN). 
# use truncating='post'. remove values from sequences larger than maxlen at the end of the sequences. 
# use padding='post'. pad after each sequence. 
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences
tweet_pad=pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')

#checking
print("2nd line of corpus as a padded sequence:",tweet_pad[2])

2nd line of corpus as a padded sequence: [ 119 1469 1386 2104  645 7972 1667   77  204 2104  645 1559 1143    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0]


In [126]:
# get the dictionary which contains every word's index in the tokenization process
word_index=tokenizer_obj.word_index

#checking
print('Number of unique words:',len(word_index))
print(list(word_index.items())[:10])

Number of unique words: 20342
[('i', 1), ('the', 2), ('like', 3), ('amp', 4), ('im', 5), ('a', 6), ('fire', 7), ('get', 8), ('new', 9), ('via', 10)]


In [127]:
# add 1 to the number of unique words for an unknown token. 
# This token will have a vector embedding having only zeros
num_words=len(word_index)+1

# initialize a embedding matrix which will contain all the vector embeddings for our unique words of corpus.
# https://numpy.org/devdocs/reference/generated/numpy.zeros.html
embedding_matrix=np.zeros((num_words,100))

# get the word and index in each iteration of the dictionary items using tqdm
# keywords: items
for word,i in tqdm(word_index.items()):
    
    # when the unknown word index appears, continue the loop
    if i > num_words:
        continue
    
    # get the embedding vector of the word from the embedding dictionary
    # https://docs.python.org/3/library/stdtypes.html#dict.get
    emb_vec=embedding_dict.get(word)
    
    if emb_vec is not None:#this None must not be changed
        # Assign the embedding vector of that word into the 2D matrix
        embedding_matrix[i]=emb_vec

#checking
print(embedding_matrix.shape)
print(embedding_matrix[1:3])

100%|██████████| 20342/20342 [00:00<00:00, 571497.20it/s]

(20343, 100)
[[-0.046539    0.61966002  0.56647003 -0.46584001 -1.18900001  0.44599
   0.066035    0.31909999  0.14679    -0.22119001  0.79238999  0.29905
   0.16073     0.025324    0.18678001 -0.31000999 -0.28108001  0.60514998
  -1.0654      0.52476001  0.064152    1.03579998 -0.40779001 -0.38011
   0.30801001  0.59964001 -0.26991001 -0.76034999  0.94221997 -0.46919
  -0.18278     0.90652001  0.79671001  0.24824999  0.25713     0.6232
  -0.44768     0.65357     0.76902002 -0.51229    -0.44332999 -0.21867
   0.38370001 -1.14830005 -0.94397998 -0.15062     0.30012    -0.57805997
   0.20175    -1.65910006 -0.079195    0.026423    0.22051001  0.99713999
  -0.57538998 -2.72659993  0.31448001  0.70521998  1.43809998  0.99125999
   0.13976     1.34739995 -1.1753      0.0039503   1.02980006  0.064637
   0.90886998  0.82871997 -0.47003001 -0.10575     0.5916     -0.42210001
   0.57331002 -0.54114002  0.10768     0.39783999 -0.048744    0.064596
  -0.61436999 -0.28600001  0.50669998 -0.4975799




## Baseline Model

In [128]:
# initialize a Sequential object. It groups a linear stack of layers into a tf.keras.Model.
# https://keras.io/guides/sequential_model/
model=Sequential()

# add an embedding layer that Turns positive integers (indexes) into dense vectors of fixed size.
# input dimension size is the number of words,
# output dimension size is 100,
# set the embedding initializer to be an initializer that generates tensors with constant values. 
# Use embedding_matrix to feed the Constant initializer
# https://keras.io/api/layers/core_layers/embedding/
model.add(Embedding(num_words,100,embeddings_initializer=Constant(embedding_matrix),
                   input_length=MAX_LEN,trainable=False))

# The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, 
# which helps prevent overfitting.
# SpatialDropout1D performs the same function as Dropout, 
# however it drops entire 1D feature maps instead of individual elements. Add a SpatialDropout1D layer where 
# Fraction of the input units to drop is 0.2 
# https://www.tensorflow.org/api_docs/python/tf/keras/layers/SpatialDropout1D
model.add(SpatialDropout1D(0.2))

# Add an LSTM layer. dimensionality of the output space is 64, fraction of the units to drop for the linear 
# transformation of the inputs is 0.2 and Fraction of the units to drop for the linear transformation of the 
# recurrent state is 0.2
# https://keras.io/api/layers/recurrent_layers/lstm/
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))

# add a Dense layer that implements the operation: output = activation(dot(input, kernel) + bias) where 
# activation is the element-wise activation function passed as the activation argument, 
# kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer 
# (only applicable if use_bias is True).
# dimensionality of the output space is 1, Activation function to  should be sigmoid
# https://keras.io/api/layers/core_layers/dense/
model.add(Dense(1, activation='sigmoid'))

# initialize Optimizer that implements the Adam algorithm. Adam optimization is a stochastic gradient descent 
# method that is based on adaptive estimation of first-order and second-order moments. 
# Learning rate should be 1e-5
# https://keras.io/api/optimizers/adam/
optimzer=Adam(learning_rate=1e-5)

# Configure the model for training, set the optimizer instance,set the list of metrics to be evaluated by the 
# model during training and testing to 'accuracy' and name of objective function is binary_crossentropy
# https://keras.io/api/models/model_training_apis/#compile-method
model.compile(loss='binary_crossentropy',optimizer=optimzer,metrics=['accuracy'])



In [129]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 100)           2034300   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, 50, 100)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
Total params: 2,076,605
Trainable params: 42,305
Non-trainable params: 2,034,300
_________________________________________________________________


In [130]:
# seperate the train and test set from tweet_pad
train=tweet_pad[:tweet.shape[0]]
test=tweet_pad[tweet.shape[0]:]

print(train.shape,test.shape)

(7613, 50) (3263, 50)


In [131]:
# split the train set into train and dev set. ratio should be 85:15
X_train,X_test,y_train,y_test=train_test_split(train,tweet['target'].values,test_size=0.15)
print('Shape of train',X_train.shape)
print("Shape of Validation ",X_test.shape)

Shape of train (6471, 50)
Shape of Validation  (1142, 50)


In [132]:
# fit the model with the train set. set batch size to be 4 and set number of epoch to be 15. set the dev set as
# validation data and set verbose to be 2
# https://keras.io/api/models/model_training_apis/#fit-method
history=model.fit(X_train,y_train,batch_size=4,epochs=15,validation_data=(X_test,y_test),verbose=2)

Train on 6471 samples, validate on 1142 samples
Epoch 1/15
 - 49s - loss: 0.6917 - accuracy: 0.5642 - val_loss: 0.6891 - val_accuracy: 0.5884
Epoch 2/15
 - 48s - loss: 0.6850 - accuracy: 0.5681 - val_loss: 0.6625 - val_accuracy: 0.5884
Epoch 3/15
 - 48s - loss: 0.6158 - accuracy: 0.6719 - val_loss: 0.5401 - val_accuracy: 0.7732
Epoch 4/15
 - 48s - loss: 0.5755 - accuracy: 0.7283 - val_loss: 0.5172 - val_accuracy: 0.7706
Epoch 5/15
 - 49s - loss: 0.5690 - accuracy: 0.7367 - val_loss: 0.5063 - val_accuracy: 0.7715
Epoch 6/15
 - 48s - loss: 0.5589 - accuracy: 0.7418 - val_loss: 0.4999 - val_accuracy: 0.7785
Epoch 7/15
 - 49s - loss: 0.5620 - accuracy: 0.7385 - val_loss: 0.4963 - val_accuracy: 0.7785
Epoch 8/15
 - 49s - loss: 0.5528 - accuracy: 0.7427 - val_loss: 0.4909 - val_accuracy: 0.7820
Epoch 9/15
 - 50s - loss: 0.5514 - accuracy: 0.7526 - val_loss: 0.4881 - val_accuracy: 0.7828
Epoch 10/15
 - 50s - loss: 0.5426 - accuracy: 0.7543 - val_loss: 0.4868 - val_accuracy: 0.7846
Epoch 11/15

## Making our submission

In [133]:
sample_sub=pd.read_csv('dataset/sample_submission.csv')

In [134]:
y_pre=model.predict(test)
y_pre=np.round(y_pre).astype(int).reshape(3263)
sub=pd.DataFrame({'id':sample_sub['id'].values.tolist(),'target':y_pre})
sub.to_csv('submission.csv',index=False)


In [135]:
sub.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1


<font size='5' color='red'>  if you like this kernel,please do an upvote.</font>