<a href="https://colab.research.google.com/github/pranshudiwan/NLP_CS_6200/blob/main/Bi-LSTMs_with_77_recall.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bi-LSTMs with Improved accuracy

### Upto data cleaning part everything remains the same

In [1]:
import numpy as np
import pandas as pd
import os
import warnings
warnings.filterwarnings('ignore')
import re
import nltk
nltk.download('stopwords')
from nltk.util import ngrams
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import defaultdict
from collections import  Counter
from sklearn.model_selection import train_test_split
import keras
from keras.models import Sequential
from keras.initializers import Constant
from keras.layers import (LSTM, 
                          Embedding, 
                          BatchNormalization,
                          Dense, 
                          TimeDistributed, 
                          Dropout, 
                          Bidirectional,
                          Flatten, 
                          GlobalMaxPool1D)
from nltk.tokenize import word_tokenize
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras.optimizers import Adam
from sklearn.metrics import (
    precision_score, 
    recall_score, 
    f1_score, 
    classification_report,
    accuracy_score
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
# Import train data
url_train = 'https://raw.githubusercontent.com/pranshudiwan/NLP_CS_6200/main/train.csv'
train = pd.read_csv(url_train)

# Import est data
url_test = 'https://raw.githubusercontent.com/pranshudiwan/NLP_CS_6200/main/test.csv'
test = pd.read_csv(url_test)

#### Training Data

In [4]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


#### Test Data

In [6]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


#### Basic Information on Training Data

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


Thus we have some null values in keyword and location columns. 

## Data Cleaning

### 1. Removing Punctuation

In [8]:
train['text'] = train['text'].map(lambda x: re.sub(r'\W+', ' ', x))
#train['keyword'] = train['keyword'].map(lambda x: re.sub(r'\W+', ' ', x))
#train['location'] = train['location'].map(lambda x: re.sub(r'\W+', ' ', x))

In [9]:
train.sample(5)

Unnamed: 0,id,keyword,location,text,target
3010,4324,dust%20storm,,Let it be gone away like a dust in the wind Bi...,1
6435,9208,suicide%20bombing,,JewhadiTM It is almost amazing to think someo...,1
2871,4127,drought,Meereen,Pizza drought is over I just couldn t anymore,0
1559,2251,chemical%20emergency,"Orbost, Victoria, Australia",Lindenow 3 15pm Emergency crews are at a chemi...,1
4237,6020,hazardous,,It s getting to be hazardous getting into this...,1


### 2. Converting to lowercase

In [10]:
#@title
#@title
train = train.apply(lambda x: x.astype(str).str.lower())

In [11]:
#@title
#@title
train.sample(5)

Unnamed: 0,id,keyword,location,text,target
5023,7164,mudslide,iupui '19,someone split a mudslide w me when i get off work,0
3819,5428,first%20responders,new york city,i just added sandy first responders lost their...,1
3038,4359,earthquake,earth,1 9 earthquake occurred 15km e of anchorage al...,1
7485,10707,wreck,"alabama, usa",first wreck today so so glad me and mom are ok...,0
5300,7570,outbreak,,families to sue over legionnaires more than 40...,1


### 3. Removing Emojis

In [12]:
#@title
#@title
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

train['text']=train['text'].apply(lambda x: remove_emoji(x))
train.sample(5)

Unnamed: 0,id,keyword,location,text,target
5809,8291,rioting,cassadaga florida,fa07af174a71408 i have lived amp my family ha...,1
3283,4709,epicentre,,epicentre cydia tweak https t co wkmfdig3nt th...,0
1909,2744,crushed,trinidad & tobago,disillusioned lead character check happy go lu...,0
6124,8740,sinking,hey georgia,each time we try we always end up sinking,0
4403,6259,hijacking,,hot funtenna hijacking computers to send data...,1


### 4. Correcting Spellings

In [13]:
#@title
#@title
!pip install pyspellchecker

Collecting pyspellchecker
[?25l  Downloading https://files.pythonhosted.org/packages/64/c7/435f49c0ac6bec031d1aba4daf94dc21dc08a9db329692cdb77faac51cea/pyspellchecker-0.6.2-py3-none-any.whl (2.7MB)
[K     |████████████████████████████████| 2.7MB 5.5MB/s 
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.6.2


In [14]:
#@title
#@title
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

#train['text']=train['text'].apply(lambda x : correct_spellings(x))
#train.sample(5)

#### An example sentence passed to our spell checker function

In [15]:
#@title
#@title

text = 'corrrect me pleas. MY NLP preject is due tomrow'
correct_spellings(text)

'correct me please MY NLP project is due tomorow'

### 5. Replacing common acronyms

Some common acronyms:
1. idk - i don't know
2. ttyl - talk to you later

In [16]:
#@title
#@title
## Replacing common acronyms

def other_clean(text):
        """
            Other manual text cleaning techniques
        """
        # Typos, slang and other
        sample_typos_slang = {
                                "w/e": "whatever",
                                "usagov": "usa government",
                                "recentlu": "recently",
                                "ph0tos": "photos",
                                "amirite": "am i right",
                                "exp0sed": "exposed",
                                "<3": "love",
                                "luv": "love",
                                "amageddon": "armageddon",
                                "trfc": "traffic",
                                "16yr": "16 year"
                                }

        # Acronyms
        sample_acronyms =  { 
                            "mh370": "malaysia airlines flight 370",
                            "okwx": "oklahoma city weather",
                            "arwx": "arkansas weather",    
                            "gawx": "georgia weather",  
                            "scwx": "south carolina weather",  
                            "cawx": "california weather",
                            "tnwx": "tennessee weather",
                            "azwx": "arizona weather",  
                            "alwx": "alabama weather",
                            "usnwsgov": "united states national weather service",
                            "2mw": "tomorrow"
                            }

        
        # Some common abbreviations 
        sample_abbr = {
                        "$" : " dollar ",
                        "€" : " euro ",
                        "4ao" : "for adults only",
                        "a.m" : "before midday",
                        "a3" : "anytime anywhere anyplace",
                        "aamof" : "as a matter of fact",
                        "acct" : "account",
                        "adih" : "another day in hell",
                        "afaic" : "as far as i am concerned",
                        "afaict" : "as far as i can tell",
                        "afaik" : "as far as i know",
                        "afair" : "as far as i remember",
                        "afk" : "away from keyboard",
                        "app" : "application",
                        "approx" : "approximately",
                        "apps" : "applications",
                        "asap" : "as soon as possible",
                        "asl" : "age, sex, location",
                        "atk" : "at the keyboard",
                        "ave." : "avenue",
                        "aymm" : "are you my mother",
                        "ayor" : "at your own risk", 
                        "b&b" : "bed and breakfast",
                        "b+b" : "bed and breakfast",
                        "b.c" : "before christ",
                        "b2b" : "business to business",
                        "b2c" : "business to customer",
                        "b4" : "before",
                        "b4n" : "bye for now",
                        "b@u" : "back at you",
                        "bae" : "before anyone else",
                        "bak" : "back at keyboard",
                        "bbbg" : "bye bye be good",
                        "bbc" : "british broadcasting corporation",
                        "bbias" : "be back in a second",
                        "bbl" : "be back later",
                        "bbs" : "be back soon",
                        "be4" : "before",
                        "bfn" : "bye for now",
                        "blvd" : "boulevard",
                        "bout" : "about",
                        "brb" : "be right back",
                        "bros" : "brothers",
                        "brt" : "be right there",
                        "bsaaw" : "big smile and a wink",
                        "btw" : "by the way",
                        "bwl" : "bursting with laughter",
                        "c/o" : "care of",
                        "cet" : "central european time",
                        "cf" : "compare",
                        "cia" : "central intelligence agency",
                        "csl" : "can not stop laughing",
                        "cu" : "see you",
                        "cul8r" : "see you later",
                        "cv" : "curriculum vitae",
                        "cwot" : "complete waste of time",
                        "cya" : "see you",
                        "cyt" : "see you tomorrow",
                        "dae" : "does anyone else",
                        "dbmib" : "do not bother me i am busy",
                        "diy" : "do it yourself",
                        "dm" : "direct message",
                        "dwh" : "during work hours",
                        "e123" : "easy as one two three",
                        "eet" : "eastern european time",
                        "eg" : "example",
                        "embm" : "early morning business meeting",
                        "encl" : "enclosed",
                        "encl." : "enclosed",
                        "etc" : "and so on",
                        "faq" : "frequently asked questions",
                        "fawc" : "for anyone who cares",
                        "fb" : "facebook",
                        "fc" : "fingers crossed",
                        "fig" : "figure",
                        "fimh" : "forever in my heart", 
                        "ft." : "feet",
                        "ft" : "featuring",
                        "ftl" : "for the loss",
                        "ftw" : "for the win",
                        "fwiw" : "for what it is worth",
                        "fyi" : "for your information",
                        "g9" : "genius",
                        "gahoy" : "get a hold of yourself",
                        "gal" : "get a life",
                        "gcse" : "general certificate of secondary education",
                        "gfn" : "gone for now",
                        "gg" : "good game",
                        "gl" : "good luck",
                        "glhf" : "good luck have fun",
                        "gmt" : "greenwich mean time",
                        "gmta" : "great minds think alike",
                        "gn" : "good night",
                        "g.o.a.t" : "greatest of all time",
                        "goat" : "greatest of all time",
                        "goi" : "get over it",
                        "gps" : "global positioning system",
                        "gr8" : "great",
                        "gratz" : "congratulations",
                        "gyal" : "girl",
                        "h&c" : "hot and cold",
                        "hp" : "horsepower",
                        "hr" : "hour",
                        "hrh" : "his royal highness",
                        "ht" : "height",
                        "ibrb" : "i will be right back",
                        "ic" : "i see",
                        "icq" : "i seek you",
                        "icymi" : "in case you missed it",
                        "idc" : "i do not care",
                        "idgadf" : "i do not give a damn fuck",
                        "idgaf" : "i do not give a fuck",
                        "idk" : "i do not know",
                        "ie" : "that is",
                        "i.e" : "that is",
                        "ifyp" : "i feel your pain",
                        "IG" : "instagram",
                        "iirc" : "if i remember correctly",
                        "ilu" : "i love you",
                        "ily" : "i love you",
                        "imho" : "in my humble opinion",
                        "imo" : "in my opinion",
                        "imu" : "i miss you",
                        "iow" : "in other words",
                        "irl" : "in real life",
                        "j4f" : "just for fun",
                        "jic" : "just in case",
                        "jk" : "just kidding",
                        "jsyk" : "just so you know",
                        "l8r" : "later",
                        "lb" : "pound",
                        "lbs" : "pounds",
                        "ldr" : "long distance relationship",
                        "lmao" : "laugh my ass off",
                        "lmfao" : "laugh my fucking ass off",
                        "lol" : "laughing out loud",
                        "ltd" : "limited",
                        "ltns" : "long time no see",
                        "m8" : "mate",
                        "mf" : "motherfucker",
                        "mfs" : "motherfuckers",
                        "mfw" : "my face when",
                        "mofo" : "motherfucker",
                        "mph" : "miles per hour",
                        "mr" : "mister",
                        "mrw" : "my reaction when",
                        "ms" : "miss",
                        "mte" : "my thoughts exactly",
                        "nagi" : "not a good idea",
                        "nbc" : "national broadcasting company",
                        "nbd" : "not big deal",
                        "nfs" : "not for sale",
                        "ngl" : "not going to lie",
                        "nhs" : "national health service",
                        "nrn" : "no reply necessary",
                        "nsfl" : "not safe for life",
                        "nsfw" : "not safe for work",
                        "nth" : "nice to have",
                        "nvr" : "never",
                        "nyc" : "new york city",
                        "oc" : "original content",
                        "og" : "original",
                        "ohp" : "overhead projector",
                        "oic" : "oh i see",
                        "omdb" : "over my dead body",
                        "omg" : "oh my god",
                        "omw" : "on my way",
                        "p.a" : "per annum",
                        "p.m" : "after midday",
                        "pm" : "prime minister",
                        "poc" : "people of color",
                        "pov" : "point of view",
                        "pp" : "pages",
                        "ppl" : "people",
                        "prw" : "parents are watching",
                        "ps" : "postscript",
                        "pt" : "point",
                        "ptb" : "please text back",
                        "pto" : "please turn over",
                        "qpsa" : "what happens", #"que pasa",
                        "ratchet" : "rude",
                        "rbtl" : "read between the lines",
                        "rlrt" : "real life retweet", 
                        "rofl" : "rolling on the floor laughing",
                        "roflol" : "rolling on the floor laughing out loud",
                        "rotflmao" : "rolling on the floor laughing my ass off",
                        "rt" : "retweet",
                        "ruok" : "are you ok",
                        "sfw" : "safe for work",
                        "sk8" : "skate",
                        "smh" : "shake my head",
                        "sq" : "square",
                        "srsly" : "seriously", 
                        "ssdd" : "same stuff different day",
                        "tbh" : "to be honest",
                        "tbs" : "tablespooful",
                        "tbsp" : "tablespooful",
                        "tfw" : "that feeling when",
                        "thks" : "thank you",
                        "tho" : "though",
                        "thx" : "thank you",
                        "tia" : "thanks in advance",
                        "til" : "today i learned",
                        "tl;dr" : "too long i did not read",
                        "tldr" : "too long i did not read",
                        "tmb" : "tweet me back",
                        "tntl" : "trying not to laugh",
                        "ttyl" : "talk to you later",
                        "u" : "you",
                        "u2" : "you too",
                        "u4e" : "yours for ever",
                        "utc" : "coordinated universal time",
                        "w/" : "with",
                        "w/o" : "without",
                        "w8" : "wait",
                        "wassup" : "what is up",
                        "wb" : "welcome back",
                        "wtf" : "what the fuck",
                        "wtg" : "way to go",
                        "wtpa" : "where the party at",
                        "wuf" : "where are you from",
                        "wuzup" : "what is up",
                        "wywh" : "wish you were here",
                        "yd" : "yard",
                        "ygtr" : "you got that right",
                        "ynk" : "you never know",
                        "zzz" : "sleeping bored and tired"
                        }
            
        sample_typos_slang_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_typos_slang.keys()) + r')(?!\w)')
        sample_acronyms_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_acronyms.keys()) + r')(?!\w)')
        sample_abbr_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_abbr.keys()) + r')(?!\w)')
        
        text = sample_typos_slang_pattern.sub(lambda x: sample_typos_slang[x.group()], text)
        text = sample_acronyms_pattern.sub(lambda x: sample_acronyms[x.group()], text)
        text = sample_abbr_pattern.sub(lambda x: sample_abbr[x.group()], text)
        
        return text

In [17]:
#@title
#@title
## !Need to change variable names and abbrevations as appropriate
train["text"] = train["text"].apply(lambda x: other_clean(x))
train.sample(5)

Unnamed: 0,id,keyword,location,text,target
537,782,avalanche,buy give me my money,great one time deal on all avalanche music and...,0
7259,10393,whirlwind,,sitting in a cafe enjoying a bite and cramming...,0
3645,5194,fatalities,san francisco,motordom lobbied to change our language aroun...,0
2965,4260,drowning,,drowning acrylic 08 05 15 https t co x17fubqbgg,1
1887,2711,crushed,,http t co kg5plkedhr wrapup 2 you s cable tv c...,0


In [18]:
#@title
#@title


### 6. Removing single and unwanted characters

In [19]:
#@title
#@title
## Removing single characters
train["text"] = train["text"].str.replace(r'\b\w\b','').str.replace(r'\s+', ' ')

In [20]:
#@title
#@title


### 7. Removing common stop-words and tokenizing the text

In [21]:
#@title
#@title
from nltk.corpus import stopwords

stop = stopwords.words('english')
def tokenizer(text):
    tokenized = []
    for string in text:
        string = re.sub('[^a-z\sA-Z]', '', string)
        string = re.sub('http\S+', '', string)
        string = re.sub('co', '', string)
        string = re.sub('via', '', string)
        string = re.sub('amp', '', string)
        tokenized.append([w for w in string.split() if w not in stop])
    return tokenized

In [22]:
#@title
#@title
train['tokenized'] = tokenizer(train["text"])

In [23]:
#@title
#@title
train.sample(5)

Unnamed: 0,id,keyword,location,text,target,tokenized
6491,9280,sunk,,shekhargupta mihirssharma high time tv channe...,0,"[shekhargupta, mihirssharma, high, time, tv, c..."
100,144,accident,uk,norwaymfa bahrain police had previously died ...,1,"[norwaymfa, bahrain, police, previously, died,..."
295,435,apocalypse,,minecraft night lucky block mod bob apocalypse...,0,"[minecraft, night, lucky, block, mod, bob, apo..."
1786,2564,crash,liverpool,party for bestival crash victim michael molloy...,1,"[party, bestival, crash, victim, michael, moll..."
3062,4393,earthquake,london,there was small earthquake in la but don worr...,1,"[small, earthquake, la, worry, emmy, rossum, f..."


### 8. Performing stemming

In [24]:
#@title
#@title
# Performing stemming

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [25]:
#@title
#@title
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [26]:
#@title
#@title


In [27]:
#@title
#@title
def stem(unstemmed_list):  
  ps = PorterStemmer()

  final_stemmed_list = []

  for i in range(len(unstemmed_list)):
    sentence = unstemmed_list[i]
    #print(sentence)
    words = word_tokenize(sentence)
    stemmed_list = []
    for w in words:
        stemmed_list.append(ps.stem(w))
    stemmed_list = ' '.join(stemmed_list)
    final_stemmed_list.append(stemmed_list)

  return final_stemmed_list


In [28]:
#@title
#@title
unstemmed_list = train['text'].tolist()
train['stemmed_text'] = stem(unstemmed_list)

In [29]:
#@title
#@title
#Removing digits
train['stemmed_text'] = train['stemmed_text'].str.replace('\d+', '')

In [30]:
#@title
#@title
train.sample(5)

Unnamed: 0,id,keyword,location,text,target,tokenized,stemmed_text
5713,8154,rescuers,,have an unexplainable desire to watch the res...,0,"[unexplainable, desire, watch, rescuers, child...",have an unexplain desir to watch the rescuer c...
2494,3583,desolate,"michigan, usa",psalm34 22 the lord redeemeth the soul of his ...,0,"[psalm, lord, redeemeth, soul, servants, none,...",psalm the lord redeemeth the soul of hi serva...
5118,7299,nuclear%20reactor,"washington, d.c.",global nuclear reactor construction market gre...,1,"[global, nuclear, reactor, nstruction, market,...",global nuclear reactor construct market grew b...
5288,7555,outbreak,nj/nyc,wow the name legionnairesdisease comes from an...,1,"[wow, name, legionnairesdisease, mes, outbreak...",wow the name legionnairesdiseas come from an o...
2414,3473,derailed,"dc, frequently nyc/san diego",whoa wmata train derailed at smithsonian,1,"[whoa, wmata, train, derailed, smithsonian]",whoa wmata train derail at smithsonian


In [31]:
#@title
#@title


We are thus done with our data cleaning part here. We will continue to build models on the clean data

## Bi-LSTMs for better accuracy

In [32]:
tweet_1 = train.text.values
test_1 = train.text.values
sentiments = tweet.target.values

In [33]:
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(tweet_1)
vocab_length = len(word_tokenizer.word_index) + 1

In [51]:
def metrics(pred_tag, y_test):
    print("F1-score: ", f1_score(pred_tag, y_test))
    print("Precision: ", precision_score(pred_tag, y_test))
    print("Recall: ", recall_score(pred_tag, y_test))
    print("Acuracy: ", accuracy_score(pred_tag, y_test))
    #print("-"*50)
    #print(classification_report(pred_tag, y_test))


In [52]:
def embed(corpus): 
    return word_tokenizer.texts_to_sequences(corpus)

In [36]:
nltk.download('punkt')
longest_train = max(tweet_1, key=lambda sentence: len(word_tokenize(sentence)))
length_long_sentence = len(word_tokenize(longest_train))
padded_sentences = pad_sequences(embed(tweet_1), length_long_sentence, padding='post')
test_sentences = pad_sequences(
    embed(test_1), 
    length_long_sentence,
    padding='post'
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [38]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [39]:
embeddings_dictionary = dict()
embedding_dim = 200
glove_file = open('/content/drive/My Drive/glove.6B.200d.txt')
for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions
glove_file.close()

In [40]:
embedding_matrix = np.zeros((vocab_length, embedding_dim))
for word, index in word_tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [41]:
X_train, X_test, y_train, y_test = train_test_split(
    padded_sentences, 
    sentiments, 
    test_size=0.25
)

In [42]:
def BLSTM():
    model = Sequential()
    model.add(Embedding(input_dim=embedding_matrix.shape[0], 
                        output_dim=embedding_matrix.shape[1], 
                        weights = [embedding_matrix], 
                        input_length=length_long_sentence))
    model.add(Bidirectional(LSTM(length_long_sentence, return_sequences = True, recurrent_dropout=0.2)))
    model.add(GlobalMaxPool1D())
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    model.add(Dense(length_long_sentence, activation = "relu"))
    model.add(Dropout(0.5))
    model.add(Dense(length_long_sentence, activation = "relu"))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation = 'sigmoid'))
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [43]:
model = BLSTM()
checkpoint = ModelCheckpoint(
    'model.h5', 
    monitor = 'val_loss', 
    verbose = 1, 
    save_best_only = True
)
reduce_lr = ReduceLROnPlateau(
    monitor = 'val_loss', 
    factor = 0.2, 
    verbose = 1, 
    patience = 5,                        
    min_lr = 0.001
)
history = model.fit(
    X_train, 
    y_train, 
    epochs = 7,
    batch_size = 32,
    validation_data = [X_test, y_test],
    verbose = 1,
    callbacks = [reduce_lr, checkpoint]
)

Epoch 1/7

Epoch 00001: val_loss improved from inf to 0.00000, saving model to model.h5
Epoch 2/7

Epoch 00002: val_loss did not improve from 0.00000
Epoch 3/7

Epoch 00003: val_loss did not improve from 0.00000
Epoch 4/7

Epoch 00004: val_loss did not improve from 0.00000
Epoch 5/7

Epoch 00005: val_loss did not improve from 0.00000
Epoch 6/7

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.001.

Epoch 00006: val_loss did not improve from 0.00000
Epoch 7/7

Epoch 00007: val_loss did not improve from 0.00000


In [44]:
lstm = BLSTM()
lstm.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 31, 200)           4317000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 31, 62)            57536     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 62)                0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 62)                248       
_________________________________________________________________
dropout_3 (Dropout)          (None, 62)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 31)                1953      
_________________________________________________________________
dropout_4 (Dropout)          (None, 31)               

In [53]:
preds = model.predict_classes(X_test)
metrics(preds, y_test)

F1-score:  0.7824474660074165
Precision:  0.798234552332913
Recall:  0.7672727272727272
Acuracy:  0.8151260504201681


### Thus, we got a recall of 76.7%, which is a significant improvement over other methods