<a href="https://colab.research.google.com/github/michal-dom/praca_magisterska/blob/master/disaster_tweets/DTC-EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Disaster tweets clasification - EDA/Cleaning

https://www.kaggle.com/c/nlp-getting-started

From competition:

*Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).*



In [0]:
from google.colab import drive
drive.mount('/content/drive')

from os import listdir
from os.path import isfile, join
files = [f for f in listdir('/content/drive/My Drive/Studia/magisterka/disaster tweets')]
print(files)


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
['test.csv', 'train.csv']


###Import utils

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import string
plt.style.use('ggplot')

###Import NLTK, gensim

In [0]:
from nltk import download
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

download('stopwords')
download('punkt')
stop=set(stopwords.words('english'))
import re

import gensim

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


###Import sklearn, keras

In [0]:
from collections import defaultdict
from collections import  Counter

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
from tqdm import tqdm

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding,LSTM,Dense,SpatialDropout1D
from keras.initializers import Constant
from keras.optimizers import Adam

Using TensorFlow backend.


##Reading data

In [0]:
train= pd.read_csv('/content/drive/My Drive/Studia/magisterka/disaster tweets/train.csv')
test=pd.read_csv('/content/drive/My Drive/Studia/magisterka/disaster tweets/test.csv')

#Cleaning data

Removing URLs e.g.:

From: 'Deputies: Man shot before Brighton home set ablaze http://t.co/gWNRhMSO8k'

To: 'Deputies: Man shot before Brighton home set ablaze'

In [0]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)


train['clean_text']=train['text'].apply(lambda x : remove_URL(x))
test['clean_text']=test['text'].apply(lambda x : remove_URL(x))

Removing links to account (nicks with @) e.g.: 

"@Navista7 Steve these fires out here are something else! California is a tinderbox - and this clown was setting my 'hood ablaze @News24680"

" Steve these fires out here are something else! California is a tinderbox - and this clown was setting my 'hood ablaze "

In [0]:
def remove_ATs(text):
    url = re.compile(r'\B@\w+')
    return url.sub(r'',text)

train['clean_text']=train['clean_text'].apply(lambda x : remove_ATs(x))
test['clean_text']=test['clean_text'].apply(lambda x : remove_ATs(x))

Removing HTML tags e.g.:

In [0]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

train['clean_text']=train['clean_text'].apply(lambda x : remove_html(x))
test['clean_text']=test['clean_text'].apply(lambda x : remove_html(x))

Removing emojis e.g.:

In [0]:
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

train['clean_text']=train['clean_text'].apply(lambda x : remove_emoji(x))
test['clean_text']=test['clean_text'].apply(lambda x : remove_emoji(x))

Removing new-lines

In [0]:
def remove_new_lines(text):
    return re.sub(r'[\n]', ' ', text)


train['clean_text']=train['clean_text'].apply(lambda x : remove_new_lines(x))
test['clean_text']=test['clean_text'].apply(lambda x : remove_new_lines(x))

Removing nonalphanumeric chars e.g.:

In [0]:
def remove_punct(text):
    return re.sub(r'[^A-Za-z0-9 \']', '', text)


train['clean_text']=train['clean_text'].apply(lambda x : remove_punct(x))
test['clean_text']=test['clean_text'].apply(lambda x : remove_punct(x))

Removing multiple spaces

In [0]:
def remove_spaces(text):
    html=re.compile(r' +')
    return html.sub(r' ',text)

train['clean_text']=train['clean_text'].apply(lambda x : remove_spaces(x))
test['clean_text']=test['clean_text'].apply(lambda x : remove_spaces(x))

Lowercasing letters

In [0]:
def to_lower(text):
    return str(text).lower()

train['clean_text']=train['clean_text'].apply(lambda x : to_lower(x))
test['clean_text']=test['clean_text'].apply(lambda x : to_lower(x))

Removing stop words:

In [0]:
def remove_stop_words(df):
    corpus=[]
    for tweet in tqdm(df['clean_text']):
        words=[word for word in word_tokenize(tweet) if((word.isalpha()==1) & (word not in stop))]
        corpus.append(" ".join(words))
    return corpus


train['clean_text_without_stops']=remove_stop_words(train)
test['clean_text_without_stops']=remove_stop_words(test)

# tweet['clean_text_without_stops']=tweet['clean_text'].apply(lambda x : to_lower(x))

100%|██████████| 7613/7613 [00:01<00:00, 7522.60it/s]
100%|██████████| 3263/3263 [00:00<00:00, 7642.47it/s]


In [0]:
train.head()

Unnamed: 0,id,keyword,location,text,target,clean_text,clean_text_without_stops
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...,deeds reason earthquake may allah forgive us
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,forest fire near la ronge sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to 'shelter in place' are ...,residents asked place notified officers evacua...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,13000 people receive wildfires evacuation orde...,people receive wildfires evacuation orders cal...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...,got sent photo ruby alaska smoke wildfires pou...


In [0]:
train.to_csv('/content/drive/My Drive/Studia/magisterka/disaster tweets/clean_train.csv', sep='\t', encoding='utf-8')
test.to_csv('/content/drive/My Drive/Studia/magisterka/disaster tweets/clea_test.csv', sep='\t', encoding='utf-8')


#Tokenization

Tokenization is the task of chopping defined document unit into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation. 

An example of tokenization:

Friends, Romans, Countrymen, lend me your ears

['Friends', 'Romans', 'Countrymen', 'lend', 'me', 'your', 'ears']

The problem here are contractions eg:

aren't -> are not 

she'd -> she would

In polish language there isn't such problems.



In [0]:
def create_corpus(df):
    corpus=[]
    for tweet in tqdm(df['text']):
        words=[word for word in word_tokenize(tweet) if((word.isalpha()==1) & (word not in stop))]
        corpus.append(words)
    return corpus


corpus=create_corpus(tweet)

100%|██████████| 7613/7613 [00:01<00:00, 7233.00it/s]


In [0]:
print(corpus[17])

['summer', 'lovely']


In [0]:
words_dict = {}

for text in tweet['text']:
  for w in text.split(' '):
    if not w in words_dict:
      words_dict[w] = 1
    else:
      words_dict[w] += 1


In [0]:
print(words_dict)



In [0]:
max(words_dict, key=words_dict.get)


''

In [0]:
tweet.head()

Unnamed: 0,id,keyword,location,text,target,corpus
0,1,,,our deeds are the reason of this earthquake ma...,1,"[deeds, reason, earthquake, may, allah, forgiv..."
1,4,,,forest fire near la ronge sask canada,1,"[forest, fire, near, la, ronge, sask, canada]"
2,5,,,all residents asked to shelter in place are be...,1,"[residents, asked, shelter, place, notified, o..."
3,6,,,13000 people receive wildfires evacuation orde...,1,"[people, receive, wildfires, evacuation, order..."
4,7,,,just got sent this photo from ruby alaska as s...,1,"[got, sent, photo, ruby, alaska, smoke, wildfi..."


#Vectorization
Creating embedding dictionary for english words e.g:



In [0]:
embedding_dict={}
with open('/content/drive/My Drive/Studia/magisterka/glove/glove.6B.100d.txt','r') as f:
    for line in f:
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

keras.preprocessing.text.Tokenizer - class that allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

In [0]:
MAX_LEN=50
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(corpus)
sequences=tokenizer_obj.texts_to_sequences(corpus)

tweet_pad=pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')

word_index=tokenizer_obj.word_index
print('Number of unique words:',len(word_index))



Number of unique words: 14401


In [0]:
num_words=len(word_index)+1
embedding_matrix=np.zeros((num_words,100))

for word,i in tqdm(word_index.items()):
    if i > num_words:
        continue
    
    emb_vec=embedding_dict.get(word)
    if emb_vec is not None:
        embedding_matrix[i]=emb_vec
            

100%|██████████| 14401/14401 [00:00<00:00, 312201.10it/s]


In [0]:
model=Sequential()

embedding=Embedding(num_words,100,embeddings_initializer=Constant(embedding_matrix),
                   input_length=MAX_LEN,trainable=False)

model.add(embedding)
model.add(SpatialDropout1D(0.2))
model.add(LSTM(32, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))


optimzer=Adam()

model.compile(loss='binary_crossentropy',optimizer=optimzer,metrics=['accuracy'])

model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 50, 100)           1440200   
_________________________________________________________________
spatial_dropout1d_8 (Spatial (None, 50, 100)           0         
_________________________________________________________________
lstm_11 (LSTM)               (None, 50, 32)            17024     
_________________________________________________________________
lstm_12 (LSTM)               (None, 64)                24832     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 1,482,121
Trainable params: 41,921
Non-trainable params: 1,440,200
_________________________________________________________________


In [0]:
train=tweet_pad[:tweet.shape[0]]
test=tweet_pad[tweet.shape[0]:]

X_train,X_test,y_train,y_test=train_test_split(train,tweet['target'].values,test_size=0.15)
print('Shape of train',X_train.shape)
print("Shape of Validation ",X_test.shape)

Shape of train (6471, 50)
Shape of Validation  (1142, 50)


In [0]:
history=model.fit(X_train,y_train,batch_size=4,epochs=15,validation_data=(X_test,y_test),verbose=2)

Train on 6471 samples, validate on 1142 samples
Epoch 1/15
 - 127s - loss: 0.6806 - acc: 0.5834 - val_loss: 0.6875 - val_acc: 0.5639
Epoch 2/15
 - 126s - loss: 0.6440 - acc: 0.6498 - val_loss: 0.6027 - val_acc: 0.6918
Epoch 3/15
 - 126s - loss: 0.5927 - acc: 0.7053 - val_loss: 0.5342 - val_acc: 0.7680
Epoch 4/15
 - 125s - loss: 0.5581 - acc: 0.7421 - val_loss: 0.4930 - val_acc: 0.7855
Epoch 5/15
 - 126s - loss: 0.5289 - acc: 0.7574 - val_loss: 0.4531 - val_acc: 0.7977
Epoch 6/15
 - 125s - loss: 0.4985 - acc: 0.7758 - val_loss: 0.4519 - val_acc: 0.7986
Epoch 7/15
 - 125s - loss: 0.4916 - acc: 0.7840 - val_loss: 0.4493 - val_acc: 0.8117
Epoch 8/15
 - 126s - loss: 0.4825 - acc: 0.7892 - val_loss: 0.4423 - val_acc: 0.8056
Epoch 9/15
 - 125s - loss: 0.4710 - acc: 0.7861 - val_loss: 0.4198 - val_acc: 0.8135
Epoch 10/15
 - 126s - loss: 0.4659 - acc: 0.7926 - val_loss: 0.4307 - val_acc: 0.8109
Epoch 11/15
 - 126s - loss: 0.4604 - acc: 0.7929 - val_loss: 0.4364 - val_acc: 0.8135
Epoch 12/15
 - 