<a href="https://colab.research.google.com/github/marcelobenedito/quarantine_covid19_behavior_analysis/blob/master/quarantine_covid19_behavior_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**QUARANTINE COVID-19 BEHAVIOR ANALYSIS**

*It will be collect data tweets about COVID-19, quarantine and related about. This content will analysed to extract sentiment and main user behavior that makes don't stay home.*

**Install libraries**

In [6]:
!pip3 install unidecode
!pip3 install twitterscraper
!pip3 install emoji
!pip3 install joblib



**Required imports**

In [7]:
import datetime as dt
import pandas as pd
import numpy as np
import string
import re
import nltk
from unidecode import unidecode
from twitterscraper import query_tweets
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from emoji import demojize

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **1. Extract tweets**

**Create a funcion to search tweets**

In [None]:
def search_tweets(search_filter, since, until, limit, language):
  return query_tweets(query = search_filter, begindate = since, enddate = until, limit = limit, lang = language)

**Defining filters used in search**

In [None]:
""" não estou saindo "não estou saindo" (quarentena OR covid) (#covid-19 OR #coronavírus OR #coronavirus OR #covid OR #quarentena) lang:pt until:2020-01-31 since:2020-01-01 -filter:replies """

contains_both_words = ''
exact_phrase = ''
contains_any_words = '(quarentena OR covid OR coronavirus OR isolamento OR festa OR role OR evento OR balada OR sair OR saindo)'
contains_any_hashtags = ''
no_retweet = '-filter:replies'
language = 'pt'
since = dt.date(2020,1,1)
until = dt.date(2020,1,2)
limit = 10

search_filter = contains_both_words + ' ' + exact_phrase + ' ' + contains_any_words + ' ' + contains_any_hashtags + ' ' + no_retweet

**Extracting tweets based on search filter**

In [None]:
tweets = search_tweets(search_filter, since, until, limit, language)

**Transform Json to DataFrame and export to CSV file**

In [None]:
df = pd.DataFrame({
    'tweet_id': tweet.tweet_id, 
    'text': unidecode(tweet.text),  
    'tweet_url': tweet.tweet_url,
    'retweets': tweet.retweets,
    'replies': tweet.replies,
    'is_replied': tweet.is_replied,
    'is_reply_to': tweet.is_reply_to,
    'user_id': tweet.user_id, 
    'screenname': tweet.screenname,
    'created_at': tweet.timestamp
} for tweet in tweets)

df.to_csv('tweets.csv', encoding = 'utf-8', index = False)

**Printing found tweets**

In [None]:
df.head()

Unnamed: 0,tweet_id,text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at
0,1212523890540462082,"Esse eu vou da tchau pra vida de festa, ta bom?",/NetoMiguel02/status/1212523890540462082,0,0,False,False,1190617786294374400,2020-01-01 23:59:57
1,1212523890439786496,o quanto as coisas demoram p sair da minha cab...,/amandiiiix/status/1212523890439786496,0,0,False,False,1064533534109589504,2020-01-01 23:59:57
2,1212523879970885633,meu momento pos role sempre e baseado em pensa...,/inouesz/status/1212523879970885633,0,0,False,False,1046604175717601280,2020-01-01 23:59:55
3,1212523878477635584,Deu janeiro e eu quero sair do emprego eai kkk...,/Haile_Din/status/1212523878477635584,0,0,False,False,1032060705221083137,2020-01-01 23:59:54
4,1212523878230171648,o que vc diria?\n\n1- cuida dela pq ela e espe...,/Laura_Liiotta/status/1212523878230171648,0,0,False,False,723473031293743104,2020-01-01 23:59:54


## **2. Preprocessing**

This process is used to preprocess the tweet text:

 - Tokenize words;
 - Remove all stop words; 
 - Punctuaction rules; 
 - Unused characters;
 - Links from tweets.

In [19]:
def preprocessing(pd_serie):

  # Converting to lowercase
  pd_serie = pd_serie.str.lower()

  # Removing punctuation rules
  pd_serie = pd_serie.str.translate(str.maketrans('', '', string.punctuation))

  # Removing unused links
  pd_serie = pd_serie.str.replace(r"(http|@)\S+", "")

  # Transform short negation form
  pd_serie = pd_serie.str.replace(r"(nao| n | ñ )", 'não')

  # Remove special chars
  pd_serie = pd_serie.apply(demojize)
  pd_serie = pd_serie.str.replace(r"::", ": :")
  pd_serie = pd_serie.str.replace(r"’", "'")
  pd_serie = pd_serie.str.replace(r"[^a-z\':_]", " ")

  # Remove repetitions
  pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
  pd_serie = pd_serie.str.replace(pattern, r"\1")

  # Removing stop words
  sw = stopwords.words('portuguese')
  sw.remove('não')

  pd_serie = pd_serie.apply(
      lambda pd_serie: ' '.join([word for word in pd_serie.split() if word not in sw])
  )

  return pd_serie

**Add negation tag to make emphasis in a negative phrase**

In [20]:
def negative_phrase(phrase):
  negative_word = 'não'
  has_nagative_word = False
  new_phrase = []
  for word in phrase.split():
    if has_negative_word == True:
      word = word + '_NÃO'
    if word == negative_word:
      has_negative_word = True
    new_phrase.append(word)
  return (' '.join(new_phrase))

**Open stored tweets from CSV file**

In [21]:
df = pd.read_csv('tweets.csv')
df.head()

Unnamed: 0,tweet_id,text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at,category
0,1.21e+18,"Esse eu vou da tchau pra vida de festa, ta bom?",/NetoMiguel02/status/1212523890540462082,0,0,False,False,1.19e+18,1/1/2020 23:59,0
1,1.21e+18,o quanto as coisas demoram p sair da minha cab...,/amandiiiix/status/1212523890439786496,0,0,False,False,1.06e+18,1/1/2020 23:59,0
2,1.21e+18,meu momento pos role sempre e baseado em pensa...,/inouesz/status/1212523879970885633,0,0,False,False,1.05e+18,1/1/2020 23:59,0
3,1.21e+18,Deu janeiro e eu quero sair do emprego eai kkk...,/Haile_Din/status/1212523878477635584,0,0,False,False,1.03e+18,1/1/2020 23:59,0
4,1.21e+18,o que vc diria?\n\n1- cuida dela pq ela e espe...,/Laura_Liiotta/status/1212523878230171648,0,0,False,False,7.23e+17,1/1/2020 23:59,0


**Count lines in dataset**

In [22]:
df.text.count()

20

**Removing duplicate lines**

In [23]:
df.drop_duplicates(['text'], inplace=True)

**Count lines**

In [24]:
df.text.count()

20

**Preproccessing data**

In [26]:
classes = df.category
tweets = df.text
tweets = preprocessing(tweets)
tweets

0                       vou tchau pra vida festa ta bom
1     quanto coisas demoram p sair cabeca eh inacred...
2     momento pos role sempre baseado pensar histori...
3                  deu janeiro quero sair emprego eai k
4     vc diria cuida pq especial pau cu pede dsclp v...
5           marca role desses chiques melhor vai direto
6     ultimo role casa isa pictwittercom epxqizxqo a...
7     so chama pra sair n quero estralar dedos faco ...
8     falta mes pro aniversario ja estao planejando ...
9     so vou festas faco playlist agr porque festa t...
10    medo ser julgada sair quarto la menina come do...
11             muitas fotos saindo merda vai ser grande
12    encontrei melhor jogo festa todos tempos confi...
13                                       sim chato role
14    c sabe aniversario parceiro deg janeiro doce n...
15    ano chorei onibus chorei sala aula chorei anda...
16                     cabelo ta mto lindo vou ate sair
17                    quero saber vai ser primei

## **3. Training process**

**Required libraries**

In [39]:
import sklearn
import joblib
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.feature_extraction.text import CountVectorizer

### **2.0.1 - MLPClassifier**

**Create a Multilayer Perceptron Model**

- Hidden layers = 1
- Neurons = 10
- Learning rate = 0.01
- Max iteration = 500
- Optimizer = Stochastic Gradient Descent with no batch-size

In [31]:
"""mlp_model = MLPClassifier(hidden_layer_sizes=(10), 
                          solver='sgd', 
                          learning_rate_init=0.01,
                          max_iter=500,
                          random_state=113)"""

mlp_model = MLPClassifier(learning_rate_init=0.01,
                          random_state=42)                

**Create pipeline for MLP classifier**

In [32]:
pipe = Pipeline([('vectorizer', CountVectorizer()), ('model', mlp_model)])

pipe_bigrams = Pipeline([('vectorizer', CountVectorizer(ngram_range=(1,2))), ('model', mlp_model)])

pipe_negation = Pipeline([
  ('vectorizer', CountVectorizer(tokenizer=lambda phrase: negative_phrase(phrase))), 
  ('model', mlp_model)
])

**Split x and y (feature and target)**

In [91]:
train_size=0.8
X_train, X_test, y_train, y_test = train_test_split(tweets,
                                                    classes,
                                                    train_size=train_size)

In [114]:
X_train

14    c sabe aniversario parceiro deg janeiro doce n...
11             muitas fotos saindo merda vai ser grande
4     vc diria cuida pq especial pau cu pede dsclp v...
12    encontrei melhor jogo festa todos tempos confi...
17                    quero saber vai ser primeiro role
1     quanto coisas demoram p sair cabeca eh inacred...
5           marca role desses chiques melhor vai direto
16                     cabelo ta mto lindo vou ate sair
15    ano chorei onibus chorei sala aula chorei anda...
19       unica tristeza desse ano novo n sao dias festa
8     falta mes pro aniversario ja estao planejando ...
2     momento pos role sempre baseado pensar histori...
6     ultimo role casa isa pictwittercom epxqizxqo a...
0                       vou tchau pra vida festa ta bom
10    medo ser julgada sair quarto la menina come do...
7     so chama pra sair n quero estralar dedos faco ...
Name: text, dtype: object

In [116]:
pd.DataFrame({'tweet': X_test, 'class': y_test})

Unnamed: 0,tweet,class
3,deu janeiro quero sair emprego eai k,0
9,so vou festas faco playlist agr porque festa t...,1
18,melhor role sempre fica pro final,0
13,sim chato role,0


**Run training MLP model**

In [98]:
pipe.fit(X_train, y_train)

accuracy = pipe.score(X_test, y_test)
print('\r{} [INFO] Accuracy of MLP model with testing data is {:.1%}\n'
  .format(dt.datetime.now(), (1-train_size), accuracy))

2020-08-16 14:57:19.287339 [INFO] Accuracy of MLP model with testing data is 20.0%



**Cross validation**

In [99]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

In [101]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(pipe, X_train, y_train, cv=kfold)
print("Average accuracy: %f (%f)" %(results['test_score'].mean(), results['test_score'].std()))

results = cross_val_predict(pipe, X_train, y_train, cv=kfold)
metrics.accuracy_score(y_train, results)

Average accuracy: 0.433333 (0.133333)


0.4375

In [104]:
print(metrics.classification_report(y_train, results, pipe.classes_))

              precision    recall  f1-score   support

           0       0.70      0.54      0.61        13
           1       0.00      0.00      0.00         3

    accuracy                           0.44        16
   macro avg       0.35      0.27      0.30        16
weighted avg       0.57      0.44      0.49        16



**Confusion matrix**

In [107]:
print(pd.crosstab(y_train, results, rownames=['Real'], colnames=['Predicted'], margins=True))

Predicted   0  1  All
Real                 
0           7  6   13
1           3  0    3
All        10  6   16


**Store MPL model in disk**

In [108]:
file_name = 'mlp_model'
joblib.dump(pipe, file_name)
print('\rMPL model was saved sucessfully!')

MPL model was saved sucessfully!


**Load MLP model**

In [109]:
pipe = joblib.load(file_name)

**Predict tweets**

In [121]:
tests = X_test
predict_result = pipe.predict(X_test)

pd.DataFrame(zip(tests, predict_result), columns=['tweet', 'class'])

Unnamed: 0,tweet,class
0,deu janeiro quero sair emprego eai k,0
1,so vou festas faco playlist agr porque festa t...,0
2,melhor role sempre fica pro final,0
3,sim chato role,0


**Prob for each class**

In [None]:
print(pipe.classes_)
pipe.predict_proba(X_test)

### **2.0.2 - Naive Bayes**

### **2.0.3 - Sequential Minimal Optimization**