<a href="https://colab.research.google.com/github/marcelobenedito/quarantine_covid19_behavior_analysis/blob/master/quarantine_covid19_behavior_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Quarantine Covid-19 Behavior Analysis**

*It will be collect data tweets about COVID-19, quarantine and related about. This content will analysed to extract sentiment and main user behavior that makes don't stay home.*

## **1 - Extract and preprocessing data**

**Install libraries**

In [40]:
!pip3 install unidecode
!pip3 install twitterscraper
!pip3 install emoji



**Required imports**

In [41]:
import string
import time
import datetime as dt
import numpy as np
import pandas as pd
import re
import nltk
from unidecode import unidecode
from twitterscraper import query_tweets
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from emoji import demojize

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Create a funcion to search tweets**

In [None]:
def search_tweets(search_filter, since, until, limit, language):
  return query_tweets(query = search_filter, begindate = since, enddate = until, limit = limit, lang = language)

**Data preprocessing**

This process is used to preprocess the tweet text:

 - Tokenize words;
 - Remove all stop words; 
 - Punctuaction rules; 
 - Unused characters;
 - Links from tweets.

In [29]:
def preprocess_data(df):
  # Converting to lowercase
  tweets = df.text.str.lower()

  # Removing punctuation rules
  tweets = tweets.str.translate(str.maketrans('', '', string.punctuation))

  # Removing unused links
  tweets = tweets.str.replace(r"(http|@)\S+", "")

  # Remove special chars
  tweets = tweets.apply(demojize)
  tweets = tweets.str.replace(r"::", ": :")
  tweets = tweets.str.replace(r"’", "'")
  tweets = tweets.str.replace(r"[^a-z\':_]", " ")

  # Remove repetitions
  pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
  tweets = tweets.str.replace(pattern, r"\1")

  # Transform short negation form
  tweets = tweets.str.replace(r"(nao| n | ñ )", 'não')

  # Spliting text into words
  # tweets = word_tokenize(tweets, 'english')

  # Removing stop words
  sw = stopwords.words('portuguese')
  sw.remove('não')

  tweets = tweets.apply(
      lambda tweet: ' '.join([word for word in tweet.split() if word not in sw])
  )

  return tweets

**Defining filters used in search**

In [None]:
""" não estou saindo "não estou saindo" (quarentena OR covid) (#covid-19 OR #coronavírus OR #coronavirus OR #covid OR #quarentena) lang:pt until:2020-01-31 since:2020-01-01 -filter:replies """

contains_both_words = ''
exact_phrase = ''
contains_any_words = '(quarentena OR covid OR coronavirus OR isolamento OR festa OR role OR evento OR balada OR sair OR saindo)'
contains_any_hashtags = ''
no_retweet = '-filter:replies'
language = 'pt'
since = dt.date(2020,1,1)
until = dt.date(2020,1,2)
limit = 10

search_filter = contains_both_words + ' ' + exact_phrase + ' ' + contains_any_words + ' ' + contains_any_hashtags + ' ' + no_retweet

**Extracting tweets based on search filter**

In [None]:
tweets = search_tweets(search_filter, since, until, limit, language)

**Transform Json to DataFrame and export to CSV file**

In [None]:
df = pd.DataFrame({
    'tweet_id': tweet.tweet_id, 
    'text': unidecode(tweet.text),  
    'tweet_url': tweet.tweet_url,
    'retweets': tweet.retweets,
    'replies': tweet.replies,
    'is_replied': tweet.is_replied,
    'is_reply_to': tweet.is_reply_to,
    'user_id': tweet.user_id, 
    'screenname': tweet.screenname,
    'created_at': tweet.timestamp
} for tweet in tweets)

df.to_csv('tweets.csv', encoding = 'utf-8', index = False)

**Printing found tweets**

In [None]:
df.head()

Unnamed: 0,tweet_id,text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at
0,1212523890540462082,"Esse eu vou da tchau pra vida de festa, ta bom?",/NetoMiguel02/status/1212523890540462082,0,0,False,False,1190617786294374400,2020-01-01 23:59:57
1,1212523890439786496,o quanto as coisas demoram p sair da minha cab...,/amandiiiix/status/1212523890439786496,0,0,False,False,1064533534109589504,2020-01-01 23:59:57
2,1212523879970885633,meu momento pos role sempre e baseado em pensa...,/inouesz/status/1212523879970885633,0,0,False,False,1046604175717601280,2020-01-01 23:59:55
3,1212523878477635584,Deu janeiro e eu quero sair do emprego eai kkk...,/Haile_Din/status/1212523878477635584,0,0,False,False,1032060705221083137,2020-01-01 23:59:54
4,1212523878230171648,o que vc diria?\n\n1- cuida dela pq ela e espe...,/Laura_Liiotta/status/1212523878230171648,0,0,False,False,723473031293743104,2020-01-01 23:59:54


**Open stored tweets from CSV file**

In [24]:
df = pd.read_csv('tweets.csv')
df.head()

Unnamed: 0,tweet_id,text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at,class
0,1.21e+18,"Esse eu vou da tchau pra vida de festa, ta bom?",/NetoMiguel02/status/1212523890540462082,0,0,False,False,1.19e+18,1/1/2020 23:59,0
1,1.21e+18,o quanto as coisas demoram p sair da minha cab...,/amandiiiix/status/1212523890439786496,0,0,False,False,1.06e+18,1/1/2020 23:59,0
2,1.21e+18,meu momento pos role sempre e baseado em pensa...,/inouesz/status/1212523879970885633,0,0,False,False,1.05e+18,1/1/2020 23:59,0
3,1.21e+18,Deu janeiro e eu quero sair do emprego eai kkk...,/Haile_Din/status/1212523878477635584,0,0,False,False,1.03e+18,1/1/2020 23:59,0
4,1.21e+18,o que vc diria?\n\n1- cuida dela pq ela e espe...,/Laura_Liiotta/status/1212523878230171648,0,0,False,False,7.23e+17,1/1/2020 23:59,0


**Preproccessing data**

In [30]:
tweets = preprocess_data(df)

**Preprocessed tweets containing a classification for each one**

In [31]:
preprocessed_data = pd.DataFrame({'tweets': tweets, 'class': df['class']})
preprocessed_data.head()

Unnamed: 0,tweets,class
0,vou tchau pra vida festa ta bom,0
1,quanto coisas demoram p sair cabeca eh inacred...,0
2,momento pos role sempre baseado pensar histori...,0
3,deu janeiro quero sair emprego eai k,0
4,vc diria cuida pq especial pau cu pede dsclp v...,0


## **2 - Training process**

**Install libraries**

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import CountVectorizer

**Transform text to vector**

In [33]:
cv = CountVectorizer()
cv.fit(preprocessed_data['tweets'])

vectorized_tweets = cv.transform(preprocessed_data['tweets'])

**Split x and y (feature and target)**

In [34]:
X_train, X_test, y_train, y_test = train_test_split(vectorized_tweets,
                                                    preprocessed_data['class'],
                                                    test_size=0.2)

### **2.0.1 - MLPClassifier**

**Create a Multilayer Perceptron Model**

- Hidden layers = 1
- Neurons = 10
- Learning rate = 0.01
- Max iteration = 500
- Optimizer = Stochastic Gradient Descent with no batch-size

In [35]:
"""mlp_model = MLPClassifier(hidden_layer_sizes=(10), 
                          solver='sgd', 
                          learning_rate_init=0.01,
                          max_iter=500,
                          random_state=113)"""

mlp_model = MLPClassifier(learning_rate_init=0.01,
                          random_state=42)                

**Train the MLP model**

In [36]:
mlp_model.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.01, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=42, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

**Score from MLP model**

In [37]:
score = mlp_model.score(X_test, y_test)
score = round((score * 100.0), 2)
print('{} [INFO] The MLP model has {}% of accuracy'.format(dt.datetime.now(), score))

2020-08-12 23:51:20.999001 [INFO] The MLP model has 75.0% of accuracy


**Predict tweets**

In [38]:
print(y_test)

15    0
13    0
0     0
5     0
Name: class, dtype: int64


In [39]:
"""data = pd.DataFrame({'english_text': ['going market now', 
                     'tomorrow cool party', 
                     'want leave home'], 'class': [0, 1, 0]})

new_tweets = preprocess_data(data)
data = pd.DataFrame({'tweets': new_tweets, 'class': data['class']})

tweet_count =  cv.transform(data['tweets'])
tweet_pred = mlp_model.predict(tweet_count)
print(tweet_pred)

score = accuracy_score(tweet_pred, data['class'])
score = round((score * 100.0), 2)
print('{} [INFO] The MLP model has {}% of accuracy'.format(dt.datetime.now(), score))"""

tweet_pred = mlp_model.predict(X_test)
print("Predicted: ", tweet_pred)
print(y_test)

Predicted:  [0 1 0 0]
15    0
13    0
0     0
5     0
Name: class, dtype: int64


**Confusion matrix**

### **2.0.2 - Naive Bayes**

### **2.0.3 - Sequential Minimal Optimization**