<a href="https://colab.research.google.com/github/marcelobenedito/quarantine_covid19_behavior_analysis/blob/master/quarantine_covid19_behavior_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Quarantine Covid-19 Behavior Analysis**

*It will be collect data tweets about COVID-19, quarantine and related about. This content will analysed to extract sentiment and main user behavior that makes don't stay home.*

## **1 - Extract and preprocessing data**

**Install libraries**

In [None]:
!pip3 install unidecode
!pip3 install googletrans
!pip3 install twitterscraper
!pip3 install emoji
!pip3 install simplejson



**Required imports**

In [None]:
import string
import time
import datetime as dt
import numpy as np
import pandas as pd
import re
import nltk
import simplejson
from unidecode import unidecode
from googletrans import Translator
from twitterscraper import query_tweets
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from emoji import demojize

**Create a funcion to search tweets**

In [None]:
def search_tweets(search_filter, since, until, limit, language):
  return query_tweets(query = search_filter, begindate = since, enddate = until, limit = limit, lang = language)

**Defining filters used in search**

In [None]:
""" não estou saindo "não estou saindo" (quarentena OR covid) (#covid-19 OR #coronavírus OR #coronavirus OR #covid OR #quarentena) lang:pt until:2020-01-31 since:2020-01-01 -filter:replies """

contains_both_words = ''
exact_phrase = ''
contains_any_words = '(quarentena OR covid OR coronavirus OR isolamento OR festa OR role OR evento OR balada OR sair OR saindo)'
contains_any_hashtags = ''
no_retweet = '-filter:replies'
language = 'pt'
since = dt.date(2020,1,1)
until = dt.date(2020,1,2)
limit = 10

search_filter = contains_both_words + ' ' + exact_phrase + ' ' + contains_any_words + ' ' + contains_any_hashtags + ' ' + no_retweet

**Extracting tweets based on search filter**

In [None]:
tweets = search_tweets(search_filter, since, until, limit, language)

**Transform Json to DataFrame and export to CSV file**

In [None]:
df = pd.DataFrame({
    'tweet_id': tweet.tweet_id, 
    'text': unidecode(tweet.text),  
    'tweet_url': tweet.tweet_url,
    'retweets': tweet.retweets,
    'replies': tweet.replies,
    'is_replied': tweet.is_replied,
    'is_reply_to': tweet.is_reply_to,
    'user_id': tweet.user_id, 
    'created_at': tweet.timestamp
} for tweet in tweets)

df.to_csv('tweets.csv', encoding = 'utf-8', index = False)

**Printing found tweets**

In [None]:
df.head()

Unnamed: 0,tweet_id,text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at
0,1212523890540462082,"Esse eu vou da tchau pra vida de festa, ta bom?",/NetoMiguel02/status/1212523890540462082,0,0,False,False,1190617786294374400,2020-01-01 23:59:57
1,1212523890439786496,o quanto as coisas demoram p sair da minha cab...,/amandiiiix/status/1212523890439786496,0,0,False,False,1064533534109589504,2020-01-01 23:59:57
2,1212523879970885633,meu momento pos role sempre e baseado em pensa...,/inouesz/status/1212523879970885633,0,0,False,False,1046604175717601280,2020-01-01 23:59:55
3,1212523878477635584,Deu janeiro e eu quero sair do emprego eai kkk...,/Haile_Din/status/1212523878477635584,0,0,False,False,1032060705221083137,2020-01-01 23:59:54
4,1212523878230171648,o que vc diria?\n\n1- cuida dela pq ela e espe...,/Laura_Liiotta/status/1212523878230171648,0,0,False,False,723473031293743104,2020-01-01 23:59:54


**Open stored tweets from CSV file**

In [None]:
# file = open('tweets.csv', encoding='utf-8').read()
df = pd.read_csv('tweets.csv')
df.head()

Unnamed: 0,tweet_id,text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at,class
0,1.21e+18,"Esse eu vou da tchau pra vida de festa, ta bom?",/NetoMiguel02/status/1212523890540462082,0,0,False,False,1.19e+18,1/1/2020 23:59,0
1,1.21e+18,o quanto as coisas demoram p sair da minha cab...,/amandiiiix/status/1212523890439786496,0,0,False,False,1.06e+18,1/1/2020 23:59,0
2,1.21e+18,meu momento pos role sempre e baseado em pensa...,/inouesz/status/1212523879970885633,0,0,False,False,1.05e+18,1/1/2020 23:59,0
3,1.21e+18,Deu janeiro e eu quero sair do emprego eai kkk...,/Haile_Din/status/1212523878477635584,0,0,False,False,1.03e+18,1/1/2020 23:59,0
4,1.21e+18,o que vc diria?\n\n1- cuida dela pq ela e espe...,/Laura_Liiotta/status/1212523878230171648,0,0,False,False,7.23e+17,1/1/2020 23:59,0


**Probably will need to make a translate from Portuguese to English**

In [None]:
tweets = df['text']
total = len(tweets)
translated_quantity = 0
english_tweets = []

print('{} [INFO] Starting tweet translation...'.format(dt.datetime.now()))
for tweet in tweets:  
  english_tweets.append(Translator().translate(unidecode(tweet)).text)
  translated_quantity += 1
  print('\r{} [INFO] {} translated tweets of {} from Portuguese to English.'.format(dt.datetime.now(), translated_quantity, total), end='')

#df['text'].apply([lambda tweet: ' '.join(Translator().translate(unidecode(text)).]text for text in tweet))
df.insert(2, 'english_text', english_tweets)

print('\n{} [INFO] Tweet translation successfully!'.format(dt.datetime.now()))

df.head()

2020-08-05 00:12:05.349748 [INFO] Starting tweet translation...
2020-08-05 00:12:08.216473 [INFO] 20 translated tweets of 20 from Portuguese to English.
2020-08-05 00:12:08.218474 [INFO] Tweet translation successfully!


Unnamed: 0,tweet_id,text,english_text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at,class
0,1.21e+18,"Esse eu vou da tchau pra vida de festa, ta bom?","This one I go bye to party life, okay?",/NetoMiguel02/status/1212523890540462082,0,0,False,False,1.19e+18,1/1/2020 23:59,0
1,1.21e+18,o quanto as coisas demoram p sair da minha cab...,how long things take to get out of my head is ...,/amandiiiix/status/1212523890439786496,0,0,False,False,1.06e+18,1/1/2020 23:59,0
2,1.21e+18,meu momento pos role sempre e baseado em pensa...,my moment can always be based on thinking abou...,/inouesz/status/1212523879970885633,0,0,False,False,1.05e+18,1/1/2020 23:59,0
3,1.21e+18,Deu janeiro e eu quero sair do emprego eai kkk...,It was January and I want to quit my job kkkkk...,/Haile_Din/status/1212523878477635584,0,0,False,False,1.03e+18,1/1/2020 23:59,0
4,1.21e+18,o que vc diria?\n\n1- cuida dela pq ela e espe...,what would you say?\n\n1- take care of her bec...,/Laura_Liiotta/status/1212523878230171648,0,0,False,False,7.23e+17,1/1/2020 23:59,0


**Data preprocessing**

This process is used to preprocess the tweet text:

 - Tokenize words;
 - Remove all stop words; 
 - Punctuaction rules; 
 - Unused characters;
 - Links from tweets.

In [None]:
# Converting to lowercase
tweets = df.english_text.str.lower()

# Removing punctuation rules
tweets = tweets.str.translate(str.maketrans('', '', string.punctuation))

# Removing unused links
tweets = tweets.str.replace(r"(http|@)\S+", "")

# Remove special chars
tweets = tweets.apply(demojize)
tweets = tweets.str.replace(r"::", ": :")
tweets = tweets.str.replace(r"’", "'")
tweets = tweets.str.replace(r"[^a-z\':_]", " ")

# Remove repetitions
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
tweets = tweets.str.replace(pattern, r"\1")

# Transform short negation form
tweets = tweets.str.replace(r"(can't|cannot)", 'can not')
tweets = tweets.str.replace(r"n't", ' not')

# Spliting text into words
# tweets = word_tokenize(tweets, 'english')

# Removing stop words
nltk.download('stopwords')
stopwords = stopwords.words('english')
stopwords.remove('not')
stopwords.remove('nor')
stopwords.remove('no')

tweets = tweets.apply(
    lambda tweet: ' '.join([word for word in tweet.split() if word not in stopwords])
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Create model**

In [None]:
model = pd.DataFrame({'tweet': tweets, 'class': df['class']})
model.head()

## **2 - Training process**

### **2.0.1 - MLPClassifier**

### **2.0.2 - Naive Bayes**

### **2.0.3 - Sequential Minimal Optimization**