<a href="https://colab.research.google.com/github/marcelobenedito/quarantine_covid19_behavior_analysis/blob/master/quarantine_covid19_behavior_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Quarantine Covid-19 Behavior Analysis**

*It will be collect data tweets about COVID-19, quarantine and related about. This content will analysed to extract sentiment and main user behavior that makes don't stay home.*

**Install libraries**

In [6]:
!pip3 install unidecode
!pip3 install googletrans
!pip3 install twitterscraper
!pip3 install emoji

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/40/8d/521be7f0091fe0f2ae690cc044faf43e3445e0ff33c574eae752dd7e39fa/emoji-0.5.4.tar.gz (43kB)
[K     |████████████████████████████████| 51kB 2.4MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.5.4-cp36-none-any.whl size=42176 sha256=621a9ea4613933e132f3375e3dc5f460a3936c43e7aba68acd6917eb03f17a0e
  Stored in directory: /root/.cache/pip/wheels/2a/a9/0a/4f8e8cce8074232aba240caca3fade315bb49fac68808d1a9c
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.5.4


**Required imports**

In [7]:
import string
import time
import datetime as dt
import numpy as np
import pandas as pd
import re
from unidecode import unidecode
from googletrans import Translator
from twitterscraper import query_tweets
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from emoji import demojize

**Create a funcion to search tweets**

In [None]:
def search_tweets(search_filter, since, until, limit, language):
  return query_tweets(query = search_filter, begindate = since, enddate = until, limit = limit, lang = language)

**Defining filters used in search**

In [None]:
""" não estou saindo "não estou saindo" (quarentena OR covid) (#covid-19 OR #coronavírus OR #coronavirus OR #covid OR #quarentena) lang:pt until:2020-01-31 since:2020-01-01 -filter:replies """

contains_both_words = ''
exact_phrase = ''
contains_any_words = '(quarentena OR covid OR coronavirus OR isolamento OR social)'
contains_any_hashtags = ''
no_retweet = '-filter:replies'
language = 'pt'
since = dt.date(2020,1,1)
until = dt.date(2020,1,31)
limit = 10

search_filter = contains_both_words + ' ' + exact_phrase + ' ' + contains_any_words + ' ' + contains_any_hashtags + ' ' + no_retweet

**Extracting tweets based on search filter**

In [None]:
tweets = search_tweets(search_filter, since, until, limit, language)

**Transform Json to DataFrame and export to CSV file**

In [None]:
df = pd.DataFrame({
    'tweet_id': tweet.tweet_id, 
    'text': tweet.text,  
    'tweet_url': tweet.tweet_url,
    'retweets': tweet.retweets,
    'replies': tweet.replies,
    'is_replied': tweet.is_replied,
    'is_reply_to': tweet.is_reply_to,
    'user_id': tweet.user_id, 
    'created_at': tweet.timestamp
} for tweet in tweets)

df.to_csv('tweets.csv', encoding='utf-8')

**Printing found tweets**

In [None]:
df.head()

Unnamed: 0,tweet_id,text,tweet_url,retweets,replies,is_replied,is_reply_to,user_id,created_at
0,1213248633954611200,Danilo Gentili e deputado do Psol discutem em ...,/aratuonline/status/1213248633954611200,1,1,True,False,67661509,2020-01-03 23:59:50
1,1213248562420703232,"Jamais rebaixe alguém da família, pq essa pess...",/milamarques/status/1213248562420703232,0,0,False,False,256680355,2020-01-03 23:59:33
2,1213248430342115328,Já planejam diminuir conteúdo dos livros didát...,/rakavazquez/status/1213248430342115328,3,0,False,False,732761681856765952,2020-01-03 23:59:01
3,1213248424134500353,QUESTIONÁRIO PARA 2020\n \nEU VOU?\n1. Tomara ...,/gahsep1914/status/1213248424134500353,0,0,False,False,710262555223134208,2020-01-03 23:59:00
4,1213248395512631296,Só os especialistas em política e economia onl...,/NerdZEEH/status/1213248395512631296,0,0,False,False,281665822,2020-01-03 23:58:53


**Open stored tweets from CSV file**

In [None]:
# file = open('tweets.csv', encoding='utf-8').read()
df = pd.read_csv('tweets.csv')


**Probably will need to make a translate from Portuguese to English**

In [None]:
tweets = df['text']
for tweet in tweets:
  english_tweets = Translator().translate(unidecode(tweet)).text

**Data preprocessing**

This process is used to preprocess the tweet text:

 - Tokenize words;
 - Remove all stop words; 
 - Punctuaction rules; 
 - Unused characters;
 - Links from tweets.

In [None]:
# Converting to lowercase
tweets = tweets.str.lower()

# Removing punctuation rules
tweets = tweets.translate(str.maketrans('', '', string.punctuation))

# TODO Removing unused links
tweets = tweets.str.replace(r"(http|@)\S+", "")

# Remove special chars
tweets = tweets.apply(demojize)
tweets = tweets.str.replace(r"::", ": :")
tweets = tweets.str.replace(r"’", "'")
tweets = tweets.str.replace(r"[^a-z\':_]", " ")

# Remove repetitions
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
tweets = tweets.str.replace(pattern, r"\1")

# Transform short negation form
tweets = tweets.str.replace(r"(can't|cannot)", 'can not')
tweets = tweets.str.replace(r"n't", ' not')

# Spliting text into words
tokenized_words = word_tokenize(tweets, 'english')

# Removing stop words
stopwords = stopwords.words('english')
stopwords.remove('not')
stopwords.remove('nor')
stopwords.remove('no')

words = []

for word in tokenized_word:
  if word not in stopwords:
    words.append(word)