# COVID19 Tweets
Get Insights and understanding on Covid Tweets Data  

**Data**:
COVID19 Tweets: Tweets with the hashtag #covid19
The tweets have the #covid19 hashtag and the collection started on 25/7/2020.
You can found the dataset here - https://drive.google.com/drive/folders/1uww8K6PUaHfF7_ZMgGFHE7J6stptarYU?usp=sharing

**Objective**:
You have to perform analysis on people tweets about COVID-19.     
Derive breakthrough insights like 
- finding what kind of subjects use this hashtag, 
- look at the geographical distribution (country), 
- cluster and evaluate sentiments, 
- look at trends (on an average, candidate shares at least 7 substantial insights).

In [1]:
import pickle
import numpy as np
import pandas as pd

In [2]:
with open('../Data/covid19_tweets.csv') as f:
    print(f)

<_io.TextIOWrapper name='../Data/covid19_tweets.csv' mode='r' encoding='UTF-8'>


In [3]:
data = pd.read_csv('../Data/covid19_tweets.csv', encoding='UTF-8')
data.shape

(179108, 13)

In [4]:
data.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,,Twitter for iPhone,False
1,Tom Basile 🇺🇸,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,,Twitter for Android,False
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19'],Twitter for Android,False
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,['COVID19'],Twitter for iPhone,False
4,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']",Twitter for Android,False


In [5]:
data[ data['user_name']== 'beIN SPORTS'][['user_created','date']]

Unnamed: 0,user_created,date
2952,2012-05-10 13:45:16,2020-07-25 10:42:42
23784,2009-06-01 02:08:53,2020-07-26 03:01:00
103961,2012-05-10 13:45:16,2020-08-11 09:46:47
108552,2009-06-01 02:08:53,2020-08-11 07:45:00
170920,2009-06-01 02:08:53,2020-08-30 05:03:00


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179108 entries, 0 to 179107
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   user_name         179108 non-null  object
 1   user_location     142337 non-null  object
 2   user_description  168822 non-null  object
 3   user_created      179108 non-null  object
 4   user_followers    179108 non-null  int64 
 5   user_friends      179108 non-null  int64 
 6   user_favourites   179108 non-null  int64 
 7   user_verified     179108 non-null  bool  
 8   date              179108 non-null  object
 9   text              179108 non-null  object
 10  hashtags          127774 non-null  object
 11  source            179031 non-null  object
 12  is_retweet        179108 non-null  bool  
dtypes: bool(2), int64(3), object(8)
memory usage: 15.4+ MB


Data types looks good **except** `user_created` and `date`

### The Dataset looks bad, let's clean this up!

#### Missing Values

In [7]:
data.isna().sum()

user_name               0
user_location       36771
user_description    10286
user_created            0
user_followers          0
user_friends            0
user_favourites         0
user_verified           0
date                    0
text                    0
hashtags            51334
source                 77
is_retweet              0
dtype: int64

Feature: `user_location`     
- fill nan with 'unknown'

In [8]:
data['user_location'].fillna('unknown', inplace=True)
data['user_location'].isna().sum()

0

Feature: `user_description`       
Tasks:
- fill nan with missing

In [9]:
data['user_description'].fillna('missing', inplace=True)
data['user_description'].isna().sum()

0

Feature: `hashtags` 
Tasks:
- Replace nan values with empty list like string
- Parse all string lists to actual lists

In [10]:
data['hashtags'].isna().sum()

51334

Let's replace them with empty list

In [11]:
data['hashtags'].fillna("[]", inplace=True)
data['hashtags'].head()

0                                   []
1                                   []
2                          ['COVID19']
3                          ['COVID19']
4    ['CoronaVirusUpdates', 'COVID19']
Name: hashtags, dtype: object

In [12]:
data['hashtags'].isna().sum()

0

In [13]:
data['hashtags'].dtype

dtype('O')

In [14]:
import ast
data['hashtags'] = data['hashtags'].apply(lambda x: ast.literal_eval(x))     # parse string list to list
data['hashtags'].head()

0                               []
1                               []
2                        [COVID19]
3                        [COVID19]
4    [CoronaVirusUpdates, COVID19]
Name: hashtags, dtype: object

In [15]:
data['hashtags'][4]

['CoronaVirusUpdates', 'COVID19']

Feature: `source`     
Tasks: 
- Replace 'Android for' string with ''

In [16]:
data['source'] = data['source'].str.replace('Twitter for ','')
data['source'].value_counts()

Twitter Web App     56891
Android             40179
iPhone              35472
TweetDeck            8543
Hootsuite Inc.       7321
                    ...  
The Shirt List          1
Co-Kinetic              1
Marketing Agency        1
15 Minute Fun           1
BAPSCharities           1
Name: source, Length: 610, dtype: int64

In [17]:
data['source'].isna().sum()

77

In [18]:
data['source'].fillna('unknown', inplace=True)
data['source'].isna().sum()

0

In [19]:
data.isna().sum()

user_name           0
user_location       0
user_description    0
user_created        0
user_followers      0
user_friends        0
user_favourites     0
user_verified       0
date                0
text                0
hashtags            0
source              0
is_retweet          0
dtype: int64

#### Data Engineering

Dates

In [20]:
data['date'].head()

0    2020-07-25 12:27:21
1    2020-07-25 12:27:17
2    2020-07-25 12:27:14
3    2020-07-25 12:27:10
4    2020-07-25 12:27:08
Name: date, dtype: object

In [21]:
type(data['date'][0])

str

In [22]:
data['date'] = pd.to_datetime(data['date'])
data['user_created'] = pd.to_datetime(data['user_created'])
type(data['date'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [23]:
data['date'][0]

Timestamp('2020-07-25 12:27:21')

In [25]:
date = data['date'][0]
date.date().year, date.month, date.day, date.time().hour

(2020, 7, 25, 12)

In [26]:
data['post_year'] = data['date'].apply(lambda x: x.date().year)
data['post_month'] = data['date'].apply(lambda x: x.month)
data['post_day'] = data['date'].apply(lambda x: x.day)
data['post_hour'] = data['date'].apply(lambda x: x.time().hour)
data['user_created_month'] = data['user_created'].apply(lambda x: x.month)

In [27]:
# data.drop(labels=['date'], inplace=True, axis=1)
# data.drop(labels=['user_created'], inplace=True, axis=1)

In [28]:
data.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet,post_year,post_month,post_day,post_hour,user_created_month
0,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,[],iPhone,False,2020,7,25,12,5
1,Tom Basile 🇺🇸,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,[],Android,False,2020,7,25,12,4
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,[COVID19],Android,False,2020,7,25,12,2
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,[COVID19],iPhone,False,2020,7,25,12,3
4,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"[CoronaVirusUpdates, COVID19]",Android,False,2020,7,25,12,2


In [29]:
months_mapping = {1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'}
data['post_month'] = data['post_month'].map(months_mapping)
data['post_month'].head()

0    Jul
1    Jul
2    Jul
3    Jul
4    Jul
Name: post_month, dtype: object

Mentions and Hashtags

In [30]:
import re

In [31]:
data['mentions'] = data['text'].apply(lambda x: re.findall("@([a-zA-Z0-9]+)", x))
data['hashtags'] = data['text'].apply(lambda x: re.findall("#([a-zA-Z0-9]+)", x))      
data['text'] = data['text'].apply(lambda x: re.sub("@([a-zA-Z0-9])+", "", x))          # removing mentions
data['text'] = data['text'].str.replace("#","")                                        # removing hashtag symbol

In [32]:
data.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet,post_year,post_month,post_day,post_hour,user_created_month,mentions
0,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,[],iPhone,False,2020,Jul,25,12,5,[]
1,Tom Basile 🇺🇸,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey and - wouldn't it have made more sense ...,[],Android,False,2020,Jul,25,12,4,"[Yankees, YankeesPR, MLB]"
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,Trump never once claimed COVID19 was a hoax...,[COVID19],Android,False,2020,Jul,25,12,2,"[diane3443, wdunlap, realDonaldTrump]"
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,The one gift COVID19 has give me is an apprec...,[COVID19],iPhone,False,2020,Jul,25,12,3,[brookbanktv]
4,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel CoronaVirusU...,"[CoronaVirusUpdates, COVID19]",Android,False,2020,Jul,25,12,2,"[kansalrohit69, DrSyedSehrish, airnewsalerts, ..."


User Location

In [33]:
data['user_location_end'] = data['user_location'].str.split(",").str[-1].str.strip()
data['user_location_end'].head()

0             astroworld
1                     NY
2                     KY
3    Stuck in the Middle
4      Jammu and Kashmir
Name: user_location_end, dtype: object

Text

In [34]:
from textblob import TextBlob

In [35]:
TextBlob('I am good today').sentiment.polarity

0.7

In [36]:
data['text_sentiment'] = data['text'].apply(lambda x: 'Positive' if TextBlob(x).sentiment.polarity > 0 else ('Neutral' if TextBlob(x).sentiment.polarity==0 else 'Negative') )
data['text_sentiment'].head()

0    Negative
1    Positive
2     Neutral
3     Neutral
4     Neutral
Name: text_sentiment, dtype: object

Hashtags

In [37]:
data['Hashtags Count'] = data['hashtags'].str.len()
data['Mentions Count'] = data['mentions'].str.len()

In [38]:
with open('preprocessed_data.pickle','wb') as f:
    pickle.dump(data, f)