# Large Scale Tweets Collection

**The following code can be run on VM Terminal.**

Required libraries need to first be downloaded, see https://github.com/cs-ssa-w21/final-project-covid-twitter.

Manually create a directory for this project, i.e. "LSC" on Desktop.

Enter the project directory:

    cd Desktop/LSC/final-project-covid-twitter/
    
Obtain required environment:

    git clone https://github.com/cs-ssa-w21/final-project-covid-twitter.git

Enter the cloned environment:

    cd ~/final-project-covid-twitter/
    
Manually create a directory for data storage, i.e. "data_new" in data subdirectory.

**Basic settings for working in IPython3**

Open IPython3 in the Terminal:

    ipython3
    
Preparations:

    %load_ext autoreload
    %autoreload 2
    %matplotlib inline

**Import required libraries**

In [None]:
from covid_data_analysis import *
from scrape_twitter_with_Twint import *

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go

import json
import preprocessor as p
import nltk
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer

## Help functions

Read JSON file into pandas DataFrame.

In [None]:
def json_to_df(username):
    tweets = []
    for line in open('data/data_new/{}.json'.format(username), 'r', encoding='utf-8'):
        tweets.append(json.loads(line))
    return pd.DataFrame(tweets)

Clean a single twitter dataframe.

In [None]:
def clean_df(username):
    df = json_to_df(username)
    tweets_df = df[['id', 'date', 'time', 'username', 'tweet', 'hashtags']]

    for i,v in enumerate(tweets_df['tweet']):
        p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.SMILEY, p.OPT.MENTION, p.OPT.HASHTAG, p.OPT.RESERVED)
        tweets_df.loc[i, 'tweet'] = p.clean(v)
        tweets_df.loc[i, 'tweet'] = tweets_df.loc[i, "tweet"]

    tweets_df['tweet'] = tweets_df['tweet'].str.lower().str.replace('[^\w\s]',' ').str.replace('\s\s+', ' ')
    return tweets_df

## Collect, store, read and clean tweets for CDCgov

Collect tweets from a single user "CDCgov", and store into a JSON file.

In [None]:
name = 'CDCgov'
get_tweets(username=name, search='COVID', since='2019-07-01', 
           until='2021-05-31', output='data_new//{}_1907_2105.json'.format(name)) 

Read data from JSON file into pandas dataframe, and clean the "CDCgov" tweets dataframe.

In [None]:
tweets_df = clean_df(username)
tweets_df.loc[:, 'date']   # check updates and number of tweets

Check basic information of the cleaned tweets dataframe of "CDCgov".

In [None]:
tweets_df.shape

In [None]:
tweets_df.loc[0:10, 'tweet']

## Collect, store, read and clean tweets for 52 States governors

In [None]:
governor = pd.read_csv('data//governor-twitter-handle.csv')
governor.head()

Collect tweets from 52 States governors, and store into 52 JSON files seperately.

In [None]:
k=len(governor) # for convenience of manual test
get_tweets_from_multiple_users(governor[:k], folder='data_new', search='COVID', since=None, until=None)

Read data from 52 States tweets JSON files into 52 pandas dataframes, clean each dataframe, and make a dictionary of the cleaned dataframes.

In [None]:
tweets_df = {}
for i in range(k):
    username = governor.iloc[i].State
    tweets_df[username] = clean_df(username)

Check basic information of any cleaned State tweets dataframe by searching the "username".

In [None]:
tweets_df[username].loc[:, 'date']   # check updates and number of tweets

In [None]:
tweets_df[username].shape

In [None]:
tweets_df[username].loc[0:10, 'tweet']