<a href="https://colab.research.google.com/github/michellekan/smt203/blob/main/Lab2/Lab2_Twitter_API_ver2.3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2 - Twitter API (v2)

In this lab, you will learn how to retrieve tweets data from Twitter by using an open source library called [Tweepy](https://docs.tweepy.org/en/latest/). Tweepy gives you a very convenient way to access the Twitter API with Python.  

Also, check the official [Twitter API](https://developer.twitter.com/en/docs/twitter-api/getting-started/guide).

This lab is written by Michelle KAN (michellekan@smu.edu.sg) and Jisun AN (jisunan@smu.edu.sg). 

Let's first install the tweepy library:<br>

In [None]:
## This it OPTIONAL if you are running the current notebook using Google Colab
!pip install tweepy

In [None]:
# Add Google Drive as an accessible path (Optional if you are running from Jupyter Notebook)
from google.colab import drive
drive.mount('/content/drive')

# change path to the designated google drive folder
# otherwise, data will be saved in /content folder which you may have issue locating
%cd /content/drive/My Drive/Colab Notebooks/

## 1) Authentication

The following code imports the tweepy library and other required libraries. Twitter API uses the [tweepy.AuthHandler](https://docs.tweepy.org/en/v3.5.0/auth_tutorial.html) class for authentication. 

In [None]:
import tweepy
from tweepy import OAuthHandler

Before using the Twitter API, you will need a Twitter account, and to have obtained Twitter API authentication credentials.<br>Set your authentication credentials below. <br>

In [None]:
# Consumer/Access key/secret/token obtained from Twitter
# You should have created a Twitter app and gotten these keys.
# Do NOT share your key/secret/token with other students.
consumer_key    = ''
consumer_secret = ''
access_token    = ''
access_secret   = ''


The following code creates an authorization object with your above authentication info and calls the Twitter's API.

In [None]:
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# This line finally calls Twitter's Rest API.
api = tweepy.API(auth)
#api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# The following codes verify if the authentication is successful
# If all goes well, you should see a message saying Authentication OK.
# Otherwise, check your Consumer/Access key/secret/token
try:
    api.verify_credentials()
    print("Authentication OK")
except Exception as e:
    print("Error during authentication:", e)


## 2) Types of Twitter API & Tweepy Cursor

### 2-1) Twitter REST API

The REST API is to pull data from Twitter. 

We can retrieve tweets based on query or tweets of all users using `tweepy.Cursor.` 

`tweepy.Cursor` method deals with the pagination -- if there's many tweets returned, it makes it easy to iterate the data.


#### a) Search tweets using Keywords

Below will return five tweets containing search words 

```
search_words = 'covid'
max_tweets = 5
tweets = tweepy.Cursor(api.search, q=search_words, tweet_mode='extended').items(max_tweets)
```


#### b) Users tweets

Below will return 5 tweets posted by BillGates

```
username = 'BillGates'
max_tweets = 5
tweets = tweepy.Cursor(api.user_timeline, id=username, tweet_mode='extended').items(max_tweets)
```


### 2-2) Streaming API tweets
The Twitter streaming API is used to download twitter messages in real time. It is useful for obtaining a high volume of tweets, or for creating a live feed using a site stream or user stream. See the [Twitter Streaming API Documentation](https://developer.twitter.com/en/docs/tweets/filter-realtime/overview).

```
keyword = 'covid'
myStream.filter(track=[keyword])
```





## 3) Search Tweets

Now you are ready to search Twitter for recent tweets! 



### a) Search Tweets using Keywords


To create this query, you will define the:
- Search term 
- start date of your search (optional)
 
Note: Search API returns tweets with specific search terms, posted in the last 7 days. You need a premium account for going further than 7 days.

(Optional) Uncomment and run the following code snippet if you wish to enable Python logging to know what's happening underlying in the API call.

In [None]:
# import logging
# logging.basicConfig(level=logging.DEBUG,
#                     format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
#                     datefmt='%m-%d %H:%M:%S')
# logger = logging.getLogger(__name__)

In [None]:
# Define the search term and the date_since date as variables
search_words = 'covid'
date_since = "2021-08-29" #if you want to collect data from yesterday

max_tweets = 5

Below we use `tweepy.Cursor()` to search for tweets containing the specified search_words and perform pagination. Parameters:
-   `api.search` – tweepy api method that returns a collection of relevant Tweets matching a specified query
- 	`q` – the search query string of 500 characters maximum, including operators. Queries may additionally be limited by complexity.
-   `lang` – restricts tweets to the given language
-   `since` – returns tweet created on or after this date. Date should be formatted as YYYY-MM-DD.

You can restrict the number of tweets returned by specifying a number in the `.items()` method. `.items(5)` will return 5 of the most recent tweets

In [None]:
# Below will return five tweets containing search words 
tweets = tweepy.Cursor(api.search, q=search_words, tweet_mode='extended').items(max_tweets)

In [None]:
# You can add other parameters like lang, since, etc) 
tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since,tweet_mode='extended').items(max_tweets)

`tweets.Cursor()` returns an object `ItemIterator` that you can iterate to access the tweet data collected. Each tweet item in the iterator has various attributes including:

- the text of the tweet
- the date the tweet was sent
- creator of the tweet
- location where the tweet is created
- and more. 

The code below loops through the object and prints the text associated with each tweet.

In [None]:
tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(max_tweets)

# Iterate tweets
for tweet in tweets:
    # print out user's screen name & tweet text
    print("----------------------------------------------------")
    print ('Tweet ID ' + str(tweet.id))
    print (f'Tweeted by: @{tweet.user.screen_name}, Created at: {str(tweet.created_at)}, Location: {tweet.user.location}' )

    # Extracting tweet text when in Extended Mode
    try: # If it's Retweet
        text = tweet.retweeted_status.full_text
    except AttributeError:  # Not a Retweet
        text = tweet.full_text
    print('\t Tweet: ' + text)


<img align="left" src="https://docs.google.com/uc?id=1IegynNxVgb3GxQoXFD_HPJMRJcx8Rlmk" width="50" style="vertical-align:middle;margin:0px 5px"/><br>Note that **user locations** are manually entered into Twitter by the user. Thus, you will see a lot of variation in the format of this value.<br>

You can access a wealth of information associated with each tweet. Try to include other items available by checking out the [twitter developer documentation](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet).


#### **Save Tweets in a JSON format into a File**


Twitter API has limits in how many times we can call APIs to collect the data (Twitter Rate Limit). So, it's always better to save the data in the file. 

__What is JSON?__

JavaScript Object Notation (JSON) is a standard text-based format for representing structured data based on JavaScript object notation syntax.
- Data is in key-value pairs
- Data is separated by commas
- Curly braces hold objects
- Square brackets hold arrays

Table / Database --> Text format

| id        | name           | tweet  |
| ------------- |:-------------:| -----:|
| 123      | Jisun | Hello |
| 456      | Michelle      |  Welcome |

JSON 
`[{"id":123,"name":"Jisun","tweet":"Hello"},{"id":456,"name":"Michelle","tweet":"Welcome"}]`

Other examples of JSON format:<br>
`{"code":"SMT203","desc":"CSS","num_of_students":46}`<br>
`{"code":"SMT203", "desc":"CSS", "students":{"qty":46, "school":"SCIS"}}`<br><br>

We can use `json.dumps()` to save json objects in string format into a file, while `json.loads()` reads in a string (e.g., from a jsons file) and returns a json object.

<img align="middle" src="https://docs.google.com/uc?id=1m8uElP-ak8FeUNlhpQ6xBO_Fwx4Zitv4" width="450"/>



In [None]:
import json

In [None]:
# set location for files to be saved
mypath = "."

tweets = tweepy.Cursor(api.search, q=search_words, lang="en", since=date_since, tweet_mode='extended').items(max_tweets)
# Write data into a file
filename = f"{mypath}/tweets_{search_words}.jsons"

with open(filename, "w") as output:
    for tweet in tweets:
        myjson = tweet._json
        output.write(json.dumps(myjson)+"\n")


Read tweets from the file.

Let's read the first tweet.

In [None]:
# Read data from a file
filename = f"{mypath}/tweets_{search_words}.jsons"

with open(filename) as fi:
    for line_cnt, line in enumerate(fi):
        tweet_json = json.loads(line.strip())
        break # Break here so that we read the first line of the file
        

In [None]:
# Print JSON formated text in pretty way
import pprint

pprint.pprint(tweet_json)

In [None]:
# Check keys in json
tweet_json.keys()

In [None]:
# How to access values in json
print(tweet_json['id'])
print(tweet_json['user']['name'])

#### Extract data from json

In [None]:
# Read data from a file
mypath= "."
filename = f"{mypath}/tweets_{search_words}.jsons"

with open(filename) as fi:
    for line_cnt, line in enumerate(fi):
        tweet = json.loads(line.strip())

        tweetid = tweet['id']
        created_at = tweet['created_at']

        # Extract text from tweets in Extended Mode
        if 'retweeted_status' in tweet: # If it's Retweet
            text = tweet['retweeted_status']['full_text']
        else:  # Not a Retweet
            text = tweet['full_text']

        user_screen_name = tweet['user']['screen_name']
        user_location = tweet['user']['location']

        print("--------------------------")
        print (f'Tweet ID: {tweetid}')
        print (f'Tweeted by: @{user_screen_name}, Created at {created_at}, User Location: {user_location}' )
        print(f'\t {text}')

        break #If you want to read other lines, comment this out


### Exercise 1

Using the tweets retrieval code example given above, add on the following details for each tweet retrieved:
- Number of times the Tweet has been retweeted (a retweet is when someone shares someone else’s tweet.)
- Source/application used to post the Tweet.
- User's name and friends count in Twitter

You may take reference to the [twitter developer documentation](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet). 

An example of the expected tweet output is given as follows: the tweet has been retweeted 21 times, the tweet has been posted using 'Twitter for Android' and user 'Cotonete' has 82 friends in Twitter:<br>
<img align="center" src='https://drive.google.com/uc?export=view&id=1WHGR9Q9ou4_w_zMhioEfVyhV1VxYNenk' style="height: 110px;">

As shown above there could be two ways to get this done. You can use Tweepy API or you can use the saved file. 

Try both!


In [None]:
## Enter your code below using Tweepy API


In [None]:
## Enter your code below using the saved json file



#### Removing Retweets

In the above example, some of the tweets retrieved may contain prefix 'RT' which means they are retweets. A retweet is when someone shares someone else’s tweet. It is similar to sharing in Facebook. Sometimes you may want to remove retweets as they contain duplicate content that might skew your analysis if you are only looking at word frequency. Other times, you may want to keep retweets.

Below you ignore all retweets by adding `-filter:retweets` to your query. You may wish to check out the [Twitter API](https://docs.tweepy.org/en/latest/api.html) documentation on other ways to customize your queries

In [None]:
new_search = search_words + " -filter:retweets" 
# new_search has the value "clean energy -filter:retweets"

tweets = tweepy.Cursor(api.search,q=new_search, lang="en",since=date_since).items(8)

for tweet in tweets:
    print("----------------------------------------------------")
    print (f'Tweeted by: @{tweet.user.screen_name} Created at: {str(tweet.created_at)} Location: {tweet.user.location}' )
    print(f'\tText: {tweet.text}')
    

### Create a Pandas Dataframe From A List of Tweet Data

Instead of displaying on screen, you can also populate a pandas dataframe using tweets data retrieved.

[Pandas](https://pandas.pydata.org/) are widely used libraries to support handling tabular data. 
I can say that pandas is the defacto standard libarary.
Let's import pandas.

As typing 'pandas' is hard (...), in most cases, pandas is imported like below:

In [None]:
import pandas as pd

Then, you can create pandas data frame from collected data.

You first append your data as a list, then conver it to dataframe.

You can imagine that the dataframe is a table. 

In [None]:
# setting parameters and retrieving tweets
new_search = search_words + " -filter:retweets" 
tweets = tweepy.Cursor(api.search,q=new_search, lang="en",since=date_since,tweet_mode='extended').items(8)

## initialise list to be used to store tweets retrieved
tweets_list = []

## appending tweets retrieved into a list
for tweet in tweets:
    
    try: # If it's Retweet
        text = tweet.retweeted_status.full_text
    except AttributeError:  # Not a Retweet
        text = tweet.full_text

    tweets_list.append([tweet.user.screen_name, tweet.created_at, tweet.user.location, text])

# populate dataframe with list of tweets
tweet_df = pd.DataFrame(data=tweets_list, columns=['user','created_at','location','text'])
tweet_df

In [None]:
## save the data into a csv file
mypath= "."
tweet_df.to_csv(f'{mypath}/covid_tweet.csv', index=False)

### Pandas basic 

(You can skip this sub-section if you are already familiar with Pandas.) 

Let's read csv file using pandas. 

In [None]:
## By default, read_csv() function assumes that the separator as ','. Thus, we can omit it as well. 

df = pd.read_csv(f'{mypath}/covid_tweet.csv', sep=',')
df.head(n=5)

You can also see the n rows from the bottom by tail().

In [None]:
df.tail(n=5)

You can check the number of rows and columns by .shape


In [None]:
df.shape

In [None]:
## below if how you can access the values of df.shape 
print (f'{df.shape[0]} rows and {df.shape[1]} columns')

We will work with pandas more later in this lab.

### b) Search Tweets by Specific User

Besides keyword, we can also retrieve tweets posted by specific Twitter user. 

Parameters:
-   `api.user_timeline` – tweepy api method that returns the most recent statuses (up to 20) posted from the user specified.
-   `id` – unique user ID or screen name of a user
-   `lang` – restricts tweets to the given language
-   `include_rts` – boolean indicator to specify whether to include retweets
-   `exclude_replies` – boolean indicator to specify whether to exclude tweet replies

Similarly, you can restrict the number of tweets returned by specifying a number in the `.items()` method. `.items(10)` will return 10 of the most recent tweets.

Let's look at the following example that retrieves tweets posted by UK Model World Health Organization. 

In [None]:
import pandas as pd

user_id = "UKModelWHO"

## initialise list to be used to store tweets retrieved
tweets_list = []

## appending tweets retrieved into a list
for tweet in tweepy.Cursor(api.user_timeline, id=user_id ,lang="en", include_rts=False, exclude_replies=True, tweet_mode='extended').items(10):
    try: # If it's Retweet
        text = tweet.retweeted_status.full_text
    except AttributeError:  # Not a Retweet
        text = tweet.full_text
#     print(f'Retweeted: {tweet.retweet_count}times')
    tweets_list.append([tweet.user.screen_name, tweet.id, tweet.created_at, tweet.retweet_count, text])

# populate dataframe with list of tweets specifying required column names
tweet_df = pd.DataFrame(data=tweets_list, columns=['user','tweetid','created_at', 'retweet_count', 'text'])
tweet_df



In [None]:
## save the data into a csv file
tweet_df.to_csv('ukmodelwho_tweet.csv')

You can save tweets in their original json format

In [None]:
user_id = "UKModelWHO"

tweets = tweepy.Cursor(api.user_timeline, id=user_id ,lang="en", include_rts=False, exclude_replies=True, tweet_mode='extended').items(10)

filename = f"{mypath}/tweets_{user_id}.jsons"
with open(filename, "w") as output:
    for tweet in tweets:
        myjson = tweet._json
        output.write(json.dumps(myjson)+"\n")


Create dataframe from json files

In [None]:
tweets_list = []

filename = f"{mypath}/tweets_{user_id}.jsons"
with open(filename) as fi:
    for line_cnt, line in enumerate(fi):

        tweet = json.loads(line.strip())

        tweetid = tweet['id']
        created_at = tweet['created_at']
        
        retweet_count = tweet['retweet_count']
        # # # Extended Mode
        if 'retweeted_status' in tweet: # If it's Retweet
            text = tweet['retweeted_status']['full_text']
        else:  # Not a Retweet
            text = tweet['full_text']

            
        user_screen_name = tweet['user']['screen_name']

        tweets_list.append([user_screen_name, tweetid, created_at, retweet_count, text])

# populate dataframe with list of tweets specifying required column names
tweet_df = pd.DataFrame(data=tweets_list, columns=['user','tweetid', 'created_at', 'retweet_count', 'text'])
tweet_df

    

In [None]:
## save the data into a tsv file (tab-separated)
tweet_df.to_csv('ukmodelwho_tweet_simple_2.csv')


#### Find the top three rows with highest retweet_count

In [None]:
df1 = tweet_df.sort_values('retweet_count',ascending = False).head(3)
df1

#### Filter out rows that have more than 2 retweet_count

You can filter dataframe by query(). See the official API from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html


In [None]:
# Let's see how many rows the current dataframe has 
tweet_df.shape

In [None]:
rt_tweet_df = tweet_df.query('retweet_count >= 2')
print(rt_tweet_df.shape) # only half of the tweets have more than 2 retweet counts! (the value might be different to your case)
rt_tweet_df

### Exercise 2

Can you find the tweets with the highest favorites (likes)? 

Step 1: from JSON file, extract number of favorites of the tweet and add it to your dataframe.
Step 2: Sort the rows based on likes and print the top 3. 


In [None]:
# Enter your codes below





### Exercise 3 (Optional)

Can you compute the correlation between two variables: favorite_count and retweet_count in the data? 

For computing the correlation for two variables, you can use ```scipy.stats.pearsonr``` function. 

Find more information about it [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html). 

Don't forget that you will need to import the library! 
```from scipy.stats.stats import pearsonr```


In [None]:
# Enter your code below




In [None]:
# compute the correlation between favorite_count and retweet_count
from scipy.stats.stats import pearsonr
pearsonr( ___ , ___ )

## 4) Draw a word cloud

A word cloud, which has been popularly used as a tag cloud in the era of Blogs, is often used to show which words frequently appear. Detail explanations are available on https://en.wikipedia.org/wiki/Tag_cloud

In [None]:
## This it OPTIONAL if you are running the current notebook using Google Colab
!conda install --yes -c conda-forge wordcloud

In [None]:
# Import relevant libraries

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

#### Let's draw WordCloud of the collected tweets

Loading a data into dataframe

In [None]:
infilename = f"{mypath}/covid_tweet.csv" 
df = pd.read_csv(infilename, sep=",")
print(df.shape)
df.head()


As WordCloud() function requires a **string** as a parameter, we need to concatenate all the rows of the 'text' column in the dataframe to a single string by join().

This can be done in one line of code using list comprehension, 
```all_tweets = " ".join(one_row for one_row in df['text'])```

Let's disagreggate this code and check how it works.

In [None]:
# access 2nd item of df['text']
df['text'][2]

In [None]:
# access the first five rows of df['text']
df['text'][:5]

In [None]:
# you can print the first 5 rows of df['text'] 
for one_row in df['text'][:5]:
    print(one_row)

In [None]:
# below will read the first 5 rows of df['text'] and reture as a list 
[one_row for one_row in df['text'][:5]]

In [None]:
# how join function works, it concatenates items in the list and return a single sentence
sample = ['jisun is cool', 'michelle is cool', 'we all are cool']
" ".join(sample)

In [None]:
" ".join([one_row for one_row in df['text'][:5]])

In [None]:
# this will concatenate all rows in df['text'] and return one single sentence!
all_tweets = " ".join([one_row for one_row in df['text']])

In [None]:
# Enter your code to draw WordCloud
wordcloud = WordCloud(stopwords=STOPWORDS, background_color="white", width=1000, height=500).generate(all_tweets)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

We are curious about the context of 'vaccine'. 
You can search dataframe like below:

In [None]:
df[df["text"].str.contains("vaccine", na=False)]

### Exercise 4 - Let's compare two hashtags

People use hashtags to discuss a certain topics and to express their opinions. For example, in the discussion of vaccination, pro-vaccine people have used #vaccinessavelives and anti-vaccine people have used #mybodymychoice. By collecting tweets including those hashtag, you can understand what people are talking about pro-vaccine or anti-vaccine issue. In this exercise, you will collect tweets about vaccination and analyze the differences between the two groups. 

Please collect tweets using two different hahstags: **#vaccinessavelives** and **#mybodymychoice**, which are *opposing* to each other, and then draw wordclouds to compare the context around the two hashtags. Please note that you don't need to add # for the keyword when searching.  


1) Create a function named ```searchTweet``` whose input is ```search_words```, ```date_since```, and ```max_tweets```, and output is a file named ```tweets_{search_words}.jsons``` where it stores tweets in a JSON format. 

2) Create a fuction named ```parseTweet``` whose input is ```search_words``` and output is a file named ```simple_stream_tweets_{search_words}.tsv,``` which is tab-separated file and includes the following tweet information, ```tweetid```, ```user_screen_name```, ```created_at```, and ```text.``` 

2) Collect tweets for two hashtags: #vaccinessavelives and #mybodymychoice for the last 3-7 days with max_tweets=100. 

3) Draw word clouds for each hashtagh. You can create a function named ```drawWordcloud``` whose input is ```search_words``` and output is the plot. 


In [None]:
# # Import relevant libraries
# import pandas as pd
# import json 
# from wordcloud import WordCloud, STOPWORDS
# import matplotlib.pyplot as plt
# import tweepy
# from tweepy import OAuthHandler

# # Consumer/Access key/secret/token obtained from Twitter
# # You should have created a Twitter app and gotten these keys.
# # Do NOT share your key/secret/token with other students.
# consumer_key    = ''
# consumer_secret = ''
# access_token    = ''
# access_secret   = ''

# auth = OAuthHandler(consumer_key, consumer_secret)
# auth.set_access_token(access_token, access_secret)

# # This line finally calls Twitter's Rest API.
# api = tweepy.API(auth)

# try:
#     api.verify_credentials()
#     print("Authentication OK")
# except Exception as e:
#     print("Error during authentication:", e)


In [None]:
def searchTweet(search_words, date_since, max_tweets):
    # Enter your code for Step 1 here
 
    

In [None]:
def parseTweet(search_words):
    # Enter your code for Step 2 here


In [None]:
date_since = "2021-08-26" 
max_tweets = 100

In [None]:
# Step 3: run the following to collect tweets for #vaccinessavelives
search_words = "vaccinessavelives"
searchTweet(search_words, date_since, max_tweets)
parseTweet(search_words)

In [None]:
# Step 3: run the following to collect tweets for #mybodymychoice
search_words = "mybodymychoice"
searchTweet(search_words, date_since, max_tweets)
parseTweet(search_words)

In [None]:
def drawWordcloud(search_words):
  # Enter your code for Step 4 drawWordcloud here
    


In [None]:
# Step 4: Run the following to draw wordcloud for #vaccinessavelives
search_words = "vaccinessavelives"
drawWordcloud(search_words)


In [None]:
# Step 4: Run the following to draw wordcloud for #mybodymychoice
search_words = "mybodymychoice"
drawWordcloud(search_words)


## 5) Streaming API [Optional]

For details of Streaming API, see [Twitter Streaming API Documentation](https://developer.twitter.com/en/docs/tweets/filter-realtime/overview)

Step 1: Creating a StreamListener

`on_data()` is called when new data comes in


In [None]:
class MyStreamListener(tweepy.StreamListener):

    """ A listener handles tweets are the received from the stream.
    This is a basic listener that just prints received tweets to stdout.

    """
    def on_data(self, data):
        myjson=data[:-1]
        myoutput.write(myjson+"\n")
        return True

    def on_error(self, status):
        print ("Error", status)


Step 2: Creating a Stream

In [None]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener, tweet_mode='extended')


You need to stop the process before it collects too much data!!

In [None]:
keyword = 'covid'

myfilename = f'{mypath}/stream_tweets_{keyword}.jsons'
myoutput = open(myfilename, 'w')

while True:
    try:
        # myStream.filter(track=['coronavirus', 'covid', 'chinese virus', 'wuhan', 'ncov', 'sars-cov-2', 'koronavirus', 'corona', 'cdc', 'N95', 'kungflu', 'epidemic', 'outbreak', 'sinophobia', 'china', 'pandemic', 'covd'])
        myStream.filter(track=[keyword])

    except Exception as e:
        raise


In [None]:
outfilename = f"{mypath}/simple_stream_tweets_{keyword}.tsv" 

with open(myfilename) as fi, open(outfilename, 'w') as output:
    # Write header in the file to load the file into dataframe
    output.write("\t".join(['user_screen_name', 'tweetid', 'created_at', 'text'])+"\n")
    
    for line_cnt, line in enumerate(fi):
        try:
            tweet = json.loads(line.strip())
        except: # The last json is not complate 
            continue
        
        if 'limit' in tweet:
            continue
        
        tweetid = tweet['id']
        
        created_at = tweet['created_at']
        user_screen_name = tweet['user']['screen_name']

        # Extract Tweet text from Streaming API when in Extended Mode 
        text = tweet['text']
        try:
            text = tweet['extended_tweet']['full_text']
        except:
            pass

        # Below line will remove all tabs and line breaks from text
        text = " ".join(text.split())

        output.write("\t".join([user_screen_name, str(tweetid), created_at, text])+"\n")


In [None]:
infilename = f"{mypath}/simple_stream_tweets_{keyword}.tsv" 
df = pd.read_csv(infilename, sep="\t")
print(df.shape)
df.head()