# Penn State DS 200 Fall 2020
# Lab 5 Tweets Gathering 

## Instructor: Professor John Yen
## TA: Rupesh Prajapati
## LA: Nathan Tack

# Learning Objectives:
- Be able to apply and obtain approval for a Twitter Developer Account.
- Be able to obtain API keys and Access Tokens, which are needed for gathering tweets from Twitter.
- Be able to identify a set of keywords and hashtags for sampling tweets relevant to a topic of interest.
- Be able to install Tweepy, a Python library/module for tathering tweets using Twitter API.
- Be able to use API keys and Access Tokens to run a Tweets gathering Python code.
- Be able to read Tweets gathered as a Table.

# Exercises: 4
- Exercise 1: 20 points
- Exercise 2: 10 points
- Exercise 3: 10 points
- Exercise 4: 5 points
- Exercise 5: 10 points

# Total Points: 55 points

# Due Date: 5 pm, September 28th, 2020

### Install Tweepy
The first thing we will do is to install a tweepy, a Python library/module for gathering tweets using Twitter API.

In [1]:
!pip install tweepy



In [2]:
import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener

import sys
import os
import json
import time
import datetime
import re

import pandas as pd

# Mounting Google Drive

Like the previous labs, we need to mount Google Drive so that the collected tweets can be saved there.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Python Code for Gathering Tweets
The following code defines a group of code that, together, "listens" (responds) to tweets (sent from Twitter API) that match the keywords and hashtags specified.  The code also filters out non-English tweets, and performs some simple preprocessing (e.g., remove non-ASCII characters in the body of the tweet), so that we do not need to worry about them later.

In [4]:
class MyListener(StreamListener):
    def __init__(self, raw_file, csv_file, text_file, max_num=300):
        super().__init__()
        self.raw_file = raw_file
        self.csv_file = csv_file
        self.text_file = text_file
        self.max_num = max_num
        self.count = 0
        self.start_time = time.time()

    def on_data(self, data):
        # Filter out special cases
        if data.startswith('{"limit":'):
            return

        # Filter out non-English tweets
        tweet = json.loads(data)
        if tweet['lang'] != 'en':
            return
        # if 'retweeted_status' in tweet:
        #     return

        # Extract fields from tweet and write to csv_file
        user_id = tweet['user']['id']
        user_name = tweet['user']['name']
        tweet_time = tweet['created_at']
        location = tweet['user']['location']
        text = tweet['text'].strip().replace('\n', ' ').replace('\t', ' ')

        # Remove non-ASCII characters and commas in user_name and location
        if user_name is not None:
            user_name = ''.join([c if ord(c) < 128 else '' for c in user_name])
            user_name = user_name.replace(',', '')
        if location is not None:
            location = ''.join([c if ord(c) < 128 else '' for c in location])
            location = location.replace(',', '')

        # Remove non-ASCII characters in text
        text = ''.join([c if ord(c) < 128 else '' for c in text])
        # Replace commas with space
        text = text.replace(',', ' ')
        # Replace double quotes with blanks
        text = re.sub(r'\"', '', text)
        # Replace consecutive underscores with space
        text = re.sub(r'[_]{2,}', ' ', text)
        # Remove all consecutive whitespace characters
        text = ' '.join(text.split())

        # Check if csv_file, text_file exist
        # If not, create them and write the heads
        if not os.path.isfile(self.csv_file):
            with open(self.csv_file, 'w') as f:
                f.write(','.join(['user_id', 'user_name', 'tweet_time', 'location', 'text']) + '\n')
        if not os.path.isfile(self.text_file):
            with open(self.text_file, 'w') as f:
                f.write('text\n')

        with open(self.raw_file, 'a') as f_raw, open(self.csv_file, 'a') as f_csv, open(self.text_file, 'a') as f_text:
            # Write to files
            f_raw.write(data.strip() + '\n')
            f_csv.write(','.join(map(str, [user_id, user_name, tweet_time, location, text])) + '\n')
            f_text.write(text + '\n')

            # Increment count
            self.count += 1
            # if self.count % 10 == 0 and self.count > 0:
            sys.stdout.write('\r{}/{} tweets downloaded'.format(self.count, self.max_num))
            sys.stdout.flush()

            # Check if reaches the maximum tweets number limit
            if self.count == self.max_num:
                print('\nMaximum number reached.')
                end_time = time.time()
                elapse = end_time - self.start_time
                print('It took {} seconds to download {} tweets'.format(elapse, self.max_num))
                sys.exit(0)

    def on_error(self, status):
        print(status)
        return True

# Get the str representation of the current date and time    
def current_datetime_str():
    return format(datetime.datetime.now(), "%Y-%m-%d_%H-%M-%S")

# Establish a Twitter Developer Account

Follow the instructions in the slides for Lab5 to obtain the approval for your Twitter Developer Account.


## Exercise 1 (20 points)
Paste your API Key, API Secret Key, Access Token, and Access Token Secret into the four strings below, which were assigned to 4 corresponding variables: consumer_key, consumer_secret, access_token, and access_secret for obtaining authentication from Twitter API before 
real-time Tweets (that match your keywords and hashtags) can be gathered by the Python code. 

#### Note: Make sure you copy each key exactly as they are.  Especially, pay attention to the first character and the last character to make sure you did not miss any of them.  Also, double check you did not accidentently include space or left parenthesis when you copy keys and token.
#### Create a keywords.txt file and upload it from your computer to Google Drive under a Tweets folder in DS200Labs

In [5]:
def main():
    # Path for Google Drive for reading keywords and writing Tweets Gathered
    data_directory ='/content/drive/My Drive/DS200Labs/Tweets/'
    # Paste your keys and token below.  
    consumer_key = 'aOqzIZZwckT6T88ryZpULoxTu'
    consumer_secret = 'rDIk0sSLrrMY4gLb0OjFrQ1KYZ8xKQCotEsieTj7G1UZcNHpDB'
    access_token = '1304306874746109952-bfa4p5ULyXwKuVJIPsOXnGhbbwrtvF'
    access_secret = 'G3MVXUPMCMdpbhGz0fCMGYC8NTpvxWqqBysoIK6sh3RDd'


    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    api = tweepy.API(auth)

    # Welcome
    print('===========================================================')
    print('Welcome to the user interface of gathering tweets pipeline!')
    print('You can press "Ctrl+C" at anytime to abort the program.')
    print('===========================================================')
    print()

    # Prompt for input keywords
    methods = ['manual', 'file']
    print('How do you want to specify your key words?')
    while True:
        m = input('Type "manual" or "file" >>> ')
        if m in methods:
            break
        else:
            print('\"{}\" is an invalid input! Please try again.\n'.format(m))

    # Choose keywords:
    if m == 'file':
        print('===========================================================')
        print('Please input the file name that contains your key words.')
        print('Notes:')
        print('    The file should contain key words in one or multiple lines, and multiple key words should be separated by *COMMA*.')
        print('        For example: NBA, basketball, Lebron James')
        print('    If the file is under the current directory, you can directly type the file name, e.g., "keywords.txt".')
        print('    If the file is in another directory, please type the full file name, e.g., "C:\\Downloads\\keywords.txt" (for Windows), or "/Users/xy/Downloads/keywords.txt" (for MacOS/Linux).')

        while True:
            file_name = data_directory + input('Type your file name >>> ')
            # The line above is for reading an input file from data directory specified in the beginning of this function.
            if os.path.isfile(file_name):
                break
            else:
                print('"{}" is not a valid file name! Please check if the file exists.\n'.format(file_name))

        # Check the content of keywords file
        key_words = []
        with open(file_name, 'r') as f:
            lines = f.readlines()
            if len(lines) == 0:
                print('\n{} is an empty file!\nTask aborted!'.format(file_name))
                sys.exit(1)

            for line in lines:
                line = line.strip()
                # Detect non-ASCII characters
                for c in line:
                    if ord(c) >= 128:
                        print('\n{} contains non-ASCII characters: "{}" \nPlease remove them and try again'.format(file_name, c))
                        sys.exit(1)
                # Check delimiters
                if line.count(' ') > 1 and ',' not in line:
                    print('\nMore than 1 <space> symbols exist in the key words file, but none comma exists')
                    print('I\'m confused about your keywords. Please separate your key words by commas.')
                    sys.exit(1)

                words = line.split(',')
                for w in words:
                    if len(w.strip()) > 0:
                        key_words.append(w.strip())

        # Check key_words
        if len(key_words) == 0:
            print('\nZero key words are found in {}! Please check your key words file.'.format(file_name))
            sys.exit(1)

    elif m == 'manual':
        print('===========================================================')
        print('Please input your key words (separated by comma), and hit <ENTER> when done.')

        while True:
            line = input('Type the key words >>> ')
            line = line.strip()

            invalid_flag = False
            # Check empty
            if len(line) == 0:
                print('\nYour input is empty! Please try again.')
                invalid_flag = True
            # Detect non-ASCII characters
            for c in line:
                if ord(c) >= 128:
                    print('\nYour input contains non-ASCII characters: "{}"! Please try again.'.format(c))
                    invalid_flag = True
                    break
            # Check delimiters
            if line.count(' ') > 1 and ',' not in line:
                print('\nMore than 1 <space> symbols exist in your input, but none comma exists')
                print('I\'m confused about your keywords. Please try again')
                invalid_flag = True

            if invalid_flag:
                continue
            else:
                break

        # Process input
        key_words = []
        for w in line.split(','):
            if len(w.strip()) > 0:
                key_words.append(w.strip())

    # Print valid key words
    key_words = list(set(key_words))
    print('\n{} unique key words being used: '.format(len(key_words)), key_words)

    # Prompt for number of tweets to be gathered
    print('===========================================================')
    print('How many tweets do you want to gather? \nInput an integer number, or just hit <ENTER> to use the default number 300.')
    num_tweets = 300
    while True:
        s = input('Input an integer >>> ')
        s = s.strip()
        if len(s) == 0:
            break
        elif s.isdigit():
            num = int(s)
            if num > 0:
                num_tweets = num
                break
            else:
                print('\nPlease input a number that is greater than 0.')
        else:
            print('\nPlease input a valid integer number.')

    print('{} tweets to be gathered.'.format(num_tweets))

    # Streaming
    # TODO: remvoe '\t', '\n' and ',' in text field, also remove empty text
    print('===========================================================')
    print('Start gathering tweets ...')

    postfix = current_datetime_str()
    raw_file = 'raw_{}.json'.format(postfix)
    csv_file = 'data_{}.csv'.format(postfix)
    text_file = 'text_{}.csv'.format(postfix)
    
    # The data_directory is the directory in Google Drive, specified in the beginning, that the gathered Tweets will be stored
    raw_path = data_directory + raw_file
    csv_path = data_directory + csv_file
    text_path = data_directory + text_file

    twitter_stream = Stream(auth, MyListener(raw_file=raw_path, csv_file=csv_path, text_file=text_path, max_num=num_tweets))
    twitter_stream.filter(track=key_words)


if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        print('\nTask aborted!')
        


Welcome to the user interface of gathering tweets pipeline!
You can press "Ctrl+C" at anytime to abort the program.

How do you want to specify your key words?
Type "manual" or "file" >>> file
Please input the file name that contains your key words.
Notes:
    The file should contain key words in one or multiple lines, and multiple key words should be separated by *COMMA*.
        For example: NBA, basketball, Lebron James
    If the file is under the current directory, you can directly type the file name, e.g., "keywords.txt".
    If the file is in another directory, please type the full file name, e.g., "C:\Downloads\keywords.txt" (for Windows), or "/Users/xy/Downloads/keywords.txt" (for MacOS/Linux).
Type your file name >>> keyword.txt

4 unique key words being used:  ['#trip', 'Trip', 'Travel', '#travel']
How many tweets do you want to gather? 
Input an integer number, or just hit <ENTER> to use the default number 300.
Input an integer >>> 1000
1000 tweets to be gathered.
Start gat

SystemExit: ignored

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [6]:
!pip install datascience

Collecting folium==0.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/72/dd/75ced7437bfa7cb9a88b96ee0177953062803c3b4cde411a97d98c35adaf/folium-0.2.1.tar.gz (69kB)
[K     |████████████████████████████████| 71kB 4.4MB/s 
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Created wheel for folium: filename=folium-0.2.1-cp36-none-any.whl size=79980 sha256=86ccce8b8aa35c2e5f0738302d94a48a2bceda2fddcda306aa77864d33f33851
  Stored in directory: /root/.cache/pip/wheels/b8/09/f0/52d2ef419c2aaf4fb149f92a33e0008bdce7ae816f0dd8f0c5
Successfully built folium
Installing collected packages: folium
  Found existing installation: folium 0.8.3
    Uninstalling folium-0.8.3:
      Successfully uninstalled folium-0.8.3
Successfully installed folium-0.2.1


In [7]:
from datascience import *

  matplotlib.use('agg', warn=False)
  matplotlib.use('agg', warn=False)


# Exercise 2 (5 points)
Run the code below to show the tweets file generated.

In [8]:
!ls /content/drive/My\ Drive/DS200Labs/Tweets/

data_2020-09-28_03-43-08.csv  raw_2020-09-28_03-43-08.json
keyword.txt		      text_2020-09-28_03-43-08.csv


# Exercise 3 (10 points)

Find out the name of your tweets file that start with data. Replace XXXXXX in the path assignment below with the rest of the tweets data file. Read the tweets table, and sort it by "text", which is the body of tweets.

In [9]:
path = "/content/drive/My Drive/DS200Labs/Tweets/data_2020-09-28_03-43-08.csv"
tweets_table = Table.read_table(path)
sorted_tweets = tweets_table.sort("text")
sorted_tweets.show(10)

user_id,user_name,tweet_time,location,text
5674722,Davanum Srinivas,Mon Sep 28 03:44:04 +0000 2020,Massachusetts USA,#148 you are hilarious! I definitely enjoy your tweets e ...
63791823,eVisitorGuide,Mon Sep 28 03:49:03 +0000 2020,CHI MKE STL & Nashville,#Blues #brewpubs and some of the best #sportsbars in the ...
727580305108938752,WomensPowerCen,Mon Sep 28 03:44:03 +0000 2020,,#MTVStars #coke Through food drinks clothes amenities &a ...
437798374,BAL Immigration,Mon Sep 28 03:45:03 +0000 2020,,#Malaysia has relaxed its entry ban for some travelers. ...
1226094135431585792,,Mon Sep 28 03:43:32 +0000 2020,Kamagut,#UhuruNaKazi The new Naivasha- Njambini road is key to i ...
954473252969205760,FBMyNextCar / WebPass Social Network,Mon Sep 28 03:45:01 +0000 2020,Florida USA,#WebPass #Florida @EdMorseMazdaPortRichey checkout Speci ...
2907966134,shabnam roy,Mon Sep 28 03:47:03 +0000 2020,Mumbai India,#cleartrip I have been hearing this for over 6 months no ...
1130398556836306944,Styles Mix,Mon Sep 28 03:46:27 +0000 2020,,#deluxe #luxury #design Portable Comfortable Reversible ...
617853906,Brett Murphy,Mon Sep 28 03:45:12 +0000 2020,Bay Area CA,#selfemployed #network #makemoney #affiliateprogram Chec ...
1087583509005418496,4 Cycling Store,Mon Sep 28 03:49:11 +0000 2020,,#yogini #travel Universal Breathable Cycling Headbandhtt ...


# Counting Words or Hashtags in Tweets

Suppose we are interested in having a high-level view about the frequency that a specific word or a hashtag (e.g., COVID, #Covid or #Covid19) occurs in each tweet. We can do this in a few steps:
- First, convert the string of text (in each tweet) into a list of words using the Python function split.
- Then, count, for each tweet, how many times the word or hashtag you are interested in occurs in the list.  
- Third, add the count for each word or hashtag to obtain a total.

The following Python code demonstrates how split converts a string representation of text into a list of words, which enable further processing of the text.

In [10]:
String = "This is a tweet about tweet, but not a tweet about Covid-19. I don't think it matters, though. #COVID19"
BoW_list = String.split(" ")
print(BoW_list)

['This', 'is', 'a', 'tweet', 'about', 'tweet,', 'but', 'not', 'a', 'tweet', 'about', 'Covid-19.', 'I', "don't", 'think', 'it', 'matters,', 'though.', '#COVID19']


In [13]:
'tweet' in BoW_list

True

# Exercise 4 (5 points)
- (a) Complete the following code so that split uses period "." (not sapce " ") as the delimeter for splitting a given string. 
- (b) Discuss the difference between using period "." versus " " in splitting a string. 

In [14]:
# Exercise 4 (a)
list2 = String.split(".")
print(list2)

['This is a tweet about tweet, but not a tweet about Covid-19', " I don't think it matters, though", ' #COVID19']


# Answer to Exercise 4 (b):

Using period to split the string, we can get a full sentence, while using space, the string will split word by word.

In [16]:
tweets_BoW = sorted_tweets.apply(lambda x: x.split(' '), "text") 

In [17]:
print(tweets_BoW)

[list(['#148', 'you', 'are', 'hilarious!', 'I', 'definitely', 'enjoy', 'your', 'tweets', 'especially', 'about', 'food/travel', 'multi', 'cultural', 'stuff.', 'Just', 'so', 'y', 'https://t.co/IQTvIL6Fve'])
 list(['#Blues', '#brewpubs', 'and', 'some', 'of', 'the', 'best', '#sportsbars', 'in', 'the', 'country!', 'Use', 'our', 'guide', 'to', 'plan', 'your', 'perfect', 'night', 'in', 'https://t.co/PfOL1tVGlP'])
 list(['#MTVStars', '#coke', 'Through', 'food', 'drinks', 'clothes', 'amenities', '&amp;', 'travel', 'we', 'enjoy', 'morethan', 'historys', '#emperors', 'We', 'still', 'grumble', 'https://t.co/J0Sno9FP6y'])
 ...
 list(['thats', 'exactly', 'what', 'it', 'sounds', 'like', 'cus', 'i', 'miss', 'u', 'biTCH'])
 list(['this', 'is', 'me', 'subtweeting', '@engelicaraya', '....', 'the', 'group', 'needs', 'a', 'beach', 'trip', '....'])
 list(['tl;dr', 'they', 'say', 'youll', 'still', 'have', 'to', 'take', 'off', 'your', 'shoes', 'at', 'tsa'])]


In [18]:
TweetsTable_BoW=sorted_tweets.with_column('BoW', tweets_BoW)
TweetsTable_BoW.show(5)

user_id,user_name,tweet_time,location,text,BoW
5674722,Davanum Srinivas,Mon Sep 28 03:44:04 +0000 2020,Massachusetts USA,#148 you are hilarious! I definitely enjoy your tweets e ...,"['#148', 'you', 'are', 'hilarious!', 'I', 'definitely', ..."
63791823,eVisitorGuide,Mon Sep 28 03:49:03 +0000 2020,CHI MKE STL & Nashville,#Blues #brewpubs and some of the best #sportsbars in the ...,"['#Blues', '#brewpubs', 'and', 'some', 'of', 'the', 'bes ..."
727580305108938752,WomensPowerCen,Mon Sep 28 03:44:03 +0000 2020,,#MTVStars #coke Through food drinks clothes amenities &a ...,"['#MTVStars', '#coke', 'Through', 'food', 'drinks', 'clo ..."
437798374,BAL Immigration,Mon Sep 28 03:45:03 +0000 2020,,#Malaysia has relaxed its entry ban for some travelers. ...,"['#Malaysia', 'has', 'relaxed', 'its', 'entry', 'ban', ' ..."
1226094135431585792,,Mon Sep 28 03:43:32 +0000 2020,Kamagut,#UhuruNaKazi The new Naivasha- Njambini road is key to i ...,"['#UhuruNaKazi', 'The', 'new', 'Naivasha-', 'Njambini', ..."


# Using .COUNT of List to Compute Word Frequency

We saw the output of applying .split(' ') to a string is a list of words/terms in the string (which we also refer to as "a Bag of Word").  

In the following exercise, we are going to use .count method of a list to count how many times a word occurs in the list. For example, 

In [15]:
String = "This is a tweet about tweet, but not a tweet about Covid-19. #COVID19"
BoW_list = String.split(" ")
print(BoW_list)

['This', 'is', 'a', 'tweet', 'about', 'tweet,', 'but', 'not', 'a', 'tweet', 'about', 'Covid-19.', '#COVID19']


In [19]:
# This code returns a number that indicates how many time the list (BoW_list) contains the word "tweet"
BoW_list.count('tweet')

2

In [20]:
BoW_list.count('#Covid19')

0

In [21]:
BoW_list.count("#COVID19")

1

# Exercise 5 (10 points)
Select a word or a hashtag you used to sample the tweets, complete the following code for both (a) and (b):
- (a) Determine, for each tweet, the frequency your chosen word (or hashtag) occur in the tweets you gathered using .count method to the "BoW" column of the table TweetsTable_Bow using .COUNT.
- (b) Add the count for each tweet to the table TweetsTable_BoW as a new column.  You can choose the name of the column based on the name of the word or hashtag you used.
 

In [34]:
travel_count = TweetsTable_BoW.apply(lambda x: x.count('travel'), "BoW")
print(travel_count)

[0 0 1 ... 0 0 0]


In [35]:
Tweets_Table_BoW_Count = TweetsTable_BoW.with_columns("travel_count", travel_count)
Tweets_Table_BoW_Count.show(10)

user_id,user_name,tweet_time,location,text,BoW,travel_count
5674722,Davanum Srinivas,Mon Sep 28 03:44:04 +0000 2020,Massachusetts USA,#148 you are hilarious! I definitely enjoy your tweets e ...,"['#148', 'you', 'are', 'hilarious!', 'I', 'definitely', ...",0
63791823,eVisitorGuide,Mon Sep 28 03:49:03 +0000 2020,CHI MKE STL & Nashville,#Blues #brewpubs and some of the best #sportsbars in the ...,"['#Blues', '#brewpubs', 'and', 'some', 'of', 'the', 'bes ...",0
727580305108938752,WomensPowerCen,Mon Sep 28 03:44:03 +0000 2020,,#MTVStars #coke Through food drinks clothes amenities &a ...,"['#MTVStars', '#coke', 'Through', 'food', 'drinks', 'clo ...",1
437798374,BAL Immigration,Mon Sep 28 03:45:03 +0000 2020,,#Malaysia has relaxed its entry ban for some travelers. ...,"['#Malaysia', 'has', 'relaxed', 'its', 'entry', 'ban', ' ...",0
1226094135431585792,,Mon Sep 28 03:43:32 +0000 2020,Kamagut,#UhuruNaKazi The new Naivasha- Njambini road is key to i ...,"['#UhuruNaKazi', 'The', 'new', 'Naivasha-', 'Njambini', ...",0
954473252969205760,FBMyNextCar / WebPass Social Network,Mon Sep 28 03:45:01 +0000 2020,Florida USA,#WebPass #Florida @EdMorseMazdaPortRichey checkout Speci ...,"['#WebPass', '#Florida', '@EdMorseMazdaPortRichey', 'che ...",0
2907966134,shabnam roy,Mon Sep 28 03:47:03 +0000 2020,Mumbai India,#cleartrip I have been hearing this for over 6 months no ...,"['#cleartrip', 'I', 'have', 'been', 'hearing', 'this', ' ...",0
1130398556836306944,Styles Mix,Mon Sep 28 03:46:27 +0000 2020,,#deluxe #luxury #design Portable Comfortable Reversible ...,"['#deluxe', '#luxury', '#design', 'Portable', 'Comfortab ...",0
617853906,Brett Murphy,Mon Sep 28 03:45:12 +0000 2020,Bay Area CA,#selfemployed #network #makemoney #affiliateprogram Chec ...,"['#selfemployed', '#network', '#makemoney', '#affiliatep ...",0
1087583509005418496,4 Cycling Store,Mon Sep 28 03:49:11 +0000 2020,,#yogini #travel Universal Breathable Cycling Headbandhtt ...,"['#yogini', '#travel', 'Universal', 'Breathable', 'Cycli ...",0
