# Twitter Scraping

A brief guide to collecting and storing tweets.

*NOTE:* Twitter data is **big data**. Since this is targeted at an audience using run-of-the-mill laptops, scope will be narrow. Bear in mind that changing a few parameters (ex. removing limits and start date) could be dangerous to your storage space.

---

**Quick Reference**

Twitter Rate Limits: https://developer.twitter.com/en/docs/basics/rate-limiting

GoT3: https://github.com/Mottl/GetOldTweets3

tweepy: http://docs.tweepy.org/en/latest/

---

**Required Installs:**

```
!pip install GetOldTweets3

!pip install tweepy
```

---

**Overview:**

Following is a notebook to walk through two libraries for retrieving tweets - GetOldTweets3 and tweepy - from project idea to a starting point.

* GoT3 takes a very minimal amount of time to start using, and offers more historical data than tweepy, but lacks streaming capability.

* Tweepy's main benefit is that it offers the ability to gather streaming data, however, it requires Twitter app authorization.

---

# Imports

In [1]:
import pandas as pd

import GetOldTweets3 as got
import tweepy

---

# Part 1: GetOldTweets3

For projects or models requiring a dataset beyond the scope of tweepy, use got.

Minimal everything is required, even code.

Tweets can be searched by keyword or username, parameters include: minimum date, maximum date, maximum number to scrape, and emoji format.

In [2]:
# basic query

basic_query = got.manager.TweetCriteria().setQuerySearch('coronavirus')\
                                         .setMaxTweets(1)

tweet = got.manager.TweetManager.getTweets(basic_query)

tweet[0].text

"Exactly. Don't worry about it. Even if the kids want to for a little skate in the park, what's the harm?"

In [3]:
# tweaking time and max tweets

earlier_tweets = got.manager.TweetCriteria().setQuerySearch('coronavirus')\
                                            .setSince('2020-01-15')\
                                            .setMaxTweets(5)

more_tweets = got.manager.TweetManager.getTweets(earlier_tweets)

for tweet in more_tweets:
    print(tweet.text + '\n')

Pior que o covid19 e o bovid17 só a minha vizinha falando que nem a Ivana de Avenida Brasil com o momolado dela, cara eles tem qse 50 anos

El Año del Coronavirus - XXXVII día de Confinamiento - Foto y frase: ©José María Navarro Cayuela #YoMeQuedoEnCasa #QuedateEnCasa #VenceremosAlVirus

Secondo me a fine anno prenderanno pure il premio produttività, che sarà ricalcolato tenendo conto delle difficoltà legata al #coronavirus. Scommettiamo o pensi che vinco facile?

#IranRegimeChange The regime's own officials are awakening Rouhani of the big mistake he's making You can support rights of +80M #Iranians, at risk of #COVID19 death, by appealing to @UNHumanRights for firm observation of decent #coronavirus measures to be taken by Rouhani. 

Americans at World Health Organization transmitted real-time information about coronavirus to Trump administration 



In [4]:
# copy-pastable function for username scrape bounded by time (no inherent max)

def tweet_to_df(user, start, stop):
    '''
    Function to scrape tweets from the GetOldTweets3 API into a dataframe.
    
    Refer to [https://pypi.org/project/GetOldTweets3/] for additional info and tweet fields.
    
    Parameters
    ----------
    user: the username of target tweeter - could be changed easily to text scrape
    start: inclusive date to start scraping
    stop: exclusive date to stop scraping
    
    Example
    -------
    >>>tweet_to_df('@McDonalds', '2020-04-01', '2020-04-03')
    returns a dataframe of size () of tweets by @McDonalds for the first half of April, 2020
    '''
    tweetCriteria = got.manager.TweetCriteria()\
                                .setUsername(user)\
                                .setSince(start)\
                                .setUntil(stop)
    
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    text_tweets = [[tweet.date, 
                    tweet.text,
                    tweet.favorites, 
                    tweet.retweets]
                   for tweet in tweets]
    
    return pd.DataFrame(text_tweets,
                        columns=['datetime', 'text', 'favorites', 'retweets'])

In [5]:
%%time

# mo-tweets mo-time

mac_tweets = tweet_to_df('@McDonalds', '2020-04-01', '2020-04-03')

mac_tweets.shape  # rows, columns

CPU times: user 4.81 s, sys: 176 ms, total: 4.99 s
Wall time: 2min 30s


(856, 4)

---

# Part 2: tweepy

Cute name. You can push the skill-curve as high as you want on this one. For now, let's just hit the ground.

To get to this point:

1. [Jump through all the hoops, make a developer account](https://developer.twitter.com/en/apply-for-access)
2. [More hoops, get an app started](https://developer.twitter.com/en/apps)
    * No need to do anything beyond generating your app token and secret!
3. Open your app tab, click on "Keys and Tokens"
4. Create a text/json file to store all 4: API key, API secret key, Access token, Access token secret

In [6]:
# json works too, this is for anyone who has simply copy-pasted into 4 lines...

# each line should look like...  thing: crazylongpassword
info = open('secret.txt').read()
info = info.split('\n') # break it up by line

In [7]:
# now we have

for i in info:
    print(f"{i.split(':')[0]}: {'*'*10} \n")

access_token: ********** 

secret: ********** 

api_key: ********** 

api_secret_key: ********** 



In [8]:
# which allows us to access the Twitter API

access_token = info[0].split(': ')[1] # first line of secret.txt
secret = info[1].split(': ')[1] # second line

api_key = info[2].split(': ')[1]
api_secret = info[3].split(': ')[1]

auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, secret)
api = tweepy.API(auth)

In [9]:
# quick check for sanity

try:
    api.verify_credentials()
    print('Grab your mental pickaxe!')
except:
    print("I can't believe you've done this.")

Grab your mental pickaxe!


In [10]:
# elemental, limited tweet stream

class StreamListener(tweepy.StreamListener):
    
    def __init__(self):
        super().__init__()
        self.counter = 0
        self.limit = 5
    
    def on_status(self, status): 
        if self.counter < self.limit:
            print(status.text + '\n')
            self.counter += 1
        else:
            return False

In [11]:
# setup and look for keywords

stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)

stream.filter(track=['covid', 'covid-19', 'coronavirus'])

RT @thiruja: குவைத் நாட்டில் கண்டறியப்பட்ட 1658 கொரொனோ நோயாளிகளில் 924 பேர் இந்தியாவை சேர்ந்தவர்கள். எனினும் இந்தியர்களால் கொரொனோ பரவுகிறது…

RT @CoronaviridaeRu: Неофициально (по данным оппозиции) количество смертей вызванных коронавирусом в Иране более 31,500.
https://t.co/cIQt7…

No es solo nuestro gobierno el que ha podido hacer algo mal con el CoronaVirus, es el mundo entero

COVID-19 Datasets Now Available on Databricks: How the Data Community Can Help 
#bigdata #covid19… https://t.co/ABsQEtl1Xs

RT @DVATW: “Injustice”? Have you SEEN the shocking non conformance to social distancing that some BAME ppl in your city have engaged in? I…

