# Twitter V2 Full Archive Search

This document shows how to use Tweepy to conduct a full archive search using v2 of the Twitter API.

## Prep work

In order to use this code, you will need to have a developer account on Twitter, with access to the Academic Research product track. Information about who is eligible and how to apply is [here](https://developer.twitter.com/en/products/twitter-api/academic-research).

Once you have an account, you will need to create a new app at https://developer.twitter.com/en/portal/dashboard and generate a "bearer token" from the app. Copy the bearer token to your clipboard and paste it into a new file in the same directory as this file, called `twitter_authentication.py`. The entire contents of the file should look like this:

```python
bearer_token = "YOUR BEARER TOKEN HERE"
```

Note that you should **never** share this token with anyone else. If, for example, you are saving your work in a Git repository, make sure that you add the `twitter_authentication.py` file to your `.gitignore`.

If anyone gets this token, they will have access to your Twitter account and you will need to revoke the token (from the same interface where you created it).

If you've created the file successfully, then the following two blocks of code should work.

In [1]:
import tweepy
from twitter_authentication import bearer_token
import time
import pandas as pd

In [2]:
client = tweepy.Client(bearer_token, wait_on_rate_limit=True)

## The Search API

Full documentation for searching tweets is at https://docs.tweepy.org/en/latest/client.html#search-tweets. There are a lot of different options, but here is a simple version that gets all of the "COVID hoax" tweets from January 10, 2021. 

By default the only information returned is the tweet ID and the text. Often, we will want information about authors, too. To get information about the author, you need to add the `user_fields` parameter with the fields you want as well as the `expansions = 'author_id'` parameter. 

To get more information about the tweet, you need the `tweet_fields` parameter. The options are shown at https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all

You also likely want to build a somewhat advanced query - instructions are at https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. For this query, I get English language tweets that are not retweets.


In [127]:
hoax_tweets = []
for response in tweepy.Paginator(client.search_all_tweets, 
                                 query = 'COVID hoax -is:retweet lang:en',
                                 user_fields = ['username', 'public_metrics', 'description', 'location'],
                                 tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
                                 media_fields = 'type',
                                 place_fields = ['place_type', 'name', 'country'],
                                 expansions = ['author_id', 'attachments.media_keys', 'geo.place_id'],
                                 start_time = '2021-01-20T00:00:00Z',
                                 end_time = '2022-09-21T00:00:00Z',
                              max_results=500, limit = 2):
    time.sleep(1)
    hoax_tweets.append(response)

In [128]:
len(hoax_tweets)

2

Note that I followed the best practice above of saving the raw response returned. If this were a real project, I would write out all of the raw responses into a file. For long-running queries (e.g., if you need to get hundreds of thousands of tweets), you will often want to build in some error handling and a way to resume data collection. For example, you might write all of the results to a file and then open the file, retrieve the last tweet, and use the ID of that tweet to tell the script where to start to retrieve new tweets.

The other problem is that the object that is returned is a bit confusing - it is nested, with the tweet data in `.data` with extra user, media, and location data in `.includes`.

In [129]:
hoax_tweets[0].data[0]

<Tweet id=1572374312358907906 text=RT@MagaRisingJohn  NO, because it's an even bigger hoax then Covid??? https://t.co/JQBv9PpEFS>

Let's see what's in `includes`

In [130]:
hoax_tweets[0].includes.keys()

dict_keys(['users', 'places', 'media'])

In [131]:
hoax_tweets[0].includes['media'][0].data

{'media_key': '3_1572359846674006017', 'type': 'photo'}

Note that both of these are objects. The data that we asked for in `user_fields` and `tweet_fields` above are attributes of the objects. For example, here's the user's description:

In [132]:
hoax_tweets[0].includes['users'][0].description

'I have experienced life in America for 69 years.  Believe in God, Country and a free way of life.'

We will often want to reorganize these into a flat file, which means connecting a tweet to the includes data. I show an example of how to do that here:

In [133]:
result = []
user_dict = {}
places_dict = {}
media_dict = {}
# Loop through each response object
for response in hoax_tweets:
    # Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
    for user in response.includes['users']:
        user_dict[user.id] = {'username': user.username, 
                              'followers': user.public_metrics['followers_count'],
                              'tweets': user.public_metrics['tweet_count'],
                              'description': user.description,
                              'location': user.location
                             }
    
    for location in response.includes['places']:
        places_dict[location.id] = {'country': location.country,
                                    'name': location.full_name,
                                    'place_type': location.place_type
                                   }
        
    for media in response.includes['media']:
        media_dict[media.media_key] = {'type': media.type}
    
    for tweet in response.data:
        tweet_dict = {}
        has_video = False
        has_photo = False
        # For each tweet, find the extra information
        try:
            author_info = user_dict[tweet.author_id]
            tweet_dict.update({'username': author_info['username'],
                       'author_followers': author_info['followers'],
                       'author_tweets': author_info['tweets'],
                       'author_description': author_info['description'],
                       'author_location': author_info['location']})
            
        except AtributeError:
            pass
        
        try:
            place_info = places_dict[tweet.geo['place_id']]
            tweet_dict.update({'tweet_location': place_info['name'],
                             'tweet_country': place_info['country'],
                             'tweet_location_type': place_info['place_type']})
        except TypeError:
            pass
            

        try:
            for key in tweet.attachments['media_keys']:
                media_type = media_dict[key]['type']
                if media_type == 'photo':
                    has_photo = True
                if media_type == 'video':
                    has_video = True
        except TypeError:
            pass
        
        # Put all of the information we want to keep in a single dictionary for each tweet
        tweet_dict.update({'author_id': tweet.author_id, 
                       'text': tweet.text,
                       'created_at': tweet.created_at,
                       'retweets': tweet.public_metrics['retweet_count'],
                       'replies': tweet.public_metrics['reply_count'],
                       'likes': tweet.public_metrics['like_count'],
                       'quote_count': tweet.public_metrics['quote_count'],
                        'has_photo': has_photo,
                        'has_video': has_video
                      })
        result.append(tweet_dict)
# Change this list of dictionaries into a dataframe
df = pd.DataFrame(result, )

In [134]:
df

Unnamed: 0,username,author_followers,author_tweets,author_description,author_location,author_id,text,created_at,retweets,replies,likes,quote_count,has_photo,has_video,tweet_location,tweet_country,tweet_location_type
0,JuliaJett8,255,19004,I have experienced life in America for 69 year...,,1240743593561616384,"RT@MagaRisingJohn NO, because it's an even bi...",2022-09-20 23:57:04+00:00,0,0,0,0,False,False,,,
1,andycpp6,16,229,"Mom of 1 enjoys crafts with the kids, scrapboo...","Toronto, canada",1605642564,@StevenJS_ @CityNewsTO i lost a family member...,2022-09-20 23:51:05+00:00,0,16,66,0,False,False,,,
2,TheVictoryTour,2628,70951,#PromiseKeepers🇺🇸 #YesWeCan~Victory requires V...,United States,105917268,Floridians should also file a class action sui...,2022-09-20 23:50:44+00:00,0,0,0,0,False,False,,,
3,ReadDavidCase,4585,27555,"Exec Producer #EXPLANT doc, #howsyourheadhun B...",L.A.,115034110,@RepAndyBiggsAZ And once again a Republican ci...,2022-09-20 23:46:14+00:00,0,0,0,0,False,False,,,
4,joesegal,10452,241543,Expand circles of kindness and compassion\nins...,"LA, USA",16548484,@jayrosen_nyu @dnbornstein I think there has t...,2022-09-20 23:37:30+00:00,0,0,0,0,False,False,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
940,KonaBean5,1247,30235,Sick of GOP BS & greed. I loathe TFG and his c...,"Virginia, USA",838115678200741888,@Socott2030 Mask mandates also stopped people ...,2022-09-16 17:09:30+00:00,0,2,0,0,False,False,,,
941,TexasFree4,229,37116,"We love our family, friends & the USA. We're a...",,1518714002951573504,@POTUS You didn't have any problem not letting...,2022-09-16 17:06:27+00:00,0,0,0,0,False,False,,,
942,d_aids,594,3596,Enough about me. Join millions who have flippe...,"Hollywood, California",4710680126,@NickAdamsinUSA The country hasn't been this d...,2022-09-16 17:03:57+00:00,0,0,0,0,False,False,,,
943,OMGno2trump,115670,43489,I'm an educated guy who cares about the world ...,"I live in a small house in Charlotte, NC area",800111181058838528,@Jim_Jordan Because your COVID hoax killed ove...,2022-09-16 16:58:09+00:00,3,0,8,0,False,False,,,


## `requests`-based version

If you want to do things without tweepy, here is some boilerplate code that should work. As you can see, it's much more complicated. Be grateful for the tweepy developers!! :)

In [None]:
import requests
import os
import json
import twitter_authentication as config
import time

# Save your bearer token in a file called twitter_authentication.py in this directory
# Should look like this:
# bearer_token = 'YOUR_BEARER_TOKEN_HERE'

bearer_token = config.bearer_token
query = '(#COVID) OR (#COVID-19)'
out_file = 'raw_tweets.txt'

search_url = "https://api.twitter.com/2/tweets/search/all"

# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': query,
                'start_time': '2010-01-01T12:00:00Z',
                'tweet.fields': 'author_id,public_metrics',
                 'user.fields': 'username',
                'expansions': 'author_id',
                'max_results': 500
               }


def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers


def connect_to_endpoint(url, headers, params, next_token = None):
    if next_token:
        params['next_token'] = next_token
    response = requests.request("GET", search_url, headers=headers, params=params)
    time.sleep(3.1)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()


def get_tweets(num_tweets, output_fh):
    next_token = None
    tweets_stored = 0
    while tweets_stored < num_tweets:
        headers = create_headers(bearer_token)
        json_response = connect_to_endpoint(search_url, headers, query_params, next_token)
        if json_response['meta']['result_count'] == 0:
            break
        author_dict = {x['id']: x['username'] for x in json_response['includes']['users']}
        for tweet in json_response['data']:
            try:
                tweet['username'] = author_dict[tweet['author_id']]
            except KeyError:
                print(f"No data for {tweet['author_id']}")
            output_fh.write(json.dumps(tweet) + '\n')
            tweets_stored += 1
        try:
            next_token = json_response['meta']['next_token']
        except KeyError:
            break
    return None



def main():
    with open(out_file, 'w') as f:
        get_tweets(500, f)



main()

In [None]:
tweets = []
with open(out_file, 'r') as f:
    for row in f.readlines():
        tweet = json.loads(row)
        tweets.append(tweet)

In [None]:
tweets[0]