# Accessing the Twitter API with Python and Tweepy

## Welcome

This tutorial will provide a walkthrough of using Python and the Tweepy package to fetch Twitter data via the Twitter API.

While this tutorial does not require very advanced knowledge of Python, it does assume familiarity with the basics such as creating and printing variables. It will also use the Pandas package to help with prettier output of data, but this is only for readability and you can safely follow the tutorial even without knowledge of Pandas. If you need a refresher on Python basics, we invite you to check the CCSS Python Bootcamp notebooks, which can be found at https://github.com/ccss-rs/python-bootcamp.

## Technical requirements

To run this tutorial, you will need two things:

1. The ability to run Jupyter notebooks. This can either be done through the Binder link found on the readme (and if you're viewing this document inside the Binder web app right now, you're already good to go!), or by [installing Jupyter on your own computer](https://jupyter.org/install) and opening this file there.
2. An approved Twitter developer account, which will let you create API keys and a bearer token which are necessary to use the API (see [this tutorial](https://cran.r-project.org/web/packages/academictwitteR/vignettes/academictwitteR-auth.html) if you have not already created a bearer token and need help doing so).

## Package installation and authetication

In data science, there are often some common tasks that we frequently find ourselves doing. Repeatedly writing code for these common tasks would be tiring and inconvenient. For example, when using the vanilla Twitter API, the commands can change from just a few lines to tens and tens of lines that involves things like importing a lot of libraries, more authentication, selecting end points, and managing less readable outputs (see for yourself: https://github.com/twitterdev/Twitter-API-v2-sample-code). Imagine having to do all of that every time we wanted to gather some tweets!

This is where *packages* come in. Packages provide pre-written functions for common tasks.

We will be using the Tweepy package to access the Twitter API in Python. Additionally, we will be using the Pandas package to do some basic formatting of the search results. The following code cell will install Tweepy and Pandas via pip.

In [None]:
!pip install tweepy pandas

Now that the package is installed, we can *import* it, allowing us to use it in our code:

In [None]:
import tweepy

The first step in accessing the Twitter API is *authentication*. This is the same kind of process as logging in on a website - it lets the API know who you are, so that it can check if you have permission to access the data. In tweepy, authentication is handled by a `tweepy.Client` object. To authenticate, all you need to do is tell `tweepy.Client` your *bearer token*, which is our unique ID tied to our Twitter developer account (see the [official Twitter documentation](https://developer.twitter.com/en/docs/authentication/oauth-2-0/bearer-tokens) for more details):

In [None]:
client = tweepy.Client(
    bearer_token=""
)

After running the cell above, the client variable now remembers our authentication info, and we can use it to talk to the Twitter API.

## Collecting Tweets

The most common reason to use the Twitter API for social science research is searching for tweets. In tweepy, this is done using the `search_all_tweets` method. Here is a list of all the options you have for `search_all_tweets`:

```
Client.search_all_tweets(
    query, 
    end_time, 
    expansions, 
    max_results, 
    media_fields,
    next_token, 
    place_fields, 
    poll_fields, 
    since_id, 
    start_time, 
    tweet_fields, 
    until_id, 
    user_fields
)
```

As you can see, we have the ability to specify our search via a query, set a date beginning or end, the number of tweets we want, and more! Let's try an example:

In [None]:
tweets = client.search_all_tweets(query="Halloween", # keyword search (in this case, we want the text "Saint Patrick's Day")
                                  start_time="2021-10-24T00:00:00Z", # start and end dates we want to search over
                                  end_time="2021-10-31T00:00:00Z",
                                  max_results=50) # how many results do you want?

What we have just done is ask the Twitter API for up to 50 tweets mentioning "Halloween" posted between October 24-31, 2021. We then saved the results to a variable called `tweets`. But what this this variable actually contain? Let's try printing it out:

In [None]:
print(tweets)

#Output
data=
Tweet id=1454598993942650883 text='@nftartcrypto NFT Halloween Sell https://t.co/Bfr11UEkQB'

Tweet id=1454598993506340868 text='RT @intro95s: halloween last year was hilarious cause bts just put on some fun glasses with their achievements on the wall and ppl lost the…'

This is not exactly very human-readable! But if we look carefully there are a few useful things we may notice. The most important thing is that inside the Response we got from the API, there is a list of tweets in a variable called `data`. In each tweet, we can see that there is an ID and the text of the tweet. This seems like the kind of data that would be best represented in a tabular format, if we want to explore it in a human-readable way. We can accomplish this using the Pandas package. Pandas provides tabular data formatting to Python in the form of R-style "Data Frames" (more here: https://pandas.pydata.org/).


In [None]:
import pandas # import the Pandas package so we can use its functions
tweets_df = pandas.DataFrame(tweets.data) # put the data into a tabular format and save it as tweets_df

In [None]:
display(tweets_df) # more readable, right?

#Output
edit_history_tweet_ids	id	text

0	[1454598993942650883]	1454598993942650883	@nftartcrypto NFT Halloween Sell https://t.co/...

1	[1454598993665794053]	1454598993665794053	RT @Rkugakisuru: #ロゼッタ #チコ #らくがき #イラスト\n#Roset...

2	[1454598993648893954]	1454598993648893954	RT @twst_jp: 【TVCM情報3/4】\n期間限定イベント「スケアリー・モンスター...

3	[1454598993544159240]	1454598993544159240	Alguém consegue adivinhar minha fantasia de ha...

4	[1454598993506340868]	1454598993506340868	RT @intro95s: halloween last year was hilariou...

better, right?

Oftentimes we may want more information than just the ID and the text. *Metadata*, the other miscellaneous descriptive properties of a tweet, can often be useful for research as well. We can specify what properties we want by using the `tweet_fields` parameter of the `get_all_tweets` method. A full list of available properties are available at the following documentation page: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet.

As an example, let's expand our last search to give us information about who wrote each tweet, when each tweet was posted, and what conversation each tweet belongs to, in addition to the ID and text:

In [None]:
tweets = client.search_all_tweets(query="Halloween", 
                                  start_time="2021-10-24T00:00:00Z", 
                                  end_time="2021-10-31T00:00:00Z",
                                  tweet_fields=["id", "text", "author_id", "conversation_id", "created_at"], # specify all desired properties in a list (ID and text must be explicitly included if you still want them)
                                  max_results=50) 

In [None]:
display(pandas.DataFrame(tweets.data)) # just like last time, use Pandas to format the output

#Output
Columns: author_id	conversation_id	created_at	edit_history_tweet_ids	id	text

Data: 

0	1448233614949421056	1454363540245069827	2021-10-30 23:59:59+00:00	[1454598993942650883]	1454598993942650883	@nftartcrypto NFT Halloween Sell https://t.co/...

1	1272660954648653826	1454598993665794053	2021-10-30 23:59:59+00:00	[1454598993665794053]	1454598993665794053	RT @Rkugakisuru: #ロゼッタ #チコ #らくがき #イラスト\n#Roset...

2	1256812805384069122	1454598993648893954	2021-10-30 23:59:59+00:00	[1454598993648893954]	1454598993648893954	RT @twst_jp: 【TVCM情報3/4】\n期間限定イベント「スケアリー・モンスター...

3	919687810273218561	1454598993544159240	2021-10-30 23:59:59+00:00	[1454598993544159240]	1454598993544159240	Alguém consegue adivinhar minha fantasia de ha...

4	878665056799641600	1454598993506340868	2021-10-30 23:59:59+00:00	[1454598993506340868]	1454598993506340868	RT @intro95s: halloween last year was hilariou...

## Count Tweets

So what if you didn't want to spend all of your limit on collecting tweets hoping the sample was what you wanted? Or, you wanted to get an estimate of how many tweets would be pulled for a particular query?

This is where tweet counts come in.

Tweet counts don't count toward your tweet cap, but they can reveal really important information.

In [None]:
count = client.get_all_tweets_count(
    query="#vote", # let's try a hashtag analysis
    start_time="2020-10-15T00:00:00Z",
    end_time="2020-11-15T00:00:00Z",
    granularity="day" # tell the API to count number of tweets per day (can also do hour or minute).
)

In [None]:
display(pandas.DataFrame(count.data)) # just like last time, use Pandas to format the output

#Output
	end	start	tweet_count
0	2020-10-16T00:00:00.000Z	2020-10-15T00:00:00.000Z	106924
1	2020-10-17T00:00:00.000Z	2020-10-16T00:00:00.000Z	106282
2	2020-10-18T00:00:00.000Z	2020-10-17T00:00:00.000Z	98745
3	2020-10-19T00:00:00.000Z	2020-10-18T00:00:00.000Z	107730
4	2020-10-20T00:00:00.000Z	2020-10-19T00:00:00.000Z	129902
5	2020-10-21T00:00:00.000Z	2020-10-20T00:00:00.000Z	134939

These results show the number of tweets by the level of granularity you selected. From this information we can now narrow our search to specific days or use this to create a visualization of tweets for a particular event.

From these results, we can see how something as simple as a tweet count can already reveal interesting phenomena. In this case, the number of tweets per day mentioning the hashtag \#vote seems to rise starting around the end of October 2020, peaking between November 3rd and 4th and then dropping sharply after that. Why do you think this is?

## User Tweets

Sometimes we may be interested in finding tweets by a specific user. We have to be careful here however! The Twitter API identifies users by a numerical ID. Thankfully, Tweepy gives us tools to convert between the two:

In [None]:
user_info = client.get_user(username="pete_enns") # look up information for the user @pete_enns
print(user_info) # note the "id" variable!

#Output
Response(data=User 
id=3229291352 
name=Peter K. Enns 
username=pete_enns, 
includes={}, errors=[], meta={})

In [None]:
profile_tweets = client.get_users_tweets(
    id=3229291352, # @pete_enns' ID that we retrieved above
    start_time="2020-10-15T00:00:00Z",
    end_time="2020-10-22T00:00:00Z",
    max_results=100
)
display(pandas.DataFrame(profile_tweets.data))

#Output
edit_history_tweet_ids	id	text

0	[1318960355289288705]	1318960355289288705	RT @wesmediaproject: Since Jan 2019, Phoenix, ...

1	[1318937460437647362]	1318937460437647362	Thanks, @soccerquant! https://t.co/Usz12CZWkJ

2	[1318937331383062528]	1318937331383062528	Results of latest @MonmouthPoll in Iowa align ...

3	[1318928730488791051]	1318928730488791051	Also, thrilled to learn that @ava DuVernay is ...

4	[1318927868362186752]	1318927868362186752	So excited for the @CornellCCSS Annual Lecture.

## Conversation Threads

Here we can see how using one query can factor into other searches. You could use a tweet or user search to identify a specific conversation that you want to dig into. The key to this is the `query` parameter in `search_all_tweets`. Previously, we have only provided text queries. But we can also query by other fields, as we will demonstrate here with `conversation_id`:

Feel free to swap the conversation ID with something more interesting!

In [None]:
thread_tweets = client.search_all_tweets(
    # Replace with ID of your choice to get replies (this ID is from the Halloween original set)
    query="conversation_id:1454363540245069827",
    start_time = "2021-10-24T00:00:00Z",
    end_time = "2021-10-31T00:00:00Z",
    max_results=100
)
display(pandas.DataFrame(thread_tweets.data))

#Output
edit_history_tweet_ids	id	text

0	[1454598993942650883]	1454598993942650883	@nftartcrypto NFT Halloween Sell https://t.co/...

1	[1454596160249806858]	1454596160249806858	@nftartcrypto https://t.co/4p3vtWcXAF

2	[1454595540113575939]	1454595540113575939	@nftartcrypto https://t.co/8mRdCRtFfD

3	[1454592631875461124]	1454592631875461124	@nftartcrypto https://t.co/VjpwIppmDH

4	[1454590300521910281]	1454590300521910281	@nftartcrypto https://t.co/sAaoRVvuaU

## Geo-tagged Tweets

Text queries and field-based queries are not mutually exclusive, a single query can contain both plain text to search for and specify individual fields. Let's demonstrate using a common use case: location filtering.

Location is really important information. Remember when Facebook added the "checking in" feature during natural disasters? Well, we can search for Tweets that specifically have location information embedded in those tweets.

Specifying `has:geo` in a query filters the search criteria to only the tweets with this information.

In [None]:
vote_geo = client.search_all_tweets(
    # what if we want to find #vote tweets from specific cities/counties? maybe they are geotagged...
    query="#vote has:geo", 
    start_time="2020-11-03T00:00:00Z",
    end_time="2020-11-04T00:00:00Z",
    max_results=20
)
display(pandas.DataFrame(vote_geo.data))

#Output
edit_history_tweet_ids	id	text

0	[1323776966198087680]	1323776966198087680	You have until 8pm tonight here in the Bay Are...

1	[1323776948275945473]	1323776948275945473	Watching this continues to ease my anxiety tod...

2	[1323776835008811010]	1323776835008811010	😈😈\n#ElectionDay\n#BidenHarris\n#BidenHarris20...

3	[1323776829845590018]	1323776829845590018	#VOTE NORTH CAROLINA! https://t.co/m9M7WgrR7o

4	[1323776778213752832]	1323776778213752832	MLine work'n today w/COBB NAACP VOTER ELECTION...

Location information can be `geo.place_id` or coordinates. If the former, then one more step is needed to convert the place code to a location (on how to do that here: https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode)

## Advanced Parameters

So far, we've augmented our naive text searches with two advanced parameters: `conversation_id` and `has:geo`. But this doesn't even begin to scratch the surface of ways we can customize our queries! There is too much to cover in a single workshop, but we encourage you to read the official documentation to learn more about all the parameters you can use: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query. For now, we will leave you with an example of a specific query we might use in practice:

In [None]:
tweets_advanced = client.search_all_tweets(
    # an advanced query combining several parameters
    query='#vote has:geo place:"ithaca" lang:en -is:retweet -is:verified',
    start_time="2020-11-03T00:00:00Z",
    end_time="2020-11-04T00:00:00Z",
    max_results=20
)
display(pandas.DataFrame(tweets_advanced.data))

#Output
edit_history_tweet_ids	id	text

0	[1323696053221269504]	1323696053221269504	Make your Voice Heard!!! #Vote #VoteResponsibl...

## Final Word

Congratulations! You have overcome perhaps the most daunting hurdle in using APIs. Which would be using an API. It is important to remember that your queries will not always be perfect or capture exactly what you want. That's okay, there are a slew of resources on data management and cleaning that help you get to the final step which will look more like a dataset worthy of analysis.

Well done and I wish you the best with your computational social science endeavors!

## Resources
Official Twitter API V2 developer documentation: https://developer.twitter.com/en/docs/twitter-api

Twitter API V2 Sample Code: https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research 