*Hello again!* 👋

This notebook is the <u>second (and last)</u> part of a **tutorial** on how to **get started with Twitter API v2 using Python** 🤓! Read our medium blog post [here](https://medium.com/data-analytics-at-nesta).

In this notebook, we make use of the **recent search** endpoint to collect Twitter data on heat pumps and gas boilers from the last 7 days.

**More on the use case**

The [sustainable future mission](https://www.nesta.org.uk/sustainable-future/) at [Nesta](https://www.nesta.org.uk/) is focused on projects to help decarbonise UK homes, with special interest in greener heating systems such as heat pumps. For those who are not familiar with the concept, a heat pump is a low-carbon heating system that captures heat from outside and moves it into your home.

What is the sentiment on heat pumps *versus* gas boilers? Have peoples' opinions towards heat pumps changed with time? Which types of users mention heat pumps on Twitter? These are all questions we can answer once we start analysing Twitter data on these topics.

But first... Let us have a go at collecting tweets mentioning heat pumps or gas boilers in the past 7 days!

### Importing packages and loading credentials
We start by importing the necessary packages to run the code.

In [None]:
import requests
import json
import time
import random
import os
import pandas as pd

We import our *bearer_token* which we previously defined as an environment variable. This way you do not have to expose your credentials in your code.

In [None]:
bearer_token = os.environ.get("BEARER_TOKEN")

### Preparing our API request
We will use the recent search endpoint to collect our first set of tweets. To do that we need to define the endpoint URL, the rules clarifying the data we want to collect and other query parameters such as fields to include and maximum number of results.

In [None]:
endpoint_url = "https://api.twitter.com/2/tweets/search/recent"

We define the following two rules:
- tweets matching one of the expressions "heat pump"/"heat pumps", written in english, which are not retweets;
- tweets matching one of the expressions "gas boiler"/"gas boilers", written in english, which are not retweets.

In [None]:
rules = [
    {"value": '("heat pump" OR "heat pumps") -is:retweet lang:en', "tag": "heat_pump"},
    {"value": '("gas boiler" OR "gas boilers") -is:retweet lang:en', "tag": "gas_boiler"},
]

We create a dictionary with query parameters, where we pass the following fields:
- **tweet.fields**: fields in the tweet object for which we want to collect information, in this example: the tweet unique identifier, the tweet text, the identifier of the user posting the tweet and the date/time the tweet was created;
- **user.fields**: fields in the user object for which we want to collect information, in this example: the user unique identifier, name, username, date/time the user created their account, description, user defined location and whether the user is verified or not;
- **expansions**: expansion query parameter with info relating to the user. We need to add this in order to receive user data in our response object.
- **max_results**: the maximum number of tweets to be retrieved per request to the API, in this case 100 (which is also the maximum allowed).

Unlike our previous example, here we do not define the query rules straight away.

In [None]:
query_parameters = {
    "tweet.fields": "id,text,author_id,created_at",
    "user.fields": "id,name,username,created_at,description,location,verified",
    "expansions": "author_id",
    "max_results": 100,
}

### Authentication
Authentication is done by bearer token.

In [None]:
def request_headers(bearer_token: str) -> dict:
    """
    Set up the request headers. 
    Returns a dictionary summarising the bearer token authentication details.
    """
    return {"Authorization": "Bearer {}".format(bearer_token)}

In [None]:
headers = request_headers(bearer_token)

### Connecting to endpoint and taking a look at the data
We connect to the endpoint and retrieve our first page of data to see what changed in comparison to the previous notebook.

In [None]:
def connect_to_endpoint(endpoint_url: str, headers: dict, parameters: dict) -> json:
    """
    Connects to the endpoint and requests data.
    Returns a json with Twitter data if a 200 status code is yielded.
    Programme stops if there is a problem with the request and sleeps
    if there is a temporary problem accessing the endpoint.
    """
    response = requests.request(
        "GET", url=endpoint_url, headers=headers, params=parameters
    )
    if response.status_code != 200:
        if response.status_code >= 400 and response.status_code < 500:
            raise Exception(
                "Cannot get data, the program will stop!\nHTTP {}: {}".format(
                    response.status_code, response.text
                )
            )
        
        sleep_seconds = random.randint(5, 60)
        print(
            "Cannot get data, your program will sleep for {} seconds...\nHTTP {}: {}".format(
                sleep_seconds, response.status_code, response.text
            )
        )
        time.sleep(sleep_seconds)
        return connect_to_endpoint(endpoint_url, headers, parameters)
    return response.json()

Let us retrieve the first page of tweets for our first rule:

In [None]:
query_parameters["query"] = rules[0]["value"]
json_response = connect_to_endpoint(endpoint_url, headers, query_parameters)

Now the json_response dictionary contains 3 keys: *data*, *includes* and *meta*. The only difference from the previous example is the *includes* field.

In [None]:
json_response.keys()

json_response["includes"] is also a dictionary and it contains one key, "users", because we are now also collecting user information. If other information such as places/location information was also being collected, then we would have another key in our json_response["includes"] dictionary.

In [None]:
json_response["includes"].keys()

This is what each user dictionary looks like:

In [None]:
json_response["includes"]["users"][0]

### Collecting tweets from the past 7 days

We define a functions to process twitter data and we start the data collection process.

In [None]:
def process_twitter_data(
    json_response: json,
    query_tag: str,
    tweets_data: pd.DataFrame,
    users_data: pd.DataFrame,
) -> (pd.DataFrame, pd.DataFrame):
    """
    Adds new tweet/user information to the table of
    tweets/users and saves dataframes as pickle files,
    if data is avaiable.
    
    Returns the tweets and users updated dataframes.
    """
    if "data" in json_response.keys():
        new = pd.DataFrame(json_response["data"])
        tweets_data = pd.concat([tweets_data, new])
        tweets_data.to_pickle("tweets_" + query_tag + ".pkl")

        if "users" in json_response["includes"].keys():
            new = pd.DataFrame(json_response["includes"]["users"])
            users_data = pd.concat([users_data, new])
            users_data.drop_duplicates("id", inplace=True)
            users_data.to_pickle("users_" + query_tag + ".pkl")

    return tweets_data, users_data

Now that we know what the data looks like, let's start our data collection process!

**The data collection process:**
- We define empty dataframes where we will store information about tweets and users;
- The for loop allows you to go through all your rules;
- We update the query parameters query field according to the rule in question;
- We connect to the endpoint as in the previous example and process the data, using the process_twitter_data() function;
- Then the program sleeps for 5 seconds. This is necessary not to surpass the rate limit. For this specific endpoint and Essential access level, the rate limit is 180 requests/15 minutes per user, which translates into 1 request every 5 seconds so we need to wait for at least 5 seconds before we make another request.
- If json_response['meta'] has a next_token (the pagination token) field then it means that we have not reached the final page of tweets, so we add it as a query parameter and collect more tweets;
- We repeat the process until  json_response['meta'] no longer contains  next_token field.

In [None]:
tweets_data = pd.DataFrame()
users_data = pd.DataFrame()

for i in range(len(rules)):
    query_parameters["query"] = rules[i]["value"]
    query_tag = rules[i]["tag"]

    json_response = connect_to_endpoint(endpoint_url, headers, query_parameters)
    tweets_data, users_data = process_twitter_data(
        json_response, query_tag, tweets_data, users_data
    )

    time.sleep(5)

    while "next_token" in json_response["meta"]:
        query_parameters["next_token"] = json_response["meta"]["next_token"]

        json_response = connect_to_endpoint(endpoint_url, headers, query_parameters)
        tweets_data, users_data = process_twitter_data(
            json_response, query_tag, tweets_data, users_data
        )

        time.sleep(5)

**We have reached the end of this tutorial on how to collect Twitter data from the past 7 days** 💪🤓

This code was inspired in official Twitter code in this [GitHub repo](https://github.com/twitterdev/Twitter-API-v2-sample-code).