# FUNDAMENTALS OF DATA ANALYSIS WITH PYTHON <br><font color="crimson">DAY 2: COLLECTING DATA FROM THE WEB</font>

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>


### Overview 

High-level overview coming soon... 

### Plan for the Day

1. [What you need to know about how the Internet works to collect data from the web](#wyntk)
2. [Scraping the Web](#scrape)
    * How to scrape text and tables from static websites with BeautifulSoup
    * An overview of working with (a) multiple pages and (2) interactive content 
3. [Collecting data via Application Programming Interfaces](#apis)
    * Understanding APIs 
    * The Guardian API 
    * The Wikipedia API
    * ? The Twitter API ? 

<hr>

# What you need to know about how the Internet works <br>to collect data from the web<a id='wyntk'></a>

Regardless of whether you are going to collect your data by web scraping or by making calls to a well-documented API, you need to have a basic understanding of how the web works to get the data you want. We have provided a draft chapter of *Doing Computational Social Science* on collecting data from the web that offers a high-level explanation of what happens when your computer makes a request to a remote web server. If you are unfamiliar with the concepts of Internet protocols, IP addresses, Domain Name Server, `GET` requests, and so on, we suggest you review that chapter. 

# Scraping the Web <a id='scrape'></a>

> **NOTE**: Some of the explanatory text in this section on web scraping is excerpted and adapted from John McLevey (2020) <font color="crimson">*Doing Computational Social Science*</font>. London: Sage. 

## Essential HTML

When your computer receives an HTML page from a web server, it creates something called a Document Object Model (DOM) and stores it in memory. Your web browser then renders the page on your screen. To scrape the web effectively, you need to understand the DOM and some basics of how documents are structured using HTML and styled using CSS.

As someone who frequently reads and writes "documents" (news stories, blog posts, journal articles, Tweets, etc.), you are already familiar with the basics of structuring and organizing documents using headings, subheadings, and so on. As humans, we parse these organizational features of documents *visually*, using size, color, types of font, bullets, and so on. If you create a document using a WYSIWYG (What you see is what you get) program like a word processor, you apply different styles to parts of the text to indicate whether something is a title, a heading, a paragraph, a list, etc. HTML documents also have these organizational features, but use special 'markup' to tell a computer how to render them to a reader. Best practice is to use html to describe structural features of a document and Cascading Style Sheets (CSS) to describe how things should appear. Content, structure, and style are separate. 

HTML markup consists of 'elements' (e.g. paragraphs) with opening and closing tags. You can think of these tags as containers. The tags tell your browser about the text that sits between the opening and closing tags (or "inside the container"). For example, the `paragraph` element opens a paragraph with `<p>` and closes it with `</p>`. The actual text content of the paragraph -- which is what you see and read in your browser -- lives between those tags.

The outermost element in any HTML document is `html`. Your computer knows that anything between `<html>` and `</html>` tags should be processed as HTML markup. Most of the time, the next element in an HTML page will be a `head` element. The text inside the `<head>` and `</head>` tags will not actually be rendered by your browser. Instead, they contain metadata and other text about the page itself. This is where the page title is contained, which is displayed on the tab in your browser.

Inside the HTML tags, will also find a `body` element. Anything inside the `<body>` and `</body>` tags will be displayed in the main browser window (e.g. the text of a news story). Inside the body tags, you will typically find elements for headings (e.g. `<h1>` and `</h1>`, `<h2>` and `</h2>`, and so on), paragraphs (`<p>` and `</p>`), bold text (`<strong>` and `</strong>` or `<b>` and `</b>`), italicized text (`<i>` and `</i>` or `<em>` and `</em>`), as well as ordered and unordered lists, tables, images, links, and so on.

Sometimes elements include 'attributes,' which provide more information about the content of the text. For example, a paragraph element may specify that the text contained within it's tags are American English. This information is contained inside the opening bracket. `<p lang="en-us">American English sentence here...</p>`. As you will soon learn, attributes can be *extremely* useful when scraping the web.

Before moving on, it is important to understand one final type of HTML element you will frequently encounter when developing web scrapers: the division tag `div`. This is simply a generic container that splits a website into smaller sections. Developers often use them to apply a particular style (e.g. switch to a `monospaced` font} to display code) to some chunk of text in the HTML document using CSS. Splitting webpages into these smaller pieces using `div` tags makes websites easier for developers to maintain and modify. They also make it easier for us web scrapers to drill down and grab the information we need. You will see this in action in the examples to follow.

When scraping the web you will also encounter CSS, which as I previously mentioned is used to *style* websites. To properly understand how CSS works, remember that the vast majority of modern websites are designed to separate content (e.g. actual words that mean things to humans) from structure and style. HTML markup tells your browser what some piece of text is (e.g. a heading, a list item, a row in a table, a paragraph) and CSS tells your browser what it should look like when rendered in your browser (e.g. what font to use for subheadings, how big to make the text, what colour to make the text, and so on). If there is no CSS, then your browser will use an extremely minimal default style to render the text in your browser. In most cases, developing a good web scraper will require a deeper understanding of HTML than CSS, so we will set aside a further discussion of CSS for now. 

A full inventory of HTML and CSS elements is, of course, beyond the scope of this course. The good news is that you do not need exhaustive knowledge of either to write a good web scraper. To put it another way, you need to know much less HTML and CSS to effectively scrape a website than you do to develop that same website. You need to have a basic understanding of the key concepts and you need to know what the most common tags mean, but *more than anything else* you need to be willing to spend time investigating the source code for websites you want to scrape, attempt to solve problems creatively, and work interatively. We will turn to those tasks next.  

## Inspecting Source

Normally, the process of requesting and rendering a page is handled by our browser (remember to consult the draft chapter if you are uncertain about this process), but as you probably realize, this is not the only way to request HTML documents from a web server. We can also connect to a web server from a Python script using a package like `requests`, and we can load the HTML provided by the web server into our computer's memory. Once we have this HTML in memory (rather than rendered in a browser), we can move onto the next step, which is to start parsing the HTML and extracting the information we want. 

When we load an HTML document in Python, we are looking at the raw markup, not the rendered version we see when we load that file in a browser. If we print the HTML to screen (as we will later), we will see all of the markup the browser hides when it renders a page. If we are lucky, the information we want will be consistently stored in elements that are easy to grab and which do not contain a lot of irrelevant information. In order to get that information, we need to parse the HTML file. We can do this using a Python package called `BeautifulSoup`. To use `BeautifulSoup` effectively, we have to study the HTML code for the website we want to scrape.

The best way to study the source code of a website for the purposes of developing a web scraper is to use the developer tools built into modern web browsers. Let's see what this looks like. 

![boris.png](img/boris.png)

Once the story is loaded, we can right-click and select "Inspect Element" to open a pane of developer tools. (You can also open this pane by selecting "Toggle Tools" from the "Web Developer" section of the "Tools" menu in the toolbar.) We can use these tools to study our target webpage interactively. We can view the rendered content and the raw source code of the webpage simultaneously.

One especially useful strategy is to highlight some information of interest on the webpage and then right-click and select "Inspect Source" (even if the developer tools pane is already open). This will jump to that specific highlighted information in the HTML code, making it much easier to quickly find what tags the information you need is stored in. From here, we can strategize how best to retrieve the data we want. 

This is an *iterative* process. As we develop our web scraper, we progressively narrow down to the information we need, clean it by stripping out unwanted information (e.g. white spaces, new line characters), and then write it to some sort of dataset for later use. 

Scraping text from a static page is *generally* a pretty straightforward task, since you can be more or less confident that most of the content you want will be stored in headers (`<h1>...</h1>`, `<h2>...</h2>`) and paragraph tags (`<p>...</p>`). Other information can, of course, be extracted with a bit of detective work.

Let's work through an example of the article on Boris Johnson mentioned previously. Open the article up in another tab in your browser: [https://www.theguardian.com/politics/2019/aug/02/europes-view-on-boris-johnson](https://www.theguardian.com/politics/2019/aug/02/europes-view-on-boris-johnson). 

We will (1) request the HTML document from *The Guardian's* web server using the `requests` package, (2) feed that HTML data into `BeautifulSoup` to construct a `soup` object that we can parse, and then (3) extract the article title and text and store them in a couple of lists.

In the code block below, we import the two packages we will use, get the HTML, construct the soup object using an `lxml` parser, and then -- *just because we can* -- print the raw HTML DOM to our screen. 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd 

In [2]:
url = 'https://www.theguardian.com/politics/2019/aug/02/europes-view-on-boris-johnson'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="js-off is-not-modern id--signed-out" data-page-path="/politics/2019/aug/02/europes-view-on-boris-johnson" id="js-context" lang="en">
 <head>
  <!--
     __        __                      _     _      _
     \ \      / /__    __ _ _ __ ___  | |__ (_)_ __(_)_ __   __ _
      \ \ /\ / / _ \  / _` | '__/ _ \ | '_ \| | '__| | '_ \ / _` |
       \ V  V /  __/ | (_| | | |  __/ | | | | | |  | | | | | (_| |
        \_/\_/ \___|  \__,_|_|  \___| |_| |_|_|_|  |_|_| |_|\__, |
                                                            |___/
    Ever thought about joining us?
    https://workforus.theguardian.com/careers/digital-development/
     --->
  <title>
   Charming but dishonest and duplicitous: Europe's verdict on Boris Johnson | Politics | The Guardian
  </title>
  <meta charset="utf-8"/>
  <meta content="As the Brexit deadline looms, Europe remains wary of the poker player behind the clown mask" name="description"/>
  <meta content="IE=Edge" http-equiv="X-UA-

Now we need to get the title. I know that the article title is stored inside an `<h1>` element with a class attribute of `content__headline` because I highlighted the text in Firefox and pulled the title up in the developer tools pane by right-clicking and selecting `inspect element`. I can use the `findAll` method from `BeautifulSoup` to retrieve that part of the text, which `BeautifulSoup` returns in the form of a list. We will clean it up a bit and store only the string. 

In [3]:
title = soup.findAll('h1', {'class': 'content__headline'})[0].text.replace('\n', '')
print(title)

Charming but dishonest and duplicitous: Europe's verdict on Boris Johnson


Getting the body text is even easier, as all body text is all contained inside `<p>` elements. We can construct a list of paragraphs with the `findAll` method. 

In [4]:
lop = soup.findAll('p')

We now have a list of paragraphs that contain the text of the full article. 

In [6]:
lop[8]

<p>Another lifelong anglophile, André Gattolin, the vice-president of the French senate’s European affairs committee, said the new prime minister had carefully cultivated a “caricatural image – the hair, the gags, the flags, the zip-wire, the provocations”.</p>

In [5]:
lop[8].text

'Another lifelong anglophile, André Gattolin, the vice-president of the French senate’s European affairs committee, said the new prime minister had carefully cultivated a “caricatural image – the hair, the gags, the flags, the zip-wire, the provocations”.'

One nice thing about our rather minimal scraper is that we can use it to grab text from other stories posted by *The Guardian* as well. In other words, simple web scrapers can be used in a broader variety of contexts, because they are not overly tailored to the content of any one specific page. The main takeaway here is that you should keep your web scrapers as simple as possible while still collecting the data you need. Avoiding adding complexity unless it is necessary.

Let's wrap these steps up in a simple function, grab some text from a few more news stories, and then construct a `dataframe` with article titles in one column and article text in another.

In [7]:
def scrape_guardian_stories(url):
    url = url
    soup = BeautifulSoup(requests.get(url).content, 'lxml')
    title = soup.findAll('h1', {'class': 'content__headline'})[0].text.replace('\n', '')
    # join the paragraphs into one long string. 
    paras = " ".join(para.text.replace('\n', '') for para in soup.findAll('p'))
    return [title, paras]

In [10]:
stories = ['https://www.theguardian.com/politics/2019/aug/02/boris-johnson-warned-he-could-lose-control-of-parliament', 'https://www.theguardian.com/politics/2019/aug/02/europes-view-on-boris-johnson', 'https://www.theguardian.com/us-news/live/2019/aug/02/trump-news-today-live-pelosi-democrats-impeachment-latest-updates', 'https://www.theguardian.com/commentisfree/2019/aug/02/no-deal-brexit-hurt-uk-economy-mark-carney']
scraped = [scrape_guardian_stories(s) for s in stories]

In [12]:
df_scraped = pd.DataFrame(scraped, columns = ['Title', 'Article Text'])
df_scraped

Unnamed: 0,Title,Article Text
0,Tory rebels threaten Boris Johnson after major...,Prime minister faces losing control of parliam...
1,Charming but dishonest and duplicitous: Europe...,"As the Brexit deadline looms, Europe remains w..."
2,John Ratcliffe: Trump's pick for intelligence ...,Julia Carrie Wong in San Francisco (now) and J...
3,"Yes, a no-deal Brexit will hurt the economy. B...","Leaving without a deal is not an event, it’s t..."


As you can see, it is possible collect a large amount of data with very little code. All you need is a list of urls! But this is generally not the approach that we take when developing web scrapers. Instead, we study and manipulate URL strings to learn rules that we can exploit to collect more information from a webpage, or we "crawl" the website by finding and following hyperlinks from one page to another. Web crawling is beyond the scope of this week-long course, but if you want to learn more, we recommend consulting a specialized source like Ryan Mitchell's (2018) excellent book [*Web Scraping with Python*](https://www.amazon.ca/Web-Scraping-Python-Collecting-Modern/dp/1491985577/ref=sr_1_1?keywords=ryan+mitchell+web+scraping+python&qid=1582848814&s=books&sr=1-1).  

* **TODO**: studying URLs to automate collection of data from many pages. **Jillian**, I will try to get to this but if I don't, I think it will be fine to leave it out. Alternatively, if you think you can produce a quick explanation and an example, go for it! 

## Javascript and The Interactive Web 

When a web server sends requested content -- e.g. `nytimes.com` -- to your computer, your browser renders the content in a way that you are deeply familiar with as someone who browses and navigates the web daily. As you know, that content is usually in the form of HTML code. Increasingly, the files sent by the web server to your machine also include a lot of JavaScript. This JavaScript is executed *by your computer, not by the remote server.* It changes what is displayed, which poses additional challenges for web scrapers. A discussion of scraping websites with a lot of JavaScript is beyond the scope of this course, but we it here because you are likely to encounter at least some JavaScript in any serious web scrape. If you do, our recommendation is that you use [Selenium with Python](https://selenium-python.readthedocs.io/).

# Collecting data via Application Programming Interfaces <a id='apis'></a>

1. [Understanding APIs](#understanding_apis)
3. [The Guardian API](#guardian)   
    a. [Overview](#g_overview)      
    b. [API Keys](#g_key)      
    c. [Making Requests](#g_requests)      
    d. [Filtering](#g_filters)    
    e. [Extra Information](#g_info)   
    f. [Requesting More Results](#g_more)  
5. [The Wikipedia API](#wikipedia)   
4. The Twitter API
5. [Key Points](#key_points)


<a id='understanding_apis'></a>
## Understanding APIs

Application Programming Interfaces (APIs) offer an alternative way to access data from online sources. They provide an explicit _interface_ to the data behind the website, defining how you can request data and what format you will receive the data. 

### Key Components of API Requests & Responses
**Endpoints** are the specific web locations where a request for a particular resource can be sent. Usually they have descriptive names like Content, Tweet, User, etc. We communicate with APIs by sending requests to these endpoints, usually in the form of a URL. 

These URLs usually contain optional **queries**, **parameters**, and **filters** that let us specify exactly what we want the API to return. 

Once a request has been made to the API it is going to return a **response**. Every response will have a response code, which will indicate whether the request was successful (200) or encountered an error (400, 401, 500, etc.). When you encounter a problem its a good idea to confirm you received a successful response, instead of one of the [many error responses](https://documentation.commvault.com/commvault/v11/article?p=45599.htm). 

As long as a request was successful, it will return a 200 OK response along with all the requested data. We will delve into what this data looks like below. 

### APIs vs Web Scraping 

Benefits: 
* Structured data (for the most part). 
* Controlled by an organization or company (Guardian, Twitter, etc) 
* Documented (usually)
* Maintained (usually)
* Rules for access are explicitly stated

Drawbacks: 
* Limited to the data made explicitly available
* Relies on the organization to make updates
* Rate limits & other restrictions apply and are usually based on business reasons rather than technical limitations


<a id='guardian'></a>
## The Guardian API

<a id='g_overview'></a>
### Overview
The Guardian's API allows us to query and download data related to their published articles. 

The Guardian API has five **endpoints**: 
* Content (`https://content.guardianapis.com/search`) &mdash; returns content. For dev keys only text. Allows querying and filtering to reduce what is returned.  
* Tags &mdash; will return all API tags (> 50, 000). These tags can be used in other quries. 
* Sections &mdash; logical grouping of content
* Editions &mdash; the content for each of the three regional main pages
* Single Item &mdash; will return all data related to a specific item (content, tag, or section) in the API. 

Today, we will focus on the content endpoint. This will allow us to retrieve the body text and metadata for articles published in The Guardian.

Often, the easiest way to interface with an API is through a client. In Python, these clients are just packages that provide functions to simplify the process of accessing the API. 

Alternatively, we can access APIs directly using the [`requests`](https://requests.readthedocs.io/en/master/) library. By accessing the API directly, we maintain freedom in how we use the API, rather than be restricted to a client. This is the option we will choose for interfacing with The Guardian API. 

<a id='g_key'></a>
### API Key
Hopefully you were all successful in receiving a Guardian API Key. If not please let one of us know! 

This API key is what gives you access to the Guardian API. Its kind of like a username and password, all wrapped into one. It is how the API monitors who is accessing their site and makes sure they are abiding by the proper terms of service.

We all registered for a developer key. With this key we receive:  
* Up to 12 calls per second
* Up to 5,000 calls per day
* Access to article text (no image, audio, or video)
* Access to a subset of Guardian content (1.9 million pieces)

If we had registered (and paid) for a commercial key, we would have fewer limitations in what we can access from the API. 

As I mentioned earlier, you can think of your API token as your username and password for accessing The Guardian API. Like any other credentials, we want to make sure this is kept secure. Most importantly, **never share API tokens in public locations**, including in git repositories or emails. 

Making an API token public allows others to access the API as if they were you. This puts you at risk if they violate the terms of service you agreed to when you requested an API token. 

For example, if someone were to get ahold of your API token, they could use it to launch a [denial of service attack](https://en.wikipedia.org/wiki/Denial-of-service_attack) on The Guardian's API. In this case, your token may be revoked and you'd be unable to request a new API token in the future without further violating the terms and services. 

To mitigate against this problem I would recommend one of two options: 
* Storing API tokens as environment variables
* Creating a `cred.py` to store credentials such as API tokens

Personally, I use a `cred.py` containing any of the credentials I need to access APIs, databases, etc. I keep this file stored on my computer in a single location which can be accessed by any Python script on my machine (usually somewhere in `PATH`). This way, the API token is outside of a script I might share and the file is outside of a git repo I might make public one day. 

If for some reason you need to store the `cred.py` file in the same directory as your Python file and this is within your git repo, make sure to add `cred.py` to the `.gitignore` file.

Let's go ahead and create this `cred.py` file now.   

Back on the Jupyter Home Page, click on the New button on the upper right side & select the text option (see the screenshot below). 

<img src=img/new_file.png></img>

A new file will open. Rename it cred.py and add the following line, replacing `<YOUR_TOKEN>` with your own API token.  

```python3
guardian_key = <YOUR_TOKEN>
```

Save & exit the file.   

Run the cell below. If it runs without throwing any errors, the API token has been successfully saved.

In [None]:
import cred

api_key = cred.guardian_key

<a id='g_requests'></a>
### Making API Requests
Now that we have our API key stored in a safer location, we can begin making requests to The Guardian API. 

To start, we will use the `requests` package to make a generic request to the content endpoint. 

In [None]:
# Importing libraries only needs to be done once
import requests
import pprint as pp

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'

MY_PARAMS = {'api-key': api_key}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

There is quite a bit of information there...

Lets break it down a bit. What are individual fields contained within the response? 

In [None]:
response_dict.keys()

Each of these are described in the [content endpoint's documentation](https://open-platform.theguardian.com/documentation/search). We can examine each field individually through indexing our response dictionary. 

Lets start by seeing what order was used to sort the results. 

In [None]:
response_dict['orderBy']

In the cell below, find the total number of items that were returned in this call. Refer to the [documentation](https://open-platform.theguardian.com/documentation/search) if you aren't sure which field you are interested in.   

In [None]:
# Your Answer Here

The interesting part of the response is really what is contained within results field. The results will contain the individual items provided by the endpoint. This will be content (mainly news articles) in our case. 

In the cell below, examine what is contained within the results field and answer (1) what data structure is being used to store the results (dictionaries, lists, etc.), (2) what data is stored for each result, and (3) how many results were returned. 

In [None]:
# Your Answer Here

<a id='filtering'></a>
### Filtering
Often we are interested in receiving very specific data from an API, rather than receiving all the data and then sifting through it later on.

Luckily, most APIs have built-in ways to make these specifications. In The Guardian's API these are called queries or filters.

**Queries** allow you to request content containing free text. This works very similar to a search engine. You can use double quotes to query exact phrase matches and the AND, OR, and NOT operators are supported.   

**Filters** allow you to request content based on specific [metadata](https://dataedo.com/kb/data-glossary/what-is-metadata). Once again, you can check the [documentation](https://open-platform.theguardian.com/documentation/search) to see what metadata is available for filtering. 

We will start off simple. You might have noticed earlier that our response from the API contained the most recent content available. What if we are  actually only interested in retrieving content published prior to Jan 01, 2020?

In [None]:
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

We can add more parameters to further specify the types of results we want to receive.

In [None]:
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'from-date': '2015-01-01',
             'lang': 'en', 
             'production-office': 'uk',
             'q': '(bees OR bees) AND plants'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

In the cell below, write an API request to fetch content using a query and at least 2 filters. 

In [None]:
# YOUR ANSWER HERE

<a id='g_info'></a>
### Extra Information
You may have noticed in the previous API requests and responses that while we were receiving article URLs, sections, and publication dates, we were missing some pretty important data. Things like headlines, bylines, and body text are not included in the default API response. This additional information is available, but needs to be specified using the `show-fields` parameter.

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'

MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'from-date': '2015-01-01',
             'lang': 'en', 
             'production-office': 'uk',
             'q': '(bees OR bees) AND plants',
             'show-fields': 'wordcount,body,byline'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']

response_dict

In the cell below, write code to access and print the body text of an article from the `response_dict`. 

In [None]:
# Your Answer Here

<a id='g_more'></a>
### Requesting More Results

In the API response, there are three fields that relate to the number of results obtained from an API request &mdash; `total`, `pages`, and `pageSize`. 

In [None]:
response_dict['total']

In [None]:
response_dict['pages']

In [None]:
response_dict['pageSize']

When looking at them all together, its becomes more clear as to how they relate. 

* `total` is the number of items available to be returned.  
* `pages` is the number of pages available for return, where each page is a small subset of the total number of items.   
* `pageSize` is how many items are in the current page being returned.   

If its hard to imagine the differences between these values, you can thinking about how Google search results work.   

The key point for us to know is that in a basic API request we are likely only receiving a fraction of the total items available for return. If we want to retrieve all the data, we need to look at (1) increasing the page limit and (2) automatically requesting data from the next page. 

In the cell below, update `MY_PARAMS` to increase the page size from 10 to 50. Use the API [documentation](https://open-platform.theguardian.com/documentation/search) to find the right parameter. 

In [None]:
API_ENDPOINT = 'http://content.guardianapis.com/search'
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'from-date': '2015-01-01',
             'lang': 'en', 
             'production-office': 'uk',
             'q': '(bees OR bees) AND plants',
             'show-fields': 'wordcount,body,byline'}
response = requests.get(API_ENDPOINT, params=MY_PARAMS)
response_dict = response.json()['response']

Run the cell below to verify you successfully increased the number of results per page to 50. 

In [None]:
if response_dict['pageSize'] < 50:
    print('The page size is still less than 50. Try again.')
elif response_dict['pageSize'] == 50: 
    print('The page size is now 50. Good job!')
elif response_dict['pageSize'] > 50: 
    print('The page size is now greater than 50. How did you do that?')

Now that each page can display 50 results, nearly 5x fewer pages are needed to contain all of the data we need!

In [None]:
response_dict['pages']

However, we still need to find a way to gather data from all the pages, instead of just the first. 

Luckily, The Guardian API has a built in `page` paramter that allows us to specify which page we want to get results from. We can combine this type of request with a `while` loop to help automate our API requests.   

#### Rate Limits
Before we look at the code below, we should think about the potential impacts of automating API requests. 

Remember that with a developer key we are limited to 12 calls per second and 5,000 calls per day. While in this case we will be making very few requests, its important to understand the importance of abiding by these limits. 

When you sign up for an API token, you typically are required to sign a Terms of Service. These terms are usually (but not always) summarized to make sure the most important information is readily available. This information usually includes: 
* Rate limits
* Disallowed uses
* Limitations for sharing data
* Intellectual Property considerations

While most of these are self-explanatory, its worthwhile taking some time to go over what rate limits are and how they are controlled. 

Rate limits are the upper bound placed on how many API requests a user can make in a given amount of time. These number differ between websites and even user types. The idea is to limit the rate of requests and ensure the website isn't overrun with traffic. 

In general, rate limits are controlled in two ways. Some websites will have built-in systems that will detect over-use and throttle or revoke access for a token that is over-requesting. This is the system The Guardian API uses. 

Other websites rely on the honour system, asking you to abide by your guidelines. In these cases the risk of exceeding limits is higher (since there is no throttling) and if you run a greater risk of being blacklisted if you exceed the API's rate limits. 

In [None]:
# Normal Setup
API_ENDPOINT = 'http://content.guardianapis.com/search'
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'from-date': '2015-01-01',
             'lang': 'en', 
             'production-office': 'uk',
             'q': '(bees OR bees) AND plants',
             'show-fields': 'wordcount,body,byline',
             'page-size': 50}

# Collect All Results
all_results = []
cur_page = 1
total_pages = 1

while (cur_page <= total_pages) and (cur_page < 10):  # with a fail safe
    # Make a API request
    MY_PARAMS['page'] = cur_page
    response = requests.get(API_ENDPOINT, params=MY_PARAMS)
    response_dict = response.json()['response']

    # Update our master results list
    all_results += (response_dict['results'])
    
    # Update our loop variables
    total_pages = response_dict['pages']
    cur_page += 1

In [None]:
print("Total # of results: {}".format(len(all_results)))

In [None]:
all_results[36]

Now that we have the results, we can continue to access them and work with them, without having to make more API requests.

Whenever possible, **store the results you receive from API requests**. This allows you to access the data without making unneccessary requests to the API. 

You can store the data in either python variables or in a file. If you are only using the data for a short period of time (e.g. real-time analysis) you can likely get away with using variables within your Python script. 

However, if you want to access the data after you've finished running your script you should save it to a file. This way the data can be used later in new analyses or to reproduce the work you've already done. 

Lets store our results in a file, so we can use them later on. 

In [None]:
import json 
FILE_PATH = 'data/guardian_api_results.json'
with open(FILE_PATH, 'w') as outfile:
    json.dump(all_results, outfile)
    

We can check that the results were written in the correct format by reading them back in. 

In [None]:
with open(FILE_PATH, 'r') as f:
    data = json.load(f)

<a id='wikipedia'></a>
## The Wikipedia API
The [English Wikipedia API](https://en.wikipedia.org/w/api.php) is one endpoint of the larger [MediaWiki API](https://www.mediawiki.org/wiki/API:Main_page). Other endpoints include the Meta-Wiki, Wikimedia Commons, and German Wikipedia APIs. 

There is plenty of documentation about how to use these APIs directly, but there is also an easy-to-use Python client we can use. The [`wikipedia`](https://wikipedia.readthedocs.io/en/latest/) Python client developed by Jonathan Goldsmith provides us with functionality for reading and parsing data from Wikipedia. 

While in the backend `wikipedia` is still using the MediaWiki API, the front-end interface (what we will work with) is much simpler than if we were to use the API directly.

### Installing `wikipedia`
Likely, up until this point all of the Python packages we've been using have come standard in the Anaconda installation you all have on your machines. 

However, `wikipedia` is not a default package in either base Python or Anaconda. So, we will need to download it for ourselves. 

Usually, Python packages can be found on [PyPI](https://pypi.org/), the official repository for Python packages. Any package found on PyPI can be installed using [`pip`](https://pypi.org/project/pip/), Python's package installer. 

Run the cell below to use `pip`to search PyPI for the `wikipedia` package. 

> Aside   
The `!` at the beginning of the cell tells Jupyter that we want that cell (and that cell only) to be executed on the command line. 

In [None]:
!pip3 search wikipedia 

Conveniently, the package we are interested in is shown right at the top. We also see that the default version of this package is 1.4.0. To install `wikipedia`, run the cell below. 

You may notice that some extra packages are being installed, or at least looked for. These packages are _requirements_ of the `wikipedia` packages and need to be installed for `wikipedia` to work properly.

In [None]:
!pip3 install wikipedia 

Once you see a message to the effect of `Successfully installed wikipedia-1.4.0` comment out the cell block above to ensure you don't acciudently try to re-install the package. 

If you get an error message, let one of us know so we can help you debug. 

Run the cell below to make sure `wikipedia` was installed successfully. If no errors show up, you are good to go!

In [None]:
import wikipedia

### Using `wikipedia`
Unlike The Guardian or Twitter APIs, Wikipedia's API doesn't require a token. Instead, everything is publically accessible to anyone. 


We need to be more careful to rate limit. 

#### Searching
Similar to how we search on Wikipedia's website, we can use the API to search for specific content. 

In [None]:
search_term = 'spelunking'

search_results = wikipedia.search(search_term)

search_results

If we are interested in a particlar page, we can request it specifically using the `page()` function. 

In [None]:
my_page = wikipedia.page(title=search_results[0])
my_page

At first this result might seem anti-climatic. After all, there really doesn't appear to be any interesting data contained within `my_page`. However, `my_page` actually does contain a lot of information, its just packaged into a `WikipediaPage` object (also known as a class). 

This object stores data such as the page's summary, links, and categories, all structured neatly within the object. Checkout the [`WikipediaPage` documentation](https://wikipedia.readthedocs.io/en/latest/code.html#wikipedia.WikipediaPage) for a full list. 

In [None]:
my_page.links

In [None]:
my_page.summary

In the cell below, use a for loop to retrieve and store the summaries for each of the 10 pages in `search_results`. 

In [None]:
# Your Answer Here

#### Jumping Between Pages
Links are inherent in Wikipedia. They connect pages to one another and provide a structure for the site. It also means you can almost always get from one page to another through these links. Checkout [Six Degrees of Wikipedia](https://www.sixdegreesofwikipedia.com) if you have any doubts. 

We can use these links between pages to move page to page, gathering information as we go. The cell below uses the `random` package to select a link at random and display its summary text. 

In [None]:
import random

# Function for selecting a random linked page
def select_random_link(links):
    total_links = len(links)
    random_num = random.randrange(0, total_links)
    random_page_name = links[random_num]
    random_page = wikipedia.page(random_page_name)
    return random_page


# All links
links = my_page.links

# Select a random linked page
linked_page = select_random_link(links)

# Print Results
print('There is a link from {} --> {}\n'.format(my_page.title, 
                                              linked_page.title))

print("{}'s summary is\n {}".format(linked_page.title,
                                    linked_page.summary))

Above, we took the first step in a ["random walk"](https://en.wikipedia.org/wiki/Random_walk) through Wikipedia. In the cell below use the `select_random_link()` function from above and a loop (`while` or `for`) to perform a random walk with 5 steps.

Feel free to choose any page as a starting point. Print out the title of each page you visit on the random walk. 

In [None]:
# Your Answer Here

It is fairly easy to image how a random walk, left to its own devices, could carry on indefinitely through Wikipedia making API request after API request. If enough people write random walk code, or other code making many requests, its quite possible we could overwhelm the Wikipedia API. 

When this happens, Wikipedia identifies the IP addresses making the most requests and serves them with an HTTP timeout error. Essentially, Wikipedia punishes the heavy users by returning errors and making them wait until the API is no longer overwhelmed. 

To help mitigate against this, we can make use of the `set_rate_limiting()` function included in the `wikipedia` Python package. 

In [None]:
wikipedia.set_rate_limiting(rate_limit=True)

Now, any requests we make to the Wikipedia API will be separated by 50 ms (default for the function). If at any point we encounter an HTTP timeout error while using rate limiting, we should adjust the limit using `set_rate_limiting()`'s `min_wait` parameter. 

## <font color='crimson'> The Twitter API </font>

> **NOTE**: Some of the explanatory text in this section on the Twitter API is excerpted and adapted from John McLevey (2020) <font color="crimson">*Doing Computational Social Science*</font>. London: Sage. 

Twitter has multiple APIs that are available to developers and researchers. The `REST API` can be used to collect information about user accounts and a limited number of historical tweets. For example, you could use the `REST API` to find out information about the account belonging to sociologist Mario Small, such as his Twitter ID number, the description he uses in his bio, the number of followers he has, the number of friends he follows, and who his followers and friends are. In addition to this information, you can collect the text and tweet metadata for his most recent 3,200 tweets. 

Twitter imposes rate limits on access to the REST API, which means that it can be rather slow to collect certain kinds of information, such as information about followers and friends. More specifically, Twitter restricts users to making 15 requests per 15 minutes. The amount of information you can get with each request depends on what exactly you are asking for, which can be a bit confusing when you first start using this API. If you make more requests than permitted within a 15 minute window, Twitter will break your connection to the REST API. However, (1) `Tweepy` simplifies the work needed to stay within the rate limits and avoid disconnection, and (2) you can collect substantially more data per request if you use Twitter's ID numbers rather than screen names. 

Twitter also offers a `Streaming API`, which enables you to download tweets in real time. Unlike the `REST API` which *pulls* historical data from Twitter, the Streaming API *pushes* real time data to us. We receive that data using a `StreamListener` class, which we will discuss shortly. This enables us to collect an enormous amount of data in a short amount of time.

The free version of the Streaming API enables you to collect up to 1% of all tweets produced within 10 milliseconds of your request. If you are looking to use this method to collect a dataset of tweets that are representative of the Twitterverse at any given moment, there are plenty of reasons to be skeptical, the main reason being that Twitter does not disclose how it selects the 1% sample. However, if you are collecting tweets within a specific set of search parameters (e.g. filtered to tweets produced by a list of user accounts, within a geographic region, or containing some specific keyword), then it is possible to collect *all* relevant tweets. This only works if the tweets produced within your search parameters are less than 1% of all tweets posted within 10 milliseconds. Given how massive Twitter is and how much data it's users produce, chances are you can stay under 1% and collect all tweets that are relevant to whatever your research question is.

## Accessing the Twitter API

To access the Twitter API, you have to have a Twitter account and register an application at [https://developer.twitter.com](https://developer.twitter.com). If you managed to get your credentials before the start of this workshop, we will use them now. 

On the developer page, you will need to "create an app" and record four pieces of information: (1) Consumer Key, (2) Consumer Secret, (3) Access Token, and (4) Access Token Secret. You can find instructions of how to find these keys online. I am not going to provide step by step instructions here because the layout of the Twitter developer page changes from time to time. Note that Twitter will only display your `ACCESS_TOKEN` and `ACCESS_TOKEN_SECRET` when you first generate them. If you lose them, you will need to generate a new pair, which invalidates the old pair. 

![](img/twitter.jpg)

As we mentioned earlier, you should treat your API keys like **passwords**. It's best not to copy and paste them directly into your scripts where other people will see (and can compromise) them. Instead, we will store our keys in a separate file and read the information into our Python script. To do so, let's create a file called `config_twitter.py` and store our four keys in it. The content of your file should look like this: 

    API_KEY = 'YOUR API KEY'  
    API_TOKEN = 'YOUR API TOKEN'  
    ACCESS_TOKEN = 'YOUR ACCESS TOKEN'  
    ACCESS_TOKEN_SECRET = 'YOUR ACCESS TOKEN SECRET'  

In [23]:
import re
import tweepy
import config_twitter

auth = tweepy.OAuthHandler(config_twitter.API_KEY, config_twitter.API_TOKEN)
auth.set_access_token(config_twitter.ACCESS_TOKEN, config_twitter.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)

Note that when we use `Tweepy` to authenticate with Twitter, we use the arguments `wait_on_rate_limit=True` and `wait_on_rate_limit_notify=True`. These arguments enable `Tweepy` to do the work of staying within Twitter's rate limits, which simplifies our work considerably. For that reason, I suggest you also use them.

Once those cells have been executed, you are authenticated with Twitter and can start making requests. Let's start with some requests to the REST API to get data on user accounts.

## The `REST API`

First, we will collect some data on user accounts and historical tweets using the REST API. There is a `csv` file in the `data` subdirectory of the course folder. It contains the Twitter Screen Names for the leaders of UK political parties as of August 14, 2019. The leaders are:

* **Boris Johnson**, Leader of the Conservative Party    
* **Jeremy Corbyn**, Leader of the Labour Party
* **Nicola Sturgeon**, Leader of the Scottish National Party  
* **Jo Swinson**, Leader of the Liberal Democrats  
* **Arlene Foster**, Leader of the Democratic Unionist Party 
* **Liz Saville Roberts**, Leader of Plaid Cymru - Party of Wales 
* **Caroline Lucas**, Leader of the Green Party of England and Wales 

We will use the Screen Names from this `csv` file throughout the module. Feel free to edit the file to include a different set of Screen Names. The code in this module will work regardless of which accounts you are collecting data from. 

In [18]:
accounts = pd.read_csv('data/twitter_accounts.csv')
accounts = accounts['Screen Name'].tolist()

### Getting User Account Metadata

In [20]:
ids = [api.get_user(i) for i in accounts]

meta = [[i.name, i.screen_name, i.id, i.description, i.location, i.followers_count, i.friends_count, i.protected] for i in ids]

meta = pd.DataFrame(meta, columns = ['Person', 'Handle', 'Twitter ID Number', 'Description', 'Location', 'Number of Followers', 'Number of Friends', 'Protected'])
meta.to_csv('output/twitter_accounts_uk_leaders.csv', index = False)
meta

Unnamed: 0,Person,Handle,Twitter ID Number,Description,Location,Number of Followers,Number of Friends,Protected
0,Boris Johnson,BorisJohnson,3131144855,Prime Minister of the United Kingdom and @Cons...,United Kingdom,1548469,453,False
1,Jeremy Corbyn,jeremycorbyn,117777690,Leader of the Labour Party.,UK,2372767,2758,False
2,Nicola Sturgeon,NicolaSturgeon,160952087,"First Minister of Scotland, @theSNP Leader and...","Glasgow, Scotland",1057788,4723,False
3,Jo Swinson,joswinson,14933304,Scottish. British. European. Runner. Feminist....,East Dunbartonshire & London,173733,2899,False
4,Arlene Foster,DUPleader,275799277,First Minister of Northern Ireland | Democrati...,"County Fermanagh, Northern Ireland",77759,1024,False
5,Liz Saville Roberts AS/MP 🏴󠁧󠁢󠁷󠁬󠁳󠁿,LSRPlaid,2350624098,Aelod Seneddol @plaid_cymru Dwyfor Meirionnydd...,Dolgellau,13384,2454,False
6,Caroline Lucas,CarolineLucas,80802900,"Green MP for Brighton Pavilion, former leader ...",Brighton,451633,5991,False


## Getting Historical Tweet Metadata

Below I have written a function called `get_tweet_data` that takes a user screen name and requests the tweets from the `user_timeline` via Twitter's REST API. It collects some (not all) of the available metadata for each tweet and adds them to a dictionary. Then, after all available Tweets have been collected, it appends each Tweet dict to a list call Tweets. The list of Tweet dicts is returned when the function is run. 

In [42]:
def get_tweet_data(user, user_meta=False):
    tweets = []
    
    for tw in tweepy.Cursor(api.user_timeline, screen_name=user, exclude_replies=False, count = 200, tweet_mode = 'extended').items():
        tdict = {}
        
        tdict['text'] = tw.full_text.replace('\n', '').strip()    
        tdict['tweet_id'] = tw.id
        tdict['retweet_count'] = tw.retweet_count
        tdict['fav_count'] = tw.favorite_count
        tdict['user_id'] = tw.user.id        
        tdict['user_screen_name'] = tw.user.screen_name
        tdict['time'] = tw.created_at
        tdict['hashtags'] = [hashtag['text'] for hashtag in tw.entities['hashtags']]
        tdict['user_mentions'] = [user['screen_name'] for user in tw.entities['user_mentions']]
        
        if user_meta is True:
            tdict['location'] = tw.user.location
            tdict['user_description'] = tw.user.description
            tdict['user_url'] = tw.user.url 
        else:
            pass
        
        # find links
        tdict['links_in_tweet'] = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tw.full_text)
        # re from stackoverflow 
        
        tdict['link_to_tweet'] = 'https://twitter.com/{}/status/{}'.format(tw.user.screen_name, tw.id)
        
        tweets.append(tdict)
    
    return tweets

Let's pick one user account to collect historical tweets from. Later, we will apply our solution to the rest of the accounts in our `csv` file. In this example, we will collect user Tweets from [Nicola Sturgeon](https://en.wikipedia.org/wiki/Nicola_Sturgeon), the leader of the [Scottish National Party](https://en.wikipedia.org/wiki/Scottish_National_Party) and the First Minister of Scotland. 

In [24]:
ns_tweets = get_tweet_data('NicolaSturgeon', user_meta = True)

In [25]:
for tweet in ns_tweets:
    print('Tweet: {}'.format(tweet['text']))
    print('Number of Retweets: {}'.format(str(tweet['retweet_count'])))
    print('Number of Favs: {}'.format(str(tweet['fav_count'])))
    print('Users Mentioned: {}'.format(list(tweet['user_mentions'])))
    print('Hashtags: {}'.format(list(tweet['hashtags'])))
    print('\n')

Tweet: RT @ScotGovFM: Tonight the First Minister @nicolasturgeon hosted the annual reception for Consular Corps at Edinburgh Castle, and thanked C…
Number of Retweets: 102
Number of Favs: 0
Users Mentioned: ['ScotGovFM', 'NicolaSturgeon']
Hashtags: []


Tweet: RT @MhairiHunter: Delighted to see that @NewGorbalsHA led application has won nearly a million pounds in @scotgov regeneration funding to r…
Number of Retweets: 20
Number of Favs: 0
Users Mentioned: ['MhairiHunter', 'NewGorbalsHA', 'scotgov']
Hashtags: []


Tweet: RT @StAlbertsG41: Our campaign to reclaim world book day for books and stories. Alisha says ‘books are so important for knowledge and langu…
Number of Retweets: 28
Number of Favs: 0
Users Mentioned: ['StAlbertsG41']
Hashtags: []


Tweet: RT @KateForbesMSP: All parties’ key public budget asks are reflected in today’s deal: 💼 Meet local government’s ask for £95m ✅👧🏻 Expand…
Number of Retweets: 161
Number of Favs: 0
Users Mentioned: ['KateForbesMSP']
Hashtags: []


Tweet: 

Tweet: And @petewishart no longer has a majority of just 21 💪 - congratulations to @theSNP longest serving MP
Number of Retweets: 420
Number of Favs: 3981
Users Mentioned: ['PeteWishart', 'theSNP']
Hashtags: []


Tweet: RT @theSNP: SNP GAIN!@DaveDooganSNP wins Angus for the SNP. #GE2019 #SNPWin https://t.co/yb9pZiByes
Number of Retweets: 394
Number of Favs: 0
Users Mentioned: ['theSNP', 'DaveDooganSNP']
Hashtags: ['GE2019', 'SNPWin']


Tweet: RT @theSNP: SNP WIN! That's 3 out of 3!@AlanBrownSNP wins Kilmarnock and Loudoun for the SNP. #GE2019 #SNPWin https://t.co/B0bUQ8yr6C
Number of Retweets: 269
Number of Favs: 0
Users Mentioned: ['theSNP', 'AlanBrownSNP']
Hashtags: ['GE2019', 'SNPWin']


Tweet: RT @theSNP: SNP WIN! That's 4 out of 4!@MartinJDocherty wins West Dunbartonshire for the SNP. #GE2019 #SNPWin https://t.co/2kaJyU2S5t
Number of Retweets: 229
Number of Favs: 0
Users Mentioned: ['theSNP', 'MartinJDocherty']
Hashtags: ['GE2019', 'SNPWin']


Tweet: RT @theSNP: SNP WIN! That's 5 

Users Mentioned: ['JaneyGodley']
Hashtags: []


Tweet: Simply outrageous and unacceptable to exclude @theSNP - the third largest party in UK. What are the other parties so scared of that they won’t agree to real debate? And why are broadcasters letting down voters, especially in Scotland? https://t.co/V3xokFN1zt
Number of Retweets: 5534
Number of Favs: 21347
Users Mentioned: ['theSNP']
Hashtags: []


Tweet: RT @PA: During the third leg of her day on the campaign trail, Nicola Sturgeon joined in with Scottish country dancing at Lochside Communit…
Number of Retweets: 54
Number of Favs: 0
Users Mentioned: ['PA']
Hashtags: []


Tweet: RT @Daily_Record: Nicola Sturgeon put her best foot forward for a spot of traditional dancing during a visit to Dumfries this morning. ht…
Number of Retweets: 31
Number of Favs: 0
Users Mentioned: ['Daily_Record']
Hashtags: []


Tweet: Pleasure to campaign today with two of our outstanding @theSNP candidates - @MargaretFerrier in Rutherglen &amp; Hamilton Wes

Users Mentioned: ['ChristinaSNP_', 'NicolaSturgeon']
Hashtags: []


Tweet: RT @AllanCasey89: It was a pleasure to welcome First Minister @NicolaSturgeon along to the new H Lane festival In Dennistoun this afternoon…
Number of Retweets: 29
Number of Favs: 0
Users Mentioned: ['AllanCasey89', 'NicolaSturgeon']
Hashtags: []


Tweet: Leopards - and it seems Liberals - don’t change their spots. According to this they’re in election talks with the Tories?! Remember this the next time you hear the Scottish Lib Dems claim that they are anti Tory. https://t.co/5RuuA9Wqku
Number of Retweets: 2187
Number of Favs: 4297
Users Mentioned: []
Hashtags: []


Tweet: RT @sca_net: A brilliant day for our #ArtinAction campaign with #FirstMinister @NicolaSturgeon #MSP visiting constituents @StudioPavilion t…
Number of Retweets: 18
Number of Favs: 0
Users Mentioned: ['sca_net', 'NicolaSturgeon', 'StudioPavilion']
Hashtags: ['ArtinAction', 'FirstMinister', 'MSP']


Tweet: Thanks for a great visit @sca_net @Stu

Users Mentioned: ['LeoVaradkar']
Hashtags: ['BIC']


Tweet: RT @ScotGovFM: Busy morning of engagements at 32nd British Irish Council with FM @NicolaSturgeon holding meetings with Welsh FM @MarkDrakef…
Number of Retweets: 78
Number of Favs: 0
Users Mentioned: ['ScotGovFM', 'NicolaSturgeon']
Hashtags: []


Tweet: RT @ScotGovFM: Ahead of today's 32nd @BICSecretariat the First Ministers of Scotland @NicolaSturgeon &amp; Wales @fmwales have made a joint cal…
Number of Retweets: 153
Number of Favs: 0
Users Mentioned: ['ScotGovFM', 'BICSecretariat', 'NicolaSturgeon', 'fmwales']
Hashtags: []


Tweet: Just arrived in Manchester for tomorrow’s British-Irish Council Summit. Very sorry to have missed @SurvivorsChoir. @BICSecretariat https://t.co/Fk9gEdFxYj
Number of Retweets: 53
Number of Favs: 246
Users Mentioned: ['SurvivorsChoir', 'BICSecretariat']
Hashtags: []


Tweet: RT @BBCNewsnight: TONIGHT: Don’t miss our interview with Scotland's First Minister, Nicola Sturgeon, as we bring you a #Newsni

Users Mentioned: ['Strath_FAI']
Hashtags: []


Tweet: RT @stuartgmcintyre: This Export Action Plan from @scotgov is a really good piece of analysis, and the starting point for a step change in…
Number of Retweets: 44
Number of Favs: 0
Users Mentioned: ['stuartgmcintyre', 'scotgov']
Hashtags: []


Tweet: RT @ScotGovFM: 'We want Scottish companies to go out &amp; compete in international market places, just as we want people from outside Scotland…
Number of Retweets: 204
Number of Favs: 0
Users Mentioned: ['ScotGovFM']
Hashtags: []


Tweet: For the pin head dancers of the world - it’s not true. https://t.co/OPAWhDQyYs
Number of Retweets: 292
Number of Favs: 1468
Users Mentioned: []
Hashtags: []


Tweet: Actually, there was an emphatic denial - which you’d have heard yourself had you been there 😀 https://t.co/tKAYznPxJU
Number of Retweets: 769
Number of Favs: 3468
Users Mentioned: []
Hashtags: []


Tweet: RT @lilja1972: Prime Minister of #Iceland @katrinjak surrounded by crime writers @va

Users Mentioned: ['skillsdevscot', 'NicolaSturgeon']
Hashtags: []


Tweet: RT @FVCollege: We were delighted to welcome First Minister, @NicolaSturgeon to our Falkirk Campus this morning #ScotAppWeek19 #NationalAppr…
Number of Retweets: 81
Number of Favs: 0
Users Mentioned: ['FVCollege', 'NicolaSturgeon']
Hashtags: ['ScotAppWeek19']


Tweet: @Louisemac Welcome back!
Number of Retweets: 0
Number of Favs: 13
Users Mentioned: ['Louisemac']
Hashtags: []


Tweet: @kgjephcott @WomensPrize Congratulations - so well deserved.
Number of Retweets: 0
Number of Favs: 7
Users Mentioned: ['kgjephcott', 'WomensPrize']
Hashtags: []


Tweet: RT @WomensPrize: And without further ado, we're thrilled to reveal the 2019 #WomensPrize longlist 🙌Congratulations to our sixteen brillian…
Number of Retweets: 1136
Number of Favs: 0
Users Mentioned: ['WomensPrize']
Hashtags: ['WomensPrize']


Tweet: RT @Team_Scotland: Double GOLD for @lauramuiruns at @Glasgow2019 Euro Indoor Champs! She unleashes that devastating f

`Pandas` can construct a `dataframe` from a list of dicts, which means we can easily store our tweet data in a `dataframe`. 

In [26]:
ns_df = pd.DataFrame(ns_tweets)
ns_df.head()

Unnamed: 0,text,tweet_id,retweet_count,fav_count,user_id,user_screen_name,time,hashtags,user_mentions,location,user_description,user_url,links_in_tweet,link_to_tweet
0,RT @ScotGovFM: Tonight the First Minister @nic...,1233133406017511424,102,0,160952087,NicolaSturgeon,2020-02-27 20:54:49,[],"[ScotGovFM, NicolaSturgeon]","Glasgow, Scotland","First Minister of Scotland, @theSNP Leader and...",https://t.co/viEKYxG7er,[],https://twitter.com/NicolaSturgeon/status/1233...
1,RT @MhairiHunter: Delighted to see that @NewGo...,1233066931684548608,20,0,160952087,NicolaSturgeon,2020-02-27 16:30:40,[],"[MhairiHunter, NewGorbalsHA, scotgov]","Glasgow, Scotland","First Minister of Scotland, @theSNP Leader and...",https://t.co/viEKYxG7er,[],https://twitter.com/NicolaSturgeon/status/1233...
2,RT @StAlbertsG41: Our campaign to reclaim worl...,1233041863457878019,28,0,160952087,NicolaSturgeon,2020-02-27 14:51:03,[],[StAlbertsG41],"Glasgow, Scotland","First Minister of Scotland, @theSNP Leader and...",https://t.co/viEKYxG7er,[],https://twitter.com/NicolaSturgeon/status/1233...
3,RT @KateForbesMSP: All parties’ key public bud...,1232751395524222978,161,0,160952087,NicolaSturgeon,2020-02-26 19:36:50,[],[KateForbesMSP],"Glasgow, Scotland","First Minister of Scotland, @theSNP Leader and...",https://t.co/viEKYxG7er,[],https://twitter.com/NicolaSturgeon/status/1232...
4,This is v funny and worth a read. https://t.co...,1232751334341910528,196,599,160952087,NicolaSturgeon,2020-02-26 19:36:36,[],[],"Glasgow, Scotland","First Minister of Scotland, @theSNP Leader and...",https://t.co/viEKYxG7er,[https://t.co/nLF7u8rA01],https://twitter.com/NicolaSturgeon/status/1232...


In [27]:
ns_df[['text', 'retweet_count', 'hashtags', 'user_mentions']].to_csv('output/nicola_sturgeon_tweets.csv', index = False)

## Working with Links from Tweets

Many tweets contain links to other tweets, and to content external to Twitter. There is potentially a lot of interesting and useful information that we can gather from these links, but we have to do some extra work to get it. 

First, we need to identify links in the tweets themselves. There are a number of ways to do this, including using a regular expression (which we do in one of the custom functions below). The resulting links have been shortened by Twitter (e.g. [https://t.co/RmUDsFf3em](https://t.co/RmUDsFf3em). In order to know what content a Tweet is linking to, we need to tell Python to follow the link. This will trigger a redirect to the linked content. We can then tell Python to tell us what the *actual* link is. Finally, we can use a package like `tldextract` to parse the actual link and return the top-level domain (e.g. Twitter, the New York Times, etc.). 

Let's define a couple of functions to (1) get unique links from a collection of Tweets and then (2) process the urls by following redirections and extracting the top-level domain. 

In [28]:
from urllib.request import urlopen
import tldextract

def get_unique_urls(tweet_data):
    """
    Retrieve the links from tweet data, flattens the list of list to a set of unique urls.
    
    We will use the following `flatten` function, which was shared by 
    [Alex Martelli on StackOverflow](https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists). 
    In this function, `l` is the outer list.
    """
    flatten = lambda l: [item for sublist in l for item in sublist]
    urls = list(set(flatten([tweet['links_in_tweet'] for tweet in tweet_data])))
    return urls

In [29]:
ul = get_unique_urls(ns_tweets)

test_ul = ul[:15]
test_ul

['https://t.co/VtZIf5I0F4',
 'https://t.co/ULLvnRHPWH',
 'https://t.co/wVKiNxrgpc',
 'https://t.co/8RWte54XTp',
 'https://t.co/SJPKI23iwd',
 'https://t.co/7r0l',
 'https://t.co/NG7KivJmvr',
 'https://t.co/bJb5lGi5RI',
 'https://t.co/pMrbpWi6lL',
 'https://t.co/Rsub4FbpjC',
 'https://t.co/UdOtjH0H3S',
 'https://t.co/BqEKLukNDL',
 'https://t.co/owbwP260Hq',
 'https://t.co/jrSFSc3Umb',
 'https://t.co/SXJnFulmkf']

The next step requires identifying the *actual* link by following the shortened link, triggering a redirection, and then parsing the url string to identify the top-level domain. We can do this in a single function, `process_urls`. One limitation of this approach is that it can take some time to run. Depending on the speed of your network connection, it can take a little under a second to process each url. If you are feeding a long list of urls into this function, you should expect to wait a little while.

In [39]:
def process_urls(url):
    """
    Accepts a url string that has been shortened by Twitter. 
    Gets the actual link, parses it, returns a dict with 
    the actual link, the domain, the suffix, and any subdomain.
    Of course will work with other types of shortened links as well. 
    
    This function will take a while to run on a large collection of links because each one 
    has to be opened, loaded, and parsed. Opening and loading speed will vary depending on 
    your network connection. 
    """
    ld = {}
    ld['original_short'] = url
    
    try:
        opened = urlopen(url)
        ld['redirected'] = opened.geturl()
        ld['valid_url'] = 'Yes'
    except:
        ld['valid_url'] = 'No'
    
    if ld['valid_url'] is 'Yes':
        ext = tldextract.extract(ld['redirected'])
        ld['domain'] = ext.domain
        ld['subdomain'] = ext.subdomain
        ld['suffix'] = ext.suffix
    else:
        ld['redirected'] = 'Not a valid url'
        ld['domain'] = 'Domain missing'
        ld['subdomain'] = 'Subdomain missing'
        ld['suffix'] = 'Suffix missing'
   
    return ld

In [40]:
for l in test_ul:
    proc = process_urls(l)
    print(proc['original_short'])
    print(proc['redirected'])
    print(proc['domain'])
    print('\n')

unable to cache TLDs in file /usr/local/lib/python3.6/dist-packages/tldextract/.tld_set: [Errno 13] Permission denied: '/usr/local/lib/python3.6/dist-packages/tldextract/.tld_set'


https://t.co/VtZIf5I0F4
https://twitter.com/alisonthewliss/status/1205081472702517248/photo/1
twitter


https://t.co/ULLvnRHPWH
https://twitter.com/NicolaSturgeon/status/1137798602548551682/photo/1
twitter


https://t.co/wVKiNxrgpc
https://www.instagram.com/p/B4m03FUF07Q/?igshid=4m7y4ci9car6
instagram


https://t.co/8RWte54XTp
https://www.youtube.com/watch?v=oUs-5dHFksw&feature=youtu.be
youtube


https://t.co/SJPKI23iwd
https://twitter.com/alisonthewliss/status/1205081472702517248
twitter


https://t.co/7r0l
Not a valid url
Domain missing


https://t.co/NG7KivJmvr
https://twitter.com/NicolaSturgeon/status/1192075733956538368/photo/1
twitter


https://t.co/bJb5lGi5RI
https://twitter.com/NicolaSturgeon/status/1172933026386513920/photo/1
twitter


https://t.co/pMrbpWi6lL
https://twitter.com/theSNP/status/1205345883828686848/video/1
twitter


https://t.co/Rsub4FbpjC
https://twitter.com/extrateethmag/status/1148511205558038529
twitter


https://t.co/UdOtjH0H3S
https://twitter.com/bbclaurak/

Printing data to screen is occasionally useful, but most of the time we want to get the data into a format that we can easily store or analyze. We can use list comprehension to process each link in our full `ul` object. The result will be a list of dictionaries, where each dictionary corresponds to a link. We can once again use `Pandas` to get this into a `dataframe` for each analysis or storage. 

You should expect the code below to take some time to run. Please be patient, and maybe get some tea or coffee while you wait. ☕️ ☕️ ☕️

In [None]:
import time
start_time = time.time()
processed = [process_urls(l) for l in ul]
print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
sturgeon_urls = pd.DataFrame(processed)
sturgeon_urls.sample(20)

In [None]:
sturgeon_urls.to_csv('output/sturgeon_urls.csv', index = False)

We can use `Pandas` (discussed in detail tomorrow) to quickly check to see what the most commonly linked domains are in Sturgeon's most recent 3,200 Tweets (recall this is the limit for historical Tweets accessed via the REST API). 

In [None]:
sturgeon_urls.groupby('domain').size().sort_values(ascending = False)[:10]

## Getting Friend and Follower Data

So far we have collected general account metadata and historical tweet data and metadata (including hashtags, mentioned users, and links to external content). We *also* want to collect data on friends (who an account follows) and followers (who follows a given account). 

This requires making yet another request to the REST API. When we make the request, we can use Screen Names *or* Twitter IDs, which are unique strings that a user cannot modify. If we want to collect the most data possible within the shortest amount of time, then we will use Twitter IDs. 

In [None]:
def get_friends(user_id):
    """
    Accepts a Twitter user ID number and gets a list of people the account follows ('friends'). 
    Could be screen name instead, but that is much slower and hits rate limiting faster. Count 
    would have to drop down to 200. 
    """
    friends = []
    cursor = tweepy.Cursor(api.friends_ids, id=user_id, count=5000) 
    for page in cursor.pages():
        for friend in page:
            friends.append(friend)
    return friends

Let's use our function to get a list of accounts that Nicola Sturgeon's account follows. We know from earlier (see the `meta` object) that her Twitter ID is `160952087`. 

In [None]:
sturgeon_friends = get_friends('160952087')
print('There are currently {} accounts following Nicola Sturgeon on Twitter.'.format(str(len(sturgeon_friends))))

In [None]:
sturgeon_friends

We now have a list of the accounts that follow Nicola Sturgeon. Let's take the first 5 and retrieve metadata about those accounts. We could do it for the full list, of course, but it will take a while because of Twitter's rate limiting. 

In [None]:
sturgeon_friends_test = sturgeon_friends[:5]
sturgeon_friends_test

In [None]:
sturgeon_friends_meta = [api.get_user(i) for i in sturgeon_friends_test]

In [None]:
sf_meta = [[i.name, i.screen_name, i.id, i.description, i.location, i.followers_count, i.friends_count, i.protected] for i in sturgeon_friends_meta]
pd.DataFrame(sf_meta)

We can, of course, do this for all of Nicola Sturgeon's friends. We can also do it for her friends' friends! And we can do it for the other politicians included in this module, and any other public Twitter accounts. All we need is a lot of time and patience. 

# The Streaming API <a id='stream'></a>

The other API that researchers routinely use when collecting Twitter data is the Streaming API. As you now know, the Streaming API is rather different than the REST API. Unlike the REST API (which *pulls* data from Twitter), the Streaming API recieves data that is *pushed* from Twitter in real time. This requires defining a special listener class. The code block below defines the `MyStreamListener` class discussed in the assigned reading. Once you execute that code cell, you are ready to start recieving real time streaming data from Twitter. The listener will append new tweet data to a csv file called `streaming_tweet_data.csv`, which is stored in the `output` directory. 

In [None]:
SEP = ';'
csv = open('output/streaming_tweet_data.csv', 'a')
csv.write('Date' + SEP + 'Tweet' + SEP + 'Number of Followers' + SEP + 'Number of Friends' + SEP + 'Handle' + '\n')

class MyStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        if hasattr(status, 'retweeted_status'):
            try:
                tweet = status.retweeted_status.extended_tweet["full_text"]
            except:
                tweet = status.retweeted_status.text
        else:
            try:
                tweet = status.extended_tweet["full_text"]
            except AttributeError:
                tweet = status.text
        
        date = status.created_at.strftime("%Y-%m-%d-%H:%M:%S")
        follower = str(status.user.followers_count)
        friend = str(status.user.friends_count)
        name = status.user.screen_name
        
        csv.write(date + SEP + tweet.strip().replace("\n","").replace('\r','').replace(';',',') + SEP + follower + SEP + friend + SEP + name + '\n')

To start streaming data, we have to initialize the class object and then provide some sort of search filter. In this case, we will stream tweets about Brexit. Remember, these tweets will *not* print to screen. They will be written to the `streaming_tweet_data.csv` file. 

I suggest you run the two cells below and then walk away for 5 minutes or so. Come back, **'interrupt' the Python kernel** (you can do this by pressing the square stop button in the Jupyter toolbar), and check the content of the `csv` file.

In [None]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

In [None]:
myStream.filter(track=['brexit'])

When you are ready to stop streaming data, you will have to 'interrupt' the Python kernel. Otherwise, it will keep collecting data from Twitter until the connection is somehow severed. You can do this by clicking the square black button at the top of your Jupyter Notebook, or by selecting 'Interrupt' from the Kernel menu at the top of the notebook. 

> **Important note!** You may get a `TweepError: Stream object already connected!` error when executing this part of the notebook. This is because Twitter will only allow one connection to the Streaming API at a time. If you get that error, then you should select "Restart & Clear Output" from the Kernel menu at the top of the notebook. You will need to re-import the packages and re-authenticate with Twitter by executing those cells at the top of the notebook. Then you can stream new data. Just don't connect the previous streaming object in the new session! 

There are, of course, more sophisticated ways of doing this. [Ted Chen](https://tedhchen.com/), for example, has developed [a set of Python tools for working with the Streaming API](https://github.com/tedhchen/twitter_streaming_tools) that are very well-suited to large-scale empirical research projects. This level of depth is outside of the scope of this class, but (1) you can learn a lot from studying repos like Ted's, and (2) you can learn more from *Doing Computational Social Science* and other texts that cover APIs for social scientific research. 

<a id='key_points'></a>
## Key Points   
You should now know: 
* The differences between working with an API directly and a API client.
* The risks associated with sharing API tokens & a method for keeping them out of python scripts. 
* How to save request results to a file & the importance of doing so. 
* Why its important to abide by rate limits. 
* How to install python packages using `pip`.
* How to automate API requests using loops.
