# Fundamentals of Data Analysis with Python 

## Day 2: Collecting Data from the Web

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>


### Overview 

High-level overview coming soon... 

### Plan for the Day

1. [What you need to know about how the Internet works to collect data from the web](#wyntk)
2. [Scraping the Web](#scrape)
    * How to scrape text and tables from static websites with BeautifulSoup
    * An overview of working with (a) multiple pages and (2) interactive content 
3. [Collecting data via Application Programming Interfaces](#apis)
    * Understanding APIs 
    * The Guardian API 
    * The Wikipedia API
    * ? The Twitter API ? 
4. [Simple text processing with web data](#text)

<hr>

# What you need to know about how the Internet works to collect data from the web <a id='wyntk'></a>

# Scraping the Web <a id='scrape'></a>

# Collecting data via Application Programming Interfaces <a id='apis'></a>

1. [Understanding APIs](#understanding_apis)
3. [The Guardian API](#guardian)   
    a. [Overview](#g_overview)      
    b. [API Keys](#g_key)      
    c. [Making Requests](#g_requests)      
    d. [Filtering](#g_filters)    
    e. [Extra Information](#g_info)   
    f. [Requesting More Results](#g_more)  
5. [The Wikipedia API](#wikipedia)   
4. The Twitter API
5. [Key Points](#key_points)


<a id='understanding_apis'></a>
## Understanding APIs

Application Programming Interfaces (APIs) offer an alternative way to access data from online sources. They provide an explicit _interface_ to the data behind the website, defining how you can request data and what format you will receive the data. 

### Key Components of API Requests & Responses
**Endpoints** are the specific web locations where a request for a particular resource can be sent. Usually they have descriptive names like Content, Tweet, User, etc. We communicate with APIs by sending requests to these endpoints, usually in the form of a URL. 

These URLs usually contain optional **queries**, **parameters**, and **filters** that let us specify exactly what we want the API to return. 

Once a request has been made to the API it is going to return a **response**. Every response will have a response code, which will indicate whether the request was successful (200) or encountered an error (400, 401, 500, etc.). When you encounter a problem its a good idea to confirm you received a successful response, instead of one of the [many error responses](https://documentation.commvault.com/commvault/v11/article?p=45599.htm). 

As long as a request was successful, it will return a 200 OK response along with all the requested data. We will delve into what this data looks like below. 

### APIs vs Web Scraping 

Benefits: 
* Structured data (for the most part). 
* Controlled by an organization or company (Guardian, Twitter, etc) 
* Documented (usually)
* Maintained (usually)
* Rules for access are explicitly stated

Drawbacks: 
* Limited to the data made explicitly available
* Relies on the organization to make updates
* Rate limits & other restrictions apply and are usually based on business reasons rather than technical limitations


<a id='guardian'></a>
## The Guardian API

<a id='g_overview'></a>
### Overview
The Guardian's API allows us to query and download data related to their published articles. 

The Guardian API has five **endpoints**: 
* Content (`https://content.guardianapis.com/search`) &mdash; returns content. For dev keys only text. Allows querying and filtering to reduce what is returned.  
* Tags &mdash; will return all API tags (> 50, 000). These tags can be used in other quries. 
* Sections &mdash; logical grouping of content
* Editions &mdash; the content for each of the three regional main pages
* Single Item &mdash; will return all data related to a specific item (content, tag, or section) in the API. 

Today, we will focus on the content endpoint. This will allow us to retrieve the body text and metadata for articles published in The Guardian.

Often, the easiest way to interface with an API is through a client. In Python, these clients are just packages that provide functions to simplify the process of accessing the API. 

Alternatively, we can access APIs directly using the [`requests`](https://requests.readthedocs.io/en/master/) library. By accessing the API directly, we maintain freedom in how we use the API, rather than be restricted to a client. This is the option we will choose for interfacing with The Guardian API. 

<a id='g_key'></a>
### API Key
Hopefully you were all successful in receiving a Guardian API Key. If not please let one of us know! 

This API key is what gives you access to the Guardian API. Its kind of like a username and password, all wrapped into one. It is how the API monitors who is accessing their site and makes sure they are abiding by the proper terms of service.

We all registered for a developer key. With this key we receive:  
* Up to 12 calls per second
* Up to 5,000 calls per day
* Access to article text (no image, audio, or video)
* Access to a subset of Guardian content (1.9 million pieces)

If we had registered (and paid) for a commercial key, we would have fewer limitations in what we can access from the API. 

As I mentioned earlier, you can think of your API token as your username and password for accessing The Guardian API. Like any other credentials, we want to make sure this is kept secure. Most importantly, **never share API tokens in public locations**, including in git repositories or emails. 

Making an API token public allows others to access the API as if they were you. This puts you at risk if they violate the terms of service you agreed to when you requested an API token. 

For example, if someone were to get ahold of your API token, they could use it to launch a [denial of service attack](https://en.wikipedia.org/wiki/Denial-of-service_attack) on The Guardian's API. In this case, your token may be revoked and you'd be unable to request a new API token in the future without further violating the terms and services. 

To mitigate against this problem I would recommend one of two options: 
* Storing API tokens as environment variables
* Creating a `cred.py` to store credentials such as API tokens

Personally, I use a `cred.py` containing any of the credentials I need to access APIs, databases, etc. I keep this file stored on my computer in a single location which can be accessed by any Python script on my machine (usually somewhere in `PATH`). This way, the API token is outside of a script I might share and the file is outside of a git repo I might make public one day. 

If for some reason you need to store the `cred.py` file in the same directory as your Python file and this is within your git repo, make sure to add `cred.py` to the `.gitignore` file.

Let's go ahead and create this `cred.py` file now.   

Back on the Jupyter Home Page, click on the New button on the upper right side & select the text option (see the screenshot below). 

<img src=img/new_file.png></img>

A new file will open. Rename it cred.py and add the following line, replacing `<YOUR_TOKEN>` with your own API token.  

```python3
guardian_key = <YOUR_TOKEN>
```

Save & exit the file.   

Run the cell below. If it runs without throwing any errors, the API token has been successfully saved.

In [3]:
import cred

api_key = cred.guardian_key

<a id='g_requests'></a>
### Making API Requests
Now that we have our API key stored in a safer location, we can begin making requests to The Guardian API. 

To start, we will use the `requests` package to make a generic request to the content endpoint. 

In [4]:
# Importing libraries only needs to be done once
import requests
import pprint as pp

In [6]:
API_ENDPOINT = 'http://content.guardianapis.com/search'

MY_PARAMS = {'api-key': api_key}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

{'currentPage': 1,
 'orderBy': 'newest',
 'pageSize': 10,
 'pages': 217494,
 'results': [{'apiUrl': 'https://content.guardianapis.com/australia-news/live/2020/feb/25/kristina-keneally-calls-for-bettina-arndt-to-be-stripped-of-australia-day-honour-politics-live',
              'id': 'australia-news/live/2020/feb/25/kristina-keneally-calls-for-bettina-arndt-to-be-stripped-of-australia-day-honour-politics-live',
              'isHosted': False,
              'pillarId': 'pillar/news',
              'pillarName': 'News',
              'sectionId': 'australia-news',
              'sectionName': 'Australia news',
              'type': 'liveblog',
              'webPublicationDate': '2020-02-25T06:51:34Z',
              'webTitle': 'Dutton says he was referring to Islamic terrorists '
                          "when he talked about 'leftwing lunatics' – politics "
                          'live',
              'webUrl': 'https://www.theguardian.com/australia-news/live/2020/feb/25/kristina-ke

There is quite a bit of information there...

Lets break it down a bit. What are individual fields contained within the response? 

In [7]:
response_dict.keys()

dict_keys(['status', 'userTier', 'total', 'startIndex', 'pageSize', 'currentPage', 'pages', 'orderBy', 'results'])

Each of these are described in the [content endpoint's documentation](https://open-platform.theguardian.com/documentation/search). We can examine each field individually through indexing our response dictionary. 

Lets start by seeing what order was used to sort the results. 

In [8]:
response_dict['orderBy']

'newest'

In the cell below, find the total number of items that were returned in this call. Refer to the [documentation](https://open-platform.theguardian.com/documentation/search) if you aren't sure which field you are interested in.   

In [9]:
# Your Answer Here

The interesting part of the response is really what is contained within results field. The results will contain the individual items provided by the endpoint. This will be content (mainly news articles) in our case. 

In the cell below, examine what is contained within the results field and answer (1) what data structure is being used to store the results (dictionaries, lists, etc.), (2) what data is stored for each result, and (3) how many results were returned. 

In [10]:
# Your Answer Here

<a id='filtering'></a>
### Filtering
Often we are interested in receiving very specific data from an API, rather than receiving all the data and then sifting through it later on.

Luckily, most APIs have built-in ways to make these specifications. In The Guardian's API these are called queries or filters.

**Queries** allow you to request content containing free text. This works very similar to a search engine. You can use double quotes to query exact phrase matches and the AND, OR, and NOT operators are supported.   

**Filters** allow you to request content based on specific [metadata](https://dataedo.com/kb/data-glossary/what-is-metadata). Once again, you can check the [documentation](https://open-platform.theguardian.com/documentation/search) to see what metadata is available for filtering. 

We will start off simple. You might have noticed earlier that our response from the API contained the most recent content available. What if we are  actually only interested in retrieving content published prior to Jan 01, 2020?

In [11]:
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

{'currentPage': 1,
 'orderBy': 'newest',
 'pageSize': 10,
 'pages': 216308,
 'results': [{'apiUrl': 'https://content.guardianapis.com/lifeandstyle/2020/jan/01/how-can-i-celebrate-my-friends-new-success-without-envy',
              'id': 'lifeandstyle/2020/jan/01/how-can-i-celebrate-my-friends-new-success-without-envy',
              'isHosted': False,
              'pillarId': 'pillar/lifestyle',
              'pillarName': 'Lifestyle',
              'sectionId': 'lifeandstyle',
              'sectionName': 'Life and style',
              'type': 'article',
              'webPublicationDate': '2019-12-31T23:58:00Z',
              'webTitle': "How can I celebrate my friend's new success, "
                          'without envy?',
              'webUrl': 'https://www.theguardian.com/lifeandstyle/2020/jan/01/how-can-i-celebrate-my-friends-new-success-without-envy'},
             {'apiUrl': 'https://content.guardianapis.com/commentisfree/2020/jan/01/i-fled-the-bushfires-with-ash-falling-

We can add more parameters to further specify the types of results we want to receive.

In [22]:
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'sport',
             'q': '(Cologne OR Koln) AND Germany'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']
pp.pprint(response_dict)

{'currentPage': 1,
 'orderBy': 'relevance',
 'pageSize': 10,
 'pages': 8,
 'results': [{'apiUrl': 'https://content.guardianapis.com/sport/blog/2019/oct/03/mlb-catch-of-the-season-relay-showreels-penalty-one-two',
              'id': 'sport/blog/2019/oct/03/mlb-catch-of-the-season-relay-showreels-penalty-one-two',
              'isHosted': False,
              'pillarId': 'pillar/sport',
              'pillarName': 'Sport',
              'sectionId': 'sport',
              'sectionName': 'Sport',
              'type': 'article',
              'webPublicationDate': '2019-10-03T09:35:33Z',
              'webTitle': 'MLB catch of the season, relay showreels and a '
                          'penalty one-two | Classic YouTube',
              'webUrl': 'https://www.theguardian.com/sport/blog/2019/oct/03/mlb-catch-of-the-season-relay-showreels-penalty-one-two'},
             {'apiUrl': 'https://content.guardianapis.com/sport/2019/may/12/michael-schumacher-feature-film-rights-up-for-grabs-at-c

In the cell below, write an API request to fetch content using a query and at least 2 filters. 

In [13]:
# YOUR ANSWER HERE

<a id='g_info'></a>
### Extra Information
You may have noticed in the previous API requests and responses that while we were receiving article URLs, sections, and publication dates, we were missing some pretty important data. Things like headlines, bylines, and body text are not included in the default API response. This additional information is available, but needs to be specified using the `show-fields` parameter.

In [36]:
API_ENDPOINT = 'http://content.guardianapis.com/search'

MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'sport',
             'q': '(Cologne OR Koln) AND Germany',
             'show-fields': 'wordcount,body,byline'}

response = requests.get(API_ENDPOINT, params=MY_PARAMS)

response_dict = response.json()['response']

response_dict

{'status': 'ok',
 'userTier': 'developer',
 'total': 72,
 'startIndex': 1,
 'pageSize': 10,
 'currentPage': 1,
 'pages': 8,
 'orderBy': 'relevance',
 'results': [{'id': 'sport/blog/2019/oct/03/mlb-catch-of-the-season-relay-showreels-penalty-one-two',
   'type': 'article',
   'sectionId': 'sport',
   'sectionName': 'Sport',
   'webPublicationDate': '2019-10-03T09:35:33Z',
   'webTitle': 'MLB catch of the season, relay showreels and a penalty one-two | Classic YouTube',
   'webUrl': 'https://www.theguardian.com/sport/blog/2019/oct/03/mlb-catch-of-the-season-relay-showreels-penalty-one-two',
   'apiUrl': 'https://content.guardianapis.com/sport/blog/2019/oct/03/mlb-catch-of-the-season-relay-showreels-penalty-one-two',
   'fields': {'byline': 'Guardian sport',
    'body': '<p>1) It’s the Prix de l’arc Triomphe on Sunday so let’s get ourselves in the mood with a highlights reel: Enable will be going for a historic hat-trick at Longchamp so watch her <a href="https://www.youtube.com/watch?v=j

In the cell below, write code to access and print the body text of an article from the `response_dict`. 

In [24]:
# Your Answer Here

<a id='g_more'></a>
### Requesting More Results

In the API response, there are three fields that relate to the number of results obtained from an API request &mdash; `total`, `pages`, and `pageSize`. 

In [25]:
response_dict['total']

72

In [26]:
response_dict['pages']

8

In [27]:
response_dict['pageSize']

10

When looking at them all together, its becomes more clear as to how they relate. 

* `total` is the number of items available to be returned.  
* `pages` is the number of pages available for return, where each page is a small subset of the total number of items.   
* `pageSize` is how many items are in the current page being returned.   

If its hard to imagine the differences between these values, you can thinking about how Google search results work.   

The key point for us to know is that in a basic API request we are likely only receiving a fraction of the total items available for return. If we want to retrieve all the data, we need to look at (1) increasing the page limit and (2) automatically requesting data from the next page. 

In the cell below, update `MY_PARAMS` to increase the page size from 10 to 50. Use the API [documentation](https://open-platform.theguardian.com/documentation/search) to find the right parameter. 

In [28]:
API_ENDPOINT = 'http://content.guardianapis.com/search'
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'sport',
             'q': '(Cologne OR Koln) AND Germany',
             'show-fields': 'wordcount,body,byline'}
response = requests.get(API_ENDPOINT, params=MY_PARAMS)
response_dict = response.json()['response']

Run the cell below to verify you successfully increased the number of results per page to 50. 

In [29]:
if response_dict['pageSize'] < 50:
    print('The page size is still less than 50. Try again.')
elif response_dict['pageSize'] == 50: 
    print('The page size is now 50. Good job!')
elif response_dict['pageSize'] > 50: 
    print('The page size is now greater than 50. How did you do that?')

The page size is still less than 50. Try again.


Now that each page can display 50 results, nearly 5x fewer pages are needed to contain all of the data we need!

In [30]:
response_dict['pages']

8

However, we still need to find a way to gather data from all the pages, instead of just the first. 

Luckily, The Guardian API has a built in `page` paramter that allows us to specify which page we want to get results from. We can combine this type of request with a `while` loop to help automate our API requests.   

#### Rate Limits
Before we look at the code below, we should think about the potential impacts of automating API requests. 

Remember that with a developer key we are limited to 12 calls per second and 5,000 calls per day. While in this case we will be making very few requests, its important to understand the importance of abiding by these limits. 

When you sign up for an API token, you typically are required to sign a Terms of Service. These terms are usually (but not always) summarized to make sure the most important information is readily available. This information usually includes: 
* Rate limits
* Disallowed uses
* Limitations for sharing data
* Intellectual Property considerations

While most of these are self-explanatory, its worthwhile taking some time to go over what rate limits are and how they are controlled. 

Rate limits are the upper bound placed on how many API requests a user can make in a given amount of time. These number differ between websites and even user types. The idea is to limit the rate of requests and ensure the website isn't overrun with traffic. 

In general, rate limits are controlled in two ways. Some websites will have built-in systems that will detect over-use and throttle or revoke access for a token that is over-requesting. This is the system The Guardian API uses. 

Other websites rely on the honour system, asking you to abide by your guidelines. In these cases the risk of exceeding limits is higher (since there is no throttling) and if you run a greater risk of being blacklisted if you exceed the API's rate limits. 

In [37]:
# Normal Setup
API_ENDPOINT = 'http://content.guardianapis.com/search'
MY_PARAMS = {'api-key': api_key, 
             'to-date': '2019-12-31', 
             'lang': 'en', 
             'section': 'sport',
             'q': '(Cologne OR Koln) AND Germany',
             'show-fields': 'wordcount,body,byline',
             'page-size': 50}

# Collect All Results
all_results = []
cur_page = 1
total_pages = 1

while (cur_page <= total_pages) and (cur_page < 10):  # with a fail safe
    # Make a API request
    MY_PARAMS['page'] = cur_page
    response = requests.get(API_ENDPOINT, params=MY_PARAMS)
    response_dict = response.json()['response']

    # Update our master results list
    all_results += (response_dict['results'])
    
    # Update our loop variables
    total_pages = response_dict['pages']
    cur_page += 1

In [38]:
print("Total # of results: {}".format(len(all_results)))

Total # of results: 72


In [39]:
all_results[36]

{'id': 'sport/blog/2008/nov/11/bundesliga-football-dortmund-bayern-hamburg',
 'type': 'article',
 'sectionId': 'sport',
 'sectionName': 'Sport',
 'webPublicationDate': '2008-11-11T14:07:00Z',
 'webTitle': 'Football: Raphael Honistein on the maverick managers of the Bundesliga',
 'webUrl': 'https://www.theguardian.com/sport/blog/2008/nov/11/bundesliga-football-dortmund-bayern-hamburg',
 'apiUrl': 'https://content.guardianapis.com/sport/blog/2008/nov/11/bundesliga-football-dortmund-bayern-hamburg',
 'fields': {'byline': 'Rafa Honigstein',
  'body': '<p>Ambition used to be a dirty little word (schmutzig, perhaps) in the <a href="http://www.theguardian.com/football/bundesligafootball" title="">Bundesliga</a>. Apart from Bayern, the loudmouths from Bavaria who could always be relied upon to brashly insist on their inalienable right to win the title every bloody season, others were far too happy keeping a low profile and expectations down. Everybody wanted to be the underdog: better to spend

Now that we have the results, we can continue to access them and work with them, without having to make more API requests.

Whenever possible, **store the results you receive from API requests**. This allows you to access the data without making unneccessary requests to the API. 

You can store the data in either python variables or in a file. If you are only using the data for a short period of time (e.g. real-time analysis) you can likely get away with using variables within your Python script. 

However, if you want to access the data after you've finished running your script you should save it to a file. This way the data can be used later in new analyses or to reproduce the work you've already done. 

Lets store our results in a file, so we can use them later on. 

In [53]:
import json 
FILE_PATH = 'data/guardian_api_results.json'
with open(FILE_PATH, 'w') as outfile:
    json.dump(all_results, outfile)
    

We can check that the results were written in the correct format by reading them back in. 

In [54]:
with open(FILE_PATH, 'r') as f:
    data = json.load(f)

<a id='wikipedia'></a>
## The Wikipedia API
The [English Wikipedia API](https://en.wikipedia.org/w/api.php) is one endpoint of the larger [MediaWiki API](https://www.mediawiki.org/wiki/API:Main_page). Other endpoints include the Meta-Wiki, Wikimedia Commons, and German Wikipedia APIs. 

There is plenty of documentation about how to use these APIs directly, but there is also an easy-to-use Python client we can use. The [`wikipedia`](https://wikipedia.readthedocs.io/en/latest/) Python client developed by Jonathan Goldsmith provides us with functionality for reading and parsing data from Wikipedia. 

While in the backend `wikipedia` is still using the MediaWiki API, the front-end interface (what we will work with) is much simpler than if we were to use the API directly.

### Installing `wikipedia`
Likely, up until this point all of the Python packages we've been using have come standard in the Anaconda installation you all have on your machines. 

However, `wikipedia` is not a default package in either base Python or Anaconda. So, we will need to download it for ourselves. 

Usually, Python packages can be found on [PyPI](https://pypi.org/), the official repository for Python packages. Any package found on PyPI can be installed using [`pip`](https://pypi.org/project/pip/), Python's package installer. 

Run the cell below to use `pip`to search PyPI for the `wikipedia` package. 

> Aside   
The `!` at the beginning of the cell tells Jupyter that we want that cell (and that cell only) to be executed on the command line. 

In [None]:
!pip3 search wikipedia 

Conveniently, the package we are interested in is shown right at the top. We also see that the default version of this package is 1.4.0. To install `wikipedia`, run the cell below. 

You may notice that some extra packages are being installed, or at least looked for. These packages are _requirements_ of the `wikipedia` packages and need to be installed for `wikipedia` to work properly.

In [None]:
!pip3 install wikipedia 

Once you see a message to the effect of `Successfully installed wikipedia-1.4.0` comment out the cell block above to ensure you don't acciudently try to re-install the package. 

If you get an error message, let one of us know so we can help you debug. 

Run the cell below to make sure `wikipedia` was installed successfully. If no errors show up, you are good to go!

In [None]:
import wikipedia

### Using `wikipedia`
Unlike The Guardian or Twitter APIs, Wikipedia's API doesn't require a token. Instead, everything is publically accessible to anyone. 


We need to be more careful to rate limit. 

#### Searching
Similar to how we search on Wikipedia's website, we can use the API to search for specific content. 

In [None]:
search_term = 'spelunking'

search_results = wikipedia.search(search_term)

search_results

If we are interested in a particlar page, we can request it specifically using the `page()` function. 

In [None]:
my_page = wikipedia.page(title=search_results[0])
my_page

At first this result might seem anti-climatic. After all, there really doesn't appear to be any interesting data contained within `my_page`. However, `my_page` actually does contain a lot of information, its just packaged into a `WikipediaPage` object (also known as a class). 

This object stores data such as the page's summary, links, and categories, all structured neatly within the object. Checkout the [`WikipediaPage` documentation](https://wikipedia.readthedocs.io/en/latest/code.html#wikipedia.WikipediaPage) for a full list. 

In [None]:
my_page.links

In [None]:
my_page.summary

In the cell below, use a for loop to retrieve and store the summaries for each of the 10 pages in `search_results`. 

In [None]:
# Your Answer Here

#### Jumping Between Pages
Links are inherent in Wikipedia. They connect pages to one another and provide a structure for the site. It also means you can almost always get from one page to another through these links. Checkout [Six Degrees of Wikipedia](https://www.sixdegreesofwikipedia.com) if you have any doubts. 

We can use these links between pages to move page to page, gathering information as we go. The cell below uses the `random` package to select a link at random and display its summary text. 

In [None]:
import random

# Function for selecting a random linked page
def select_random_link(links):
    total_links = len(links)
    random_num = random.randrange(0, total_links)
    random_page_name = links[random_num]
    random_page = wikipedia.page(random_page_name)
    return random_page


# All links
links = my_page.links

# Select a random linked page
linked_page = select_random_link(links)

# Print Results
print('There is a link from {} --> {}\n'.format(my_page.title, 
                                              linked_page.title))

print("{}'s summary is\n {}".format(linked_page.title,
                                    linked_page.summary))

Above, we took the first step in a ["random walk"](https://en.wikipedia.org/wiki/Random_walk) through Wikipedia. In the cell below use the `select_random_link()` function from above and a loop (`while` or `for`) to perform a random walk with 5 steps.

Feel free to choose any page as a starting point. Print out the title of each page you visit on the random walk. 

In [None]:
# Your Answer Here

It is fairly easy to image how a random walk, left to its own devices, could carry on indefinitely through Wikipedia making API request after API request. If enough people write random walk code, or other code making many requests, its quite possible we could overwhelm the Wikipedia API. 

When this happens, Wikipedia identifies the IP addresses making the most requests and serves them with an HTTP timeout error. Essentially, Wikipedia punishes the heavy users by returning errors and making them wait until the API is no longer overwhelmed. 

To help mitigate against this, we can make use of the `set_rate_limiting()` function included in the `wikipedia` Python package. 

In [None]:
wikipedia.set_rate_limiting(rate_limit=True)

Now, any requests we make to the Wikipedia API will be separated by 50 ms (default for the function). If at any point we encounter an HTTP timeout error while using rate limiting, we should adjust the limit using `set_rate_limiting()`'s `min_wait` parameter. 

## <font color='crimson'> The Twitter API </font>

<a id='key_points'></a>
## Key Points   
You should now know: 
* The differences between working with an API directly and a API client.
* The risks associated with sharing API tokens & a method for keeping them out of python scripts. 
* How to save request results to a file & the importance of doing so. 
* Why its important to abide by rate limits. 
* How to install python packages using `pip`.
* How to automate API requests using loops.


# Simple text processing with web data <a id='text'></a>