# Using APIs to Get Data From the Internet


**API** means Application Programmer Interface

An API is a set of instructions that describe how computers can interact with each other to request and receive information.

Some important questions we will ask that help us discover APIs is below.

|Question | In technical terms |
|:---------|:--------------------|
|Where is my data? | What is the domain? |
|How do I learn what data is available?| Where is the documentation? |
|How do I request specific data?| How do I formulate a URL for a specific purpose? |
|How do I interpret the data?| What is the structure and format of the output?|



**Let's walk through an example in the browser**

PlaceKitten!

In a browser, go to http://www.placekitten.com

|In technical terms | PlaceKitten |
|:---------|:--------------------|
|What is the domain? | http://www.placekitten.com |
|Where is the documentation?| The documentation is on the home page. |
|How do I formulate a URL for a specific purpose? | You put it in the url like http://www.placekitten/width/height |
|What is the structure and format of the output?| It's an image! |

# Accessing placekitten in python

We're going to use a special library called <code>requests</code>

In [31]:
from IPython.display import display, Image  # This line lets you display images. We'll use that in a bit.

# This line lets you use python to download data from the web.
import requests
import pandas as pd

In [None]:
# Get a 200 by 300 image from placekitten.
r = requests.get('http://www.placekitten.com/200/300')

In [None]:
# Look at the status code
r.status_code

In [None]:
# print the content
r.content

In [None]:
# Use the Image function to display the image
display(Image(r.content))

### Exercise 1

Write a function that takes in the width and height and prints an image

### Exercise 2

Can you write a loop to show several images?


In [None]:
# Write a loop that shows multiple images


# Example 2: Getting World Times

This example introduces a slightly more complicated API. It also introduces **JSON** which is a very common data format.

The API (including some documentation) is at http://worldtimeapi.org/

In [None]:
# Download list of time zones
r = requests.get("http://worldtimeapi.org/api/timezone")
print(r.content)

### Exercise 3

Use the .json() function to get the response converted to a dictionary or list

In [None]:
# Use the .json() function to get the response converted to a dictionary or list
# What did it return?


### Exercise 4

Get the time for your time zone

In [None]:
# Your code here


### Exercise 5

Get the time for your IP address

In [None]:
# Get the time for your IP address


# Example 3: Getting Wikipedia pages

Wikipedia also has an open API, and I want to use it to show one other tip for using the `requests` library; many APIs will take in a set of parameters, which you can pass as a parameter dictionary.

The documentation for the very extensive API is [here](https://www.mediawiki.org/wiki/API:Main_page). Many of the operations require you to authenticate (which we will cover next), but some things, like getting the content of a page, do not.

For example, the following code gets the recent changes to Wikipedia.

In [None]:
import requests

endpt = 'https://en.wikipedia.org/w/api.php'


def get_last_pages_changed(n):
    params = {'action': 'query',
          'format': 'json',
          'list': 'recentchanges',
          'rcnamespace': '0',
          'rclimit': n}
    r = requests.get(endpt, params=params)
    #print(r.json())
    #print(r.json()['query']['recentchanges'])
    result = []
    content = r.json()['query']['recentchanges']
    for page in content:
        result.append(page['title'])
    return result

## Exercise 6

Review the documentation (and Google) to see if you can figure out how to get a list of all of the users who have ever edited the most recently edited Wikipedia page.

In [None]:
## Your code here

def get_editors(title):
    params = {'action':'query',
         'prop':'revisions',
         'titles': title,
              'format': 'json',
          'rvlimit': 500,
          'rvprop': 'user|timestamp'
         }
    r = requests.get(endpt, params=params)
    print(r.json())
    
get_editors('Purdue University')

# Example 4: Intro to Twitter API

In order to use the Twitter API, you need to do two things:

1. Install tweepy. This is a python library designed to make it easier to use the API (rather than using `requests` directly. I made [this video](https://www.youtube.com/watch?v=TASX3evcgG4) to walk you through how to install tweepy in Anaconda.

2. To use the Twitter API, you need to be authenticated, and so you need a developer account. [This page](https://wiki.communitydata.science/Intro_to_Programming_and_Data_Science_(Summer_2020)/Twitter_authentication_setup) explains how to get a developer account.

Once you have your keys, you should create a file called `twitter_authentication.py` in the same directory as this file. It should contain the following line (replace the fake string below with the corresponding key from your twitter account):

```
BEARER_TOKEN = 'oxVSzC1OjXOVVYrBvGyy6XKKe772Jdvvw6Opb3bSLdIb'
```

In general, it is a good practice to keep your keys (which should be secret) separate from your code, which you can share. In this case, we put them in a different file and then import them.

The following code loads the tweepy library and imports these keys from the `twitter_authentication.py` file, and then prepares to "log in" to your account for the Twitter API.

In [5]:
import tweepy

from twitter_authentication import BEARER_TOKEN

client = tweepy.Client(bearer_token=BEARER_TOKEN, wait_on_rate_limit=True)

## Rate Limiting

You will quickly learn that the Twitter API is "rate limited". This means that they will only let each account make a certain number of calls to their API in a given time period. The default rate is quite low - many calls only allow 15 calls per 15 minutes.

You may notice above that we had the code:
```
api = tweepy.API(auth, wait_on_rate_limit=True)
```
the `wait_on_rate_limit=True` tells your code to wait for 15 minutes if it gets back a message that you've exceeded a rate limit. This can get annoying when debugging, so be careful with how often you try things - sometimes it makes sense, for example, to try to get a small amount of data that only takes one call and make sure that your code works before trying to get all of the data.

Let's start by getting the last 20 tweets from the @LifeAtPurdue account.

In [18]:
puid = client.get_user(username='LifeAtPurdue').data.id
tweets = client.get_users_tweets(puid, max_results=20)

Tweepy stores the tweets in `.data`, so this loops through all of them and prints the text

In [19]:
for tweet in tweets.data:
    print(tweet.text)

#PurdueUniversity President Mitch Daniels will sit down with serial entrepreneur and philanthropist @JTLonsdale,  general manager of venture capital firm @8vc, to discuss his innovation, investing and more. The Presidential Lecture Series event is Nov. 16. https://t.co/BwuIzTV5TJ
Computer chips are the brains that power all modern electronics but as their demand skyrockets, so does the demand for a trained workforce to design, test and develop them. #Purdue's answer: graduate 1,000 semiconductor engineers annually. https://t.co/3kDoPwxM5q
“Everything Tyler wrote about, I could relate to. We’ve had so many things in common, from types of radiation and chemotherapy to all that we’ve felt throughout these journeys.” — Eric Magallanes, winner of the 2022 Tyler Trent Award. Get to know Eric. ⬇️ https://t.co/B4GtHBmMoe
“Being named a Brand That Matters reflects delivering on that promise, for today and tomorrow.” #FCBrandAwards Learn more: https://t.co/7iY0WDnLdZ
“Purdue has positioned itsel

By default, tweet objects only contain their ID and the text of the tweet, but there's a lot more information you can ask for.

For example, this gets information about what the tweet is replying to and some metrics (retweets, likes, etc.)

In [23]:
tweets = client.get_users_tweets(puid, max_results=20, tweet_fields=['created_at', 'public_metrics', 'referenced_tweets'])

These are also kind of hidden within each tweet object, so this is how we might gather them into a data frame

In [36]:
result = []
for tweet in tweets.data:
    curr_result = {'text': tweet.text,
                   'id': tweet.id,
                   'retweets': tweet.public_metrics['retweet_count'],
                   'replies': tweet.public_metrics['reply_count'],
                   'created_time': tweet.created_at
                  }
    result.append(curr_result)
    
df = pd.DataFrame(result)
                  

In [37]:
df

Unnamed: 0,text,id,retweets,replies,created_time
0,#PurdueUniversity President Mitch Daniels will...,1586011926013906948,0,0,2022-10-28 15:08:04+00:00
1,Computer chips are the brains that power all m...,1585772583257083904,3,0,2022-10-27 23:17:00+00:00
2,"“Everything Tyler wrote about, I could relate ...",1585738162231459841,1,0,2022-10-27 21:00:14+00:00
3,“Being named a Brand That Matters reflects del...,1585723242203213825,2,1,2022-10-27 20:00:56+00:00
4,“Purdue has positioned itself as a brand and u...,1585723179003432960,1,1,2022-10-27 20:00:41+00:00
5,"For the 2nd year in a row, #Purdue has been na...",1585723065169764362,4,3,2022-10-27 20:00:14+00:00
6,@resources_david Thank you for reaching out Un...,1585685689353572358,0,0,2022-10-27 17:31:43+00:00
7,@RKRelentless @PurdueLibArts We are so proud o...,1585681498564444161,0,0,2022-10-27 17:15:04+00:00
8,It's the kick-off to #ThisIsPurdue’s ‘Countdow...,1585677923650093056,8,0,2022-10-27 17:00:52+00:00
9,"Hats off to @AmericanAir for innovative, inclu...",1585666412437184513,1,0,2022-10-27 16:15:07+00:00


You can try to change the `count` argument above, and you'll quickly learn that if you raise it over 200, you will still only get 200 tweets. If you want to print more than 200 tweets, you may need to use a [cursor](http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html).

This is basically tweepy's clever way of breaking what you want to do into multiple calls to the API.

For example, this call will get 350 tweets. The `count` argument (optional) says how many tweets to get per call, and the argument in `.items()` is how many to get in total.

In [None]:
for tweet in tweepy.Cursor(api.home_timeline, count = 175).items(350):
    print(tweet.text)

## Followers

You can also get information about a user, such as who their followers are.

Here's information about me and some of my followers.

In [None]:
user = api.get_user(screen_name = 'jdfoote')

print(user.screen_name + " has " + str(user.followers_count) + " followers.")

print("They include these 100 people:")

for follower in user.followers(count=100):
    print(follower.screen_name)

Here is what that user object looks like for my user

In [None]:
user._json

And here's the user object for one of my followers, which is nearly identical.

In [None]:
follower._json

Note that 200 is the maximum number of followers that you can get at one time. If you want to get information about all of a user's followers, you will need to use a cursor. If you are getting many followers, you will almost certainly hit rate limits.

In [None]:
f = []
for follower in tweepy.Cursor(api.get_followers, screen_name='jdfoote', count=200).items():
    #print(follower.screen_name)
    f.append(follower.screen_name)

In [None]:
print(f)

## Searching

For most of your research, you may be interested in how people are talking about a given topic. There are two main ways to do this.

The first is the search API ([Official Twitter info on the Search API](https://developer.twitter.com/en/docs/tweets/search/overview)). We only have access to "[Standard Search](https://developer.twitter.com/en/docs/tweets/search/overview/standard)", the most limited of Twitter Search API options, which is limited to the last 7 days.


**Note that if you would like to use Twitter for a project or a paper, you can request access to the Academic Research API, which includes historical search and a much higher limit on the number of tweets you can request**

Unforutnately, tweepy doesn't yet support the v2 API for Twitter, but [here is an example of how to use it](https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/master/Full-Archive-Search/full-archive-search.py) with just requests.


[This page](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets) is the documentation for Standard Search and has some helpful intel about modifying the parameters.

Below is a simple example that gets the last 20 tweets about data science.

In [None]:
public_tweets = api.search_tweets('"from:@jdfoote"', count=20)

for tweet in public_tweets:
    print(tweet.user.screen_name + "\t" + str(tweet.created_at) + "\t" + tweet.text)

Note that many of these results are truncated. If you want the full tweet, you actually have to modify the call a little bit, like so.

In [None]:
public_tweets = api.search_tweets('"data science"', count=20, tweet_mode='extended')

for tweet in public_tweets:
    print(tweet.user.screen_name + "\t" + str(tweet.created_at) + "\t" + tweet.full_text)

### Additional Search resources

* [Tweepy extended tweets documentation](http://docs.tweepy.org/en/latest/extended_tweets.html)
* [Twitter documentation for crafting queries](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators). This includes things like how to search by geography or remove retweets.

## Streaming

The other option is to "stream" tweets. Instead of looking backward, this just keeps you connected to Twitter and whenever new tweets come in, they are sent to your program. You would typicaly just keep the program running and keep writing the data that you want to an external file.

As with the search API, there are some caveats. One is that (I believe) there is no guarantee that this is all of the tweets that match. If you try to filter by very popular terms, then Twitter may give you only a sample of them.

In [None]:
class Streamer(tweepy.Stream):
    def on_status(self, tweet):
        print(tweet.author.screen_name + "\t" + tweet.text)

    def on_error(self, status_code):
        print( 'Error: ' + repr(status_code))
        return False

streamer = Streamer(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

keywords = ['Purdue', '"data science"']
streamer.filter(track = keywords)

# Exercises


7. Use the streaming API to produce a list of 1000 tweets about a topic.
2. From that list of 1000 tweets, eliminate retweets.
4. For each original tweet, create a dictionary with the number of times you see it retweeted in your dataset.
5. Get a list of the URLs in your dataset
3. Now, see if you can figure out how to eliminate retweets in the query instead.
7. Get the last 50 tweets from West Lafayette, using the search API. (Hint - look up the geocode information [here](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets)).
8. Alter the streaming algorithm to include a "locations" filter to get tweets from New York City. You need to use the order sw_lng, sw_lat, ne_lng, ne_lat for the four coordinates instead of a radius as in the search API.

### BONUS Questions
1. For each of your followers, get *their* followers (investigate time.sleep to throttle your computation)
2. Identify the follower you have that also follows the most of your followers.
3. How many users follow you but none of your followers?