## extracting the data from a public API

**using authentication tokens, handling pagination, understanding rate limits, and automating data extraction**

### Extracting Data from an API Using the Requests Module

Problem : determine the popularity of different programming languages such as Python, Java, and JavaScript. We could extract data from GitHub on the languages used by popular repositories and determine the prevalence of each language

Representational State Transfer (REST), on the other hand, relies on HTTP as the communication protocol including the use of status codes to determine successful or failed calls. It defines data types much more loosely and uses JSON heavily, though other formats are also supported.

Step 1 : The first API we want to call is to list all the repositories on GitHub. 

In [3]:
import requests
response = requests.get('https://api.github.com/repositories',headers={'Accept': 'application/vnd.github.v3+json'})
print(response.status_code)

#v3 is used bcox v4 is based on graphql

200


In [4]:
# extract the type of content and server details that have been returned by the API, 
print (response.encoding)
print (response.headers['Content-Type'])
print (response.headers['server'])

utf-8
application/json; charset=utf-8
GitHub.com


In [5]:
response = requests.get('https://api.github.com/search/repositories')
print (response.status_code)

422


 This code indicates that the request was correct, but the server was not able to process the request. This is because we have not provided any search query parameter as specified in the docs.

In [6]:
response = requests.get('https://api.github.com/search/repositories',
    params={'q': 'data_science+language:python'},
    headers={'Accept': 'application/vnd.github.v3.text-match+json'})
print(response.status_code)

# search query is encoded to look for data_science, filter the language by Python (language:python), and combine the two (+). This constructed query is passed as the query argument q to params. We also pass the argument headers containing the Accept parameter where we specify text-match+json
# so that the response contains the matching metadata and provides the response in JSON format:

200


In [9]:
from IPython.display import Markdown, display  
def printmd(string): 
    display(Markdown(string))  

# list of top 5 repos

for item in response.json()['items'][:10]:
    printmd('**' + item['name'] + '**' + ': repository ' +
            item['text_matches'][0]['property'] + ' - \"*' +
            item['text_matches'][0]['fragment'] + '*\" matched with ' + '**' +
            item['text_matches'][0]['matches'][0]['text'] + '**')

**data-science-from-scratch**: repository description - "*code for Data Science From Scratch book*" matched with **Data Science**

**data-science-blogs**: repository description - "*A curated list of data science blogs*" matched with **data science**

**galaxy**: repository description - "*Data intensive science for everyone.*" matched with **Data**

**data-scientist-roadmap**: repository description - "*Toturials coming with the "data science roadmap" picture.*" matched with **data science**

**DataCamp**: repository description - "*DataCamp data-science courses*" matched with **data**

**dsp**: repository description - "*data science preparation*" matched with **data science**

**Kaggler**: repository description - "*Code for Kaggle Data Science Competitions*" matched with **Data Science**

**cookiecutter-data-science**: repository description - "*A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.*" matched with **data science**

**PDA_Book**: repository description - "*Code Examples Data Science using Python*" matched with **Data Science**

**kedro**: repository description - "*A Python framework for creating reproducible, maintainable and modular data science code.*" matched with **data science**

## Pagination

GitHub API implements the pagination concept where it returns only one page at a time, and in this case each page contains 30 results. The links field in the response object provides details on the number of pages in the response.

In [10]:
response.links


{'next': {'url': 'https://api.github.com/search/repositories?q=data_science%2Blanguage%3Apython&page=2',
  'rel': 'next'},
 'last': {'url': 'https://api.github.com/search/repositories?q=data_science%2Blanguage%3Apython&page=34',
  'rel': 'last'}}

**The next field provides us with a URL to the next page, which would contain the next 30 results, while the last field provides a link to the last page, which provides an indication of how many search results there are in total.**

## Rate limiting

To ensure that an API can continue serving all users and avoid load on their infrastructure, providers will often enforce rate limits. The rate limit specifies how many requests can be made to an endpoint in a certain time frame

In [11]:
response = requests.head(
    'https://api.github.com/repos/pytorch/pytorch/issues/comments')
print('X-Ratelimit-Limit', response.headers['X-Ratelimit-Limit'])
print('X-Ratelimit-Remaining', response.headers['X-Ratelimit-Remaining'])

# Converting UTC time to human-readable format
import datetime
print(
    'Rate Limits reset at',
    datetime.datetime.fromtimestamp(int(
        response.headers['X-RateLimit-Reset'])).strftime('%c'))

X-Ratelimit-Limit 60
X-Ratelimit-Remaining 58
Rate Limits reset at Wed Apr 20 15:05:48 2022


X-Ratelimit-Limit indicates how many requests can be made per unit of time (one hour in this case), X-Ratelimit-Remaining is the number of requests that can still be made without violating the rate limits, and X-RateLimit-Reset indicates the time at which the rate would be reset. It’s possible for different API endpoints to have different rate limits.

### Extracting Twitter Data with Tweepy

General steps :
    
1. Obtaining Credentials
2. create an app.

In [13]:
# installing tweepy
# import

import tweepy

app_api_key = 'VMv15qXtkYFsokt1yoCc5Ia6K'
app_api_secret_key = 'RbLjXG9ll9tRejwMAyNHNTQTdhPZ7LPhjL9WRTAsM9i6hRs8nY'


auth = tweepy.AppAuthHandler(app_api_key, app_api_secret_key)
api = tweepy.API(auth)

print ('API Host', api.host)



API Host api.twitter.com


In [16]:
import pandas as pd
search_term = 'cryptocurrency'

tweets = tweepy.Cursor(api.search_tweets,
                       q=search_term,
                       lang="en").items(100)

retrieved_tweets = [tweet._json for tweet in tweets]
df = pd.json_normalize(retrieved_tweets)

df[['text']].sample(3)

Unnamed: 0,text
70,Stock market these days more volatility than #...
93,RT @tavalez74: #FireANTS #Metaverse Museum abo...
16,RT @atnircapital: ATNIR Capital is a decentral...


**We have successfully completed the API call and can see the text of the retrieved tweets in the previous table, which already show interesting aspects. For example, we see the use of the word RT, which indicates a retweet (where the user has shared another tweet)**

## Extracting Data from the Streaming API

Some APIs provide near real-time data, which might also be referred to as streaming data. In such a scenario, the API would like to push the data to us rather than waiting for a get request as we have been doing

In [18]:
# Tweepy already provides basic functionality in the StreamListener class that contains the on_data function. 
# This function is called each time a new tweet is pushed by the streaming API, 
# and we can customize it to implement logic that is specific to certain use cases.