#### This Notebook is explains how `api_request_script.py` works

Importing libraries

In [1]:
import sys
import requests
import datetime
import re
import pickle

I collect command line arguments to get:
   1. sys.argv[1] is the API request (string) I'm making. Let's call it arg1 in this tutorial.
   2. sys.argv[2] is the name (string) of the pickled dat file I'm exporting. Similarly, we'll call it arg2

In [2]:
arg1 = 'https://newsapi.org/v2/everything?q=(BTC OR bitcoin)&from=2018-01-01&to=2018-01-10&language=en&sortBy=popularity&apiKey=6d00cdefd3bc4ee38f8a7af69ac5bec4'
arg2 = '2018-01-01_2018-01-10.dat'

These classes help me make the API request, format the data, and collect several pages of a request if possible. The comments give a bit more detail.

In [3]:
"""This class converts a dict to nested objects"""

class Struct(object):
    """
    Attributes will depend on the structure of object. 
    If we keep calling the 'everything' newsapi, then the attributes will be:
    
                    articles: A list of articles, each with their own objects
                    status: Status of request, should be 'ok'
                    totalResults: The total number of results available for the request, will need
                                  to use the &page= parameter to get these as only 20 articles are
                                  returned per request.
                                  
    Resource: https://stackoverflow.com/questions/1305532/convert-python-dict-to-object
    """
    def __init__(self, data):
        for name, value in data.items():
            setattr(self, name, self._wrap(value))

    def _wrap(self, value):
        if isinstance(value, (tuple, list, set, frozenset)): 
            return type(value)([self._wrap(v) for v in value])
        else:
            return Struct(value) if isinstance(value, dict) else value


"""
General class to aggregate all useful objects. 
Could customize, e.g. change structure of get_raw_data to affect data object
"""

class myclass(object):
    """
    Attributes:
                call: The url sent to newsapi
                raw_data: The dictionary returned when requesting call
    """
    
    def get_raw_data(self,call):
        r = requests.get(call).json()
        
        for i in r['articles']:
            #don't care about author or image url
            del i['author'] 
            del i['urlToImage']
            #convert publishing data/time to be by day
            t = datetime.datetime.strptime(i['publishedAt'], "%Y-%m-%dT%H:%M:%S%fZ")
            nt = t.replace(hour=0, minute=0, second=0, microsecond=0)
            i['publishedAt'] = str(nt)
            #collect only the name of the source, not id
            i['source'] = i['source']['name']
        
        return r
    
    def __init__(self,call):
        self.call = call
        self.data = Struct(self.get_raw_data(call))
        self.data.n_pages = self.data.totalResults/20
        
    """
    Takes call and paginates over user input number of pages to provide a list of 
    lists made up of articles
    """
        
    def paginate(self,n):
        #If page argument already exists in call, remove it
        fp = self.call.find('&page=')
        if fp > 0:
            l = [x for x, v in enumerate(self.call) if v == '&']
            l.append(len(self.call))
            nxt = l[next(x[0] for x in enumerate(l) if x[1] > fp)]
            base_call = self.call[:fp] + self.call[nxt:]
        else:
            base_call = self.call
        
        #loop over pages and add article objects to list
        articles_list = []
        
        for i in range(1,n+1):
            new_call = base_call + "&page=" + str(i)
            d = Struct(self.get_raw_data(new_call))
            articles_list.extend(d.articles)
            
        return articles_list


In [4]:
n =  myclass(arg1)

What does the API request look like after I used `myclass()`?

In [5]:
n.data.__dict__

{'articles': [<__main__.Struct at 0x7f3cf80fcd30>,
  <__main__.Struct at 0x7f3cf80fcc50>,
  <__main__.Struct at 0x7f3cf80fc9e8>,
  <__main__.Struct at 0x7f3cf80fcb70>,
  <__main__.Struct at 0x7f3cf80fca20>,
  <__main__.Struct at 0x7f3cff00f438>,
  <__main__.Struct at 0x7f3d0165de80>,
  <__main__.Struct at 0x7f3cf8ce05f8>,
  <__main__.Struct at 0x7f3cf8ce0cc0>,
  <__main__.Struct at 0x7f3cf8ce02b0>,
  <__main__.Struct at 0x7f3cf8ce0d68>,
  <__main__.Struct at 0x7f3cf8ce0d30>,
  <__main__.Struct at 0x7f3cf8ce03c8>,
  <__main__.Struct at 0x7f3cf8ce0518>,
  <__main__.Struct at 0x7f3cf8ce0e10>,
  <__main__.Struct at 0x7f3cf8ce01d0>,
  <__main__.Struct at 0x7f3cf8ce06a0>,
  <__main__.Struct at 0x7f3cf8ce0438>,
  <__main__.Struct at 0x7f3cf8ce04e0>,
  <__main__.Struct at 0x7f3cf811b4a8>],
 'n_pages': 265.15,
 'status': 'ok',
 'totalResults': 5303}

We can use `data.n_pages` to get the number of pages we should collect (potential input to `.paginate()`). I provide some logic here so that the user doesn't accidentally exceed their 1000 request per day limit :)

In [6]:
if n.data.n_pages > 900:
    p = 900
else:
    p = int(n.data.n_pages)

Now we can collect all pages of articles in one list from our API request. For now I'll just collect 10.

In [7]:
l = n.paginate(10)
len(l)  #should be 20 * p

200

Lastly, we need to store this list of objects for later use! We'll pickle it

In [8]:
with open(arg2, "wb") as f:
    pickle.dump(l, f)

An example command line call:

python3 api_request_script.py 'https://newsapi.org/v2/everything?q=(BTC OR bitcoin)&from=2018-01-01&to=2018-01-10&language=en&sortBy=popularity&apiKey=6d00cdefd3bc4ee38f8a7af69ac5bec4' '2018-01-01_2018-01-10.dat'
