# Python training for data engineers
## 04. Data ingestion

### Goal
Retrieve information via two methods:
* Web crawling using `requests` and `lxml`
* API crawling using `requests` and `json` parsing

### Web crawling
Import `requests` so we can make URL requests in Python:

In [1]:
import requests

Define the URL that we will connect to. In this case we will look for packages related to 'machine learning'. The following code will reconstruct [`https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search`](https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search).


In [2]:
# URL to connect to
url = 'https://pypi.python.org/pypi'
searchterm = 'machine learning'
params = {':action':'search',
          'term':searchterm,
          'submit':'search'}
params

{':action': 'search', 'submit': 'search', 'term': 'machine learning'}

Execute the request to the pypi URL.


In [3]:
response = requests.get(url, params)
# Print the response URL
response.url

u'https://pypi.org/pypi/?%3Aaction=search&submit=search&term=machine+learning'

The response contains a whole HTML page as a binary string as can be seen when printing the content (only the first 200 characters).

In [4]:
response.content[0:200]

'\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale='

Because we work with different notebooks, we will save the variable to a file on disk so we can use it later on.

In [5]:
import pickle
pickle.dump(response.content, open("xmlcontent_notebook_04.pickle", "wb"))

In [6]:
ls -la xmlcontent_notebook_04.pickle

-rw-rw-r-- 1 jitsejan jitsejan 3785 Apr 17 05:23 xmlcontent_notebook_04.pickle


### API

#### Retrieving the data
Lets define the URL and retrieve all tags related to `Python`.

In [7]:
API_URL = 'https://api.stackexchange.com/2.2/tags/python/related?pagesize=100&site=stackoverflow'

Retrieve the JSON response from the API.

In [8]:
import requests
response = requests.get(API_URL)
data = response.json()

The data is a huge JSON string, which is basically a dictionary in Python.

In [9]:
data

{u'has_more': False,
 u'items': [{u'count': 935905,
   u'has_synonyms': True,
   u'is_moderator_only': False,
   u'is_required': False,
   u'name': u'python'},
  {u'count': 86578,
   u'has_synonyms': False,
   u'is_moderator_only': False,
   u'is_required': False,
   u'name': u'django'},
  {u'count': 62197,
   u'has_synonyms': True,
   u'is_moderator_only': False,
   u'is_required': False,
   u'name': u'python-3.x'},
  {u'count': 58568,
   u'has_synonyms': False,
   u'is_moderator_only': False,
   u'is_required': False,
   u'name': u'pandas'},
  {u'count': 53041,
   u'has_synonyms': False,
   u'is_moderator_only': False,
   u'is_required': False,
   u'name': u'python-2.7'},
  {u'count': 42335,
   u'has_synonyms': False,
   u'is_moderator_only': False,
   u'is_required': False,
   u'name': u'numpy'},
  {u'count': 30351,
   u'has_synonyms': True,
   u'is_moderator_only': False,
   u'is_required': False,
   u'name': u'list'},
  {u'count': 25677,
   u'has_synonyms': True,
   u'is_moderator

Check the type of the data:

In [10]:
type(data)

dict

And check how many items there are inside the data:

In [11]:
len(data['items'])

50

That's it for a simple API crawler. In more advanced API crawlers, you should also deal with pagination to scroll through big amounts of pages of data, but for the scope of this course this is omitted.

Lets save the JSON data to a pickle to be used by the next notebook:

In [12]:
pickle.dump(data, open("jsoncontent_notebook_04.pickle", "wb"))

In [13]:
ll *.pickle

-rw-rw-r-- 1 jitsejan 6486 Apr 17 05:23 jsoncontent_notebook_04.pickle
-rw-rw-r-- 1 jitsejan 3785 Apr 17 05:23 xmlcontent_notebook_04.pickle
