# Python training for data engineers
## 04. Data ingestion

### Goal
This notebook describes how to retrieve information via two methods:
* Web crawling using `requests` and `lxml`
* API crawling using `requests` and `json` parsing

After retrieving the data we do some simple processing and save the data to a different format.

Other ways of importing data that are often used are (a.o.):
* Read CSV directly
* Read Excel
* Connect to any database (postgres, mysql)
* Use a Python library to connect to a database (pymongo)

### Web crawling
Import `requests` so we can make URL requests in Python:

In [1]:
import requests

Define the URL that we will connect to. In this case we will look for packages related to 'scikit'. The following code will reconstruct [`https://pypi.org/search/?q=scikit`](https://pypi.org/search/?q=scikit).


In [2]:
# URL to connect to
url = 'https://pypi.python.org/search/'
searchterm = 'scikit'
params = {
    'q': searchterm
}
params

{'q': 'scikit'}

Execute the request to the pypi URL.


In [3]:
response = requests.get(url, params)
# Print the response URL
response.url

'https://pypi.org/search/?q=scikit'

The response contains a whole HTML page as a binary string as can be seen when printing the content (only the first 200 characters are shown to not pollute this notebook).

In [4]:
response.content[0:200]

b'\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale='

Because we work with different notebooks, we will save the variable to a file on disk so we can use it later on. `pickle` is a tool that is often used (not only in Python!) to save variables 'raw'. The saving of the variable is type-agnostic; it is expected by the developer he knows what he is saving and what he is loading. I.e. an integer variable should not be loaded as a dictonary.

In [5]:
import pickle
pickle.dump(response.content, open("xmlcontent_notebook_04.pickle", "wb"))

In [6]:
ls -la xmlcontent_notebook_04.pickle

-rw-r--r--  1 jitsejan  staff  325463 Apr 25 01:36 xmlcontent_notebook_04.pickle


### API

#### Retrieving the data
Let's define the URL and retrieve all tags related to `Python`.

In [7]:
API_URL = 'https://api.stackexchange.com/2.2/tags/python/related?pagesize=100&site=stackoverflow'

Retrieve the JSON response from the API.

In [8]:
import requests
response = requests.get(API_URL)
data = response.json()

The data is a huge JSON string, which is basically a dictionary in Python.

In [9]:
data

{'items': [{'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 940527,
   'name': 'python'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 86864,
   'name': 'django'},
  {'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 62823,
   'name': 'python-3.x'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 59204,
   'name': 'pandas'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 53225,
   'name': 'python-2.7'},
  {'has_synonyms': False,
   'is_moderator_only': False,
   'is_required': False,
   'count': 42574,
   'name': 'numpy'},
  {'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 30503,
   'name': 'list'},
  {'has_synonyms': True,
   'is_moderator_only': False,
   'is_required': False,
   'count': 25829,
   'name': 'matplotlib'},
 

Check the type of the data:

In [10]:
type(data)

dict

And check how many items there are inside the data:

In [11]:
len(data['items'])

50

That's it for a simple API crawler. In more advanced API crawlers, you should also deal with pagination to scroll through big amounts of pages of data, but for the scope of this course this is omitted.

Let's save the JSON data to a pickle to be used by the next notebook:

In [12]:
pickle.dump(data, open("jsoncontent_notebook_04.pickle", "wb"))

In [13]:
ll *.pickle

-rw-r--r--  1 jitsejan  staff  359970 Apr 25 00:31 json_dataframe_notebook_05.pickle
-rw-r--r--  1 jitsejan  staff   21582 Apr 25 00:54 json_dataframe_notebook_07.pickle
-rw-r--r--  1 jitsejan  staff    1942 Apr 25 01:36 jsoncontent_notebook_04.pickle
-rw-r--r--  1 jitsejan  staff    5534 Apr 25 00:31 xml_dataframe_notebook_05.pickle
-rw-r--r--  1 jitsejan  staff  325463 Apr 25 01:36 xmlcontent_notebook_04.pickle


### Important links

[Using Pickle](https://wiki.python.org/moin/UsingPickle)