# Getting and processing data

This week, we will cover the topic of getting and processing data. Given a research problem, where can you find the relevant data? How do you obtain the data? And how do you actually process the data? This notebook aims to guide you through the process.

**At the end of this week, you will be able to:**
- Get data from the web using an API or the `requests` library.
- Process data from the web using different tools.
- Use generator functions.
- Use `try` and `except` to handle errors.

**This requires that you already have (some) knowledge about:**
- JSON files
- files, loops and functions

**If you want to learn more about these topics, you might find the following links useful:**
- Video: [Loop like a native](http://nedbatchelder.com/text/iter.html)

**Important**: Please install the following modules before class. For this, you'll need to use the command line environment. On a Mac, use the Terminal application. On Windows, use cmd ([see this video](https://www.youtube.com/watch?v=EohzkYPV6nI)).
- GeoPy -- type: `pip install geopy`
- pyspotlight -- type: `pip install pyspotlight`
- SpaCy

To install SpaCy, enter the following commands on the command line.

* `conda config --add channels spacy` on the command line
* `conda install spacy` 
* `python -m spacy.en.download` 

If you're on windows, the last command might give an error. Don't worry, you just need to change the filename of the `en-1.1.0.tmp` folder manually. Go to `YOUR_ANACONDA_FOLDER\lib\site-packages\spacy\data\en-1.1.0.tmp` and remove `.tmp` from the filename.



## Preliminaries: generators and error handling

This notebook has two sides: 

1. a theoretical side in which we'll cover some important programming concepts, and
2. a more practical side in which we'll explore APIs and processing data using NLP tools.

We'll first focus on the theory, and then apply that theory in the second half of this notebook.

### Error handling

By now you've probably seen several different error messages. But to be sure, let's execute some broken code!

In [None]:
capitals_dict = {"The Netherlands": "Amsterdam"}
print(capitals_dict["France"])

.

.

.

.

.

.

.

Oh my! 😱 Python's complaining! There are two ways around this error. Here is the first, familiar way:

In [None]:
country = "France"
# Check if the country is in the dictionary before getting the associated value.
if country in capitals_dict:
    print(capitals_dict[country])
# If that's not the case, do something else:
else:
    print("Country not in the dictionary!")

Here is another way, using `try` and `except`-statements:

In [None]:
# Just try to look up the capital of the country.
try:
    print(capitals_dict[country])
# Except if that fails, then print something.
except KeyError:
    print("Country not in the dictionary!")

The difference between these two is that in the former, you check whether it's OK to execute the first bit of code before actually going ahead and running it. This is called the "look before you leap" coding style ([LBYL](https://docs.python.org/3.6/glossary.html#term-lbyl)). The alternative is to just run the code, and see if it breaks down. If the code breaks down, then you execute some other piece of code. This coding style is associated with the slogan "It's easier to ask for forgiveness than for permission" ([EAFP](https://docs.python.org/3.6/glossary.html#term-eafp)).

So when do you use which style? Basically, it comes down to these two questions:

* How often does the exception happen? If the exception is common, then using the if-statement is better. But if exceptions are rare, then it's better to just run the code and catch the error with the `except`-statement. (Else you'd be performing loads of unnecessary checks.)
* How costly is the operation that might give you an error? If it's a very heavy operation, you might want to make sure whether it's OK to run it in the first place. But if the operation is very light, then that's not a very big issue.

Read more about errors [here](https://docs.python.org/3.6/tutorial/errors.html).

### Generators

Generators are functions that produce items one at a time, and forgets each item immediately after producing it, moving to the next one. This is very memory-efficient, because your computer doesn't have to keep a list with all results in memory.

OK, that was an abstract definition. Let's see some examples.

In [None]:
def awesome_counter(n):
    "Generator that produces all the whole numbers up to n."
    # Keep running until the counter has reached n.
    for n in range(n):
        # Perform any operation you want.
        awesome_string = "The number %s is awesome!" % str(n)
        
        # Produce the current value of the counter.
        yield awesome_string

for message in awesome_counter(10):
    print(message)

At each point in time, `count` only refers to one number. Each iteration of the for-loop, `awesome_counter` produces the current value of `awesome_string`, but it doesn't remember the value! This is different from a function like this:

In [None]:
def awesome_list_counter(n):
    "Function that produces a list with all the whole numbers up to n."
    # Initialize results list. This is where ALL results will be stored, which will take a lot of memory
    # for large values of N.
    numbers = []
    for n in range(10):
        # Perform any operation you want.
        awesome_string = "The number %s is awesome!" & n

        # Append the current value of the counter to the list.
        numbers.append(count)

    # Return the full result.
    return numbers

# Here, the function first produces a list, which Python keeps in memory for the duration of the loop.
# Afterwards, the list is removed from memory again. But for a short period of time, it's taking up space.
for message in awesome_list_counter(10):
    print(message)

When you call a generator function, it returns a *generator object*. You can use the built-in function `next()` to keep calling the next-to-be-generated value from the generator object until it has produced everything it should. At that point, calling `next()` will result in a `StopIteration` error. Please run the next bit of code to see it in action.

In [None]:
generator = awesome_counter(2)

i = next(generator)
print("the first value is", i)

i = next(generator)
print("the second value is", i)

i = next(generator)
print("the third value is", i)

So how does the for-loop know when to stop if calling `next()` gives an error at some point? Simple: error handling! Implicitly, the loop looks sort of like this:

In [None]:
generator = awesome_counter(2)
# While loops work like this: the while-statement indicates that you want to keep doing 
# something while the condition following the 'while'-keyword is true. 
#
# While True means that the loop will never finish, because the condition is never False.
while True:
    try:
        # Try to get the next item.
        i = next(generator)
    except StopIteration:
        print("Finishing the loop!")
        # Break out of the loop.
        break
    
    # ...Continue the current iteration.
    # Do whatever you want with the item, in this case print it.
    print(i)

Files also work like generators. You can run through them line by line, so that you never have to keep the entire file in memory. Just the current line (and whatever you decide to extract or compute from the text).

In [None]:
f = open('../Data/RedCircle/RedCircle.txt')

line = next(f)
print(line)

# Let's move to the end of the file to show we get the same error if we use next() one more time after that.
for line in f:
    # Do nothing. Just loop until the end of the file.
    pass

# If you just use next(f), you get an error now.
try:
    line = next(f)
    print("Another line!")
except StopIteration:
    print("Reached the end!")

And if you only want to read the first N lines, you could use a loop like this:

In [None]:
line_number = 0
# Open the file.
with open('../Data/RedCircle/RedCircle.txt') as f:
    # Loop over the file, line by line.
    for line in f:
        # Print the line
        print(line_number, line)
        # Increase the line counter
        line_number += 1
        # And break out of the loop after 10 lines.
        if line_number == 10:
            break

But more Pythonic (prettier code) would be to use `enumerate()`, which also acts like a generator:

In [None]:
with open('../Data/RedCircle/RedCircle.txt') as f:
    # Loop over the file, line by line.
    for line_number, line in enumerate(f):
        # SELF-CHECK QUESTION:
        # Why did I put the if-statement at the beginning now?
        if line_number == 10:
            break
        # Print the line
        print(line_number, line)
        # And break out of the loop after 10 lines.

How does that for-loop work? Remember that we could use *multiple assignment* to assign values to multiple variables at once:

In [None]:
x,y = (1,2)
print("The value of X is:", x)
print("The value of Y is:", y)

We can do the same in a for-loop! Here is an example:

In [None]:
import nltk

words = "I like dogs and cats"
tokens = words.split()
tagged_tokens = nltk.pos_tag(tokens)

print("The tagged tokens are represented as a list of tuples:")
print(tagged_tokens)

print("Now let's print them in a table!")
print("------------")
print("Token \t Tag")
print("------------")
for token, tag in tagged_tokens:
    print(token,'\t',tag)
print("------------")

If we were to re-implement `enumerate()`, it would look like this:

In [None]:
def enumerate_clone(iterable):
    "This is a clone of enumerate(), which yields items and their index, one by one."
    index = 0
    for item in iterable:
        yield (index, item)
        index +=1

Let's see whether it works:

In [None]:
with open('../Data/RedCircle/RedCircle.txt') as f:
    # Loop over the file, line by line.
    for line_number, line in enumerate_clone(f):
        if line_number == 10:
            break
        # Print the line
        print(line_number, line)
        # And break out of the loop after 10 lines.

## Where to find data

Here's a non-exhaustive list of places where you could get interesting data.

**Curated**

* Corpora (Brown ([NLTK version](http://www.nltk.org/book/ch02.html)), [OANC](http://www.anc.org/data/oanc/download/), [UMBC WebBase](http://ebiquity.umbc.edu/resource/html/id/351))
* Psycholinguistic data (sometimes known as 'norms' in the Psychology literature)
* DBpedia
* Open data (e.g. [Dutch](https://data.overheid.nl/), [American](https://www.data.gov/))
* Web N-gram data (e.g. [here](http://hpsg.fu-berlin.de/cow/ngrams/))

**The web**

* [USENET](http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)
* [Internet Archive](https://archive.org/)
* [Project Gutenberg](https://www.gutenberg.org/)
* Wikipedia ([dumps](https://dumps.wikimedia.org/), [export]())
* [Web data commons](http://webdatacommons.org/)

**Do it yourself**

* [BootCat](http://bootcat.sslmit.unibo.it/)
* Experiments
* Annotating
* Crowdsourcing
* ...

## How to get the data

### Downloading directly

Here are three ways to download data from the web, each with their own use cases.

* Browser (loads of data available online)
* Command line: `wget` ([manual](https://www.gnu.org/software/wget/manual/wget.html))
* Python: `requests`, `urllib`

If you see some dataset online, or you just want to download a webpage, there is no better way than to use your browser and either save the page (from the File menu), or to right-click and press "save as..". But for more complex cases, you'll want to automate the process. 

The command line `wget` tool is like a swiss pocket knife for downloading stuff in bulk. For example, if you have a list of URLs in a text file called `list_of_urls.txt`, you can just use `wget -i list_of_urls.txt` to download all the files. You can also use the `wget` module in Python. For more complicated procedures, it's easier to just use the `requests` or `urllib` library. The `wget` tool is also [available on Windows](http://gnuwin32.sourceforge.net/packages/wget.htm). We won't explore `wget` in this course.

Here is how we downloaded the Linguist List data for this course (we'll use this data later on):

```python
import os
import urllib.request
import time

base_url = 'http://listserv.linguistlist.org/pipermail/linglite/'
years = [str(year) for year in range(1997,2016)]
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
          'August', 'September', 'October', 'November', 'December']

for year in years:
    # OS-independent way of creating the path to the folder.
    path = os.path.join('..', 'linguistlist', year)
    # Make the necessary folder.
    os.makedirs(path)
    
    for month in months:
        # Update variables.
        filename = '{}-{}.txt.gz'.format(year, month)
        path_with_file = os.path.join(path, filename)
        url = base_url + filename
        
        # Write the data to disk.
        with urllib.request.urlopen(url) as response:
            # Use the 'wb' flag because the response contents are bytes.
            with open(path_with_file, 'wb') as outfile:
                data = response.read()
                outfile.write(data)
        
        # Be nice to the server.
        time.sleep(2)
```

How did we do this?

* First, we went to the [Linguist List archive website](http://listserv.linguistlist.org/pipermail/linglite/). The archive looks nice, but it's a lot of work to download all of those files by hand!
* Then, we inspected the **source** of the webpage. In Firefox, you can do this by going to `Tools/Developer/Page Source`. In Chrome: `View/Developer/View Source`. Most other browsers offer this functionality as well.
* We saw that the URLs for the monthly archives are very regular. This is good, it means that we can exploit this regularity.
* Then, we decided on a local structure: we want to have one folder for every year, in which all the archives for that year are stored. This structure determined the structure of our program.
* If you don't download files often, search online for a good way to do this. Many programmers would be lost without Google/StackOverflow! The first thing we found was the `urllib` library. But a solution using the `requests` library would also be OK! That would look like this:

```python
import requests

# Get the data:
r = requests.get('http://listserv.linguistlist.org/pipermail/linglite/2016-September.txt.gz')

# Use the 'wb' flag because the response contents are bytes.
with open('September.txt.gz','wb') as f:
	# Write the data:
	f.write(r.content)
```

* It's good practice to make your computer wait a little between requests. So we used the `sleep` function from the `time` module to wait 2 seconds after each download.

You might be surprised by the file ending: `.txt.gz`. What kind of filename is that? Basically it's an archive file, similar to a `.zip` file. You'll often see this extension for large files, because it's a means of compressing data into a smaller format. You can get the un-compressed `.txt` file by unpacking the `.txt.gz` file, but there's also a useful Python module called `gzip` that lets you inspect these files ([documentation here](https://docs.python.org/3.6/library/gzip.html)). 

An unzipped example is in `../Data/linguistlist/example`. Here's how to use the `gzip` module:

In [None]:
import gzip

# Open the linguist list data for the month April in the year 2000.
# Use text-mode so that each line is returned as a unicode string.
with gzip.open('../Data/linguistlist/2000/2000-April.txt.gz','rt') as f:
    # Loop over the lines in the file, using enumerate to get the line numbers.
    # Use start=1 to start counting at 1 rather than 0.
    for line_number, line in enumerate(f, start=1):
        # Print the line. Use end='' because lines already end in a newline character.
        print(line, end='')
        # Stop after 50 lines.
        if line_number == 50:
            break

#### Things to come
This was a simple example that doesn't require us to do any parsing of the webpage itself. But how would you write a function that takes a URL like [this one](http://listserv.linguistlist.org/pipermail/linguist/2016-September/date.html) and returns all job descriptions? What would be your approach (on a high level)? 

We will revisit this problem below.

### Using an API

An API (*application programming interface*) provides a way for programs to interact with applications running independently. Those applications could either be running on your own computer, or they could be running somewhere else. We will be working with online APIs, specifically APIs providing the interface to some database. 

General guidelines for using APIs:

1. Try to minimize the number of requests you make. Can you be selective before putting in your requests? 
2. Try to spread your requests so that you don't overload the server.
3. Try to cache your results so that you don't request the same thing twice. (Think about multiple sessions and testing your code.)

In short: developers providing APIs are doing us a favor. Acting nice to them is the least we can do.

#### Bare APIs and wrappers

APIs work like this: you send them a request (possibly with some additional information), and they send you the relevant data back. Sometimes you have to send these requests explicitly in your code, but other times there will be a *wrapper* where people have written code to provide a nice interface for you to use.

**Geopy** is a nice example of a wrapper around several geolocation APIs. Read the documentation [here](https://geopy.readthedocs.io/en/1.10.0/). You can install Geopy using `pip install geopy`. 


In [None]:
import json

# Load the Nominatim API.
# Read more about Nominatim here: http://wiki.openstreetmap.org/wiki/Nominatim
from geopy.geocoders import Nominatim

# Instantiate a geolocator object, using the Nominatim API.
geolocator = Nominatim()

# Try to find out more about a place, such as the street where the VU main building is.
location = geolocator.geocode('de Boelelaan')

# Print the place.
print(location)

**Question** What kind of information can you get from the `Location` object?

**Example**

Here is some code to get you started if you ever want to use this API. Interesting aspects are:

* Caching: this code stores the latitude and longitude for each place in a dictionary called `location_cache`. Keeping track of all responses means we never have to make the same request twice.
* Try & Except: this code makes use of a try-except block. Typically, code following `try` is the default case, and the code following `except` is for handling situations where the code in the try-block cannot be executed.

In [None]:
try:
    with open('location_cache.json') as f:
        location_cache = json.load(f)
except FileNotFoundError:
    location_cache = dict()

def get_lon_lat(place, location_cache):
    """
    Get the latitude and longitude of a place.
    """
    if place in location_cache:
        # Get the longitude and latitude from the location cache.
        lon, lat = location_cache[place]
    
    # If 'place' is not in the location cache..
    else:
        location = geolocator.geocode(place)
        lon,lat  = location.longitude, location.latitude
        location_cache[place] = [lon, lat]
    # return longitude and latitude.
    return lon, lat

# REST OF YOUR CODE. Example:
lon,lat = get_lon_lat('Amsterdam', location_cache)

# Write out the file.
with open('location_cache.json', 'w') as f:
    json.dump(location_cache, f)

This code is friendly to the server, because it only makes a request if you haven't already asked where Amsterdam is. Otherwise it just returns the values from the cache. But we can make it even more friendly by making the computer wait a little between each request:

In [None]:
import time

for location in ['Amsterdam', 'Utrecht', 'Amersfoort', 'Uitgeest']:
    # Make the request. Add ", the Netherlands" to make the request as specific as possible.
    # This is a means to reduce errors.
    lon, lat = get_lon_lat(location + ', the Netherlands', location_cache)
    
    # Do something with the result, e.g. print it.
    print(location, 'has the following longitude and latitude:', lon, ';', lat)

    # Wait.
    time.sleep(2)

When there is no wrapper, you just treat the API as if you are downloading something from the URL. Let's go through some examples. Both of these provide output in JSON format.

**Recipepuppy** is a website where you can search for recipes you can make with a particular set of ingredients. The description of their API is [here](http://www.recipepuppy.com/about/api/). So how do we make this work?

In [None]:
# This library comes pre-installed with Anaconda. We use it to send requests to the web.
import requests

# Get the ingredients
ingredients = input('Please enter the ingredients as a comma-separated list.\n')

# Remove spaces if there are any. (This makes the script more robust.)
ingredients.replace(' ','')

# Prepare the API request URL
base_url = "http://www.recipepuppy.com/api/?i="
api_request = base_url + ingredients

# Get the response
response = requests.get(api_request)

# And print it
print(response.content)

We know from last week that JSON objects are just like Python dictionaries, and you can load them using the JSON module. Let's try that!

In [None]:
import json

recipe_data = json.loads(response.content)

.

.

.

Woops! It turns out that data from the internet is in bytes-format. The JSON library really needs it to be a string.
For this, we need to use the `decode` method to turn the bytes into unicode. If this sounds like magic to you, don't worry: this is something all programmers have struggled with at some point. 

For the next class, please watch the video [Pragmatic Unicode, or: How do I stop the pain?](http://nedbatchelder.com/text/unipain.html). And, if you want to learn more about Unicode, read [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html).

Now, let's just convert the bytes and continue working with the recipe data.

In [None]:
# Decode bytes into utf-8 (unicode).
decoded_data = response.content.decode('utf-8')

# Load the data.
recipe_data = json.loads(decoded_data)

# Print the keys.
print(recipe_data.keys())

It worked! Now I'll let you in on a secret: because conversion to text is a very common operation, there's also an attribute `response.text` that we could have used to get the recipe data in text format. But this was a useful exercise to show you how to convert from bytes to text manually.

Here's the shortened version of the previous code snippet.

In [None]:
# Load the data.
recipe_data = json.loads(response.text)

# Print the keys.
print(recipe_data.keys())

You might ask: but why is there a bytes-format in the first place? Well, that's just how computers store things. And if we were to just save the recipepuppy data, there wouldn't be any need to convert it. We could just do something like this (and then you can open this file in any text editor):

In [None]:
# Open the file in write-mode (using bytes)
with open('recipepuppy.json','wb') as f:
    f.write(response.content)

**Pretty printing**

A nice way to inspect JSON response dictionaries is to use the built-in pretty printer from the `pprint` library

In [None]:
# Import the pretty printer:
from pprint import pprint

# Print the recipe data:
pprint(recipe_data)

So now we understand the basics of how this API works: ingredients are passed to the website as a comma-separated string, and we get a JSON response back that we can load as a dictionary. The dictionary contains a key called 'results', which maps to a list of results (dictionaries as well). 

But there is more to this API. Apparently you can't just get one page of results, but you can actually get multiple pages of results. [Here](http://www.recipepuppy.com/api/?i=onions,garlic&q=omelet&p=3) is their example. Some questions:

* How can you get more results?
* How do you know whether you have *all* results for a given query?

.

.

.

.

.

Play with the URL and see what happens! Try stuff like p=500000 (or some other high number).
We can assume that the website will give a similar page when there are no more results.
That's when the algorithm to get all the results needs to stop.

### Exercise: dealing with the Recipepuppy API

We will work with [this URL for omelettes containing potatoes](http://www.recipepuppy.com/api/?i=potato&q=omelette&p=1), for the simple reason that there aren't that many recipes matching this query. It's nice to have examples like these, because you can easily test your code. Trying out all the numbers shows us that there are **three types of responses**:

* http://www.recipepuppy.com/api/?i=potato&q=omelette&p=1 **Returns a JSON file with results.**
* http://www.recipepuppy.com/api/?i=potato&q=omelette&p=2 **Gives a 404: page not found error.** (There's a bug in the API!)
* http://www.recipepuppy.com/api/?i=potato&q=omelette&p=3 Returns a JSON file with results.
* http://www.recipepuppy.com/api/?i=potato&q=omelette&p=4 Returns a JSON file with results.
* http://www.recipepuppy.com/api/?i=potato&q=omelette&p=5 Returns a JSON file with results.
* http://www.recipepuppy.com/api/?i=potato&q=omelette&p=6 **Returns a JSON file with no results.**

We will write some functions to properly deal with this API. Here are all the steps:

1. Write a function to return either a dictionary with the results if there is a JSON file, and `None` if the website gives an error.
2. Write a generator function to easily loop through the result pages.
3. Write a function to collect a specific amount of results.

**Part 1: write a function to get the results**

Using the code below, write a function that returns either a dictionary with the results if there is a JSON file, and None if the website gives an error.

HINT: loading the 404 page as a JSON string will raise an error!

In [None]:
def get_results(query, ingredients, page):
    """
    Query: string indicating the kind of recipe that you're looking for.
    Ingredients: comma-separated string of ingredients.
    Page: results page.
    """
    # YOUR CODE HERE.
    # - use a variable base_url to refer to the part of the URL that never changes.
    # - combine the base_url with the query (e.g. 'omelette'), the ingredients 
    # (e.g. 'potato' or 'potato,onion'), and the page number (e.g. '1').
    # - It's probably nicest if this function allows both integers as well as strings for the page number.
    #   You can do this by converting page to a string (use str()). 
    
    try:
        # YOUR CODE HERE.
        # Get the page, and load the JSON data.
        # results = ...
        #
        # The results from recipepuppy.com don't have any page number.
        # Let's fix that, because it might be useful in the future.
        results['page'] = page
        return results
    except #SOME KIND OF ERROR:
        return None

**Part 2: write a generator function to loop over the results**

We've covered generators at the beginning of this notebook. Let's use one of them in practice! So what would a generator function for search results look like? Basically it should keep calling the API until there are no more relevant results. This is when the JSON file has an empty list of results. (In this case, we need to raise the StopIteration error because we're defining the stopping criteria ourselves. We'll give you this part of the code for free.)

Please complete the code below.

In [None]:
import itertools

def results_generator(query, ingredients):
    """
    Generator to yield all the result pages for the given query and ingredients.
    """
    # Write a loop in which you keep calling the results page until there are no more results.
    # Use the 'yield' keyword to produce the results.
    # Be sure to also use the sleep() function to pause between calls.
    
    # Itertools.count() keeps counting forever if we don't do anything. 
    # The StopIteration error is there to stop once there are no more results.
    for page_number in itertools.count(start=1):
        
        # YOUR CODE HERE: 
        # - get the results (call the variable 'result').
        # - make the loop sleep.
        
        
        # Some code to prevent an infinite loop while you're working. 
        # Remove this once you're sure your code works.
        if page_number > 5:  # REMOVE THIS LINE WHEN YOU'RE DONE
            break            # REMOVE THIS LINE WHEN YOU'RE DONE
        
        # If the page gives a 404 error.
        if result == None:
            # Continue means: move to the next iteration of the loop, 
            # without executing the rest of the code.
            continue
        
        # If we got here, then that means we didn't get a None-result.
        none_count = 0
        
        # If there are no more results, raise the StopIteration error so that Python knows to stop.
        elif len(result["results"]) == 0:
            raise StopIteration
        
        else:
            # YOUR CODE HERE: yield the result.

In [None]:
# For testing purposes, use this code.
results_list = []
for result in results_generator(query="omelette",ingredients="potato"):
    results_list.append(result)

print(results_list[0])

**Step 2b: make the generator more robust**

A problem with the generator function above is that it might produce an infinite loop if Recipepuppy.com is down. It might be a good idea to add a counter that keeps track of how many times `result` has been equal to `None`, and breaks out of the loop when that number goes over a certain threshold (say, 5 times `None` in a row). How would you do this? Modify the code to implement your solution.

**Step 3: write a function to collect a specific amount of results**

Suppose you wanted to look for pasta recipes. There are hundreds of them! Getting all recipes from the API would take a long time, and you may only want to have a couple. Hence it's a good idea to write another function to get (at most) a specified number of results. Please complete the following function:

In [None]:
def get_n_recipes(query,ingredients,n):
    """
    Function that returns at most N results, where N is equal to the number of recipes.
    """
    # NOTE: there are multiple recipes per results page!
    return list_of_results

### More to explore

**Hackernews** is a website where people can post URLs to interesting stories, submit polls, show the community something, or ask the community a question. The description of their API is [here](https://github.com/HackerNews/API). 

**Question**: what kind of things could you do with this data?

We will use the Hackernews API in the assignment.

Many APIs require you to authenticate yourself to the server, before they actually return any results. This is a means to prevent abuse (e.g. overloading the server). This usually means you have to register for the service in order to get an *API key*. We won't cover these in class (we don't want to force you to register for anything), but know there are many public APIs out there!

## How to process your data

### Processing the data: HTML

Let's take a look at a simple webpage. [Here](http://listserv.linguistlist.org/pipermail/linguist/2016-September/date.html) is one with all postings from the Linguist List in September 2016. Our goal will be to get a list with all the Job postings, including the URL. How do we go about this?

Step 1. **Look at the source code first**. We can't do anything without knowing how the page is structured. You can open the page with your browser and inspect the source, or right-click the link and choose "Save as.." to save the file and inspect it with a text editor. What would be a good approach?

.

.

.

.

.

.


**Possible approaches**

1. Use string-methods, look for all the lines with the word 'Jobs' in it, and extract the URL and title from them.
2. Use regular expressions, write a pattern to match all links with 'Jobs' in the text.
3. Use a module to parse the HTML first, then look for all links with the word 'Jobs'.

Let me first emphasize: *There is no wrong way to do this.* If it works, it works. But as the problems you are trying to solve are getting more and more complex, it's increasingly easier to use a high-level approach. (To illustrate: how would you get the full text of [this article](http://www.bbc.com/news/disability-35881779) from the webpage? Parsing HTML is definitely the way to go, here.)

Step 2. **Create a working solution for the problem at hand.** Let's try all three approaches. 

In [None]:
# Python 3 only imports libraries that it hasn't already imported.
import requests

# Get the data, and convert to string.
response = requests.get('http://listserv.linguistlist.org/pipermail/linguist/2016-September/date.html')
contents = response.content.decode('utf-8') # We'll use this variable as the starting point for this exercise.

First, try to find all URLs and titles of job-announcements using string-methods.

In [None]:
# Steps:
# 1. Split the contents into lines.
# 2. Create a list with all the lines containing the word 'Jobs'.
# 3. Write a function to extract the URL and title from a line.
# 4. Apply that function to each of the lines, and collect the results in another list.

# YOUR CODE HERE

Now try to find all URLs and titles of job-announcements using regular expressions. Learn about regular expressions [here](https://regexone.com/), and read the documentation for the [re](https://docs.python.org/3/library/re.html) module. Here is a small example of how to use the module.

In [None]:
import re

# Example of how to find all smiley faces in a text.

# re.compile is nice because it allows you to define a pattern wherever you want 
# (put it somewhere prominent & easy to modify) and because your code will be much
# faster if you use the pattern often. (Otherwise Python has to compile the pattern 
# each time you want to use it.)

pattern = re.compile(r':-?[\(\)]') # The 'r' stands for 'raw string'.
results_1 = pattern.findall("""Greetings! :) This is a sentence with smileys! :-) 
                The last one had a nose, probably written by an old person :(""")

# Example of how to use capturing groups.
pattern = re.compile('like (\w+)')
results_2 = pattern.findall("I like hamsters, but I don't like cleaning the cage.")

print(results_1)
print(results_2)

In [None]:
# Steps:
# 1. Write a pattern with two capturing groups: one for the URL and one for the text.
# 2. Use re.findall(content) to find all the job listings. You will automatically get tuples with the relevant data.
#
# HINT: you can use the question mark to do non-greedy matching for the asterisk. 
# '.*?\n' will match 'everything until the end of the line'. 
# Contrast this with '.*\n', which means "everything up until the last line break".


Finally, let's use the `lxml` module to find all URLs and titles of job announcements. See the code below for instructions.

In [None]:
from lxml import html

root = html.fromstring(contents)

# Modify the XPATH string so that 'links' will contain the right elements.
# Use root.getchildren() to explore what the document tree looks like.
# You can use getchildren() on other elements as well.
links = root.xpath("./path/to/link/tag[contains(.,'Jobs')]")

# 'links' will be a list with html elements.
# Use dir() to see what you can do with them.
# For any link element, you can get the URL like this:
# url = link.attrib['href']

Step 3. **How generalizable is your solution?** How many steps does it take to change our solutions to, for example:

* Use a different URL (maybe you want to do this in October as well).
* Search for a different set of announcements, e.g. *Books*, or *Conferences*.

You don't need to implement these changes, though you can if you want to! (Use the code boxes below.) But just read through your solutions to this problem and think about what changes should be made.

### Processing data: NLP tools

The common idea for all NLP tools is that they try to structure or transform text in some meaningful way. The question of which tool you should use is only secondary to the question what you want to achieve. To give you a sense of the things you can achieve with standard NLP techniques, we will now look at two tools that you can use to analyze text: **SpaCy** and **pyspotlight**. 

#### SpaCy: quickly parsing documents

SpaCy provides a small NLP pipeline: it takes a raw document, tokenizes it, tags all the tokens, and parses each sentence. On top of that, it also recognizes different types of entities: numbers, locations, and persons. The advantage of SpaCy is that it is really fast, and it has a good accuracy. The downside is that, at the moment, it only works for English and German. There are other tools available for different languages, but those are a bit more difficult to set up. (We can help you with this; ask us after class.)

**Installing** 

To install SpaCy, enter the following commands on the command line.

* `conda config --add channels spacy` on the command line
* `conda install spacy`. 
* `python -m spacy.en.download` (if this doesn't work, see [here](http://spacy.io/docs/#getting-started) for updated instructions).

**Using SpaCy**

First let's load SpaCy.

In [None]:
# Load the English parser.
# Note for speakers of German: it's also possible to parse German sentences using SpaCy! 
# See the documentation for more info.
from spacy.en import English

# The English parser is a class. 
# If you call it without any arguments, you will get a parser object.
# You can use this object to parse documents.
parser = English()

In [None]:
# Here's how to parse a document.
parsed_document = parser("I have an awesome cat. It's sitting on the mat that I bought yesterday.")

In [None]:
# Now you can loop over the document and print each sentence.
for sentence in parsed_document.sents:
    print(sentence)

In [None]:
# Print some information about the tokens in the second sentence.
sentences = list(parsed_document.sents)
for token in sentences[1]:
    data = '\t'.join([token.orth_,
                      token.lemma_,
                      token.pos_,
                      token.tag_,
                      str(token.i),   # Turn index into string
                      str(token.idx)])# Turn index into string
    print(data)

**Question**: what is the difference between `token.pos_` and `token.tag_`? ([read the docs](https://spacy.io/docs/)) to find out.

**Question:** what do the different tags mean? Read [this page](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to find out.

In [None]:
# Here's a slightly longer text, from the Wikipedia page about Harry Potter.
harry_potter = """Harry Potter is a series of fantasy novels written by British author J. K. Rowling. 
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry .
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."""

sentences = parser(harry_potter)
for e in sentences.ents:
    first_word = list(e)[0]
    etype = first_word.ent_type_
    print(e,'\t',etype)

Pretty cool, but what does NORP mean? According to the [docs](https://spacy.io/docs/#annotation-ner): Nationalities or religious or political groups.

#### pyspotlight: 'interpret' sentences using DBpedia

Pyspotlight provides an easy way to use DBpedia Spotlight, which is a service you can use to find DBpedia entities in a text. DBpedia is --roughly-- a machine-readable version of Wikipedia. In short, this tool enables us to figure out which entities a text is about.

**Installing**

To install pyspotlight, enter the following command on the command line.

* `pip install pyspotlight`

**Using pyspotlight**

Pyspotlight has a demo server that we can use for teaching purposes. If you'd like to use Spotlight in the future, it may be wise to set up your own server (you can run it on your laptop) or ask us to set something up for you.

* Please run the code below. Is there anything surprising about the output? 
* If you speak German, Dutch, Hungarian, French, Portuguese, Italian, Russian, Turkish, or Spanish, you could try running Spotlight for any of those languages as well. See the [documentation](https://pypi.python.org/pypi/pyspotlight/0.7.1) for the list of ports in the demo server. Change `2222` below to the relevant port, and you can run Spotlight for your language!

In [None]:
import spotlight

demo_server = 'http://spotlight.sztaki.hu:2222/rest/annotate'

# Annotate the Harry Potter text we've seen earlier.
spotlight.annotate(demo_server, harry_potter)

#### Other tools (not covered in class)

Unfortunately we cannot cover all NLP tools in this course. Below is a short list of tools that might be useful to you in the future. You can either use these tools as standalone programs (and then process their output using Python), or you can choose to use a *wrapper* that allows you to call these tools from inside Python.

* Treetagger is a tool for tokenization and part-of-speech tagging in many languages. [Here](http://treetaggerwrapper.readthedocs.io/en/latest/) is a Python interface for it. 
* Stanford CoreNLP is a suite of NLP tools (constituting a full pipeline). [Here](https://github.com/dasmith/stanford-corenlp-python) is a library to interact with those tools.

## Exercises

Here are some exercises to help you practice your data processing skills! (These are not mandatory, but we do recommend you to try these.)

### NLTK versus SpaCy

There a difference in quality between SpaCy and the NLTK, with the former being superior. But how can you tell? Here's an example of both tools in action. 

* The example text is a case in point. What goes wrong here?
* Try experimenting with the text to see what the differences are.

In [None]:
# Only load this cell if you haven't loaded SpaCy and nltk yet. E.g. if you restarted this notebook.
import nltk
from spacy.en import English

nlp = English()

In [None]:
text = "I like cheese very much"

print("NLTK results:")
nltk_tagged = nltk.pos_tag(text.split())
print(nltk_tagged)

print("SpaCy results")
doc = nlp(text)
spacy_tagged = []
for token in doc:
    tag_data = (token.orth_, token.pos_, token.tag_,)
    spacy_tagged.append(tag_data)
print(spacy_tagged)

### Harry Potter

Use the requests library to get [the Harry Potter article from Wikipedia in JSON format]([this URL](https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Harry%20Potter). Then, answer the following questions:

1. Who are the frequent characters mentioned in the Wikipedia article? (HINT: you might want to use SpaCy)
2. What are the most frequent locations in the Wikipedia article?
3. What are the most cited books on this page? What about websites? (HINT: this is a job for regular expressions!)

### More APIs and datasets

What other APIs and datasets are available online? Use Google or some other search engine to find more APIs you could use to get interesting data. Also try to see whether the government supports open data, and what kind of data they're making available. (More and more governments do this!)

Personally, I really like that the Dutch government makes all debates available in XML format. See [here](https://zoek.officielebekendmakingen.nl/zoeken/parlementaire_documenten). If you'd like to explore this data for the final assignment, I wrote some scripts [here](https://github.com/evanmiltenburg/Dutch-corpora) to download all the data.