# NLP4Free 🔠⚡🤖🧠😃 
A Free Natural Language Processing Microcourse

by Myles Harrison 

https://www.mylesharrison.com/nlp4free

---

### License

This work is licensed under a [Creative Commons NonCommerical License](https://creativecommons.org/licenses/by-nc/4.0/).


<img src='../assets/Cc_by-nc_icon.png' width="25%"/>

You are free to:
- **Share** — copy and redistribute the material in any medium or format
- **Adapt** — remix, transform, and build upon the material. The licensor cannot revoke these freedoms as long as you follow the license terms.


Under the following terms:
- **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- **NonCommercial** — You may not use the material for commercial purposes.

---

## Part 2 - Data Acquisition and Preprocessing

### Requesting Data From a Web Service with the `requests` library

In this notebook, we will acquire and preprocess text data from online sources. We have already been introduced to the [requests library](https://requests.readthedocs.io/en/latest/) and we will show how with a few simple lines of code we can pull data from a web service (REST API).

[The Cocktail DB](https://www.thecocktaildb.com/) is an open source database of cocktails and drinks from around the world, and their ingredients. It also has an API that is free to use for educational purposes. 

Let's get some text data using the `requests` library, here a description of gin. The URL pattern for a given web service is up to its designer, and should be well documented. The Cocktail DB tells us to use the URL pattern `https://www.thecocktaildb.com/api/json/v1/1/search.php?i=<ingredient name>` in order to get information back on a drink ingredient.

First we import the requests library, than simply make a request the `get` method and the URL:

In [41]:
# Import the requests library
import requests

# Make a call to the API
r = requests.get("https://www.thecocktaildb.com/api/json/v1/1/search.php?i=gin")

Our machine has now made the request and hopefully gotten a response from the server! Let's check the response code.

In [43]:
# Check the response
r

<Response [200]>

We can see we have received a response code of 200, which means "OK" and that data was returned successfully. Let's check what was returned from our request, there are two ways to do this. The most straightforward is just to return using the `.text` attribute, which shows the response contents as an ordinary python string:

In [44]:
# What are the contents
r.text

'{"ingredients":[{"idIngredient":"2","strIngredient":"Gin","strDescription":"Gin is a distilled alcoholic drink that derives its predominant flavour from juniper berries (Juniperus communis). Gin is one of the broadest categories of spirits, all of various origins, styles, and flavour profiles, that revolve around juniper as a common ingredient.\\r\\n\\r\\nFrom its earliest origins in the Middle Ages, the drink has evolved from a herbal medicine to an object of commerce in the spirits industry. Gin emerged in England after the introduction of the jenever, a Dutch liquor which originally had been a medicine. Although this development had been taking place since early 17th century, gin became widespread after the William of Orange-led 1688 Glorious Revolution and subsequent import restrictions on French brandy.\\r\\n\\r\\nGin today is produced in subtly different ways, from a wide range of herbal ingredients, giving rise to a number of distinct styles and brands. After juniper, gin tends

We can see there is some nesting of data here, as the response is actually returned in [Javascript Object Notation (JSON) format ](https://en.wikipedia.org/wiki/JSON). We can see this in JSON format as well using `.json` which will return a python dictionary:

In [46]:
# Convert to JSON and dict
r.json()

{'ingredients': [{'idIngredient': '2',
   'strIngredient': 'Gin',
   'strDescription': 'Gin is a distilled alcoholic drink that derives its predominant flavour from juniper berries (Juniperus communis). Gin is one of the broadest categories of spirits, all of various origins, styles, and flavour profiles, that revolve around juniper as a common ingredient.\r\n\r\nFrom its earliest origins in the Middle Ages, the drink has evolved from a herbal medicine to an object of commerce in the spirits industry. Gin emerged in England after the introduction of the jenever, a Dutch liquor which originally had been a medicine. Although this development had been taking place since early 17th century, gin became widespread after the William of Orange-led 1688 Glorious Revolution and subsequent import restrictions on French brandy.\r\n\r\nGin today is produced in subtly different ways, from a wide range of herbal ingredients, giving rise to a number of distinct styles and brands. After juniper, gin te

Now to pull out the description, it is a matter of subsetting the returned list (there is only one element, element 0) and the getting the value associated with the `strDescription` key:

In [47]:
description = r.json()['ingredients'][0]['strDescription']
print(description)

Gin is a distilled alcoholic drink that derives its predominant flavour from juniper berries (Juniperus communis). Gin is one of the broadest categories of spirits, all of various origins, styles, and flavour profiles, that revolve around juniper as a common ingredient.

From its earliest origins in the Middle Ages, the drink has evolved from a herbal medicine to an object of commerce in the spirits industry. Gin emerged in England after the introduction of the jenever, a Dutch liquor which originally had been a medicine. Although this development had been taking place since early 17th century, gin became widespread after the William of Orange-led 1688 Glorious Revolution and subsequent import restrictions on French brandy.

Gin today is produced in subtly different ways, from a wide range of herbal ingredients, giving rise to a number of distinct styles and brands. After juniper, gin tends to be flavoured with botanical/herbal, spice, floral or fruit-flavours or often a combinatio

Great! We have sucessfully retrieved some text from an API using `requests`. We could write more code to return more data programatically and stored in a data structure such as a list or pandas dataframe to work with in an NLP task:

In [51]:
ingredients = ['gin', 'vodka', 'rum']

description_list = list()

for ingredient in ingredients:
    
    # Make a call to the API
    r = requests.get(f"https://www.thecocktaildb.com/api/json/v1/1/search.php?i={ingredient}")
    
    # Pull out the description and append to the list
    description = r.json()['ingredients'][0]['strDescription']
    description_list.append({'ingredient':ingredient, 'description':description})

Finally, we can plunk into a pandas dataframe:

In [53]:
# Check ouput
import pandas as pd

desc_df = pd.DataFrame(description_list)

# Check
desc_df.head()

Unnamed: 0,ingredient,description
0,gin,Gin is a distilled alcoholic drink that derive...
1,vodka,Vodka is a distilled beverage composed primari...
2,rum,Rum is a distilled alcoholic beverage made fro...
