*****************************************************************
#  The Social Web 
- Instructors: Davide Ceolin, Dayana Spagnuelo
- Lab Assistants: Michael Accetto, Sarthak Gupta 
- Exercises for Hands-on session 2
- 12 February 2020 11:00 - 12:45                 
- NU-5B-21, NU-6A-04, NU-6B-20, NU-6C-39, NU-6C-40                             
*****************************************************************

In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data. 

Prerequisites:
- Python 3.7
- Python packages: requests, BeautifulSoup4, HTMLParser, rdflib, rdflib_microdata


In [1]:
# If you're using a virtualenv, make sure it's activated before running 
# this cell!
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser 
!pip install rdflib
# !pip install rdflib_microdata



To get the newer features available with the facebook-sdk package, we will install the package from its github repository. \
This is possible with pip, by following this syntax:  

`pip install git+YOUR_GITHUB_REPOSITORY_URL` 

For instance:

In [2]:
!pip install -e git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata

Obtaining rdflib-microdata from git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata
  Updating /home/mknw/.venv/tswvenv37/src/rdflib-microdata clone
Installing collected packages: rdflib-microdata
  Found existing installation: rdflib-microdata 0.2.0
    Uninstalling rdflib-microdata-0.2.0:
      Successfully uninstalled rdflib-microdata-0.2.0
  Running setup.py develop for rdflib-microdata
Successfully installed rdflib-microdata


##  Exercise 1 
Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). 


In [16]:
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'
URL = 'https://en.wikipedia.org/wiki/Amsterdam'

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)
# print(req.text)
placemark = soup.findAll("Placemark")
print(placemark)


[]


In [None]:

geoTag = soup.find(True, 'geo')

if geoTag and len(geoTag) > 1:
        lat = geoTag.find(True, 'latitude').string
        lon = geoTag.find(True, 'longitude').string
        print ('Location is at'), lat, lon
elif geoTag and len(geoTag) == 1:
        (lat, lon) = geoTag.string.split(';')
        (lat, lon) = (lat.strip(), lon.strip())
        print (('Location is at'), lat, lon)
else:
        print ('Location not found')


### Task 1

Can you convert the output of Exercise 1 into KML? Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 and here you can find a simple example of how it is used: https://renenyffenegger.ch/notes/tools/Google-Earth/kml/index

Visualise the point in Google Maps using the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features
You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1.

Is KML a microformat, why (not)?

## Exercise 2 
In order to find information in the web we can use microformats. However in this example you will not be using hRecipe. Instead, we'll show you how to find arbitrary tags in a webpage.


### Task 2 
Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web).

In [4]:
import requests
import json
from bs4 import BeautifulSoup

# A yummy webpage (feel free to change to your likings.)
URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/"

# requests will return the html found at the given webpage...
page = requests.get(URL)
# ...and a BeautifulSoup object can be created from its content.
soup = BeautifulSoup(page.content, 'html.parser')

listchildren = list(soup.children)

We can find any element in the page through *html tag selectors*
You can find them all [here](https://www.w3schools.com/cssref/css_selectors.asp), but shortly these are "." for classes, # for ids and plain text for the element name.


You can also combine them, so that looking for ".class1.class2" would select all elements displaying both classes. For a deeper overview please check the above link (or google "html tag selectors"). 

In [5]:
print(len(listchildren)) # we can see here how many children the html doc has got.
ingredients_unparsed = soup.select_one(".tasty-recipes-ingredients")
# let's get all the "list item" elements in a list:
ing_unp = ingredients_unparsed.findAll('li')
print(ing_unp)

31
[<li><span data-amount="1">1</span> pound spaghetti noodles</li>, <li><span data-amount="0.5" data-unit="cup">½ cup</span> smoked mozzarella cheese</li>, <li><span data-amount="0.5" data-unit="cup">½ cup</span> grated Parmesan cheese, plus more for serving</li>, <li><span data-amount="4">4</span> egg yolks</li>, <li><span data-amount="1" data-unit="cup">1 cup</span> frozen Earthbound Farm Organic peas</li>, <li><span data-amount="8" data-unit="cup">8 cup</span>s Earthbound Farm Organic spinach</li>, <li><span data-amount="3" data-unit="tablespoon">3 tablespoon</span>s butter</li>, <li><a data-tasty-links-no-disclosure="" href="https://www.acouplecooks.com/what-is-kosher-salt/" target="_blank">Kosher salt</a></li>, <li>Fresh ground black pepper</li>]


Mmmh... not so pretty yet. How about listing their items using the text method?

In [6]:

ingredients = [t.text for t in ing_unp]
print("Ingredients:\n")
[print(i) for i in ingredients]

Ingredients:

1 pound spaghetti noodles
½ cup smoked mozzarella cheese
½ cup grated Parmesan cheese, plus more for serving
4 egg yolks
1 cup frozen Earthbound Farm Organic peas
8 cups Earthbound Farm Organic spinach
3 tablespoons butter
Kosher salt
Fresh ground black pepper


[None, None, None, None, None, None, None, None, None]

Good. Now the instructions:

In [7]:
instructions_unparsed = soup.select_one(".tasty-recipes-instructions")
instructions_unparsed = instructions_unparsed.findAll("li")
print(instructions_unparsed)

[<li>In a large pot, combine 6 quarts of water with 2 tablespoons <a data-tasty-links-no-disclosure="" href="https://www.acouplecooks.com/what-is-kosher-salt/" target="_blank">kosher salt</a> and bring it to a boil.</li>, <li>Grate the Parmesan and mozzarella cheese. Carefully separate four egg yolks and set aside.</li>, <li>Once boiling, add the pasta and cook until the pasta is just about al dente, about 7 minutes; then add peas and spinach and cook for 1 minute. Reserve 1 cup cooking water, and then drain the pasta and vegetables.</li>, <li>In a skillet, melt the butter, then stir in the cheeses, ¼ cup pasta water, and ¼ teaspoon <a data-tasty-links-no-disclosure="" href="https://www.acouplecooks.com/what-is-kosher-salt/" target="_blank">kosher salt</a>. Stir in the pasta and vegetables until creamy over low heat, adding more pasta water if necessary (note that the mozzarella will stick together in some places).</li>, <li>To serve, top each pasta serving with a whole egg yolk and ad

Let's finish off with the title:

In [8]:
title_unparsed = soup.select_one(".post-header") # 
categorical_title = title_unparsed.text.split("›") # website specific divider.
recipe_title = categorical_title[-1].strip() # let's remove that ugly space at the beginning.
recipe_title

'Vegetarian Carbonara'

## Task 2.1
Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). 

Make sure to:

- return itemized content (e.g. ingredients) in a list. You may want to use a list comprehension here.
- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```. Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).


In [9]:
# -*- coding: utf-8 -*-

import requests
import json
from bs4 import BeautifulSoup

# Pass in a URL containing hRecipe, such as
# https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/

URL = "https://www.acouplecooks.com/"#YOUR RECIPE HERE/

# Parse out some of the pertinent information for a recipe.
# See http://microformats.org/wiki/hrecipe.

def parse_website(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # You code here
    #
    #
    #
    #
    #
    #
    #
    #
    # title = 
    # ingredients_unparsed = 
    # instructions = 

    return {
            'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
print (recipe)

NameError: name 'fn' is not defined

But How can we get information not only from one website,  but from all? 

The answer: microformats.

But rather than extracting with information manually from the schema.org or hRecipe microformats, we can use a package, ```scrape-schema-recipe``` 

Feel free to experiment with it. 

### Task 2.2
hRecipe is a microformat specifically created for recipes.
Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets.

## Exercise 3

Schema.org is one of the most widely used annotations formats. Schema.org is a multipurpose  template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on Schema.org.

### Task 3

Parsing schema.org microdata. To parse this data you need to install the rdflib-microdata package, which you have done in one of the previous steps.



In [None]:
# -*- coding: utf-8 -*-

import rdflib
import rdflib_microdata

# Pass in a URL containing Schema.org microformats
url = "http://www.last.fm/music/Red+Hot+Chili+Peppers?ac=red"

g = rdflib.Graph()
g.parse(url, format="microdata")
print (g.serialize())

### Task 3.1 
Compare the schema.org information about a band on last.fm to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability?

### Task 3.2
Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. Think about possible microformats you want to support in your final assignment and read up on how to parse them.