*****************************************************************
#  The Social Web 
- Instructors: Davide Ceolin, Dayana Spagnuelo
- Lab Assistants: Michael Accetto, Sarthak Gupta 
- Exercises for Hands-on session 3
- 20 February 2020 11:00 - 12:45                 
- NU-5B-21, NU-6A-04, NU-6B-20, NU-6C-39, NU-6C-40                             
*****************************************************************

In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data. 

Prerequisites:
- Python 3.7
- Python packages: requests, BeautifulSoup, HTMLParser, rdflib, rdflib_microdata

You have to install the rdflib_microdata package from Git, as it is not in the standard PIP library. 
You can use:



In [1]:
!pip install -e git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata

Obtaining rdflib-microdata from git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata
  Updating /home/mknw/.venv/tswvenv37/src/rdflib-microdata clone
Installing collected packages: rdflib-microdata
  Found existing installation: rdflib-microdata 0.2.0
    Uninstalling rdflib-microdata-0.2.0:
      Successfully uninstalled rdflib-microdata-0.2.0
  Running setup.py develop for rdflib-microdata
Successfully installed rdflib-microdata


In [2]:
# If you're using a virtualenv, make sure it's activated before running 
# this cell!
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser 
!pip install rdflib
!pip install rdflib_microdata
!pip install -e git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata

Obtaining rdflib-microdata from git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata
  Updating /home/mknw/.venv/tswvenv37/src/rdflib-microdata clone
Installing collected packages: rdflib-microdata
  Found existing installation: rdflib-microdata 0.2.0
    Uninstalling rdflib-microdata-0.2.0:
      Successfully uninstalled rdflib-microdata-0.2.0
  Running setup.py develop for rdflib-microdata
Successfully installed rdflib-microdata


To get the newer features available with the facebook-sdk package, we will install the package from its github repository. \
This is possible with pip, by following this syntax:  

`pip install git+YOUR_GITHUB_REPOSITORY_URL` 

For instance:

In [3]:
!pip install -e git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata

Obtaining rdflib-microdata from git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata
  Updating /home/mknw/.venv/tswvenv37/src/rdflib-microdata clone
Installing collected packages: rdflib-microdata
  Found existing installation: rdflib-microdata 0.2.0
    Uninstalling rdflib-microdata-0.2.0:
      Successfully uninstalled rdflib-microdata-0.2.0
  Running setup.py develop for rdflib-microdata
Successfully installed rdflib-microdata


##  Exercise 1 
Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). Save and run the following code as a Python script.


In [4]:
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'
URL = 'https://en.wikipedia.org/wiki/Amsterdam'

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)

geoTag = soup.find(True, 'geo')

if geoTag and len(geoTag) > 1:
        lat = geoTag.find(True, 'latitude').string
        lon = geoTag.find(True, 'longitude').string
        print ('Location is at'), lat, lon
elif geoTag and len(geoTag) == 1:
        (lat, lon) = geoTag.string.split(';')
        (lat, lon) = (lat.strip(), lon.strip())
        print ('Location is at'), lat, lon
else:
        print ('Location not found')


Location is at


### Task 1

Can you convert the output of Exercise 1 into KML? Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 and here you can find a simple example of how it is used: http://kml-samples.googlecode.com/svn/trunk/kml/Placemark/placemark.kml

Visualise the point in Google Maps using the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features
You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1.

Is KML a microformat, why (not)?

## Exercise 2 
There are a lot of specialised microformats, such as hRecipe to format recipes in. This format helps search engines to find recipes and index them properly so when you search for "recipe biscotti" it knows that it needs to return a page that contains ingredients and instructions. 


### Task 2 
Parsing hRecipe data for a Chocolate Biscotti recipe (from Example 2-7 in Mining the Social Web). Save and run the following code as a Python script.

In [11]:
# -*- coding: utf-8 -*-

import requests
import json
from bs4 import BeautifulSoup

# Pass in a URL containing hRecipe, such as
# http://wholewheatsweets.com/recipe/cookies/whole_wheat_chocolate_hazelnut_biscotti

URL = "https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/"

# Parse out some of the pertinent information for a recipe.
# See http://microformats.org/wiki/hrecipe.

def parse_hrecipe(url):
    req = requests.get(URL)
    
    soup = BeautifulSoup(req.text)

    hrecipe = soup.find(True, 'hrecipe')

    if hrecipe and len(hrecipe) > 1:
        fn = hrecipe.find(True, 'fn').string
        yield_ = hrecipe.find(True, 'yield').find(text=True)
        ingredients = [i.string
            for i in hrecipe.findAll(True, 'ingredient')
                if i.string is not None]

        instructions = []
        for i in hrecipe.find(True, 'instructions').findAll(True, 'instruction'):
            if type(i) == BeautifulSoup.Tag:
                s = ''.join(i.findAll(text=True)).strip()
            elif type(i) == BeautifulSoup.NavigableString:
                s = i.string.strip()
            else:
                continue

            if s!='':
                instructions += [s]

        return {
            'name': fn,
            'yield:': yield_,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    else:
        return {}

recipe = parse_hrecipe(URL)
print (json.dumps(recipe, indent=4))

{}


### Task 2.1
Can you modify the hRecipe script in such a way that it gives a more informative error message if no recipe information is found instead of {} 


### Task 2.2 
Does the hRecipe format facilitate easy comparison of different recipes? Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets.

## Exercise 3

As you might have noticed in the previous exercise, hRecipe is not used on many sites anymore, instead Schema.org annotations are added. Schema.org is a multipurpose format that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on Schema.org.

### Task 3

Parsing schema.org microdata. To parse this data you need to install the rdflib-microdata package, which is not in the standard pip repository. 
You can do so as follows:



In [None]:
!pip install -e git+https://github.com/edsu/rdflib-microdata.git#egg=rdflib-microdata

In [8]:
# -*- coding: utf-8 -*-

import rdflib
import rdflib_microdata

# Pass in a URL containing Schema.org microformats
url = "http://www.last.fm/music/Red+Hot+Chili+Peppers?ac=red"

g = rdflib.Graph()
g.parse(url, format="microdata")
print (g.serialize())

b'<?xml version="1.0" encoding="UTF-8"?>\n<rdf:RDF\n   xmlns:ns1="[http://schema.org/Place]#"\n   xmlns:ns2="[http://schema.org/Organization]#"\n   xmlns:ns3="[http://schema.org/MusicGroup]#"\n   xmlns:ns4="[http://schema.org/MusicAlbum]#"\n   xmlns:ns5="[http://schema.org/MusicRecording]#"\n   xmlns:ns6="[http://schema.org/MusicEvent]#"\n   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"\n>\n  <rdf:Description rdf:nodeID="Nfd1c0457e5914d5fa0151c25c7230d8c">\n    <ns6:name>Firenze Rocks</ns6:name>\n    <rdf:type rdf:resource="[http://schema.org/MusicEvent]"/>\n    <ns6:startDate>2020-06-10T00:00:00</ns6:startDate>\n    <ns6:url rdf:resource="/festival/4549753+Firenze+Rocks"/>\n    <ns6:location rdf:nodeID="N0d8b7345281f4206afe69c774ecb1aa2"/>\n  </rdf:Description>\n  <rdf:Description rdf:nodeID="N0c6b51685b9848b68f715c448b05d207">\n    <ns3:image>https://lastfm.freetls.fastly.net/i/u/ar0/366962d5733a4aee8bee4a136c239d47.jpg</ns3:image>\n    <rdf:type rdf:resource="[http://schem

### Task 3.1 
Compare the schema.org information about a band on last.fm to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability?

### Task 3.2
Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. Think about possible microformats you want to support in your final assignment and read up on how to parse them.