*****************************************************************
#  The Social Web: data representation
- Instructors: Davide Ceolin, Emma Beauxis-Aussalet.
- TAs: Zubaria Inayat, Maxim Sergeev, Zhuofan Mei, Alexander Schmatz, Ling Jin.
- Exercises for Hands-on session 2
*****************************************************************

In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data.

Prerequisites:
- Python 3.8
- Python packages: requests, BeautifulSoup4, HTMLParser, rdflib


In [1]:
# If you're using a virtualenv, make sure it's activated before running
# this cell!
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser
!pip install rdflib
!pip install cloudscraper


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

##  Exercise 1

Even if web pages do not use microformat, interesting data can often be extracted from the HTML. You may use packages such as BeautifulSoup to extract arbitrary pieces of data from any HTML page.
The example below shows how we can find the URL of first image in the infobox table of the wikipedia page on Amsterdam. Tip: compare the code below with HTML source code of the wikipedia page: the image url is in the "src" attribute of the "img" element of in the "table" element with class="infobox".

In [2]:
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'
URL = 'https://en.wikipedia.org/wiki/Amsterdam'

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)
# print(req.text)
image1 = soup.findAll('table', class_='infobox')[0].find('img')
print(image1['src'])  


//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png/330px-Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png


  image1 = soup.findAll('table', class_='infobox')[0].find('img')


Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). Note that wikipages may encode long/lat information in different ways. On of the ways used by the Amsterdam wikipedia page is in a span element that is not shown to the user: 
<span class="geo">52.367; 4.900</span>
This span element has a single child: len(geoTag == 1) and no further structure, we have to manually get the long/lat by splitting the string on the ';' semicolon.

In [3]:

geoTag = soup.find(True, 'geo')
print(geoTag)

if geoTag and len(geoTag) > 1:
        lat = geoTag.find(True, 'latitude').string
        lon = geoTag.find(True, 'longitude').string
        print ('Location is at'), lat, lon
elif geoTag and len(geoTag) == 1:
        (lat, lon) = geoTag.string.split(';')
        (lat, lon) = (lat.strip(), lon.strip())
        print (('Location is at'), lat, lon)
else:
        print ('Location not found')


<span class="geo">52.37278; 4.89361</span>
Location is at 52.37278 4.89361


### Task 1

Can you convert the output of Exercise 1 into KML? Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 and here you can find a simple example of how it is used: https://renenyffenegger.ch/notes/tools/Google-Earth/kml/index

Visualise the point in Google Maps using the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features
You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1. If it fails to work you can also use KML viewer websites like https://kmzview.com/.

Is KML a microformat, why (not)?

In [4]:
# This code is from my KML file, which I uploaded to kmzview.com and worked successfully.

"""
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
  <Placemark>
    <name>A2: T1</name>
    <description>Task 1</description>
    <Point>
      <coordinates>52.37278,4.89361,0</coordinates>
    </Point>
  </Placemark>
</kml>
"""

'\n<?xml version="1.0" encoding="UTF-8"?>\n<kml xmlns="http://www.opengis.net/kml/2.2">\n  <Placemark>\n    <name>A2: T1</name>\n    <description>Task 1</description>\n    <Point>\n      <coordinates>52.37278,4.89361,0</coordinates>\n    </Point>\n  </Placemark>\n</kml>\n'

No, KML is not a microformat. Microformats need to be HTML elements, where KML is an element of XML.

## Exercise 2 
In order to find information in the web we can use microformats such as [hRecipe](https://microformats.org/wiki/hrecipe) or Schema.org's [Recipe](https://schema.org/Recipe). But first, we'll show you how to find arbitrary tags in a webpage.


### Task 2 
Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web).

In [5]:
import cloudscraper
import json
from bs4 import BeautifulSoup

# A yummy webpage (feel free to change to your likings.)
URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara"

# Create a CloudScraper object
scraper = cloudscraper.create_scraper()

# Use the CloudScraper object to fetch the HTML content
response = scraper.get(URL)

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Now you can work with the 'soup' object as you did before
listchildren = list(soup.children)
#print(listchildren)


We can find any element in the page through *css tag selectors*
You can find them all [here](https://www.w3schools.com/cssref/css_selectors.asp), but shortly these are "." for classes, # for ids and plain text for the element name.


You can also combine them, so that looking for ".class1.class2" would select all elements displaying both classes. For a deeper overview please check the above link (or google "html tag selectors"). 

In [6]:
print(len(listchildren)) # we can see here how many children the html doc has got.
title_unparsed = soup.select_one("title")
#show the title element
print(title_unparsed)

2
<title>Vegetarian Carbonara – A Couple Cooks</title>


Not so pretty.... Use the text method.

In [7]:
print(title_unparsed.text)

Vegetarian Carbonara – A Couple Cooks


The website has a block of JSON-LD data embedded. Try to see if you can find it in the soup object.
We can load the JSON-LD script to work with it easier.
Lets get a list of the ingredients.

In [8]:
# Find the script tag containing the JSON-LD data
json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})

# Extract the content of the script tag
script_content = json_ld_script.string

# Load the JSON data from the script content
data = json.loads(script_content)

# Access the "recipeIngredient" list
recipe_ingredients = data["@graph"][7]["recipeIngredient"]

# Print the list of ingredients
for ingredient in recipe_ingredients:
    print(ingredient)

1 pound spaghetti noodles
½ cup smoked mozzarella cheese
½ cup grated Parmesan cheese, plus more for serving
4 egg yolks
1 cup frozen Earthbound Farm Organic peas
8 cups Earthbound Farm Organic spinach
3 tablespoons butter
Kosher salt
Fresh ground black pepper


Lets also print out the instructions.

In [9]:
recipe_instructions= data["@graph"][7]["recipeInstructions"]
#the instructions list contains dictionaries as elements, take a look at how the list is organized
for step in recipe_instructions:
    print(step["text"])

In a large pot, combine 6 quarts of water with 2 tablespoons kosher salt and bring it to a boil.
Grate the Parmesan and mozzarella cheese. Carefully separate four egg yolks and set aside.
Once boiling, add the pasta and cook until the pasta is just about al dente, about 7 minutes; then add peas and spinach and cook for 1 minute. Reserve 1 cup cooking water, and then drain the pasta and vegetables.
In a skillet, melt the butter, then stir in the cheeses, ¼ cup pasta water, and ¼ teaspoon kosher salt. Stir in the pasta and vegetables until creamy over low heat, adding more pasta water if necessary (note that the mozzarella will stick together in some places).
To serve, top each pasta serving with a whole egg yolk and additional Parmesan cheese, and stir the yolk into the pasta at the table (if you are uncomfortable serving egg yolks at the table, stir the egg yolks into the pasta in the skillet to heat them through). Serve immediately. (Note that the mozzarella cheese can become gummy th

Websites are going to be structured differently. Look at the following JSON-DL snippet.

In [10]:
json_example = {
    "title": "The anarchist cookbook",
    "recipeInstructions": "<ol class=\"recipeSteps\"><li>Cook the linguine according to the packet instructions. </li><li>Meanwhile, carefully crack the eggs into a small bowl and beat them with a fork. Season with a little black pepper, then stir in the ricotta finely grate in most of the lemon zest. </li><li>When the pasta has 3 minutes left, add the peas. Reserve a little cooking water, then drain the linguine and peas, and return to the pan. </li><li>Stir in the egg mixture and spinach with a wooden spoon – they'll cook gently in the residual heat. Add a little pasta water to loosen, if needed. </li><li>Share between bowls and serve with a green salad.</li></ol>",
    "ingredients": ["a lot of effort", "the right mindset"]
}

recipe_instructions = json_example["recipeInstructions"]
example_soup = BeautifulSoup(recipe_instructions, 'html.parser')

In [11]:
#to get a nice and clean list of the instructions, step by step
#we can use the find method to get the first "ol" element with attribute "class.." and then use find_all to get all list elements in there
#then we can strip the list items to obtain the instructions
list_items = example_soup.find('ol', class_='recipeSteps').find_all('li')
instructions = [item.get_text(strip=True) for item in list_items]
print(instructions)

['Cook the linguine according to the packet instructions.', 'Meanwhile, carefully crack the eggs into a small bowl and beat them with a fork. Season with a little black pepper, then stir in the ricotta finely grate in most of the lemon zest.', 'When the pasta has 3 minutes left, add the peas. Reserve a little cooking water, then drain the linguine and peas, and return to the pan.', "Stir in the egg mixture and spinach with a wooden spoon – they'll cook gently in the residual heat. Add a little pasta water to loosen, if needed.", 'Share between bowls and serve with a green salad.']


## Task 2.1
Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). 

Make sure to:

- return itemized content (e.g. ingredients) in a list. You may want to use a list comprehension here.
- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```. Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).


In [12]:
#Here you can see the solution for our example website

URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara"

def parse_website(url):
    # Create a CloudScraper object
    scraper = cloudscraper.create_scraper()

    # Use the CloudScraper object to fetch the HTML content
    response = scraper.get(URL)

    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    #Get the title
    title_unparsed = soup.select_one("title")
    fn = title_unparsed.text
    
    json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})

    # Extract the content of the script tag
    script_content = json_ld_script.string

    # Load the JSON data from the script content
    data = json.loads(script_content)

    # Access the "recipeIngredient" list
    recipe_ingredients = data["@graph"][7]["recipeIngredient"]
    
    ingredients = [ingredient for ingredient in recipe_ingredients]
    
    #Access the instructions
    recipe_instructions= data["@graph"][7]["recipeInstructions"]
    #the instructions list contains dictionaries as elements, take a look at how the list is organized
    instructions = [step["text"] for step in recipe_instructions]

    return {'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
print (recipe)

{'name': 'Vegetarian Carbonara – A Couple Cooks', 'ingredients': ['1 pound spaghetti noodles', '½ cup smoked mozzarella cheese', '½ cup grated Parmesan cheese, plus more for serving', '4 egg yolks', '1 cup frozen Earthbound Farm Organic peas', '8 cups Earthbound Farm Organic spinach', '3 tablespoons butter', 'Kosher salt', 'Fresh ground black pepper'], 'instructions': ['In a large pot, combine 6 quarts of water with 2 tablespoons kosher salt and bring it to a boil.', 'Grate the Parmesan and mozzarella cheese. Carefully separate four egg yolks and set aside.', 'Once boiling, add the pasta and cook until the pasta is just about al dente, about 7 minutes; then add peas and spinach and cook for 1 minute. Reserve 1 cup cooking water, and then drain the pasta and vegetables.', 'In a skillet, melt the butter, then stir in the cheeses, ¼ cup pasta water, and ¼ teaspoon kosher salt. Stir in the pasta and vegetables until creamy over low heat, adding more pasta water if necessary (note that the 

In [13]:
# -*- coding: utf-8 -*-

import cloudscraper
import json
from bs4 import BeautifulSoup

URL = "https://www.acouplecooks.com/rhubarb-pie/"

def parse_website(url):
    # Create a CloudScraper object
    scraper = cloudscraper.create_scraper()
    # Use the CloudScraper object to fetch the HTML content
    response = scraper.get(URL)
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    #Get the title
    title_unparsed = soup.select_one("title")
    fn = title_unparsed.text
    json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})
    # Extract the content of the script tag
    script_content = json_ld_script.string
    # Load the JSON data from the script content
    data = json.loads(script_content)
    # Access the "recipeIngredient" list
    recipe_ingredients = data["@graph"][7]["recipeIngredient"]
    ingredients = [ingredient for ingredient in recipe_ingredients]
    #Access the instructions
    recipe_instructions= data["@graph"][7]["recipeInstructions"]
    #the instructions list contains dictionaries as elements, take a look at how the list is organized
    instructions = [step["text"] for step in recipe_instructions]
    return {'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }

recipe = parse_website(URL)
print (recipe)

{'name': 'Classic Rhubarb Pie – A Couple Cooks', 'ingredients': ['1 Homemade Pie Crust (or your favorite recipe)', '6 cups diced rhubarb (about 2 pounds)', '1 1/4 cups granulated sugar', '1/4 cup cornstarch', '2 teaspoons vanilla extract', '1/4 teaspoon orange zest (or 1/2 teaspoon lemon zest)', '1/2 cup all-purpose flour', '1/4 cup light brown sugar, packed', '1/4 teaspoon cinnamon', '1 pinch kosher salt', '4 tablespoons salted butter, melted'], 'instructions': ['In a large skillet, mix the chopped rhubarb with sugar and place it on the counter. Allow it to sit and macerate at room temperature for 30 minutes while making the pie crust (this extracts the juices of the rhubarb; do not skip this step!).\xa0', 'Make the pie crust. Refrigerate while you make the topping.', 'Preheat the oven to 400°F.\xa0', 'Place the skillet on a burner on the stove. Stir in the cornstarch, orange zest, and vanilla extract. Add medium high heat and cook about 5 to 6 minutes, until the sauce becomes very th

But How can we get information not only from one website,  but from all? 

The answer: microformats.

But rather than extracting with information manually from the schema.org or hRecipe microformats, we can use a package, ```scrape-schema-recipe``` 

Feel free to experiment with it. 

### Task 2.2
hRecipe is a microformat specifically created for recipes.
Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets.

In [15]:
import scrape_schema_recipe

# Recipe 1

URL = "https://www.allrecipes.com/recipe/12316/fresh-rhubarb-pie/"

recipe_list = scrape_schema_recipe.scrape_url(URL, python_objects=True)
recipe = recipe_list[0]

ingredients1 = recipe['recipeIngredient']

# Recipe 2 (This is copy and pasted from above)

URL2 = "https://www.acouplecooks.com/rhubarb-pie/"

scraper2 = cloudscraper.create_scraper()
response2 = scraper2.get(URL2)
soup2 = BeautifulSoup(response2.text, 'html.parser')
json_ld_script = soup2.find("script", {"class": "yoast-schema-graph"})
script_content = json_ld_script.string
data = json.loads(script_content)
recipe_ingredients = data["@graph"][7]["recipeIngredient"]

ingredients2 = [ingredient for ingredient in recipe_ingredients]

for ingredient in ingredients1:
    if ingredient not in ingredients2:
        print(ingredient)

print()

for ingredient in ingredients2:
    if ingredient not in ingredients1:
        print(ingredient)

# It makes sense that all ingredients are printed, since there are no identical amounts of ingredients.

1.3333333730698 cups white sugar
6 tablespoons all-purpose flour
1 (14.1 ounce) package double-crust pie pastry, thawed
4 cups chopped rhubarb
1 tablespoon butter

1 Homemade Pie Crust (or your favorite recipe)
6 cups diced rhubarb (about 2 pounds)
1 1/4 cups granulated sugar
1/4 cup cornstarch
2 teaspoons vanilla extract
1/4 teaspoon orange zest (or 1/2 teaspoon lemon zest)
1/2 cup all-purpose flour
1/4 cup light brown sugar, packed
1/4 teaspoon cinnamon
1 pinch kosher salt
4 tablespoons salted butter, melted


## Exercise 3

Schema.org is one of the most widely used annotations formats. Schema.org is a multipurpose  template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on Schema.org.

### Task 3

Parsing schema.org microdata. To parse this data you need to install the rdflib-microdata package, which you have done in one of the previous steps.



In [16]:
from rdflib import Graph

# Source: https://www.youtube.com/watch?v=sCU214rbRZ0
# Pass in a URL containing Schema.org microformats
URL = "http://dbpedia.org/resource/Micheal_Jackson"

# Initialize a graph
g = Graph()

# Parse in an RDF file graph dbpedia
result = g.parse(location=URL)

# Loop through first 10 triples in the graph
for index, (sub, pred, obj) in enumerate(g):
    print(sub, pred, obj)
    if index == 10:
        break

http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageWikiLink http://dbpedia.org/resource/Michael_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageRedirects http://dbpedia.org/resource/Michael_Jackson


In [17]:
# Print the size of the Graph
print(f'Graph has {len(g)} facts')

Graph has 2 facts


In [18]:
# Print out the entire Graph in the RDF Turtle format
print(g.serialize(format='ttl'))

@prefix ns1: <http://dbpedia.org/ontology/> .

<http://dbpedia.org/resource/Micheal_Jackson> ns1:wikiPageRedirects <http://dbpedia.org/resource/Michael_Jackson> ;
    ns1:wikiPageWikiLink <http://dbpedia.org/resource/Michael_Jackson> .




### Task 3.1 
Compare the schema.org information about a band on last.fm to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability? In particular, refer to the Microformat specifications indicated in the box on the top right corner.

In [20]:
import requests, json, extruct

url_lastfm = "https://www.last.fm/music/The+Beatles"
headers = {"User-Agent": "Mozilla/5.0"}
html_lastfm = requests.get(url_lastfm, headers=headers).text

data_lastfm = extruct.extract(
    html_lastfm,
    base_url=url_lastfm,
    syntaxes=['json-ld', 'microdata', 'opengraph']
)

print(json.dumps(data_lastfm['json-ld'], indent=2))


print("-------------------------------------------------------")

# Facebook Open Graph
fb_url = "https://www.facebook.com/thebeatles"
html_fb = requests.get(fb_url, headers={'User-Agent': 'Mozilla/5.0'}).text
soup_fb = BeautifulSoup(html_fb, "html.parser")

og_tags = soup_fb.find_all("meta", property=lambda p: p and p.startswith("og:"))

og_data = {tag.get("property"): tag.get("content") for tag in og_tags if tag.get("content")}

print(json.dumps(og_data, indent=2))

# From online research, it seems that last.fm has removed semantic markup (rdf). Thus, in some ways, it seems that Facebook's Open Graph
# would be more interoperable simply from an accessibility standpoint. However, I did research what last.fm data would have looked like, if
# I had been able to access it. Where Facebook's data all begins with "og:", making it FB-specific, last.fm is much more generic, with 
# intuitive titles that seem like they could be used across platforms (like name, genre, url). In this sense, last.fm would be much 
# more interoperable.

[]
-------------------------------------------------------
{
  "og:type": "video.other",
  "og:title": "The Beatles",
  "og:description": "The Beatles. 39.252.540 vind-ik-leuks \u00b7 65.337 personen praten hierover. New casting announced for The Beatles - A Four-Film Cinematic Event, directed by Sam Mendes....",
  "og:url": "https://www.facebook.com/thebeatles",
  "og:image:alt": "The Beatles",
  "og:image": "https://scontent-ams2-1.xx.fbcdn.net/v/t1.6435-1/57964042_10157220713559539_1721715379509657600_n.jpg?stp=dst-jpg_tt6&cstp=mx1134x1134&ctp=s720x720&_nc_cat=1&ccb=1-7&_nc_sid=3ab345&_nc_ohc=7UR6Xa9SM_YQ7kNvwHFraK4&_nc_oc=AdkDwGrSyPdiE2NVrWQmFfcryZCB3CNnwN8PsGWpMvSoOkcPlfKRFpAhBs_4riCNAl4&_nc_zt=24&_nc_ht=scontent-ams2-1.xx&_nc_gid=b5WzTG_v1UJTEuHKcrlykQ&oh=00_AfhjM7e7aEqG0R5vHkqLngxMvxGxcf0EPlw8awiGgtFmDA&oe=693D89DD",
  "og:locale": "en_US"
}


### Task 3.2
Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. Think about possible microformats you want to support in your final assignment and read up on how to parse them.

In [21]:
from mf2py.parser import Parser
import requests, json

# Microformat Example: h-card
html = """
<div class="h-card">
  <p class="p-name">Example Name</p>
  <a class="u-url" href="https://examplesite.com">Example Website</a>
  <p class="p-locality">Amsterdam</p>
</div>
"""

parsed = Parser(doc=html).to_dict()

# Output from Microformats.org
url = "https://microformats.org/wiki/h-card"
html_from_site = requests.get(url).text

parsed_site = Parser(doc=html_from_site).to_dict()

print("Example Output")
print(json.dumps(parsed, indent=2))
print("Output from Microformats.org")
print(json.dumps(parsed_site, indent=2))

print("------------------------------------------------------------------")

# Microformat Example: h-event
html_event = """
<div class="h-event">
  <h2 class="p-name">Example Event</h2>
  <p>
    <time class="dt-start" datetime="2025-11-12T19:00">Nov 12, 2025, 7PM</time>
    at <span class="p-location">Example Cafe</span>
  </p>
  <p class="p-description">Coffee house hosting a wonderful open evening.</p>
</div>
"""

parsed_event = Parser(doc=html_event).to_dict()

# Output from Microformats.org
url_event = "https://microformats.org/wiki/h-event"
html_from_site_event = requests.get(url_event).text
parsed_site_event = Parser(doc=html_from_site_event).to_dict()

print("Example Output")
print(json.dumps(parsed_event, indent=2))
print("Output from Microformats.org")
print(json.dumps(parsed_site_event, indent=2))

print("------------------------------------------------------------------")

# Microformat Example: h-entry
html_entry = """
<article class="h-entry">
  <h1 class="p-name">Example Class</h1>
  <a class="u-url" href="https://examplesite.com/">Example note</a>
  <p class="p-author h-card">
    <span class="p-name">Example Author</span>
  </p>
  <time class="dt-published" datetime="2025-11-10">November 10, 2025</time>
  <div class="e-content">
    <p>Example paragprah containing article content!</p>
  </div>
</article>
"""

parsed_entry = Parser(doc=html_entry).to_dict()

# Output from Microformats.org
url_entry = "https://microformats.org/wiki/h-entry"
html_from_site_entry = requests.get(url_entry).text
parsed_site_entry = Parser(doc=html_from_site_entry).to_dict()

print("Example Output")
print(json.dumps(parsed_entry, indent=2))
print("Output from Microformats.org")
print(json.dumps(parsed_site_entry, indent=2))

# With visual comparison, the example formats are very similar to the actual format of the output from Microformats.org, 
# but naturally the actual output contains a lot more information, especially as one block can contain many sub-blocks.
# The example formats show the simplest case, so it makes sense that actual implementations would be more complicated.

Example Output
{
  "items": [
    {
      "type": [
        "h-card"
      ],
      "properties": {
        "name": [
          "Example Name"
        ],
        "url": [
          "https://examplesite.com"
        ],
        "locality": [
          "Amsterdam"
        ]
      }
    }
  ],
  "rels": {},
  "rel-urls": {},
  "debug": {
    "description": "mf2py - microformats2 parser for python",
    "source": "https://github.com/microformats/mf2py",
    "version": "2.0.1",
    "markup parser": "html5lib"
  }
}
Output from Microformats.org
{
  "items": [
    {
      "type": [
        "h-card"
      ],
      "properties": {
        "role": [
          "Editor"
        ],
        "name": [
          "Tantek \u00c7elik"
        ]
      },
      "lang": "en"
    }
  ],
  "rels": {
    "stylesheet": [
      "/wiki/load.php?lang=en&modules=ext.pygments%7Cskins.vector.styles.legacy&only=styles&skin=vector"
    ],
    "shortcut": [
      "/favicon.ico"
    ],
    "icon": [
      "/favicon.ico"
 