## Creating a taxonomy from a website

#### Let us create a product taxonomy using owlready
We first need to install owlready2 and beautiful soup (for accessing website content):

In [1]:
!pip install bs4
!pip install owlready2



### Let us now import the needed libraries

In [2]:
import requests
from bs4 import BeautifulSoup	#HTML handler
from owlready2 import *
import time
import json

### Continue with loading the ontology and defining the namespace

In [3]:
onto = get_ontology('http://TypeTaxonomyFromBulbagarden.owl#')

# create a basic product class
with onto:
    class Types(Thing):
        label = ['Types']
        pass

### Now let us declare the website we are gonna access

In [4]:
base_url = "https://bulbapedia.bulbagarden.net/wiki/Type"

### This is the website we are gonna access. You can take a different one if you like

![Bulbagarden.png](img/Bulbagarden.png "Source 1")

## Right-click on the website and inspect the sourcecode. There, you can find tags that link to the subclasses you want to extract.
### In this case, we are interested in retrieving the classes "Normal", "Fire", "Gras", ...
#### The program shall extract the names of the subclasses, open the link to the page of the subclasses, and again extract the names of the subclasses. With this, we create a product taxonomy

#### When scrolling down a bit, we find a clock of code that consists of span elements (Compare line numbers in the image below). 
![Source1.png](img/Source1.png "Source 1")

#### Almost at the very end of each line of code, we find a header that contains the types we are looking for (e.g. "Normal).
![Source3.png](img/Source3.png "Source 3")

#### One challange, when scraping a page, is to make sure you extract the information from the lines that interest you. In this case for example, we are only interested in 18 of 3000+ lines. To do so, one has to get creative. In this case we take a spand element a startpoint. These span elements are within the lines shown above and define a unique form for the buttons, that correspond with each type on the website. This way, we ensure to only grab lines that contain the wanted information.
![Source2.png](img/Source2.png "Source 2")

### Now let us use beautiful soup to scrape the page

In [5]:
# assign bulbagarden base url for subclasses
pokeurl = "https://bulbagarden.net/home/"

# access the URL
response = requests.get(base_url)

# create a HTML parser
soup = BeautifulSoup(response.text, 'html.parser')

# We look for each element of the "span" type. Usually one would folow up with something like "style = True" or "href = True", depending on the element type.
# Instead we here use a trick: By utilizing a lambda function, we can look specifically for span elements with a style that has the given specifications
for x in soup.find_all('span', {"style" : lambda L: L and L.startswith('margin: 0 5px 0 10px;')}):
    # We repeat this but this time we only look within the lines we just extracted (so 18 instead of 3000+)
    for header in x.find_all('span', {"style" : lambda L: L and L.startswith('color:#FFF')}):
         with onto:
            # We clean the line to only extract the header. 
            class_label = header.text.replace('\n', '')
            class_name = class_label.replace(' ', '').replace('&', 'And')
            # We create a new class with our class_name. This is a subclass of the previously created "Types" class".
            pokemon_type = types.new_class(class_name, (onto.Types,))
            # We append a lable to it.
            pokemon_type.label.append(class_label)

### Test if the classes were assigned correctly
Fun fact: "???" is an actual type that only existed throughout generations II. to IV.

In [6]:
for cls in onto.classes():
    print(cls.label)

['Types']
['Normal']
['Fire']
['Fighting']
['Water']
['Flying']
['Grass']
['Poison']
['Electric']
['Ground']
['Psychic']
['Rock']
['Ice']
['Bug']
['Dragon']
['Ghost']
['Dark']
['Steel']
['Fairy']
['Stellar']
['???']


### Save the ontology and download to manually check

In [7]:
onto.save(file = "TypeTaxonomyFromBulbagarden.owl", format = "rdfxml")