We will in this notebook try to extract all the metadata available on the FIS homepage of homoglations separately for each destination, and cross-reference it with the expected amount of certificates based on the found pdfs.

In [7]:
import os
import re

In [57]:
base = {}
for f in os.scandir("NOR-pdfs/"):
    name = re.findall(r'[aA-zZ\_]+[^(\_\d)]', f.name)
    if name[0] not in base.keys():
        base[name[0]] = 1
    else: 
        base[name[0]] = base.get(name[0]) + 1
    
base

{'NOR_Tolga': 8,
 'NOR_Mo_i_Rana': 6,
 'NOR_Al__Liatoppen': 8,
 'NOR_Skaret__Molde': 7,
 'NOR_Asen_IL__Rotterudmoen__Nannestad': 5,
 'NOR_Bodo': 9,
 'NOR_Korlevoll__Odda': 6,
 'NOR_Stokke__Vestfold': 5,
 'NOR_Henningvola': 6,
 'NOR_Sandnes__Sor_Varanger': 4,
 'NOR_Hoydalsmo': 4,
 'NOR_Rena': 1}

These are the amount of certificates we can expect from the different destinations^. We can see that it seems like not all destinations had their pdfs extracted. We will handle this problem at a later time __(Remember!)__
        We see that we were able to handle the case with Rena (here 'NOR_Rena') because there was only one certificate to extract the meta-data from. The case will be different for the other destinations, that have more than one course certificate stored on their html-page.

In [31]:
for f in os.scandir("NOR-courses/"):
    for k in base.keys():
        if f.name[0:4] == k[0:4]:
            print(f.name)

Rena.txt
Skaret%2C+Molde.txt
Bodo.txt
Tolga.txt
Henningvola.txt
Asen+IL%2C+Rotterudmoen%2C+Nannestad.txt
Hoydalsmo.txt
Sandnes%2C+Sor+Varanger.txt
Korlevoll%2C+Odda.txt
Stokke%2C+Vestfold.txt


Here we can see that we have lost, among other potential destinations, "Mo_i_Rana" as a html-file (txt-version). We try to find a better way to compare:

In [32]:
found_files = []
for f in os.scandir("NOR-courses/"):
    for k in base.keys():
        name = f.name.replace('+','_').replace('%2C','_')
        if name.strip('.txt') == k:
            found_files.append(f.name)
found_files

['Rena.txt',
 'Skaret%2C+Molde.txt',
 'Bodo.txt',
 'Tolga.txt',
 'Henningvola.txt',
 'Asen+IL%2C+Rotterudmoen%2C+Nannestad.txt',
 'Hoydalsmo.txt',
 'Mo+i+Rana.txt',
 'Sandnes%2C+Sor+Varanger.txt',
 'Korlevoll%2C+Odda.txt',
 'Stokke%2C+Vestfold.txt',
 'Al%2C+Liatoppen.txt']

We can then check if the length of the found-files list matches the found "bases" in the base dict:

In [34]:
len(base) == len(found_files)

True

Good, then we can continue the search for the meta-data on each of the found files.

In [45]:
def search_site(lst):
    alld = []

    for f in found_files:
        file = open("NOR-courses/"+f)

        pattern = r'(?<=\>).*(?=\<)' 

        results = re.findall(pattern,file.read())
        data = []

        for r in results:
            if len(r)>0 and len(r)<50 and r.strip(" ") != "":                
                data.append(r)
        alld.append([f, data])
    
    return alld


We can see that we have been successfull in running re.findall on the relevant sites by checking that we have available all filenames in findings:

In [48]:
findings = search_site(found_files)
for e in findings:
    print(e[0])

Rena.txt
Skaret%2C+Molde.txt
Bodo.txt
Tolga.txt
Henningvola.txt
Asen+IL%2C+Rotterudmoen%2C+Nannestad.txt
Hoydalsmo.txt
Mo+i+Rana.txt
Sandnes%2C+Sor+Varanger.txt
Korlevoll%2C+Odda.txt
Stokke%2C+Vestfold.txt
Al%2C+Liatoppen.txt


The next step will then be to see if we are able to extract all metadata for each of the certificates available. We can try to make a generalized function and call it on the destination with the most listed certificates. In this case: __Bodo__ 

In [58]:
findings[2] #this is Bodo.

['Bodo.txt',
 ['Homologations',
  'Homologations',
  'Filter',
  'Discipline',
  'All',
  'Cross-Country',
  'Ski Jumping',
  'Alpine Skiing',
  'Snowboard',
  'Speed Skiing',
  'Homologation number',
  'Place',
  'Nation',
  'Level',
  'All',
  'WC',
  'COC',
  'Event',
  'All',
  'Downhill',
  'Freeski Slopestyle',
  'Giant Slalom',
  'Indoor',
  'Parallel',
  'Parallel',
  'Parallel Giant Slalom',
  'Parallel Slalom',
  'Slalom',
  'Snowboard Cross',
  'Speed Skiing',
  'Super G',
  'Category',
  'All',
  'A',
  'B',
  'C',
  'D',
  'E',
  'Gender',
  'All',
  'Women',
  'Men',
  'search',
  'Bodo',
  'Homologation#',
  'Homologation #',
  'Course',
  'Course',
  'Course data',
  'Level',
  'Length',
  'Valid until',
  'Inspector',
  'Certifcate',
  'Cert.',
  '<div class="clip">20/51.01/1.2</div>',
  '<div class="clip">Bestemorenga 1.2km sprint</div>',
  '20/51.01/1.2',
  'Bestemorenga 1.2km sprint',
  'Level',
  'COC',
  'Length',
  '1248',
  'Valid until',
  '30.06.2025',
  'Insp

Let's try to reuse some of the code we already have available: