# Scraping ANS: Data collection 

The objective of this notebook is to scrape the ANS (de Algemene Nederlandse Spraakkunst) (ANS | e-ANS, z.d.) website for minimal pairs in order to create a challenge set to evaluate Dutch language models on their linguistic knowledge.

The ANS is the standard reference work in which the grammatical rules of Dutch are described. The grammatical rules that are described include examples of correct and incorrect usage of those grammatical rules. Some sentences are classed into one of the following groups: informeel, formeel, uitgesloten and twijfelachtig.

We employ web scraping (extracting content and data from websites) and crawling (using URLs to get more information) to save the sentences that are part of one of these classes to a Data Frame in order to be used for future research.

In [1]:
# Import the relevant packages
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
from math import ceil
import argparse
from requests import get, RequestException
from contextlib import closing
import multiprocessing
import time
import os
import pandas as pd
import numpy as np

### Helper functions

I will be utelising functions created by professors of the minor Artificial Intelligence at the UvA that will enable me to effectively scrape and crawl the ANS website. 

In [2]:
def simple_get(url, max_retries=3):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    Retry up to `max_retries` times on HTTP errors with a 1-second delay.
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
        }
        for _ in range(max_retries):
            with closing(get(url, stream=True, headers=headers)) as resp:
                if is_good_response(resp):
                    return resp.content
                else:
                    print(f"Recieved a HTTP {resp.status_code} ERROR for {url}.")
                    print("Retrying in 1 second...")
                    time.sleep(1)
        print("-------------------------------------")
        print(f"Retrieving {url} FAILED. Visit this URL in your browser to confirm correctness.")
        print("-------------------------------------")
        return None
    except RequestException as e:
        print(f"The following error occurred during HTTP GET request to {url}: {str(e)}")
        return None

def url_to_filename(url, folder):
    """
    Transforms an URL string to a folder/filename string by replacing slashes
    with underscores.
    """
    return os.path.join(folder, f"{url.replace('https://', '').replace('/', '_')}.html")


def save_to_html(url, content, filename):
    """
    Saves the page content to an HTML file in the 'data' folder.
    """
    with open(filename, 'wb') as f:
        f.write(content)

def retrieve_and_store_page_content(url, folder):
    """
    Retrieves the page content for a single URL and writes it to the folder data.
    """
    filename = url_to_filename(url, folder)

    # Check if the file already exists and has content
    if os.path.exists(filename) and os.path.getsize(filename) > 0:
        return

    content = simple_get(url)
    save_to_html(url, content, filename)

from functools import partial

def get_page_contents_multiprocess(url_list, processes=20, folder='webpages'):
    """
    Retrieves the page content for a list of URLs using multiprocessing.
    By default writes the content to HTML files in the 'webpages' folder.
    """

    # Check if the folder exists, and create it if needed
    if not os.path.exists(folder):
        os.makedirs(folder)

    with multiprocessing.Pool(processes=20) as pool:
        pool.map(partial(retrieve_and_store_page_content, folder=folder), url_list)

def is_good_response(resp):
    """
    Returns true if the response seems to be HTML, false otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)


def read_page_content_from_disk(url, folder='webpages'):
    """
    Reads the page content from a file on disk.
    """
    filename = url_to_filename(url, folder)

    if not os.path.exists(filename):
        print(f"File not found: {filename}")
        print(f"Have you scraped this URL: {url}?")
        return

    with open(filename, 'r', encoding='utf8') as f:
        content = f.read()

        if len(content) == 0:
            print(f"{filename} is empty. Something went wrong during crawling!")

        return content

### Splitting the sentences

When extracting the example sentences from the data, we find that in some cases the format is not clear. Sometimes there are an excess amount of white spaces in the string we extract, or an unnecessary new line character. Below is an example of one of the sentences.

     De uitspraak van de
     
           w
     
We, of course, want to modify the text with the purpose of creating a clear sentence that can easily be comprehended and searched for in a data frame. I have defined a function `split_string()` that takes a string, splits it and joins it back together again with single white spaces in between each word. 

Another issue that will present itself during crawling of the data; we will extract sentences with (again) too many white spaces and sometimes an unwanted new line character, but also the class label stuck to the end of the sentence as seen in the example below.

     Zonder een bril
     
           ziet hij bijna niets.uitgesloten
           
I have defined a function `split_sentence()` that takes a string, splits it just like the function `split_string()` and removed the label at the end of the sentence. This function, however, does not work properly. So it will not be used in this notebook. I have left it in so it can be rewritten and fixed in order to be used properly.

In [3]:
def split_string(string):
    # Split the sentence, get list of all words in string
    new_sentence = string.split()
    
    # Join words from list back together with single whitespaces
    joined_string = ' '.join(new_sentence)
    return joined_string

In [4]:
def split_sentence(sentence):
    """
    Takes a string, removes whitespaces, newlines and label, and returns the newly formatted string.
    :Param
        - sentence: Sentence that needs to be split (string)
    Returns:
        - split_sentence: Sentence without extra whitespaces, newlines and label
    """
    # Split the sentence, get list of all words in string
    new_sentence = sentence.split()
    
    # Join words from list back together with single whitespaces
    joined_string = ' '.join(new_sentence)
    
    last_word = new_sentence[-1]
    
    # Loop over characters in last element of list (the element with the label stuck after the punctuation)
    for char in last_word:
        
        # Check if character is period
        if char == '.':
            
            # Split string at the period and select the 0th element from list
            my_string_new = joined_string.split('.')[0]
            
            # Create split sentence with the punctuation back at the end
            clean_sentence = f"{my_string_new}."
            return clean_sentence
            
        elif char == '?':
            my_string_new = joined_string.split('?')[0]
            clean_sentence = f"{my_string_new}?"
            return clean_sentence
            
        elif char == '!':
            my_string_new = joined_string.split('!')[0]
            clean_sentence = f"{my_string_new}!"
            return clean_sentence
        
    return joined_string
            

### Scraping the ANS

Firstly, we want to initialize the url and use the helpers function `simple_get()` to get the domain we want to scrape. Hereafter, we define a function `scrape_data()` to extract the information from the menu bar on the ANS website. For this part of the scraping, we are concerning ourselves with the extraction of the following data:

1. Subject (i.e. 'De klank', 'Het woord', etc.)
2. Section (the numeric section of the sub-subject, i.e. 1, 1.2.1, etc.)
3. Sub-section (i.e. 'De klankleer van het Nederlands', 'Soorten werkwoorden', etc.)
4. URL (the URLs of those subsections that will be crawled to extract further information)


In [5]:
url = 'https://e-ans.ivdnt.org/'
page = simple_get(url)
dom = BeautifulSoup(page, 'html.parser')

In [6]:
def scrape_data(dom):
    '''
    :Param
        dom: html website voor ANS
    Returns:
        df: a data frame containing information of the subjects discussed on the website
            - Subject
            - Section
            - Sub-subject
            - URL
    '''
    data = []
    subjects = []
    sections = []
    sub_subjects = []
    urls = []

    # Create lists containing strings of section number corresponding to the subjects
    woord_sections = ['2','3','4','5','6','7','8','9','10','11','12']
    constituent_sections = ['13','14','15','16','17','18']
    zin_sections = ['19','20','21','22','23']


    # Find all subjects and save them to list
    for subject in dom.find_all("div", class_="nav-chapters"):
        subject_ = subject.span.text
        subjects.append(subject_)

    # Find sub-subject information
    for item in dom.find_all("div", class_="nav-ch-header"):

        # Save section to list
        section = item.find("span", class_="idx").text
        sections.append(section)

        # Find title for sub-subject and add to list
        sub_subject_ = item.find("span", class_="ttl").text
        sub_subject = split_string(sub_subject_)

        # Find url for sub-subject and add to list
        url_ = item.a['href']
        url = f'https://e-ans.ivdnt.org{url_}'

        # Determine main_subject based on which section the 
        if section.split('.')[0] == '1':
            main_subject = subjects[0]
        elif section.split('.')[0] in woord_sections:
            main_subject = subjects[1]
        elif section.split('.')[0] in constituent_sections:
            main_subject = subjects[2]
        elif section.split('.')[0] in zin_sections:
            main_subject = subjects[3]
        else:
            main_subject = subjects[4]

        # Append the information to their respective keys in a dictionary and
        # append that to the list 'data'
        data.append(
                {
                    'Subject': main_subject,
                    'Section': section,
                    'Sub-subject': sub_subject,
                    'URL': url,
                }
                    )

    # Create a dataframe from the listed dictionary
    df = pd.DataFrame(data)
    return df

### Crawling the ANS

Now that we have a dataframe containting the preliminary information on the ANS website, it will be possible the crawl the URLs and extract the sentences that are classed as informeel, formeel, uitgesloten or twijfelachtig.

In [7]:
def extract_wrong_sentence(listed_example, ilex):
    # Extract label
    label = listed_example.find("span", class_=ilex).text

    # Extract the example sentence
    wrong_sentence = listed_example.text
    stripped_wrong_sentence = split_sentence(wrong_sentence)
    labeled_example = True
    
    return label, stripped_wrong_sentence, labeled_example

In [14]:
def crawl_data(df, url_list):
    """
    :Param
        df: pandas data frame containing information on the ANS website
            - Subject
            - Section
            - Sub-subject
            - URL
        url_list: a list with the urls we want to crawl
    Returns:
        df: the input pandas data frame with two additional columns containing the class and sentence
            With columns:
            - Subject
            - Section
            - Sub-subject
            - Class
            - Sentence
            - URL
    """
    data_complete = []
    labels = []
    sentences = []
    row_index = 0
    count_examples = 0
    prev_count_examples = 0
    labeled_example = False
    correct_example = False
    
    for url in url_list:        
        # Load and read webpage
        html = read_page_content_from_disk(url)
        ans_item = BeautifulSoup(html, 'html.parser')
        
        # Find all listed examples in each webpage (example 'a', example 'b', etc.)
        compare_examples = ans_item.find_all("div", class_="relatedList")
        
        # Continue if there are examples in webpage
        if len(compare_examples) > 0:
        
            # Loop over listed examples
            for example in compare_examples:
                count_examples+=1
                
                # Find all individual examples from the ones listed
                listed_examples = example.find_all("span", class_="ilex-wg")
                
                # Loop over individual examples
                for listed_example in listed_examples:

                    # Find if sentence has a label
                    if listed_example.find("span", class_="ilex-judgment") is not None:
                        label, stripped_wrong_sentence, labeled_example = extract_wrong_sentence(listed_example, "ilex-judgment")

                    elif listed_example.find("span", class_="ilex-label") is not None:
                        label, stripped_wrong_sentence, labeled_example = extract_wrong_sentence(listed_example, "ilex-label")

                    # If sentence has no label, save sentence as correct_sentence
                    else:
                        correct_sentence = listed_example.text
                        
                        # Clean example sentences
                        stripped_correct_sentence = split_string(correct_sentence)
                        correct_example = True
                        
                    # Is there is no correct sentence, save "NA to dataframe"
                    if (correct_example == False) and (prev_count_examples != count_examples):
                        stripped_correct_sentence = "NA"

                    if labeled_example == True:
                        # Append data to dictionary in list
                        data_complete.append(
                                        {
                                            'Subject': df.loc[row_index, 'Subject'],
                                            'Section': df.loc[row_index, 'Section'],
                                            'Sub-subject':  df.loc[row_index, 'Sub-subject'],
                                            'Correct sentence': stripped_correct_sentence,
                                            'Wrong sentence': stripped_wrong_sentence,
                                            'Label': label,
                                            'URL': url,
                                        }
                                            )
                        
                        labeled_example = False
                        correct_example = False
                        
                    prev_count_examples = count_examples

        row_index+=1
        
    df = pd.DataFrame(data_complete)
    return df

In [9]:
def save_df_as_csv(df, folder, file_name):
    """
    Takes a data frame, converts it to a csv file, names it as specified by user 
    and saves it to the specified folder 
    :Param
        - df: a Pandas dataframe
        - folder: the name of the folder where we want to save the csv file
        - file_name: the file name of the csv file
    """
    # Get current working directory
    curr_wd = os.getcwd()
    
    # Check if the folder exists, and create it if needed
    if not os.path.exists(folder):
        os.makedirs(folder)
    
    # Define the folder path
    folder_path = f"{curr_wd}/{folder}"

    # Join the folder path and file name to get the complete file path
    file_path = os.path.join(folder_path, file_name)

    # Save the DataFrame as a CSV file
    df.to_csv(file_path, index=False)
    

In [10]:
def save_webpages(url_list, folder):
    """
    Takes a list of URLs and saves each as a html file to the specified folder
    :Param
        - url_list: list of URLs
        - folder: name of folder we want to save each individual URL to
    """
    # Get current working directory
    curr_wd = os.getcwd()
    
    for url in url_list:
        # Define the folder path
        folder_path = f"{curr_wd}/{folder}"

        # Check if the folder exists, if not, create it
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

        # Call on helpers function to retreive and store the content of the url
        retrieve_and_store_page_content(url, folder)
    

### Main function

In the main function, we call on the functions `scrape_data()` and `crawl_data()` and save the resulting dataframes using the `save_df_as_csv()` function.

In [11]:
if __name__ == '__main__':
    # Call on 'scrape_data' function to extract the preliminary information 
    df_menu = scrape_data(dom)

    # Save the data frame with preliminary data as a csv file to the 'data' folder 
    save_df_as_csv(df_menu, 'data', 'preliminary_data.csv')

    # Get URLs from the data frame column ['URL']
    url_list = df_menu['URL'].tolist()

    # Call on 'save_webpages()' to get the content of each url and save it to the 'webpages' folder 
    save_webpages(url_list, 'webpages')
    
    # Call on 'crawl_data()' to get the example sentences and their label
    complete_data = crawl_data(df_menu, url_list)

    # Dave the data frame with the complete results as a csv file to the 'data' folder
    save_df_as_csv(complete_data, 'data', 'complete_data.csv')

## Discussion

In this project we show how scraping can be utilised for the extraction of important information. There are, however, a few shortcomings of this scraping project that should be discussed. 

Firstly, in some columns, the content is printed in a way that is not ideal. For instance:

     Zonder een bril
     
           ziet hij bijna niets.uitgesloten

As we can see, there is a new line in the middle of the sentence, too many whitespaces and the label is attached to the end of the sentence. To rectify this, I defined a function `split_sentence()` which should take one such string and return a comprehendible version, like this:

    Zonder een bril ziet hij bijna niets.
    
This function sadly does not work when called on in the `crawl_data()` function. I decided to define another function `split_string()` that only rids the string of unnecessary white spaces and new line characters. It does not however, strip the label from the end of the sentence. I left the function `split_sentence()` in this notebook because I think the function would be helpfull if it worked and it would be worth it to rewrite it so it does work.

One of the main limitations of this code is that it is very slow; specifically the `crawl_data()` function. This is mainly due to the large amount of webpages the function loops over. For each webpage, the function finds all examples given and determines if they have a label. If so, it saves the label along with the sentence to a new data frame. This is a tedious process and takes a lot of time. I believe there would likely be a way to optimize this code with the aim to minimize the time complexity (Simplilearn, 2023). I, however, do not have the skill set to be able to do this in the timeframe to complete this project. 

In conclusion, I believe this project provides a stepping stone to successfully extract example sentences from the ANS website. There is much that can still be improved but, as has been proven with this project, it is definitely possible.

### References

ANS | e-ANS. (z.d.). 
    https://e-ans.ivdnt.org/

Simplilearn. (2023, 6 november). Introduction to Big O Notation in Data Structure. Simplilearn.com. https://www.simplilearn.com/big-o-notation-in-data-structure-article#:~:text=Big%20O%20Notation%20in%20Data%20Structure%20is%20used%20to%20express,algorithm%20for%20an%20input%20value.

Minor Artificial Intelligence, Universiteit van Amsterdam
    https://www.proglab.nl/programs/minai/