# Website classification demo with *Homepage2Vec*

This notebook is meant to demonstrate how URLs can be classified using the `homepage2vec` library for `Python`. The `homepage2vec` library is based on the [dataset](https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693) from [curlie.org](https://curlie.org/).

For further details you can consult the information provided in the [`Homepage2Vec` *GitHub* repository](https://github.com/epfl-dlab/homepage2vec).

If you use `Homepage2Vec` for your research, make sure to cite the associated conference paper:

Lugeon, S., Piccardi, T., & West, R. (2022). Language-Agnostic Website Embedding and Classification. *arXiv preprint [arXiv:2201.03677](https://arxiv.org/pdf/2201.03677.pdf)*.

*Note*: The code in the following is based on the examples provided in the [`ReadMe` file from the `homepage2vec` *GitHub* repository](https://github.com/epfl-dlab/homepage2vec/blob/master/README.md).

***IMPORTANT***: The demo notebook hosted on *MyBinder.org* is only meant for testing purposes. Hence, it should only be used for a few test URLs. If you want to use the functions/code provided here to classify a large number of URLs for your research, please copy/clone the notebook and run the notebook (or the code it contains) on your local machine or your own server.

## Setup

In addition to `homepage2vec`, we will use [`pandas`](https://pandas.pydata.org/) for data wrangling when we want to classify multiple URLs contained in one or more `.txt` files (see below).

*Note*: If you run the following code on your local machine/own server, you may have to install `homepage2vec` and `pandas` first (typically using [`pip`](https://pip.pypa.io/en/stable/) or [`conda`](https://docs.conda.io/projects/conda/en/latest/commands/install.html)).

In [None]:
import os
from homepage2vec.model import WebsiteClassifier
import pandas as pd

We might want to enable detailed logging for debugging purposes and to make the functions more verbose (so that they provide detailed progress information). Please note that this might cause issues in the `Jupyter` Notebook due to size limits for the cell output (you can deal with this by using the cell magic command [`%%capture`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-capture) or adding a ; at the end of the line producing the extended output).

In [None]:
#import logging
#logging.getLogger().setLevel(logging.DEBUG)

Specify the `WebsiteClassifier` from `homepage2vec` as the model we want to use for classifying the URLs.

In [None]:
model = WebsiteClassifier()

## Classify a single URL

Define a function for classifying a single URL and printing the result.

In [None]:
def classify_website(model, url):
    website = model.fetch_website(url)
    scores, embeddings = model.predict(website)
    print(url, "class probabilities:", scores)

Provide the URL that you want to classify:

In [None]:
url = "wikipedia.org"

Use the `classify_website()` function to classify the specified URL.

In [None]:
classify_website(model, url)

## Classify multiple URLs contained in one or more `.txt` files

We define two functions: One for looping through the `.txt` file(s) to create a list of URLs and another one for classifying the URLs contained in the list.

First, we define the function for reading the URLs from the `.txt` file(s):

In [None]:
def read_urls(directory):
    # create an empty list to store the URLs
    urls = []

    # loop through all the files in the directory
    for file_name in os.listdir(directory):
        # check if the file is a .txt file
        if file_name.endswith('.txt'):
            # open the file and read the contents line by line
            with open(os.path.join(directory, file_name)) as file:
                for line in file:
                    # add each line to the urls list
                    urls.append(line.strip())
    return urls

Next, we define the function for classifying the URLs from the list we created before:

In [None]:
def classify_websites(urls, model):
    # create an empty list to store the classifications
    classifications = []

    # loop through all the URLs
    for url in urls:
        # classify the URL
        website = model.fetch_website(url)
        classification = model.predict(website)
        classifications.append(classification[0])

    return classifications

To use the functions, we first need to specify the directory containing the `.txt` file(s).

*Note*: The *GitHub* repository containing this `Jupyter` Notebook contains a folder named `urls` with two exemplary `.txt` files.

In [None]:
directory = "./urls"

Next, we can create the URL list based on the content of the `.txt` file(s) in the specified directory.

In [None]:
urls = read_urls(directory)

**NB**: Depending on the number of URLs, applying this function may take some time. If you want information about the progress, you can enable logging (see above).

In [None]:
classifications = classify_websites(urls, model)

After classifying the URLs, we can combine the URLs and the classification results for each of the categories into a `pandas` dataframe.

In [None]:
# convert the classifications into a pandas DataFrame
classifications_df = pd.DataFrame(classifications)
# create a new DataFrame with the urls as a column
urls_df = pd.DataFrame({'website': urls})# concatenate the urls_df DataFrame with classifications_df
classifications_df = pd.concat([urls_df, classifications_df], axis=1)

For further analysis, we can export the dataframe we have created to a `.csv` file.

In [None]:
classifications_df.to_csv("./classifications.csv", index=False)