KBO URL Linking Model Development

Welcome to the official repository for the master thesis from KU Leuven in cooperation with Statistiek Vlaanderen. This project focuses on developing a machine learning model to accurately link businesses from the KBO (Crossroads Bank for Enterprises) database to their true URLs. The work presented here is part of an academic thesis and represents a collaborative effort between the university and the official statistics agency of the Flemish Region in Belgium.

Project Overview

This repository contains a series of Jupyter notebooks and a Python script that make up a data processing and machine learning prediction pipeline. The functions for both Jupyter notebooks and the Python script are combined in individual python scripts in the folder 'Library'. Each component of the pipeline corresponds to a step in the process of collecting, processing, visualizing, and predicting business URLs based on KBO data.

Workflow Visualization

A BPMN (Business Process Model and Notation) diagram is included to outline the process flow, ensuring clarity and coherence in the methodology used throughout this project.

Zip_downloader.ipynb - Handles the downloading of zipped data files.
Zip_processor.ipynb - Manages the extraction and preprocessing of the data.
Data_Preparation_And_Visualization.ipynb - Prepares the data for analysis and performs initial visualizations.
Web_scraper.ipynb - Collects additional data from the web to complement the dataset.
Pipeline_ML.ipynb - Trains and evaluates the machine learning models.

Python Script

Prediction_pipeline.py - Serves as the operational pipeline for predictions using the trained models.

Getting Started

This section guides you through getting set up to use the notebooks and scripts in this repository.

Prerequisites

Before you begin, make sure you have the following installed:

Anaconda or Miniconda
Git (optional, for cloning the repository)

Cloning the Repository and Starting Jupyter Notebooks

To get started with the notebooks in this repository, you'll first need to clone it to your local machine. Open a terminal or Anaconda Prompt and follow these steps:

Cloning the Repository

# Clone the repository using git
git clone https://github.com/nathanvdv/URLfinder

# Navigate to the repository directory
cd https://github.com/nathanvdv/URLfinder

Environment Setup

Creating a dedicated Conda environment for this project is recommended to avoid dependency conflicts. You can create and activate a new environment using the following commands in your terminal or Anaconda Prompt:

# Create a new conda environment named 'kbo-url-linking'
conda create --name kbo-url-linking python=3.8

# Activate the environment
conda activate kbo-url-linking

# Install the required packages from requirements.txt
pip install -r requirements.txt

Usage Guide for Jupyter Notebooks

This guide provides a step-by-step approach to executing the Jupyter notebooks included in this repository. These notebooks are part of a pipeline developed for linking businesses from the KBO database to their respective URLs.

General Steps to Run a Notebook:

Open the Notebook: In your browser, navigate to the notebook file (.ipynb) within the Jupyter Notebook interface and click to open it.
Install Dependencies: Before running the notebook, make sure that all required libraries are installed, these are bundled in the requirement.txt file. These can typically be installed via pip or conda:
```
# Use pip
pip install -r requirements.txt

# Or use conda
conda install -r requirements.txt
```

Repository Contents

This repository is structured to include a series of Jupyter notebooks and a Python script that are intended to be used sequentially for the development of a model that links businesses from the KBO database to their actual URLs. Below is a brief overview of each file along with placeholders for detailed information.

Jupyter Notebooks

1. `Zip_downloader.ipynb`

Description: Downloads zipped data files from the open source dataset of KBO
Instructions: Ensure you have adequate storage and network bandwidth for the downloads.
Output: Zipped files saved in the designated directory.

2. `Zip_processor.ipynb`

Description: Processes the downloaded zip files, extracting and organizing the data.
Instructions: Set the input directory to where the zipped files are located. Specify the output directory for extracted content.
Output: .Parquet data files ready for further processing.

3. `Data_Preparation_And_Visualization.ipynb`

Description: Prepares data for analysis and visualizes key aspects.
Instructions: Load the data files output by the Zip_processor.ipynb. Perform necessary data cleaning and visualization.
Output: Cleaned data set, along with visualizations that may be used for exploratory data analysis.

4. `Web_scraper.ipynb`

Description: Scrapes urls from the web using Google and/or DuckDuckGo required for the analysis.
Instructions: Set up the scraping parameters and target URLs. Ensure compliance with the terms of service of the websites being scraped. When scraping with Google, make use of the API.
Output: Data gathered from web sources, saved and integrated with the main data set.

5. `Pipeline_ML.ipynb`

Description: Preprocesses the prepared data. Trains machine learning models using the preprocessed data.
Instructions: Load the cleaned and combined dataset. Select models and parameters for training. Perform cross-validation.
Output: Trained machine learning models, evaluation metrics, and insights.

Python Script

6. `Prediction_pipeline.py`

Description: Serves as a pipeline for making predictions with the trained models.
Instructions: Run this script with Python 3.x. Ensure all dependencies are installed. Load the trained model before making predictions.
Output: The output is a .parquet file with the predicted URLs and their domains. It can contain multiple different URLs. This script can be integrated into a production environment for real-time predictions.

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
ExtractedFiles		ExtractedFiles
Library		Library
URLBERT-master		URLBERT-master
.gitattributes		.gitattributes
.gitignore		.gitignore
Data_Preparation_And_Visualization.ipynb		Data_Preparation_And_Visualization.ipynb
LICENSE		LICENSE
Performance_Plot.ipynb		Performance_Plot.ipynb
Pipeline_ML.ipynb		Pipeline_ML.ipynb
Pipeline_ML_with_Embedding.ipynb		Pipeline_ML_with_Embedding.ipynb
Prediction_pipeline.py		Prediction_pipeline.py
README.md		README.md
Web_Scraper.ipynb		Web_Scraper.ipynb
Zip_Downloader.ipynb		Zip_Downloader.ipynb
Zip_Processor.ipynb		Zip_Processor.ipynb
diagram_methodology.png		diagram_methodology.png
gradient_boosting_classifier.pkl		gradient_boosting_classifier.pkl
random_forest_model.pkl		random_forest_model.pkl
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KBO URL Linking Model Development

Project Overview

Workflow Visualization

Contents

Jupyter Notebooks

Python Script

Getting Started

Prerequisites

Cloning the Repository and Starting Jupyter Notebooks

Cloning the Repository

Environment Setup

Usage Guide for Jupyter Notebooks

General Steps to Run a Notebook:

Repository Contents

Jupyter Notebooks

1. `Zip_downloader.ipynb`

2. `Zip_processor.ipynb`

3. `Data_Preparation_And_Visualization.ipynb`

4. `Web_scraper.ipynb`

5. `Pipeline_ML.ipynb`

Python Script

6. `Prediction_pipeline.py`

License

About

Releases

Packages

Contributors 2

Languages

License

nathanvdv/URLfinder

Folders and files

Latest commit

History

Repository files navigation

KBO URL Linking Model Development

Project Overview

Workflow Visualization

Contents

Jupyter Notebooks

Python Script

Getting Started

Prerequisites

Cloning the Repository and Starting Jupyter Notebooks

Cloning the Repository

Environment Setup

Usage Guide for Jupyter Notebooks

General Steps to Run a Notebook:

Repository Contents

Jupyter Notebooks

1. Zip_downloader.ipynb

2. Zip_processor.ipynb

3. Data_Preparation_And_Visualization.ipynb

4. Web_scraper.ipynb

5. Pipeline_ML.ipynb

Python Script

6. Prediction_pipeline.py

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

1. `Zip_downloader.ipynb`

2. `Zip_processor.ipynb`

3. `Data_Preparation_And_Visualization.ipynb`

4. `Web_scraper.ipynb`

5. `Pipeline_ML.ipynb`

6. `Prediction_pipeline.py`

Packages