Data Science project

Group 5's project of the minor Data Science at Rotterdam University of Applied Sciences (RUAS).

Group 5 consists of:

Jamey Schaap (0950044)
Thomas Poelman (1008138)
Jarell Wespel (0999541)
Luc Karlas (1017799)
Maurits Hanhart (1009228)
Dominique Kuijten (1009466)

Grade: 9.5

Documentation like the report and the final presentation can be found in /docs.

The best model

The best model (shallow feed forward network) can be found at /out/models/best_model, named RawData.9c_Adam_1024_0_#FactorScheduler-factor_0.995-stop_factor_0.00075-base_lr_0.00075#_200_25_32_0.19385148584842682.shallow_fnn.keras. The model's name follows the following naming convention: {VERSION}_Adam_{units}_{dropoutRate}_{learningRate}_{epochs}_{patience}_{batchsize}_{least_val_loss}.shallow_fnn.keras. In this case #FactorScheduler-factor_0.995-stop_factor_0.00075-base_lr_0.00075# is the learning rate.

Datasets

All used and created datasets can be found at /datasets. The following datasets were used while preprocessing and merging the datasets:

Polity5.xls - Political data; Provided by the project course of the Data Science minor at Rotterdam University of Applied Sciences (RUAS)
IMFInvestmentandCapitalStockDataset2021.xlsx - Investment data; Downloaded from International Monetary Fund (IMF); https://www.imf.org/
API_SP.POP.TOTL_DS2_en_excel_v2_5871620.xls - Population per country; Downloaded from the World Bank; https://www.worldbank.org/

Requirements

Python 3.11.x

Setting up

Auto setup (Windows)

To set the environment up, start a Powershell 7 Core (PWS) shell, navigate to the root directory of the project and run the script (.\simple-setup.ps1).

Manual steps

Note: A virtual enviroment can also be setup with other tools like Anaconda, but be sure to specify the Python version as 3.11.

Install virtualenv for Python
pip install virtualenv

Create a virtual python environment
python -m venv .\venv

Activate the virtual environment
Unix: source env/bin/activate
PowerShell (PWSH) (Core): venv\Scripts\Activate.ps1
Command Prompt (CMD): venv\Scripts\activate.bat

Update Pip
python -m pip install --upgrade pip

Install requirements
pip install -r requirements.txt

During development

Dependent on which editor/IDE you use you might have to activate the virtual Python environment manually. This can be done with:
Unix: source env/bin/activate
PowerShell (PWSH) (Core): venv\Scripts\Activate.ps1
Command Prompt (CMD): venv\Scripts\activate.bat

Configuration

All configuration is done by the files found at src\configs\.

Versioning

Versioning and the number of labels/classes/targets is controlled through the variable __amount_of_classes which can be found at src\configs\data.py line 96. This will control both the amount labels/classes/targets used during the data preprocessing & merger and during the training of the models. Currently, this is configured for 9 labels with equal partitions.

How to use the different scripts

To use any of the scripts navigate to the src/ directory.

Data preprocessing & merger

Run the merge_datasets.py file. This will process the data and create two datasets (xlsx) containing the preprocessed data:

src/datasets/MergedDataset-Dataset-V.RawData.\<VERSION\>c.xlsx → The dataset for humans, containing the data that is interesting/of use for the project.
src/datasets/MachineLearning-Dataset-V.RawData.\<VERSION\>c.xlsx → The dataset for machine learning models, containing the features and labels, which are encoded where needed.

Plotting

Edit the plotting.py file with what has to be plotted. Examples:

# Example 1 - Simple invoke, specify the dataframe, columns and the plotting function
# if a description exists, it will be shown as the axis label.
simple_invoke(df, x=Column.DUR, y=Prefix.NORM + Column.RISK, plot_func=gf.plot_kde)

# Example 2 - Manually configure the plotting functions
gf.plot_hist(df["norm_risk"], x_label="Risk factor (0..1)")

Then run the plotting.py file to plot.

Machine-/Deep learning

For each Jupyter Notebook (.ipynb), run the cells under the chapters Load & split the dataset and Utility functions definitions.

KNN, Logistic Regression, SVM & Random Forest

Open the machine_learning.ipynb, here all code with regards to KNN, Logistic Regression, SVM & Random Forest can be found. Here the models can be trained, tuned and used to predict and the accuracy and feature importance (Shap) can be viewed.

Shallow FNN

Open the shallow_feed_forward_neural_network.ipynb, here all code with regards to the training of the shallow FNN models can be found. Here are cells dedicated to the loading, parameter tuning and hyperparameter tuning as well as cells dedicated to view the difference (observed - predicted) and feature importance.

Deep FNN

Open the deep_feed_forward_neural_network.ipynb, here all code with regards to the training of the shallow FNN models can be found. Here are cells dedicated to the loading and parameter tuning as well as cells dedicated to view the difference (observed - predicted) and feature importance.

General functionality

An excel file can be created containing the incorrectly predicted rows:

# Example - Random forest
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=500, random_state=42) 
rf_model.fit(x_train, train_labels)

_, y_pred = print_results(rf_model)

from machine_learning.utils import output_incorrectly_predicted_xlsx
output_incorrectly_predicted_xlsx(test_df, y_pred, "rf")

A dataframe containing difference between observed and predicted labels can be created and plotted with:

distribution = get_distribution(test_df, y_pred)
print(distribution)
plot_distribution(distribution)

The feature importance can be plotted through the use of the Shap library, this can be done through:

## machine_learning.ipynb ##
_, shap_values = calculate_shap_values(model)
shap.summary_plot(shap_values, x_test, feature_names=feature_names,
                  class_names=RISKCLASSIFICATIONS.get_names())

## deep_feed_forward_neural_network.ipynb    ##
## shallow_feed_forward_neural_network.ipynb ##
explainer = shap.KernelExplainer(model.predict, x_train)
shap_values = explainer.shap_values(shap.sample(x_test, 20), nsamples=100, random_state=41)

feature_names = df.columns.tolist()
feature_names.remove(Column.COUNTRY_RISK)
shap.summary_plot(shap_values, x_test, 
                  feature_names=feature_names,
                  class_names=RISKCLASSIFICATIONS.get_names())

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
datasets		datasets
docs		docs
out		out
src		src
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE.bib		LICENSE.bib
README.md		README.md
requirements.txt		requirements.txt
simple_setup.ps1		simple_setup.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science project

The best model

Datasets

Requirements

Setting up

Auto setup (Windows)

Manual steps

During development

Configuration

Versioning

How to use the different scripts

Data preprocessing & merger

Plotting

Machine-/Deep learning

KNN, Logistic Regression, SVM & Random Forest

Shallow FNN

Deep FNN

General functionality

About

Releases

Packages

Contributors 2

Languages

License

jamey-schaap/Minor-Data_Science-Project

Folders and files

Latest commit

History

Repository files navigation

Data Science project

The best model

Datasets

Requirements

Setting up

Auto setup (Windows)

Manual steps

During development

Configuration

Versioning

How to use the different scripts

Data preprocessing & merger

Plotting

Machine-/Deep learning

KNN, Logistic Regression, SVM & Random Forest

Shallow FNN

Deep FNN

General functionality

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages