# Machine learning: Final Project

## Authors: Maximilian Janisch and Marco Bertenghi

### Problem Description

We have been assigned a multi-class classification problem, that is we are interested in correctly predicting if a given input in $\mathcal{X} = \mathbb{R}^d$ belongs to one of __three__ classes (i.e. $\mathcal{Y} = \{0,1,2\}$). We are instructed to solve this multi-class classification problem by reducing it to a sequence of binary classification problems (i.e. where $\mathcal{Y}'=\{0,1\}$). We are __not__ allowed to use the multi-class implementation of sklearn and one of the models we must implement must be the random forest model. The goal of the project is then to describe and document in detail how to find and implement the best machine learning algorithm for the dataset assigned to us.


We are given a __E-SCOOTER DATASET__. The dataset contains data about the profitability of e-scooter companies. The goal is to predict, as well as possible, the profitability of e-scooter companies by using the information provided. The dataset comes as a csv file (comma-separated values file), named dataset20.csv and contains the following description:

__E-SCOOTER DATASET__

----------------------------------------------

__Dataset description__:
The data is compiled from cities, towns, and small villages in Germany, Austria, and Switzerland, classified according to the profitability of e-scooter companies active in that location. There are __thirteen different features__ associated with each location. The goal is to predict the profitability of an e-scooter company (feature "class") from the other features.

__Attention!__
Please notice that the data has been artificially generated. The dataset does not reflect real-world statistical correlations between features and labels.

	Number of samples: 500
	Number of features: 13 (numeric and strings) + one column of class labels (0,1,2)
	Features description:
		pub_trans: public transport index
		price: price to rent per km
		temperature: average tempereture during summer
		inhabitants: number of inhabitants
		registered: number of registered users in thousand
		country: country
		id: internal dataset code
		nr_counterparts: nr_counterparts
		cars: number of cars per inhabitant
		labour_cost: average labour cost in thousand per month
		humidity: absolute humidity in g/m3
		windspeed: average wind speed in km/h
		size: size of city center in km2
		class: profitability (0 = loss, 1 = balanced, 2 = profit) <--- LABEL TO PREDICT

----------------------------------------


We notice that:

+ We have a total of 500 samples.
+ We have a total of 13 features.
 + Not all features will be relevant, more on that later.
+ The profitability is a class, it is the label we want to predict.
+ The CSV file uses ';' as a delimiter.

# Coding part

## Reading in the data

In [19]:
# Preliminary code, libraries we want to load
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [20]:
# Preliminary code, global variables
DATA_FILE = os.path.join("data", "dataset20.csv")

We will use the library _pandas_ for our data management.

In [21]:
data = pd.read_csv(DATA_FILE, delimiter=";")
x = data.drop(columns="class")
y = data["class"]

As a brief sanity check: We can have a look at the dataset by printing the first five lines of data.

In [22]:
print("First five lines of covariates:", x.head())
print("First five labels:", y.head())

   pub_trans     price  temperature  inhabitants  registered      country  \
0  14.202635  0.019019    25.175131     2.355282  118.021208      Germany   
1  13.139854  0.153448    23.901338     4.150463   92.945396      Germany   
2  13.204899  0.924526    22.453461     5.586596   99.359941  Switzerland   
3  14.537539  2.190265    21.242409     4.836816  106.712314  Switzerland   
4  11.778787  2.784961    17.454760     5.027284  103.028215      Germany   

        id nr_counterparts      cars  labour_cost  humidity  windspeed  \
0  DZ83987             few  0.413614     2.237814  2.665311   2.831770   
1  DZ80864            none  0.213533     2.543435  2.083593   3.369611   
2  DZ05013            none  0.504168     2.446645  2.462123   2.382111   
3  DZ18329            none  0.234218     1.919706  2.694815   4.118423   
4  DZ25149             few  0.474206     2.688411  2.348723   0.590034   

       size  
0  0.748582  
1  1.019265  
2  0.578460  
3  0.802346  
4  0.848055  
0    2
1

Here is the shape of our data:

In [23]:
print("Shape of the covariates:", x.shape)  # Sanity check, to see what our data looks like.
print("Shape of the labels:", y.shape)

Shape of the covariates: (499, 13)
Shape of the labels: (499,)


# Data preprocessing

#### Feature selection

Let us have a look at the correlation coefficient between the labels and the covariates:

In [24]:
print(x.corrwith(y))

pub_trans      0.742013
price         -0.011742
temperature    0.208521
inhabitants   -0.006517
registered     0.045788
cars           0.000198
labour_cost   -0.776912
humidity       0.099054
windspeed      0.107189
size          -0.068314
dtype: float64


Evidently, not all features are relevant and we might as well _'forget'_ about them. We recall our features and mark them as relevant respectively irrelevant:

		pub_trans: public transport index  (irrelevant)
		price: price to rent per km (relevant)
		temperature: average tempereture during summer (relevant)
		inhabitants: number of inhabitants
		registered: number of registered users in thousand (relevant)
		country: country (relevant)
		id: internal dataset code (irrelevant)
		nr_counterparts: nr_counterparts (irrelevant)
		cars: number of cars per inhabitant (relevant)
		labour_cost: average labour cost in thousand per month
		humidity: absolute humidity in g/m3 (relevant)
		windspeed: average wind speed in km/h (irrelevant)
		size: size of city center in km2 (irrelevant)
		class: profitability (0 = loss, 1 = balanced, 2 = profit) <--- LABEL TO PREDICT !!!