# Machine learning: Final Project

## Authors: Maximilian Janisch and Marco Bertenghi

### Problem Description

We have been assigned a multi-class classification problem, that is we are interested in correctly predicting if a given input in $\mathcal{X} = \mathbb{R}^d$ belongs to one of __three__ classes (i.e. $\mathcal{Y} = \{0,1,2\}$). We are instructed to solve this multi-class classification problem by reducing it to a sequence of binary classification problems (i.e. where $\mathcal{Y}'=\{0,1\}$). We are __not__ allowed to use the multi-class implementation of sklearn and one of the models we must implement must be the random forest model. The goal of the project is then to describe and document in detail how to find and implement the best machine learning algorithm for the dataset assigned to us.


We are given a __E-SCOOTER DATASET__. The dataset contains data about the profitability of e-scooter companies. The goal is to predict, as well as possible, the profitability of e-scooter companies by using the information provided. The dataset comes as a csv file (comma-separated values file), named dataset20.csv and contains the following description:

__E-SCOOTER DATASET__

----------------------------------------------

__Dataset description__:
The data is compiled from cities, towns, and small villages in Germany, Austria, and Switzerland, classified according to the profitability of e-scooter companies active in that location. There are __thirteen different features__ associated with each location. The goal is to predict the profitability of an e-scooter company (feature "class") from the other features.

__Attention!__
Please notice that the data has been artificially generated. The dataset does not reflect real-world statistical correlations between features and labels.

	Number of samples: 500
	Number of features: 13 (numeric and strings) + one column of class labels (0,1,2)
	Features description:
		pub_trans: public transport index
		price: price to rent per km
		temperature: average tempereture during summer
		inhabitants: number of inhabitants
		registered: number of registered users in thousand
		country: country
		id: internal dataset code
		nr_counterparts: nr_counterparts
		cars: number of cars per inhabitant
		labour_cost: average labour cost in thousand per month
		humidity: absolute humidity in g/m3
		windspeed: average wind speed in km/h
		size: size of city center in km2
		class: profitability (0 = loss, 1 = balanced, 2 = profit) <--- LABEL TO PREDICT

----------------------------------------


We notice that:

+ We have a total of 500 samples.
+ We have a total of 13 features.
 + Not all features will be relevant, more on that later.
+ The profitability is a class, it is the label we want to predict.
+ The CSV file uses ';' as a delimiter.

# Coding part

## Reading in the data

In [3]:
# Preliminary code, libraries we want to load
import numpy as np
import matplotlib.pyplot as plt
import csv

We will use the most basic (and reliable) method, which is via the csv.reader function of the csv package.

In [30]:
with open('dataset20.csv','r') as f:
    data = csv.reader(f, delimiter = ";") # we specify the delimiter as by ";".

    row = data.__next__()
    features_names = np.array(row) # specification of the features

    x = []  # Initialize empty lists x,y (vectors)
    y = []

    for row in data:
        x.append(row)
        y.append(row[1])

    x = np.array(x)
    y = np.array(y)

print(x.shape)  # Sanity check, to see what our data looks like.
print(y.shape)

(499, 14)
(499,)


As a brief sanity check: We can have an idea of the dataset by printing the header (which contains the names of the columns) and the first line:

In [31]:
print(features_names)
print(x[0,:])

['pub_trans' 'price' 'temperature' 'inhabitants' 'registered' 'country'
 'id' 'nr_counterparts' 'cars' 'labour_cost' 'humidity' 'windspeed' 'size'
 'class']
['14.202635088464051' '0.019018646290564982' '25.17513069713383'
 '2.3552816631348117' '118.02120779845077' 'Germany' 'DZ83987' 'few'
 '0.41361381893833177' '2.2378136716970234' '2.6653106348180056'
 '2.831770192097773' '0.7485818800190784' '2']


# Data preprocessing

#### Feature selection

Evidently, not all features are relevant and we might as well _'forget'_ about them. We recall our features and mark them as relevant respectively irrelevant:

		pub_trans: public transport index  (irrelevant)
		price: price to rent per km (relevant)
		temperature: average tempereture during summer (relevant)
		inhabitants: number of inhabitants
		registered: number of registered users in thousand (relevant)
		country: country (relevant)
		id: internal dataset code (irrelevant)
		nr_counterparts: nr_counterparts (irrelevant)
		cars: number of cars per inhabitant (relevant)
		labour_cost: average labour cost in thousand per month
		humidity: absolute humidity in g/m3 (relevant)
		windspeed: average wind speed in km/h (irrelevant)
		size: size of city center in km2 (irrelevant)
		class: profitability (0 = loss, 1 = balanced, 2 = profit) <--- LABEL TO PREDICT !!!