<img src="images/bwHPC_Logo_cmyk.svg" width="200" /> <img src="images/HochschuleEsslingen_Logo_RGB_DE.png" width="200" /> <img src="images/Konstanz_Logo.svg" width="200" /> <img src="images/KIT_Logo.png" width="200" />

# Machine Learning

* Machine learning (ML) algorithms use statistics to find patterns in large amounts of data. These algorithms can then make decisions and predictions. The better the data, the more accurate the predictions.
* What problems can be solved with ML?
  Creditworthiness, price predictions, spam filters, ...

## Supervised Learning
 - Past data is known (taxi rides, number of passengers, tip provided, ...)
 - Labeled: The desired output is known (e.g., the total charge of a taxi ride)
 - Regression Task: The label to be predicted is continuous (e.g., prices)
 - Classification Task: Classifying into a specific category (handwriting recognition)

 ## Unsupervised learning
 - Labels not available
 - Label must be found by itselve

## Data division
- X: Features (Data used for estimation)
- y: Label (value we want to predict, e.g. price of the taxi ride)

### Training Datenset vs. Test-Dataset
- Measuring the reliability of the trained model
- Training with the training dataset and verifying with the test dataset

![title](images/train_test.png)

# Linear Regression

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/1024px-Linear_regression.svg.png" alt="MAE"
	title="Linear Regression" width="500" />

$\hat y = b_{0}x_{0} + ... + b_{n}x_{n}$

$\hat y$: Predicted output

$x_{i}$: Feature
$b_{i}$: Parameter of the algorithm

Plan: Find the $b_{n}$'s which best fit a line through the cloud of values

### Gradient method
<img src="https://blog.paperspace.com/content/images/2018/05/gd_basic.png" alt="MAE"
	title="Linear Regression" width="500" />

Source: https://blog.paperspace.com/content/images/2018/05/gd_basic.png

#  Scikit learn
- Offers a variety of ML algorithms
- Algorithms may be quickly replaced
- Provides toools for model validation and slection
- Is well-documented with a large and active community

- X_train; y_train (Training data)
- X_test; y_test (Test data)

In [None]:
from sklearn.model_selection import train_test_split # Create splot of Train-Test data
import numpy as np

X, y = np.arange(12).reshape((6, 2)), range(6)
display(X) # Data for learning (Part of the training data, used for prediction)
display(y) # Values, to be predicted by X (here index of all elements in the test data)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y) # One may as well specify percentages like (70% Train, 30% Test)
# After the split:
display(X_train) # 7 elements for training
display(X_test)  # 3 elements for testing
display(y_train) # Index of training data (shall be predicted)
display(y_test)  # Index of test data (shall be predicted)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) # Here specifying a percentage of 50%
# After the split:
display(X_train) # 5 elements for training
display(X_test)  # 5 elemente for test
display(y_train) # Index of the training data
display(y_test)  # Index of the test data

In [None]:
# Import ML-Algorithm to be used on the training data
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [None]:
model.fit(X_train, y_train) # Train the model

In [None]:
predictions = model.predict(X_test) # Predict values for X_test

In [None]:
from sklearn.metrics import mean_squared_error
performance = mean_squared_error(y_test, predictions) # Compare between y_test and the predictions of the model with error metric mean_squared_error()
display(y_test)
display(predictions)
display(performance)
# Due to the direct relationship between Data and Index 
# (the Index is the rounded-down part of the elements in the training data)
# the performance even with few data points is rather good (MSE close to 0)

In [None]:
from sklearn.model_selection import train_test_split # Create Train-Test Split
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X, y = np.arange(4).reshape((2, 2)), range(2)
display(X)
display(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
performance = mean_squared_error(y_test, predictions)
display(y_test)
display(predictions)
display(performance) # With only one data value for training the performance degrades (MSE close to 1)

## First small ML example (without parallelization)

Based on the same New York Taxi trip data set: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import sys

In [None]:
#n = 100  # reduce CSV dataset --> since we don't have enough memory on the shared Jupyter NoteBook instance.
#df = pd.read_csv('s3://nyc-tlc/trip data/green_tripdata_2019-02.csv', parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime'], header=0, skiprows=lambda i: i % n != 0)
## Example: i=5 --> 5 mod 100 != 0 --> lambda returns True --> Skip row
##
## S3 requires an account for AWS account
## Please use the provided parquet-Datei (see below)

# https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')
display(df)
df = df.sample(1000) # Reduce Data set
df

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df['pickup_hour'] = df['lpep_pickup_datetime'].dt.hour # Add Columns with 24 categories for 24 hours

In [None]:
df['pickup_hour'].tail(500)


In [None]:
df['ride_duration'] = df['lpep_dropoff_datetime'].sub(df['lpep_pickup_datetime'], axis=0)

In [None]:
df


In [None]:
df['ride_duration_minutes'] = df['ride_duration'].dt.total_seconds().div(60).astype(int)

In [None]:
df

In [None]:
df = df[['passenger_count', 'trip_distance', 'fare_amount', 'total_amount', 'tip_amount','pickup_hour', 'ride_duration_minutes']] # Cleanup for the pairplot

In [None]:
fig = sns.pairplot(df)
fig.savefig("output.png")

In [None]:
# Examples using MatPlotLib
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16,6)) # 1 row with 3 columns
axes[0].plot(df['tip_amount'],df['total_amount'],'o')
axes[0].set_ylabel("Total Amount")
axes[0].set_title("Tip Amount")

axes[1].plot(df['fare_amount'],df['total_amount'],'o')
axes[1].set_ylabel("Total Amount")
axes[1].set_title("Fare Amount")

axes[2].plot(df['trip_distance'],df['total_amount'],'o')
axes[2].set_ylabel("Total Amount")
axes[2].set_title("Trip Distance")

plt.tight_layout()

In [None]:
X = df[['trip_distance', 'pickup_hour', 'ride_duration_minutes']] # Defining the labels
X

In [None]:
y = df['tip_amount']
y

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.20, random_state=42) # Defining the test-train percentage

In [None]:
len(X_test)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
prediction = model.predict(X_test)
type(prediction)

In [None]:
np.set_printoptions(threshold=sys.maxsize)
prediction

## Evaluation

Now we have a trained model. But how well does it predict?

<img src="https://i.imgur.com/19LNbyQ.jpg" alt="MAE" title="MAE" width="500" />
    
Source: https://stackoverflow.com/questions/56401346/mean-absolute-error-in-tensorflow-without-built-in-functions/56401550

Problem: If only few values are extremely off, we would not notice...

Better: Mean Squared Error (MSE): Using the square of the error, values which are extremely off will be punished

<img src="https://cdn-media-1.freecodecamp.org/images/hmZydSW9YegiMVPWq2JBpOpai3CejzQpGkNG" alt="MSE" title="MSE" width="500" />

<img src="https://miro.medium.com/max/483/1*lqDsPkfXPGen32Uem1PTNg.png" alt="RMSE" title="RMSE" width="500" />


Root Mean Square Error: Again, values which are off by far will show up by far, taking the root returns the same units. This is the preferred error method.

## Question: What is a good RMSE, which value is not?
Answer as always: it depends!

An RMSE of 50€ would be good in relation to large costs (the price of a house), but not for a (hopefully cheaper) taxi ride.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
df['tip_amount'].mean()
np.sqrt(mean_squared_error(y_test, prediction))

Very bad value in comparison to the average value of the tip_amount --> not a good modell to predict.

## Cross Validation, Grid Search
- Many ML algorithms can be customized with parameters. Question: Which parameters produce the best results?

- Idea: I set a parameter and train the model and test with the test data.

- It is not good if, for example, the last 20% of the data is always used as test data. It may be that the trained model runs very well or very poorly with this data by chance.

- The complete data set is divided into training, validation and test. The test data set is only used once the final parameters have been found.

### k-fold Cross Validation
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/1920px-K-fold_cross_validation_EN.svg.png" alt="K-fold Cross Validation" title="K-Fold Cross Validation" width="500" />

1. Parameters of the ML algorithm are defined.
2. k iterations are carried out, each with different test and training data.
3. Error of each iteration is calculated.
4. Average error of all iterations evaluates the current parameter configuration of the ML algorithm
5. Change parameters and repeat steps 1 to 4

In [None]:
# Just an example: MLAlgo has to be replaced by a "real" ML algorithm
from sklearn.model_selection import cross_val_score
model = MLAlgo(paramter=xx)
score = cross_val_score(model, X_train, y_train, scoring='mean_squared_error', cv=5)

# Afterwards calculate average of the erros and repeat for the next parameter of the ML algorithm

## Grid Search

Previously, algorithm parameters had to be adjusted manually. GridSearchCV automates this by allowing a list of parameters to be passed. At the end, the configuration that delivered the best estimates can be displayed.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')
df = df.sample(1000)
df['pickup_hour'] = df['lpep_pickup_datetime'].dt.hour
df['ride_duration'] = df['lpep_dropoff_datetime'].sub(df['lpep_pickup_datetime'], axis=0)
df['ride_duration_minutes'] = df['ride_duration'].dt.total_seconds().div(60).astype(int)
X = df[['trip_distance', 'pickup_hour', 'ride_duration_minutes']]
y = df['tip_amount']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.20, random_state=42)

from sklearn.svm import SVR
reg = SVR(C=1)
#param_grid = {'param1':[0.1, 0.2, ...], 'param2':[0.4, 0.5,...] }
#param_grid = {'param1':[0.1], 'param2':[0.4] }
param_grid = {'C': [0.001,0.01,0.1,1,10,100,1000]}

from sklearn.model_selection import GridSearchCV
grid_model = GridSearchCV(reg, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5, verbose=2)
grid_model.fit(X_train, y_train)

grid_model.best_estimator_