# Statistics Meets Logistics
--- 

## Description
This notebook holds the DataFrames and analysis for the project. Requirements for the project environment can be found in https://github.com/luiul/statistics-meets-logistics/blob/main/requirements.txt. **Disclaimer**: the project has not been tested in other environments. 

## Goal 
The goal of this project is to perform a regression analysis given raw download and upload data to estimate the throughput of the system, i.e. the label we're trying to predict. 

## Overview
We were given raw download and upload data collected from ...

## In General
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features').

## Question
What is the predicted throughput?

## Architecture Model Described in the Paper
<img src="./figures/architecture.png" width="600" alt="Architecture model for the client-based data rate prediction." class="center">



## From the Article: Boosting VtC Communication by ML-enabled Context Prediction

Article propose a client-side opportunistic transmission scheme that applies machine learning-based data rate prediction for scheduling the transmission times of sensor data transmissions with respect to the expected resource-efficiency

The studies agree that passively measurable network quality indicators such as Reference Signal Received Power (RSRP), Reference Signal Received Quality (RSRQ), Signal- to-interference-plus-noise Ratio (SINR), and Channel Quality Indicator (CQI) provide meaningful information, which can be leveraged to estimate the resulting data rate based on machine learning methods even in challenging environments. In comparison to time series-based active data rate prediction (e.g., based on Kalman filters), passive approaches do not monitor the data rates of ongoing transmissions and can therefore be applied without introducing additional traffic themselves. As resource efficiency is one of the optimization goals of this work, we focus on passive data rate prediction.

In this context, the usage of connectivity maps for anticipatory communication allows to exploit a priori information about the channel quality based on previous measurements in the same geographical area. Radio Environment Maps (REMs) implement a similar concept, which enables opportunistic data transfer with Cognitive Radio (CR) methods. However, those purely spectrum-aware approaches do not consider the cross-layer interdependencies within the protocol stack. Moreover, as the resource allocation in LTE is performed by the scheduling mechanisms of the evolved NodeB (eNB), those methods have to be imple- mented by the mobile network operator. In contrast to that, the proposed machine learning-based approach can easily be implemented on the client side without requiring modifications to the network infrastructure.

The feature set of the data rate prediction is composed of the network quality indicators, the velocity and the payload size of the data packet. The resulting data rate of the active transmis- sion is used as the label for the prediction process, which is performed with the models Artificial Neural Network (ANN), Linear Regression (LR), Random Forest (RF), M5 Decision Tree (M5T) and Support Vector Machine (SVM). Finally, the prediction performance of the different models is evaluated using 10-fold cross validation. Additionally, the measured channel context parameters and the position information of the vehicle are utilized to create a multi-layer connectivity map that stores the cell-wise average of each indicator from multiple visits of the same geographical area.


# Import Libraries and Set Options
---

## Import Libraries

In [1]:
# Vector (Series) & Matrix (DateFrame) manipulation 
import numpy as np
import pandas as pd

In [2]:
# Data Visualization
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

# If JaveScript is configured and enabled: 
# static images: 
# %matplotlib inline

# interactive images: 
# %matplotlib notebook

In [3]:
# Interactive Data Visualization
# import plotly.express as px

In [4]:
# Python Utilities
# Generate datetime objects from raw timestamps and vice versa
from datetime import datetime

# OS Interface
# import os

# Regex search patterns 
# import re

## Check Prerequisites

In [5]:
# calling np.version.version should return 1.18.1
# np.version.version

# calling pd.__version__ should return 1.1.2
# pd.__version__

## Set Options

In [6]:
pd.set_option('display.max_columns',None)
# avoid truncate view of DataFrame (scroll to view all columns); set to 0 for pandas to auto-detect the with of the terminal and print truncated object that fits the screen width

# pd.set_option('float_format', '{:.2f}'.format)
# prints floats with two decimal points; do not comment out in this project since the features lat and lon have sigficant figures after two decimal points

In [7]:
# Display all outsputs if the cell has multiple commands as its input

# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

## Scikit-learn Libraries

### Train | Test Split & Pre-Processing

In [8]:
# Split Function (see Signature for correct tuple unpacking)
# from sklearn.model_selection import train_test_split

# Default split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [9]:
# When perfroming a classic Train | Test Spit fit ONLY to X_train to avoid data leakage! (Follow Procedure described in documentation under Cross Validation and Linear Regression Project)

# Data Scaling (iff values are in different order of magnitude)
# from sklearn.preprocessing import StandardScaler

In [10]:
# k-fold cross validation scores; estimator = ML model, cv = fold value, scoring = error metric (use the ones provided by sklearn!)

# from sklearn.model_selection import cross_val_score

In [11]:
# Polynomial Regression ( poly_trafo: X->X*...*X )
# from sklearn.preprocessing import PolynomialFeatures

In [12]:
# Grid search with cross vadlidation
# from sklearn.model_selection import GridSearchCV

### Linear Models

In [13]:
# Linear Regression Model
# from sklearn.linear_model import LinearRegression

In [14]:
# Elastic Net Regularization: start here for regularization in Linear Regression. Make sure to keep an l1_ratio that allows us to go fully to Lasso or fully to Ridge. See Lasso and Ridge explanations below. 
# from sklearn.linear_model import ElasticNetCV

# Use from sklearn.linear_model import ElasticNet in case CV done manually / grid search

# Standard procedure with no grid search: create X and y, split data, scale data (stadardize)
# Standard procedure with grid search: create X and y, split data, scale data (stadardize), instantiate base model, 

In [15]:
# L2: Ridge Regularization: adds beta squared shrinkage penalty. Hyper-parameter alpha: alpha=0 -> RSS minimization. L2 CV takes an alpha tuple and computes the hyper-parameter that delivers the best performance (either based on default scorer or one from the SCORES dictionary)
# from sklearn.linear_model import RidgeCV

# Use from sklearn.linear_model import Ridge in case CV done manually

In [16]:
# L1: Lasso Regularization: adds absolute beta value shrinkage penalty. Hyper-parameter alpha: alpha=0 -> RSS minimization. There are two ways to determine the alpha hyper-parameter: (a) provide list of alphas as an array (b) alpha can be set automatically by the class based off epsilon and n_alphas (we use the default values)
# from sklearn.linear_model import LassoCV

# Use from sklearn.linear_model import Lasso in case no CV done manually

### Support Vector Machines

In [17]:
# from sklearn.svm import SVR

# from sklearn.svm import LinearSVC
# faster than the generic version with the caveat that it only has a linear kernel

### Performance Metrics

In [18]:
# Performance Evaluation: common evaluation metrics; they can also be found in the SCORES dictionary (although transformed s.t. the higher the score the better, the model performance)
# from sklearn.metrics import mean_absolute_error, mean_squared_error

In [19]:
# Dictionary with different scorer objects; higher return values are better than lower return values by convention, e.g. negative error maximization -> the higher the score, the better the model performance
# from sklearn.metrics import SCORERS

In [20]:
# Normal Probability Plot
# import scipy as sp

### Model Deployment

In [21]:
# ML Model Deployment 
# from joblib import dump, load

# Models