# Módulo 1: introducción
## 1.1. Introducción al Machine Learning

Let's suppose you want to sell a car. Depending on your car's make, model, age, mileage, etc., you can sell it for a certain price. An expert can look at the car's features and determine a price based on them. In other words, the expert took data and extracted patterns from that data.

DATA -> EXPERTO -> PATRÓN ... o lo que es lo mismo ... DATA -> MACHINE LEARNING -> PATRÓN

Machine Learning is a technique which allows us to build models that extract these patterns from data, just like the expert in our example.

* **Features** are the characteristics of the data we've got (year, make, mileage, etc).
* The **target** is a feature we want to predict.
* A **model** is a "description of statistical patterns" that predicts a target given some input. Models are trained with algorithms that take some input features as well as reference targets for those features. The algorithms then extract patterns that calculate the target given the feature inputs within some error margin, and those patterns are stored in the model.

Once we've trained a model, we can use it to process new completely original input and predict the target for the input's features.

DEMOSTRACIÓN

Nuestros datos se tratan de una colección de trazas de tráfico entre distintos switches SDN. Una de las características (features) es el PER (Packet Error Rate), que nos proporciona una buena imagen general acerca de cómo es el tráfico entre los switches, por lo que se convierte en el target. Entrenaremos los datos (features + target) para obtener un modelo mediante Machine Learning. Una vez obtenido el modelo, lo utilizaremos para realizar predicciones de PER utilizando el resto de features. Se trataría al final de una ecuación:

$$features + modelo = PER$$

## 1.2. ML vs Ruled-based systems

In the traditional programming paradigm, the developer defines how a system will behave by defining specific rules. However, for complex or ever-changing behaviors, this method can become unsustainable or even impossible.

For example: we can try to create a spam filter by using specific rules, such as filtering words, blocking certain senders, etc., but human language is so complex and spam changes so quickly that it's impossible to keep up and our filter would be obsolete inmediately and would never work with acceptable effectiveness.

ML offers a solution to this issue:

* We can gather data (in our example, emails, both regular email and spam) to create a dataset.
* We can define and calculate the features which are relevant to our dataset and the problem we're trying to solve.
* Finally, we can train and use a model which is able to recognize the patterns that distinguish regular email from spam, allowing us to act on it by filtering spam.

ML does not necessary discard all Rule-Based Systems. We could use (some of) the rules defined on a Rule-Based System and use them as features for our ML model. Following the spam filter example: a feature could be whether the sender is from a specific domain, or whether the subject contains certain words.

Essentially, ML is a paradigm shift compared to traditional programming. Traditional programming follows this structure:

$$data + code = outcome$$

But ML changes this equation and becomes like this:

$$data + outcome = model$$

And the resulting model allows us to replace code in the original equation:

$$data + model = outcome$$

## 1.3. Supervised Machine Learning

In Supervised Machine Learning (SML) there are always labels associated with certain features. The model is trained, and then it can make predictions on new features. In this way, the model is taught by certain features and targets.

* Feature matrix (X): made of observations or objects (rows) and features (columns).
* Target variable (y): a vector with the target information we want to predict. For each row of X there's a value in y.

The model can be represented as a function g that takes the X matrix as a parameter and tries to predict values as close as possible to y targets. The obtention of the g function is what it is called training.

$$g(X) = y$$

siendo X la matriz features, g es la función modelo que aplicada a la matriz features, devuelve y (vector target).

Types of SML problems:

* Regression: the output is a number (car's price).
* Classification: the output is a category (spam example).
    * Binary: there are two categories.
    * Multiclass problems: there are more than two categories.
* Ranking: the output is the top scores associated with corresponding items. It is applied in recommender systems.

En nuestro ejemplo de SDN, intentaremos predecir el tráfico en una red basándonos en un parámetro numérico (PER), por lo que se tratará de un modelo de regresión.

In summary, SML is about teaching the model by showing different examples, and the goal is to come up with a function that takes the feature matrix as a parameter and makes predictions as close as possible to the y targets.

## 1.4. CRISP-DM

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model. Conceived in 1996, it became a European Union project under the ESPRIT funding initiative in 1997. The project was led by five companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company:

1. Business understanding: an important question is why do we need ML for the project. The goal of the project has to be measurable.
2. Data understanding: analyze available data sources, and decide if more data is required.
3. Data preparation: clean data, remove noise applying pipelines, and convert the data to a tabular format, so we can put it into ML.
4. Modeling: train different models and choose the best one. Considering the results of this step, it is proper to decide if it is required to add new features or fix data issues.
5. Evaluation: measure how well the model is performing and if it solves the business problem.
6. Deployment: roll out to production to all the users. The evaluation and deployment often happen together - online evaluation.

It is important to consider how well maintainable the project is. In general, ML projects require many iterations.

Iteration:

1. Start simple.
2. Learn from the feedback.
3. Improve.

En nuestro problema de SND: utilizaremos un modelo para realizar predicciones de tráfico, basándonos en una feature (PER, delay, jitter...). De esta forma, evaluaremos el modelo


## 1.5. Model Selection Process

Which model to choose?

* Logistic regression
* Decision tree
* Neural Network
* Or many others

The validation dataset is not used in training. There are feature matrices and y vectors for both training and validation datasets. The model is fitted with training data, and it is used to predict the y values of the validation feature matrix. Then, the predicted y values (probabilities) are compared with the actual y values.

Multiple comparisons problem (MCP): just by chance one model can be lucky and obtain good predictions because all of them are probabilistic.

The test set can help to avoid the MCP. Obtaining the best model is done with the training and validation datasets, while the test dataset is used for assuring that the proposed best model is the best.

1. Split datasets in training, validation, and test. E.g. 60%, 20% and 20% respectively.
2. Train the models with the training data.
3. Evaluate the models comparing the result of the model between training dataset and validation dataset.
4. Select the best model.
5. Apply the best model to the test dataset.
6. Compare the performance metrics of validation and test.

NB: Note that it is possible to reuse the validation data. After selecting the best model (step 4), the validation and training datasets can be combined to form a single training dataset for the chosen model before testing it on the test set.

## 1.6. Setting up the environment

In [9]:
import pandas as pd
import numpy as np
import sklearn, seaborn

pd.__version__

# csv = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv')

'2.2.3'

## 1.7. Introduction to NumPy

Numpy, short for Numerical Python, is a powerful Python library that enables efficient and convenient array manipulation and mathematical operations. It forms the foundation for many scientific and data-related tasks. In this article, we'll provide a straightforward explanation of Numpy concepts and how to use them.

In [13]:
# Importing numpy
import numpy as np

## ARRAYS
# Creating arrays
zero_array = np.zeros(10) # Array de ceros con longitud 10
ones_array = np.ones(10) # Array de unos con longitud 10
constant_array = np.full(10,3) # Array de 3 con longitud 10

# Convertir lista a array
list = [1,2,3]
list_to_array = np.array(list)

# Array aleatorio entre 0 y 10
range_array = np.arange(10)
linspace_array = np.linspace(0,1,11)

## MATRIX
# Arrays multidimensionales
zero_matrix = np.zeros((10,2))
ones_matrix = np.ones((10,2))
constant_matrix = np.full((10,2),3)

# Indexing and slicing arrays
arr = np.array([[2, 3, 4], [4, 5, 6]])
first_row = arr[0]
first_column = arr[:,0]

## Random arrays
np.random.seed(2)
array = np.random.rand(5,2) # Matriz 5x2 con valores entre 0 y 1

## Operaciones con arrays
array1 = array + 1 # Suma 1 a todos los elementos del array
array2 = array * 2 # Multiplica por dos todos los elementos del array

# Operaciones entre arrays del mimso tamaño
array3 = array1 + array2
array4 = array1 / array2

# Arrays booleanos
arr = np.array([1, 2, 3, 4])
greater_than_2 = arr > 2  # Produces [False, False, True, True]
selected_elements = arr[arr > 1]  # Gets elements greater than 1

# Operaciones aritméticas
min_value = arr.min()    # Minimum value
max_value = arr.max()    # Maximum value
sum_value = arr.sum()    # Sum of all elements
mean_value = arr.mean()  # Mean (average) value
std_deviation = arr.std()  # Standard deviation

## 1.8. Linear Algebra Refresher

* Vector operations
* Multiplication
    * Vector-vector multiplication
    * Matrix-vector multiplication
    * Matrix-matrix multiplication
* Identity matrix
* Inverse

In [28]:
## Vector operations: suma, resta, multiplicación (coinciden con operaciones con arrays)

## Multiplicación

# Vector-vector
def vector_vector_multiplication(u, v):
    assert u.shape[0] == v.shape[0]
    
    n = u.shape[0]
    
    result = 0.0

    for i in range(n):
        result = result + u[i] * v[i]
    
    return result

a = [1,2,3]
b = [1,2,3]
arr_a = np.array(a)
arr_b = np.array(b)
# vector_vector_multiplication(arr_a,arr_b)

# Vector-matriz

def matrix_vector_multiplication(U, v):
    assert U.shape[1] == v.shape[0]
    
    num_rows = U.shape[0]
    
    result = np.zeros(num_rows)
    
    for i in range(num_rows):
        result[i] = vector_vector_multiplication(U[i], v)
    
    return result

# Matriz-matriz

def matrix_matrix_multiplication(U, V):
    assert U.shape[1] == V.shape[0]
    
    num_rows = U.shape[0]
    num_cols = V.shape[1]
    
    result = np.zeros((num_rows, num_cols))
    
    for i in range(num_cols):
        vi = V[:, i]
        Uvi = matrix_vector_multiplication(U, vi)
        result[:, i] = Uvi
    
    return result

U = np.array([
    [1,1,1],
    [1,2,3],
    [3,2,1],
])

V = np.array([
    [1, 1, 2],
    [0, 0.5, 1], 
    [0, 2, 1],
])

# Atributo .dot es igual a una multiplicacion de matrices
dot = U.dot(V)

## Matriz identidad
identidad = np.eye(3)

## Matriz inversa
V = np.array([
    [1, 1, 2],
    [0, 0.5, 1], 
    [0, 2, 1],
])
inv = np.linalg.inv(V)

## 1.9. Introduction to Pandas

Pandas es una librería de Python utilizada principalmente para la manipulación y análisis de datos. A continuación se van a ver algunos ejemplos de las funcionas más utilizadas de la librería Pandas. 

In [30]:
import pandas as pd
import numpy as np

# En forma de array

data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

pd.DataFrame(data, columns=columns)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [31]:
# En forma de diccionario Python (similar a JSON)

data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

pd.DataFrame(data)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [32]:
# Para mirar las primeras filas utilizamos el atributo head
df = pd.DataFrame(data)

df.head(2)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150


In [36]:
# Para acceder a las columnas podemos acceder de dos formas (si la columna tiene un espacio no podemos acceder de la primera forma,
# habría que acceder de la segunda forma):
df.Make
df['Make']

# Para acceder a varias columnas lo hacemos pasando una lista como entrada
df[['Make','Model']]

Unnamed: 0,Make,Model
0,Nissan,Stanza
1,Hyundai,Sonata
2,Lotus,Elise
3,GMC,Acadia
4,Nissan,Frontier


In [41]:
# Para añadir columnas nuevas lo podemos hacer de la siguiente forma
df['id'] = [1,2,3,4,5]
# ... y para eliminarla
del df['id']
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [42]:
# Atributo que proporciona información acerca de los índices de la tabla (columna más a la izquierda)
df.index

RangeIndex(start=0, stop=5, step=1)

In [46]:
# Si queremos que nos devuelva la información completa de un índice determinado
df.loc[1]
# df.loc[[1,2]] si queremos acceder a más de uno
# Tenemos que hacer coincidir el index exactamente con lo que pongamos en loc[x] para que nos lo devuelva bien ... mejor utilizar iloc 
# ya que es independiente del tipo de dato utilizado
df.iloc[1]
# Si queremos resetear el índice a 0,1,2,3...
df.reset_index()

Unnamed: 0,index,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


### Operaciones con DataFrames

In [47]:
# Podemos multiplicar, dividir, hacer comparaciones, etc.. todo una columna por un número
df['Year'] * 2
# comparaciones booleanas
df['Year'] >= 2015

0    False
1     True
2    False
3     True
4     True
Name: Year, dtype: bool

In [48]:
## FILTRADO de datos
df[
    df['Make'] == 'Nissan'
]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340
