<a href="https://colab.research.google.com/github/martasaparicio/lematecX/blob/main/3.2-ModelTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Training

## Introduction

In supervised learning, machine learning techniques are commonly used to create **prediction models**. These models are the end product of the machine learning application process.

Prediction models are built based on data that, ideally, is related to the problem we want to solve. The use of this data in the construction of models implies a model **training process**, where machine learning algorithms search for patterns that make it possible to make predictions in future situations with similar characteristics.

Let us consider the following **analogy** where the model training phase is the equivalent of learning a subject based on examples. Sometimes, the best way to learn a particular subject (e.g., solving quadratic equations) is to see several solved exercises in order to detect the pattern used in the solution. It is the same for machine learning: algorithms use data to determine existing patterns and build prediction models. 

In this tutorial, we will present the main elements of model training:

*   ***Scikit-learn*** **library**. The *scikit-learn* library has various machine learning functions and algorithms. 
*   **Pre-processing.** Sometimes, before training models, it is necessary to carry out specific data preparation procedures.
*   **Application of algorithms.** Model training is carried out using machine learning algorithms.

This tutorial has several **examples** that illustrate how code is applied and the effects of its application. In addition to this, along the course of the tutorial you will be faced with various challenges which serve to check whether you understand the material. At the end, all of the content presented will be summarised.

Note that this is an **introductory-level** tutorial and for this reason, several important aspects are not covered. For more information, we recommend that you consult the [official *scikit-learn* library documentation](https://scikit-learn.org/stable/index.html).

## Scikit-learn Library

***Scikit-learn*** is a Python library that facilitates the creation of machine learning programs. As such, *scikit-learn* has a set of methods and functions that make it easy to apply machine learning algorithms and perform additional related tasks. 

Since the *scikit-learn* library has several modules, we will opt for a **different** importation process than what is usually done when importing the *seaborn* library, for example.

Therefore, for the purposes of training models, the typical import from the *scikit-learn* library is as follows:

In [None]:
from sklearn.model_selection import train_test_split

In the previous instruction:

*   The word `from` tells the computer that we want to import a specific module from a library (and not an entire library).
*   In the expression `sklearn.model_selection`,
  *   `sklearn` refers to the library where the modules we want to import are located (in this case, it is the *scikit-learn* library, denominated `sklearn` for import purposes). 
  *   `model_selection` refers to the module that we want to import from the *scikit-learn* library.
*   The word `import` tells the computer that we want to import something from this module.
*   The word `train_test_split` identifies what we want to import.

From the moment the import is carried out, we **are able to use** the imported functions.

This tutorial uses **different modules and functions** from the *scikit-learn* library and they will be presented as they are used. Although there are many modules and functions, only a few of them are currently used. With this in mind, the rest of the tutorial will focus on the modules and functions that are used in the following operations:

*   Pre-processing.
*   Model training with machine learning.



## Pre-processing

In order for the algorithms to be applied, it is necessary to guarantee some data formatting assumptions. As such, pre-processing corresponds to **specific data preparation operations** that permit us to apply machine learning algorithms.

There are **various pre-processing operations**. Some of the most frequent are:

*   Variable encoding.
*   Independent and dependent variables.
*   Training and testing datasets.

### Variable encoding

Until now, our focus has been on numerical variables. However, it is common to find datasets that have categorical variables. In these cases, it is necessary to **transform the categorical variables into numerical variables** because the algorithms we want to work with require that numerical variables be used.

**Categorical** variables have a finite number of distinct categories (or groups). This means that the value of the variables is described by a finite set of values. In general, categorical variables can be characterised as:

*   Ordinal
*   Nominal

**Ordinal** categorical variables present a specific order. For example, blouse size is represented by a finite set of categories that have an order: XL > L > M > S. 

On the other hand, **nominal** categorical variables do not imply that there is any particular order. For example, the colour of blouses is represented by distinct categories (e.g., yellow, red and green), but these categories are not orderable (it does not make sense to say yellow > red).

Two common techniques used to convert categorical variables into numerical variables are: 

*   Label encoding.
*   One-hot encoding.

#### Label encoding

Let us create an example to illustrate the type of problem we want to address: 

In [None]:
import pandas as pd

df = pd.DataFrame({'Colour':['yellow', 'red', 'green'], 
                   'Size':['S', 'M', 'L'], 
                   'Price':[12, 15, 22]})
df

Unnamed: 0,Colour,Size,Price
0,yellow,S,12
1,red,M,15
2,green,L,22


In order for the algorithm to be able to read the data correctly, we have to transform categorical variables into numerical variables. Typically, this transformation is done by converting categorical variables into integers.

In the example shown, our categorical variables are 'Colour' and 'Size'. Starting with the variable 'Colour', the first thing that stands out is that this variable is nominal and not ordinal. This means that it does not matter what integer we assign to each of the categories and therefore, we can simply start counting the categories from 0. For that, we will use the `LabelEncoder` class from the *scikit-learn* library: 

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder = encoder.fit(df['Colour'])
encoded = encoder.transform(df['Colour'])

df['Colour'] = encoded

df

Unnamed: 0,Colour,Size,Price
0,2,S,12
1,1,M,15
2,0,L,22


In the previous instruction:

*   `from from sklearn.preprocessing import LabelEncoder` does the typical import from the *scikit-learn* library. 
  *   `sklearn.preprocessing` refers to the module where you can find the `LabelEncoder` class.
  *   `LabelEncoder` refers to the class where the label encoding technique is programmed. 
*   `encoder = LabelEncoder()` defines the use of the `LabelEncoder` class.
*   `encoder = encoder.fit(df['Colour'])` is the step where we pass the data we want to encode to the `encoder` object. In this case, the data in question corresponds to the variable 'Colour'. 
*   `encoded = encoder.transform(df['Colour'])` stores the data that has recently been encoded (via `encoder.transform(df['Colour'])`) in a new variable, `encoded` variable.  
*   `df['Colour'] = pd.Series(encoded)` serves to place the `encoded` variable data in a *Series*. Although our data is initially in a *Series* (`df['Colour']`), using `LabelEncoder` changes this data structure and, therefore, it is essential to apply the *Series* function to once again get back to a *Series*. 
*   `df` displays the `df` variable with the 'Colour' variable already encoded.

As we can see, the encoding left us with the following **correspondence**:

*   yellow = 0
*   red = 2
*   green = 1

The 'Size' variable's case is different because this is an ordinal variable. In this case, to ensure that the encoding respects the existing order (`L > M > S`), we have to specifying exactly which values we want for each category:





In [None]:
map_size = {'L': 2,
            'M': 1,
            'S': 0}

df['Size'] = df['Size'].map(map_size)
df

Unnamed: 0,Colour,Size,Price
0,2,0,12
1,1,1,15
2,0,2,22


As you can see, the encoding maintained the correspondence we wanted:

*   L = 2
*   M = 1
*   S = 0

This correspondence guarantees that the order `L > M > S` is preserved, since `2 > 1 > 0`. 

**Challenge:** Encode the categorical variables from the following dataset. 

In [None]:
df = pd.DataFrame({'Car':['Honda', 'Toyota', 'Toyota', 'Peugeot', 'Ford'],
                   'Year':['2000', '2010', '2009', '2008', '2013'], 
                   'Price':[17000, 23000, 22000, 20000, 24000]})
df

Unnamed: 0,Car,Year,Price
0,Honda,2000,17000
1,Toyota,2010,23000
2,Toyota,2009,22000
3,Peugeot,2008,20000
4,Ford,2013,24000


In [None]:
# Solution for the challenge

#### One-hot encoding

In the previous chapter, **after encoding the categorical variables**, we ended up with the following correspondences:

*   Variable 'Colour'
  *   yellow = 0
  *   red = 2
  *   green = 1
*   Variable 'Size'
  *   L = 2
  *   M = 1
  *   S = 0

As we indicated, there is an important difference between these two categorical variables: the variable 'Colour' is nominal and the variable 'Size' is ordinal. With this being the case, our encoding system has a **problem**. Although the variable 'Colour' is not ordinal, the algorithm will assume that `red > green > yellow`, just as it assumes that `L > M > S` (according to the integers that represent each of these categories). This situation means that, under certain circumstances, the algorithms do not work correctly.

One way around this problem is to use **one-hot encoding**. One-hot encoding involves creating a variable (column) for each existing category, attributing the value 1 or 0 to the observation in that variable, depending on whether or not the observation belongs to the category in question. For example, if a given observation refers to a yellow blouse, one-hot encoding will make it so that this observation will have the properties `yellow = 1`, `red = 0` and `green = 0`.

One-hot encoding is applied through the *scikit-learn* library using the `OneHotEncoder` class, just as described in the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=one hot encoder#sklearn.preprocessing.OneHotEncoder).

However, because it is simpler, let us illustrate how to do one-hot encoding via an alternative route, using the `get_dummies` function from the *pandas* library:




In [None]:
df = pd.DataFrame({'Colour':['yellow', 'red', 'green'], 
                   'Price':[12, 15, 22]})
df

Unnamed: 0,Colour,Price
0,yellow,12
1,red,15
2,green,22


In [None]:
pd.get_dummies(df)

Unnamed: 0,Price,Colour_green,Colour_red,Colour_yellow
0,12,0,0,1
1,15,0,1,0
2,22,1,0,0


As you can see, a column was created for each of the categories in the variable 'Colour' (the only categorical variable in the example we have presented). Therefore, each observation now has the value 1 in the column associated with its colour and the value 0 in the columns associated with the other colours.

If we look closely at the dataset, we see that it has **redundant information**: one of the colour columns could be dropped because if there are three colours, two columns are enough to define an observation. For example, if we did not have the variable 'Colour_yellow', all we would need to know is that the observation 0 has a value of 0 in the 'Colour_green' and 'Colour_red' variables in order to deduce that the observation has the colour yellow.

In certain situations, this redundancy can impair the functioning of the algorithms. Therefore, we must avoid this problem. To do this, all we have to do is assign the value `True` to the `drop_first` parameter of the `get_dummies` function: 

In [None]:
pd.get_dummies(df, drop_first=True)

Unnamed: 0,Price,Colour_red,Colour_yellow
0,12,0,1
1,15,1,0
2,22,0,0


**Challenge:** Apply one-hot encoding to the following dataset. 

In [None]:
df = pd.DataFrame({'Car':['Honda', 'Toyota', 'Toyota', 'Peugeot'],
                   'Price':[17000, 23000, 22000, 20000]})
df

Unnamed: 0,Car,Price
0,Honda,17000
1,Toyota,23000
2,Toyota,22000
3,Peugeot,20000


In [None]:
# Solution for the challenge

### Independent and dependent variables

All prediction models have **two types of variables**:

1.   Independent variables
2.   Dependent variables

**Independent variables** are the variables we use to predict the dependent variable. This means that independent variables do not depend on other variables. For example, if we want to predict the species of a plant based on the size of its petals, the petal size is an independent variable. In English machine learning terminology, it is customary to see independent variables referred to as `features`, `inputs` or `independent variables`.

**Dependent variables** are the variables we intend to predict. The value of these variables depends on the value of the independent variables, and this explains why they are called dependent variables. For example, if we want to predict the tip amount based on the cost of a meal, the tip amount is the dependent variable and the cost of the meal is the independent variable. In English machine learning terminology, it is customary to see dependent variables referred to as `target`, `outputs` or `dependent variables`.

Before **running a learning algorithm**, it is essential to define the independent variables and the dependent variable because the algorithm needs this information to work. These variables are usually defined by:

*   Creating an `X` variable where observations and their values are stored in independent variables.
*   Creating a `y` variable where observations and their values are stored in the dependent variable.

To make it easier to understand, let us look at an example. In this example, we will use the `tips` dataset (which we have already come across) and define the following:

*   The dependent variable is the variable that stores the tip amounts given ('tip' variable).
*   The independent variables are all the other variables ('total_bill', 'sex', 'smoker', 'day', 'time' and 'size' variables).

So, we would do the following:

In [None]:
import seaborn as sns

df = sns.load_dataset('tips')

X = df.drop('tip', axis=1)
y = df['tip']

In the previous instruction:

*   `import seaborn as sns` imports the *seaborn* library, which contains the dataset we want to use. 
*   `df = sns.load_dataset('tips')` stores the `tips` dataset in the `df` variable. 
*   `X = df.drop('tip', axis=1)` defines the independent variables. Since we want all of the variables to be independent variables, except for the dependent variable, the easiest way to define the `X` variable is to consider the entire dataset and eliminate the column from the dependent variable.
*   `y = df['tip']` defines the dependent variable which, in this case, is the 'tip' variable.

Defining the independent and dependent variables is essential in solving supervised learning problems. As such, except in very specific situations, this pre-processing operation is always used.

**Challenge**: Using the dataset below, set the flower species as the dependent variable ('species' variable) and the remaining variables as independent.

In [None]:
df = sns.load_dataset('iris')
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [None]:
# Solution for the challenge

### Training and Test Datasets

When using machine learning to build a prediction model, the aim is for the computer to be able to learn to make predictions based on past observations. This learning process, based on past observations, is commonly called **'model training'**.

It is during the training process that we seek to extract the patterns that allow us to make **generalisations** and apply prediction models to new situations (different from those described in past observations). This point is particularly important because prediction models are only useful if their application in new contexts is possible and successful.

That said, there are two aspects that need to be guaranteed when training models: 

*   **Training.** Ensure that we use past observations to train the prediction model. 
*   **Test.** Ensure that we test the application of the model in new situations.

To guarantee these aspects, the dataset is typically split and we end up working with:

*   **Training dataset.** A subdivision of the original dataset that will be used to train the model. 
*   **Test dataset.** A subdivision of the original dataset that will be used to test the model (evaluate its performance in new situations).

The *scikit-learn* library allows us to do this division using the `train_test_split()` function: 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In the previous instruction:

*   `from sklearn.model_selection import train_test_split` performs the typical import from the *scikit-learn* library.
  *   `sklearn.model_selection` refers to the module where the `train_test_split()` function is located.
  *   `train_test_split` refers to the function of the `sklearn.model_selection` module we want to use.
*   `X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)` creates four variables `(X_train, X_test, y_train and y_test)` from the original dataset (`X` and `y`).
  *   `X_train` and `X_test` contain the independent variables' training and test data, respectively.
  *   `y_train` and `y_test` contain the dependent variable's training and test data, respectively. 
  *   `train_test_split()` is the function that splits the original dataset into test and training datasets. 
  *   `X` and `y` represent the original dataset, previously divided into independent and dependent variables.
  *   `random_state=42` ensures that the division of the dataset, although random, is always done in the same way (different numbers generate different random divisions). The number 42 was used, but any other number could have been used.

Once we have the original dataset divided into training and test datasets, we can then move on to applying machine learning algorithms. 

**Challenge:** Divide the dataset [found here](https://raw.githubusercontent.com/pmarcelino/datasets/master/titanic.csv) into training and test datasets. Assume that the variable 'Survived' is the dependent variable and that all other variables are independent variables.




In [None]:
# Solution for the challenge

## Model training with machine learning

After performing the pre-processing operations, we are able to **apply the machine learning algorithms** to build our prediction model.

There are **several machine learning algorithms** that can be used. Most of these algorithms are available in the *scikit-learn* library, as can be seen in the [official documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning). Discussion of the different types of algorithms is beyond the scope of this course.

At this point, what is important to remember regarding the application of machine learning algorithms to train models is that: 

*   Regardless of the algorithm used, the **way it is applied is identical** in most cases. 
*   In general, there is a **version** of the algorithm for **classification** problems and another version for **regression** problems. 

### Algorithm application

The application of algorithms is almost always done in the same way, regardless of which algorithm we are using. Let us therefore analyse an example to identify this structure, starting with importing and pre-processing the `iris` dataset:

In [None]:
from sklearn.model_selection import train_test_split

df = sns.load_dataset('iris')

X = df.drop('species', axis=1)
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In the previous instruction:

*   `df = sns.load_dataset('iris')` stores the `iris` dataset in the `df` variable. This dataset characterises flower species according to their physical characteristics.
*   `X = df.drop('species', axis=1)` store the observations in the `X` variable and their values in the independent variables. In this case, all the variables are considered to be independent, with the exception of the `species` variable.  
*   `y = df['species']` stores the dependent variable in the `y` variable. In this case, the `species` variable is considered to be the dependent variable, which identifies the flower species.
*   `X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)` separates the dataset into training and test datasets. The `X_train`, `X_test`, `y_train` and `y_test` variables store this information. 

Note that since we are already familiar with this dataset, we know that it is not necessary to perform any more preprocessing operations. However, in any other case, it would be necessary to analyse the dataset to determine the need for any more pre-processing operations.

Applying the algorithm now:


In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In the previous instruction:

*   `from sklearn.ensemble import RandomForestClassifier` imports the algorithm we are going to use in this example (*Random Forest* algorithm).
  *   `sklearn.ensemble` is the module where the `RandomForestClassifier` class is located.
  *   `RandomForestClassifier` is the class where the *Random Forest* algorithm is programmed.
*   `model = RandomForestClassifier(random_state=42)` creates the model using the `RandomForestClassifier` class.
  *   `random_state=42` ensures that the algorithm is always applied in the same way and produces the same results (the *Random Forest* algorithm has a random part, so we have to define the `random_state` parameter if we want the results to be reproducible). The number 42 was used, but any other number could have been used. 
*   `model.fit(X_train, y_train)` trains the model using the data from the training set (defined by `X_train` and `y_train`). 

Once the algorithm is imported, its application for the purpose of model training is done using the lines of code `model = RandomForestClassifier()` and `model.fit(X_train, y_train)`. This procedure is valid for most algorithms used, the only thing that changes is the algorithm itself. This means that if we want to use another algorithm, we have to replace the `RandomForestClassifier` (which refers to the *Random Forest* algorithm) with whatever corresponds to the algorithm we want to use. 

To conclude, it is important to highlight one particular aspect when using a dataset such as the `iris` dataset. In this case, when we look at the dependent variable, we see that it is categorical (it can have the value `setosa`, `versicolor`, or `virginica`). As such, the question arises: is it necessary to encode the variable 'species'?

The answer is 'no' because *scikit-learn* allows you to use categorical variables as dependent variables (ie, there is no need to transform the dependent variable into a numerical one). However, if we were to encode the 'species' variable in pre-processing, it would not be a problem either. 

**Challenge**: Use the following dataset to train a model using the *Random Forest* algorithm. The dependent variable is the 'kind' variable and all the others are independent variables. During your solution, you should consider the need for a training dataset and a test dataset.

In [None]:
df = sns.load_dataset('geyser')
df

Unnamed: 0,duration,waiting,kind
0,3.600,79,long
1,1.800,54,short
2,3.333,74,long
3,2.283,62,short
4,4.533,85,long
...,...,...,...
267,4.117,81,long
268,2.150,46,short
269,4.417,90,long
270,1.817,46,short


In [None]:
# Solution for the challenge

### Classification and regression problems

Algorithms can be applied to **classification problems** and **regression problems**.

**Classification problems** are problems where the dependent variable is a categorical variable. For example, if we want to predict which species a penguin belongs to, based on its physical characteristics, we are faced with a classification problem.

In turn, **regression problems** are problems where the dependent variable is a numerical variable. For example, if we want to predict the size of a penguin's flippers, based on a specific information, we are faced with a regression problem. 

In the *scikit-learn* library, the **application of algorithms for classification and regression problems is identical**. The only difference is in the name of the class that will be used. For example, the version of the *Random Forest* algorithm for classification problems is programmed in the `RandomForestClassifier` class of the *scikit-learn* library. In turn, the version of the *Random Forest* algorithm for regression problems is programmed in the `RandomForestRegressor` class of the *scikit-learn* library.

Two examples are presented (one for classification and one for regression) that show how the entire process is identical, with the exception of the classes used:

In [None]:
# Classification problem
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

df = sns.load_dataset('iris')

X = df[['sepal_length', 'sepal_width','petal_length','petal_width']]
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
# Regression problem
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

df = sns.load_dataset('tips')

X = df[['total_bill', 'size']]
y = df['tip']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

As we can see, the entire structure of the solution is identical, except that in the classification problem we use `RandomForestClassifier` and in the regression problem we use `RandomForestRegressor`. 

Furthermore, it is also worth noting that we chose to define the independent variables one by one (instead of using the `drop` function). This is an alternative way of defining the `X` variable and it is particularly useful when we want to be specific in defining the independent variables we want to include in our prediction model.

**Challenge**: Use the following dataset to train a model with the *Random Forest* algorithm. The dependent variable is the `price` variable and let us assume that the independent variables are only `depth`, `table`, `x`, `y` and `z`. During your solution, you should consider the need for a training dataset and a test dataset. Finally, be aware that this problem is a regression problem. 

In [None]:
df = sns.load_dataset('diamonds')
df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [None]:
# Solution for the challenge

## Summary

In this tutorial, we saw:

*   ***Scikit-learn* library**. The *scikit-learn* library has a set of features related to the application of machine learning algorithms. 
*   **Pre-processing**. Before using learning algorithms, it is necessary to ensure that the data is prepared for their application. Therefore, it is common to perform pre-processing operations such as encoding categorical variables, defining independent and dependent variables, and separating the dataset into a training dataset and a test dataset. 
*   **Model training with machine learning**. Model training follows a standard structure. The aspects that differ are related to the definition of the algorithm we want to use (in this course, we always use the *Random Forest* algorithm) and whether we intend to use the version for classification problems (in the case of *Random Forest* this would mean using the `RandomForestClassifier` class) or for regression problems (in the case of *Random Forest* this would mean using the `RandomForestRegressor` class).

This tutorial presented various instructions and you are not expected to know them by heart. Above all, the purpose of this tutorial is to illustrate the **potential of the *scikit-learn* library** and to **serve as a document for future reference**. Later, with practice, you will begin to retain the instructions that you use most often and the logic behind each instruction will become more intuitive. 