<a href="https://colab.research.google.com/github/martasaparicio/lematecX/blob/main/3.3-Model_Training_Cookbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cookbook

## Model training

Model training corresponds to the moment when we apply machine learning algorithms to train models. As such, model training is an essential step in building machine learning models.

In this cookbook, we'll look at a set of common model training techniques. Most of the illustrated techniques use the library [scikit-learn](http://scikit-learn.org/).

## Problem 1

Apply label encoding to a categorical variable.

### Solution

In [None]:
# Import libraries
import pandas as pd

# Create data
df = pd.DataFrame({'Animal':['Elephant', 'Penguin', 'Cat', 'Elephant'],             
                   'Weight':[6000, 23, 4.5, 5700]})
df

Unnamed: 0,Animal,Weight
0,Elephant,6000.0
1,Penguin,23.0
2,Cat,4.5
3,Elephant,5700.0


In [None]:
# Import libraries
from sklearn.preprocessing import LabelEncoder

# Apply label encoding
encoder = LabelEncoder()
encoder = encoder.fit(df['Animal'])
encoded = encoder.transform(df['Animal'])

df['Animal'] = encoded

df

Unnamed: 0,Animal,Weight
0,1,6000.0
1,2,23.0
2,0,4.5
3,1,5700.0


## Problem 2

Apply one-hot encoding to a categorical variable

### Solution

In [None]:
# Create data
df = pd.DataFrame({'Month':['January', 'February', 'March'],             
                   'Profit':[1200, 1230, 1500]})
df

Unnamed: 0,Month,Profit
0,January,1200
1,February,1230
2,March,1500


In [None]:
# Apply one-hot encoding
pd.get_dummies(df, columns=['Month'], drop_first=True)

Unnamed: 0,Profit,Month_January,Month_March
0,1200,1,0
1,1230,0,0
2,1500,0,1


## Problem 3

Define independent and dependent variables.

### Solution

In [None]:
# Import libraries
import seaborn as sns

# Import data
df = sns.load_dataset('penguins')
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


In [None]:
# Define independent variables
X = df[['bill_length_mm', 'bill_depth_mm']]
X

Unnamed: 0,bill_length_mm,bill_depth_mm
0,39.1,18.7
1,39.5,17.4
2,40.3,18.0
3,,
4,36.7,19.3
...,...,...
339,,
340,46.8,14.3
341,50.4,15.7
342,45.2,14.8


* We are assuming that we only want to use two independent variables ('bill_length_mm' and 'bill_depth_mm')
* We named the variable 'X' because this is what independent variables are usually called

In [None]:
# Define dependent variables
y = df['species']
y

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object

* We are assuming that we only want to use one dependent variable ('species')
* We named the variable 'y' because this is what dependent variables are usually called

## Problem 4

Split a dataset into training and testing datasets.

### Solution

In [None]:
# Import data
df = sns.load_dataset('penguins')
df

# Define independent and dependent variables
X = df.drop('species', axis=1)
y = df['species']

In [None]:
# Split data into training and testing datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Problem 5

Apply a machine learning algorithm to train the model.

### Solution

In [None]:
# Import data
df = sns.load_dataset('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [None]:
# Define independent and dependent variables
X = df[['total_bill', 'size']]
y = df['tip']

# Split data into training and testing datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
# Import libraries
from sklearn.ensemble import RandomForestRegressor

# Apply a machine learning algorithm to train the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)