### **Importing the libraries:**

In [69]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### **Importing the dataset:**

In [70]:
dataset = pd.read_csv('Data.csv')
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [71]:
X = dataset.iloc[:, :-1].values # .values to convert to numpy array
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [72]:
y = dataset.iloc[:, -1].values
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

### **Taking care of missing data:**

In [73]:
# Check for missing data
missing_data = dataset.isnull().sum()
missing_data

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [74]:
from sklearn.impute import SimpleImputer
help(SimpleImputer)

Help on class SimpleImputer in module sklearn.impute._base:

class SimpleImputer(_BaseImputer)
 |  SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, copy=True, add_indicator=False, keep_empty_features=False)
 |
 |  Univariate imputer for completing missing values with simple strategies.
 |
 |  Replace missing values using a descriptive statistic (e.g. mean, median, or
 |  most frequent) along each column, or using a constant value.
 |
 |  Read more in the :ref:`User Guide <impute>`.
 |
 |  .. versionadded:: 0.20
 |     `SimpleImputer` replaces the previous `sklearn.preprocessing.Imputer`
 |     estimator which is now removed.
 |
 |  Parameters
 |  ----------
 |  missing_values : int, float, str, np.nan, None or pandas.NA, default=np.nan
 |      The placeholder for the missing values. All occurrences of
 |      `missing_values` will be imputed. For pandas' dataframes with
 |      nullable integer dtypes with missing values, `missing_values`
 |      can be set to eit

In [75]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # `imputer`is object of the `SimpleImputer` to replace missing values with the mean

`imputer` is just an object, we haven't connect anything yet to our matrix of features, so the next step is indeed to apply this `imputer` object on the matrix of features. Then we will call the `fit()` method, the `fit()` method will exactly connect this `imputer` to the matrix of features:

In [76]:
imputer.fit(X[:, 1:3]) # fit the imputer to the data

In [77]:
X[:, 1:3] = imputer.transform(X[:, 1:3]) # replace the missing values with the mean

In [78]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### **Encoding categorical data**

This dataset contains a column with categories: France, Spain, or Germany. When working with machine learning models, it can be challenging for the model to compute correlations between these categorical columns. To address this, we need to convert these strings into numerical values.

One approach would be to encode France as 0, Spain as 1, and Germany as 2. This way, the model would understand that France is represented by 0, Spain by 1, and Germany by 2. However, this encoding may lead to misinterpreted correlations because the model might assume a numerical relationship between these countries.

A better approach is to use one hot encoding. One hot encoding involves creating separate columns for each category (in this case, we'll have three columns for the three countries). Each column represents a binary vector. For example, France would be represented by the vector `[1 0 0]`, Spain by `[0 1 0]`, and Germany by `[0 0 1]`. In this way, we only have zeros and ones, and each country is represented by a unique combination of these binary values.

Additionally, there is another column called "purchased" that contains non-numerical labels. We also need to convert these labels into zeros and ones to make them compatible with the machine learning model.

+ Encoding the Independent Variable:

In [79]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [80]:
help(ColumnTransformer)

Help on class ColumnTransformer in module sklearn.compose._column_transformer:

class ColumnTransformer(sklearn.base.TransformerMixin, sklearn.utils.metaestimators._BaseComposition)
 |  ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)
 |
 |  Applies transformers to columns of an array or pandas DataFrame.
 |
 |  This estimator allows different columns or column subsets of the input
 |  to be transformed separately and the features generated by each transformer
 |  will be concatenated to form a single feature space.
 |  This is useful for heterogeneous or columnar data, to combine several
 |  feature extraction mechanisms or transformations into a single transformer.
 |
 |  Read more in the :ref:`User Guide <column_transformer>`.
 |
 |  .. versionadded:: 0.20
 |
 |  Parameters
 |  ----------
 |  transformers : list of tuples
 |      List of (name, transformer, columns) tuples

In [81]:
help(OneHotEncoder)

Help on class OneHotEncoder in module sklearn.preprocessing._encoders:

class OneHotEncoder(_BaseEncoder)
 |  OneHotEncoder(*, categories='auto', drop=None, sparse_output=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None, feature_name_combiner='concat')
 |
 |  Encode categorical features as a one-hot numeric array.
 |
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
 |  encoding scheme. This creates a binary column for each category and
 |  returns a sparse matrix or dense array (depending on the ``sparse_output``
 |  parameter).
 |
 |  By default, the encoder derives the categories based on the unique values
 |  in each feature. Alternatively, you can also specify the `categories`
 |  manually.
 |
 |  This encoding is needed for feeding categorical data to many s

In [82]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

In this line of code, we create a `ColumnTransformer` object and pass it a list of data transformations. In this case, we only have a single transformation. This transform is defined using a transformer called `OneHotEncoder` from the `sklearn.preprocessing` library that is used to convert input variables to binary variables (one-hot encoding). In this case, we apply `OneHotEncoder` to the first column of the data (the column with index 0). The `'encoder'` in this line of code is just a name or identifier given to the transformation. It's a way to label the transformation being applied, in this case, the `OneHotEncoder()`.

When you create a `ColumnTransformer`, you provide a list of transformations. Each transformation is a tuple that contains three elements:

1. A name (string) for the transformation. This is `'encoder'` in our case.
2. The transformer object. This is `OneHotEncoder()` in our case.
3. The columns this transformation should be applied to. This is `[0]` in our case, meaning the transformation is applied to the first column of the input data.

The `remainder='passthrough'` argument is used to specify how to handle columns that do not have a transform applied. In this case, the columns that do not have the transformation applied will be passed through.

In [83]:
fitTransformX = ct.fit_transform(X) # fit and transform the data in X
fitTransformX

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [84]:
X = fitTransformX
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

+ Encoding the Dependent Variable:

In [85]:
from sklearn.preprocessing import LabelEncoder

In [86]:
help(LabelEncoder)

Help on class LabelEncoder in module sklearn.preprocessing._label:

class LabelEncoder(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  Encode target labels with value between 0 and n_classes-1.
 |
 |  This transformer should be used to encode target values, *i.e.* `y`, and
 |  not the input `X`.
 |
 |  Read more in the :ref:`User Guide <preprocessing_targets>`.
 |
 |  .. versionadded:: 0.12
 |
 |  Attributes
 |  ----------
 |  classes_ : ndarray of shape (n_classes,)
 |      Holds the label for each class.
 |
 |  See Also
 |  --------
 |  OrdinalEncoder : Encode categorical features using an ordinal encoding
 |      scheme.
 |  OneHotEncoder : Encode categorical features as a one-hot numeric array.
 |
 |  Examples
 |  --------
 |  `LabelEncoder` can be used to normalize labels.
 |
 |  >>> from sklearn.preprocessing import LabelEncoder
 |  >>> le = LabelEncoder()
 |  >>> le.fit([1, 2, 2, 6])
 |  LabelEncoder()
 |  >>> le.classes_
 |  array([1, 2, 6])
 |  >>> le.transform(

In [87]:
fitTransformY = LabelEncoder().fit_transform(y)
fitTransformY

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [88]:
y = fitTransformY
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

### **Splitting the dataset into the Training set and Test set.**

In [89]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

We encounter the lines of code above again, but this time we will carefully explain the concepts within it as follows:

+ `test_size` is a parameter that determines the size of the test data set (test set) compared to the total number of original data samples. If `test_size` is a decimal number between 0.0 and 1.0, it represents a percentage of the test data. If `test_size` is an integer, it represents the absolute number of test data samples. If `test_size` is not specified (None), the default value will be the offset of the training set size, i.e. 0.25.

+ `random_state` is the parameter that determines the random state when dividing the data into training and testing sets. If you want the data division result to remain the same across different runs, you can provide an integer for `random_state`. This ensures that each run will use the same random state, resulting in the same results. If you don't care about keeping the random state intact, you can leave `random_state` as None or not provide a value for it.

For example, `test_size` is set to 0.2, so 20% of the original data will be used for the test set. `random_state` is set to 1, ensuring that the split result will be the same across different runs.

In [90]:
X_train

array([[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [91]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [92]:
y_train

array([0, 1, 0, 0, 1, 1, 0, 1])

In [93]:
y_test

array([0, 1])

### **Feature Scaling.**

Performing feature scaling before applying machine learning models is necessary because it ensures that all features contribute equally to the model's performance. As we move from simple linear regression to multivariate and polynomial, the number of features and their complexity increase. Without feature scaling, larger-scale features can disproportionately influence the model, leading to inaccurate predictions.

+ Simple Linear Regression:

$$y = b_0 + b_1x_1.$$

+ Multiple Linear Regression:

$$y = b_0 + b_1x_1 + b_2x_2 + \ldots + b_nx_n.$$

+ Polynomial Linear Regression:

$$y = b_0 + b_1x_1 + b_2x_1^2 + \ldots + b_nx_1^n.$$

In [94]:
from sklearn.preprocessing import StandardScaler

Again, StandardScaler is: $X' = \displaystyle\frac{X - \mu}{\sigma}, \ X'\in[-3, 3]$.

In [95]:
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

Note, you cannot use `StandardScaler()` directly but must first assign it to an object (in this case `sc`).

`X_test` does not use the `sc.transform` function like `X_train` because when using `StandardScaler()`, we only need to fit and transform the training data (`X_train`) to calculate mean and standard deviation. Then, we use the calculated mean and standard deviation values from the training data to transform the test data (`X_test`) without re-fitting. This helps ensure that the scaling process is applied consistently to both training and test data.

In [96]:
X_train

array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
       [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
       [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
       [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
       [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
       [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
       [0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
       [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
      dtype=object)

In [97]:
X_test

array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],
       [1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
      dtype=object)