## Data Preprocessing Tools

##### Importing libraries

In [41]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

##### Importing the dataset

In [42]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [43]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [44]:
print(X[:, 1:3])

[[44.0 72000.0]
 [27.0 48000.0]
 [30.0 54000.0]
 [38.0 61000.0]
 [40.0 nan]
 [35.0 58000.0]
 [nan 52000.0]
 [48.0 79000.0]
 [50.0 83000.0]
 [37.0 67000.0]]


In [27]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


##### Taking care of missing data

In [29]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

1. `SimpleImputer`: class provided by scikit-learn's sklearn.impute module. It is used for imputing missing values in datasets. Imputation is the process of replacing missing data with substituted values. For the case above:

    * `missing_values`: allows you to specify the value or values that represent missing data in your dataset. When the SimpleImputer encounters these values in the dataset, it will consider them as missing and perform imputation based on the specified strategy.

    * `strategy`: specifies the imputation strategy to be used for filling missing values in the dataset. It determines how the missing values are replaced or filled. For this case, `mean` replaces missing values with the mean of the non-missing values for each feature (column).

2. The `fit()` method of `SimpleImputer()` calculates the necessary statistics from the data needed for imputation based on the specified strategy. For example, if you use the 'mean' strategy, it calculates the mean of each feature (column) from the data. The method takes the dataset (or a subset of it) as input.

3. The `transform()` method of `SimpleImputer()` is a method used to transform the dataset by imputing missing values based on the fitted imputer object. After fitting the `SimpleImputer` object to the data using the `fit()` method, you can use the `transform()` method to replace missing values in the dataset with the imputed values.



In [30]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


##### Encoding categorical data

Encoding the independent variable

In [45]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

1. `ColumnTransformer` is a class provided by scikit-learn that allows you to apply different transformers to different columns of your dataset. It's particularly useful when you have a dataset with mixed data types (e.g., numerical and categorical features) and you want to apply different preprocessing steps to each type of feature.

    * `transformers`: This parameter specifies the list of transformers to be applied to the columns of the dataset. It should be a list of tuples, where each tuple contains:

        a. A name for the transformer (optional). For this case, `encoder`

        b. The transformer object itself. For this case, `OneHotEncoder()`
        
        c. The column indices (or column names) to which the transformer should be applied. For this case, it will be the first column of the dataset (`[0]`)

    * `remainder`: This parameter specifies what to do with the columns that are not specified in the transformers list. It can take one of the following values:

        a. `drop`: Drop the remaining columns that are not transformed.

        b. `passthrough`: Leave the remaining columns unchanged and include them in the output.

        c. A transformer object: Apply the specified transformer to the remaining columns.

2. `fit_transform(X)`: This part of the code fits the `ColumnTransformer` object `ct` to the dataset `X` and then transforms `X` according to the specified transformations. The `fit_transform()` method combines the fitting (learning the parameters) and transformation steps. It applies the transformations defined in the ColumnTransformer to the columns of X based on the rules specified in the transformers parameter. After transformation, it returns the transformed dataset.

    

In [46]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 nan]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 nan 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


***One-hot Encoding:*** s a technique used in machine learning and data preprocessing to represent categorical variables as binary vectors, i.e., [0,0,1,0] or [1,0,0,0], etc. In one-hot encoding, each categorical variable is represented as a binary vector with a length equal to the number of unique categories in the variable. Each category is then represented by a binary value: 1 if the sample belongs to that category, and 0 otherwise.

Encoding the independent variable

In [33]:
from sklearn.preprocessing import LabelEncoder

lc = LabelEncoder()
y = lc.fit_transform(y)

1. `LabelEncoder` is a utility class in scikit-learn used for encoding categorical labels into numerical labels. It assigns a unique integer to each unique category in the input data, thereby converting categorical labels into numerical representations.

2. The `fit_transform()` method of LabelEncoder is called with the target variable y as input. This method fits the LabelEncoder to the unique categories present in y and transforms the categories into their corresponding numerical representations.

***Difference between `LabelEncoder()` and `ColumnTransformer()`:*** 

Both components of scikit-learn used for data preprocessing, but they serve different purposes and operate on different types of data.

1. `LabelEncoder()`:
    * Used to encode categorical labels into numerical labels.
    * It is typically applied to a single feature (column) containing categorical labels.
    * Assigns a unique integer to each unique category in the input data, thereby converting categorical labels into numerical representations.
    * It is commonly used for encoding target variables in classification tasks or transforming categorical labels for certain preprocessing tasks.

2. `ColumnTransformer()`:
    * Used for applying different transformations to different columns (features) of a dataset.
    * It allows you to specify different preprocessing steps for different columns of the dataset, including transformations like scaling, encoding, imputation, and more.
    * ColumnTransformer() is more versatile and can handle various preprocessing tasks, including encoding categorical variables using techniques like `OneHotEncoder()` or `OrdinalEncoder()`.
    * It is often used as part of a preprocessing pipeline to apply different transformations to different subsets of the dataset.

In summary, the main difference between `LabelEncoder()` and `ColumnTransformer()` is their scope and purpose: LabelEncoder() is specifically designed for encoding categorical labels, while ColumnTransformer() is a more general-purpose tool for applying transformations to multiple columns of a dataset.

In [34]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


##### Splitting the dataset into the Trainning set and Test set

In [35]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

1. `train_test_split()` is a function from the sklearn.model_selection module in scikit-learn. It is used for splitting a dataset into two or more subsets for training and testing machine learning models.

    * The first parameter are the arrays from the dataset you want to split. In this case, `X` and `y`
    * `test_size`: This parameter specifies the proportion of the dataset that should be included in the test split. It can be either a float (representing the proportion) or an integer (representing the absolute number of samples).
    * `random_state`: This parameter controls the random seed used for shuffling the dataset before splitting. Setting a specific random state ensures reproducibility of the split.
    * The function returns a tuple containing the split datasets. By default, it returns the training data, testing data, training labels, and testing labels (if provided).

In [36]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [37]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [38]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [39]:
print(y_test)

[0 1]


##### Feature Scaling