<h2>Importing the Libraries</h2>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<h2>Import the Dataset</h2>

In [3]:
dataset = pd.read_csv('Data.csv')

In [4]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In a dataset, X (features) and y (target variable) are decided based on what you are trying to predict and the input variables that will help in making that prediction. Here's how they are determined:

1. X (Features/Inputs):
These are the independent variables that are used to predict the outcome.
They provide the necessary information for the model to make predictions.
Features could be numerical or categorical, like age, income, number of bedrooms, etc.
Example: If you are predicting house prices, your features could be:

X = [Square footage, number of bedrooms, location, year built]

2. y (Target/Output):
This is the dependent variable (or the label) that you want to predict.
It is the result that depends on the values of the features.
Example: For the house price prediction task, the target variable could be:

y = [Price of the house]

In [5]:
X = dataset.iloc[:,:-1]
y = dataset.iloc[:,-1]

X = dataset.iloc[:, :-1]:

dataset.iloc[:, :-1] selects all rows (:) and all columns except the last one (:-1).
This means that X will contain all the features (independent variables) of the dataset except the last column.

y = dataset.iloc[:, -1]:

dataset.iloc[:, -1] selects all rows (:) and only the last column (-1).
This means that y will be the last column, which is usually the target variable (dependent variable) you want to predict.

In [6]:
print(X)

   Country   Age   Salary
0   France  44.0  72000.0
1    Spain  27.0  48000.0
2  Germany  30.0  54000.0
3    Spain  38.0  61000.0
4  Germany  40.0      NaN
5   France  35.0  58000.0
6    Spain   NaN  52000.0
7   France  48.0  79000.0
8  Germany  50.0  83000.0
9   France  37.0  67000.0


In [7]:
print(y)

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object


<h2>Taking Care of Missing Data</h2>

In [8]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X.iloc[:, 1:3])
X.iloc[:,1:3] = imputer.transform(X.iloc[:, 1:3])

SimpleImputer: This class from sklearn is used to handle missing data by replacing missing values. In this case, it replaces np.nan values with the mean of the feature where the missing value occurs.

1. missing_values=np.nan: This tells the imputer to look for NaN values in the data.
strategy='mean': This defines that the missing values will be replaced by the mean of the column.

2. fit(): This method computes the mean of columns 1 and 2 (because the index is zero-based, the slice 1:3 refers to the 2nd and 3rd columns). The imputer learns the mean value from the data in these columns.

3. transform(): After fitting the imputer with the mean values of columns 1 and 2, the transform() method is used to replace the missing values (np.nan) in those columns with the computed mean.

X[:, 1:3] =: This updates the original matrix X by assigning the transformed values back to columns 1 and 2.

In [9]:
print(X)

   Country        Age        Salary
0   France  44.000000  72000.000000
1    Spain  27.000000  48000.000000
2  Germany  30.000000  54000.000000
3    Spain  38.000000  61000.000000
4  Germany  40.000000  63777.777778
5   France  35.000000  58000.000000
6    Spain  38.777778  52000.000000
7   France  48.000000  79000.000000
8  Germany  50.000000  83000.000000
9   France  37.000000  67000.000000


<h2>Encoding Categorical Data</h2>

Encoding categorical data is the process of converting non-numerical categories (like ‘red’, ‘blue’, or ‘small’, ‘medium’, ‘large’) into a numerical format that machine learning models can understand. Models usually require input data to be numeric, so categorical data (like names of colors, countries, etc.) needs to be transformed into numbers.

<h3>Encoding the Independant Variable</h3>

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers= [('encoder',OneHotEncoder(), [0])],remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)

[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
  8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]]


1.	Importing Necessary Libraries:

  	•   ColumnTransformer: This is a utility in scikit-learn that allows you to apply different preprocessing steps to different columns of your dataset.
	•	OneHotEncoder: This is a preprocessing technique that converts categorical variables into a format that can be provided to machine learning algorithms to do a better job in prediction. It transforms a categorical feature with k categories into k binary features.

3. Creating the ColumnTransformer:

   •	transformers: This argument specifies how to transform each column. It takes a list of tuples where:
	•	The first element is a name for the transformation (in this case, 'encoder').
	•	The second element is the transformer instance (OneHotEncoder()).
	•	The third element is a list of column indices to which the transformation will be applied. In this case, it’s [0], meaning the first column (index 0) of the input array X.
	•	remainder='passthrough': This argument specifies what to do with the remaining columns not specified in the transformers. By setting it to 'passthrough', it means that those columns will be included in the output without any change.

4. Transforming the Data:

   •	fit_transform(X): This method fits the ColumnTransformer to the data X and then transforms it. The OneHotEncoder will identify the unique categories in the first column and create new binary columns for each category.
	•	The transformed output will replace the original first column with the newly created binary columns, while the remaining columns will pass through unchanged due to the remainder='passthrough' setting.


<h3>Encoding the Dependent Variable</h3>

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

print(y)

[0 1 0 0 1 1 0 1 0 1]


1.	Importing LabelEncoder:

	•	The LabelEncoder is a utility that converts categorical labels into numerical values. This is useful for machine learning models that work with numerical inputs.

2.	Creating an Instance of LabelEncoder:

	•	An instance of LabelEncoder is created using le = LabelEncoder(). This instance will be used to perform the encoding.

3.	Fitting and Transforming the Labels:

	•	The fit_transform method is called on the label encoder instance with y as the input. This method does two things:
	•	Fit: It identifies all unique categories in y and assigns each a unique integer.
	•	Transform: It converts each label in y into its corresponding integer value.

-> Feature Scaling should be done after splitting the dataset. 

-> To Prevent Data Leakage.

<h2>Splitting the Dataset into the Training Set and Test Set</h2>

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 1)

The `train_test_split` function from `sklearn.model_selection` is used to split your dataset into two subsets: one for training your model (`X_train`, `y_train`) and one for testing it (`X_test`, `y_test`). Here's what each parameter means:

- `X, y`: These are your features (`X`) and target labels (`y`).
- `test_size = 0.2`: This specifies that 20% of the data will be used for testing, while the remaining 80% will be used for training.
- `random_state = 1`: This ensures reproducibility by setting a seed for random number generation. If you run the code again, the split will be the same as long as `random_state` is set to the same value.

As a result:
- `X_train`, `y_train`: Data used for training the model.
- `X_test`, `y_test`: Data used for testing the model's performance.

In [13]:
print(X_train)

[[0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
  8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]]


In [14]:
print(X_test)

[[0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04]
 [1.0e+00 0.0e+00 0.0e+00 3.7e+01 6.7e+04]]


In [15]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [16]:
print(y_test)

[0 1]


<h2>Feature Scaling</h2>

In [21]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])                                 

In [22]:
print(X_train)

[[ 0.          0.          1.         -0.19159184 -1.07812594]
 [ 0.          1.          0.         -0.01411729 -0.07013168]
 [ 1.          0.          0.          0.56670851  0.63356243]
 [ 0.          0.          1.         -0.30453019 -0.30786617]
 [ 0.          0.          1.         -1.90180114 -1.42046362]
 [ 1.          0.          0.          1.14753431  1.23265336]
 [ 0.          1.          0.          1.43794721  1.57499104]
 [ 1.          0.          0.         -0.74014954 -0.56461943]]


In [23]:
print(X_test)

[[ 0.  1.  0. -1. -1.]
 [ 1.  0.  0.  1.  1.]]
