### Importing the Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset

In [2]:
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

"iloc" function takes the indices of the columns we want to extract from the dataset, not only the columns but also the indexes of the ro

### Taking care of missing data

There can be two ways to taking care of missing data. First by completely deleting all the related data in the set or second replacing the missing data with the average value of the rest of the values from the dataset.

For sklearn simple imputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [9]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])


# Encoding the Independent Variable


In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


# Encoding the Dependent Variable

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

More on LabelEncoder sklearn on https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


# Splitting the dataset into the Training Set and Test Set

More on train_test_split : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

We apply the feature scaling after splitting the dataset into the training set and test set.

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
# print(X, ' ' ,y, '  ' , X_train)

# Feature Scaling 

In [29]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:5] = sc.fit_transform(X_train[:, 3:5])
X_test[:, 3:5] = sc.fit_transform(X_test[:, 3:5])

# print(X_train)
# print(X_test)