<h2>Importing the Libraries</h2>

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<h2>Import the Dataset</h2>

In [15]:
dataset = pd.read_csv('Data.csv')

In [16]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In a dataset, X (features) and y (target variable) are decided based on what you are trying to predict and the input variables that will help in making that prediction. Here's how they are determined:

1. X (Features/Inputs):
These are the independent variables that are used to predict the outcome.
They provide the necessary information for the model to make predictions.
Features could be numerical or categorical, like age, income, number of bedrooms, etc.
Example: If you are predicting house prices, your features could be:

X = [Square footage, number of bedrooms, location, year built]

2. y (Target/Output):
This is the dependent variable (or the label) that you want to predict.
It is the result that depends on the values of the features.
Example: For the house price prediction task, the target variable could be:

y = [Price of the house]

In [17]:
X = dataset.iloc[:,:-1]
y = dataset.iloc[:,-1]

X = dataset.iloc[:, :-1]:

dataset.iloc[:, :-1] selects all rows (:) and all columns except the last one (:-1).
This means that X will contain all the features (independent variables) of the dataset except the last column.

y = dataset.iloc[:, -1]:

dataset.iloc[:, -1] selects all rows (:) and only the last column (-1).
This means that y will be the last column, which is usually the target variable (dependent variable) you want to predict.

In [18]:
print(X)

   Country   Age   Salary
0   France  44.0  72000.0
1    Spain  27.0  48000.0
2  Germany  30.0  54000.0
3    Spain  38.0  61000.0
4  Germany  40.0      NaN
5   France  35.0  58000.0
6    Spain   NaN  52000.0
7   France  48.0  79000.0
8  Germany  50.0  83000.0
9   France  37.0  67000.0


In [19]:
print(y)

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object


<h2>Taking Care of Missing Data</h2>

In [24]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X.iloc[:, 1:3])
X.iloc[:,1:3] = imputer.transform(X.iloc[:, 1:3])

SimpleImputer: This class from sklearn is used to handle missing data by replacing missing values. In this case, it replaces np.nan values with the mean of the feature where the missing value occurs.

1. missing_values=np.nan: This tells the imputer to look for NaN values in the data.
strategy='mean': This defines that the missing values will be replaced by the mean of the column.

2. fit(): This method computes the mean of columns 1 and 2 (because the index is zero-based, the slice 1:3 refers to the 2nd and 3rd columns). The imputer learns the mean value from the data in these columns.

3. transform(): After fitting the imputer with the mean values of columns 1 and 2, the transform() method is used to replace the missing values (np.nan) in those columns with the computed mean.

X[:, 1:3] =: This updates the original matrix X by assigning the transformed values back to columns 1 and 2.

In [25]:
print(X)

   Country        Age        Salary
0   France  44.000000  72000.000000
1    Spain  27.000000  48000.000000
2  Germany  30.000000  54000.000000
3    Spain  38.000000  61000.000000
4  Germany  40.000000  63777.777778
5   France  35.000000  58000.000000
6    Spain  38.777778  52000.000000
7   France  48.000000  79000.000000
8  Germany  50.000000  83000.000000
9   France  37.000000  67000.000000


<h2>Encoding Categorical Data</h2>