### Day 1 includes
> -  Preprocessing
> -  Simple Linear Regression
> -  Multiple Linear Regression
> -  Logistic Regression 
> -  Implementing Logistic Regression

### 1. Preprocessing

1.1 import packages

In [1]:
import numpy as np
import pandas as pd
print("Numpy version: ",np.__version__)
print("Pandas version: ",pd.__version__)

Numpy version:  1.19.0
Pandas version:  1.0.5


1.2 load the dataset 

In [2]:
dataset = pd.read_csv('Data.csv')

In [3]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


df.loc and df.iloc are used for indexing, i.e., to pull out portions of data. In essence, the difference is that .loc allows label-based indexing, while .iloc allows position-based indexing.

In [4]:
#dataset.select_dtypes(['object']).columns
fs = dataset.select_dtypes(['float']).columns

In [5]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

In [6]:
X[:, 1:3]

array([[44.0, 72000.0],
       [27.0, 48000.0],
       [30.0, 54000.0],
       [38.0, 61000.0],
       [40.0, nan],
       [35.0, 58000.0],
       [nan, 52000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [37.0, 67000.0]], dtype=object)

1.3 handling the missing data 

In [7]:
#from sklearn.impute import SimpleImputer
#imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
#imputer = imputer.fit(X[:, 1:3])
#X[:, 1:3] = imputer.transform(X[:, 1:3])

In [8]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

1.4 Encoding categorical data

Categorical data are variables that contains label values rather than numarical values.
For example hair color, country, letter grade and so on.
There are actually two types of categorical data: 1. categorical and ordinal 2. categorical and nominal
1. <b>Categorical and ordinal</b> varialbles refer to categorical data if there is a logical ordering to the values of categorical data, e.g., letter grades.
2. <b>Categorical and nomial</b> variables refer to categorical data when there is no logical ordering to the values of categorical data, e.g., hair color. 

What is the problem with Categorical data?

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

but some algorithms can work directly on categorical data, e.g, decision tree.


for categorical and nominal variables we use <b>one hot encoding</b>






ordinal encoding is a natural encoding for ordinal variables. For categorical and nominal variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In [10]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
onehotencoder = OneHotEncoder()
ct = ColumnTransformer([('onehot', OneHotEncoder(), [0]),
                       ('imputer', SimpleImputer(missing_values=np.nan, strategy="mean"), fs)], remainder='passthrough')














#### sklearn.compose.ColumnTransformer

Applies transformers to columns of an array or pandas DataFrame.

In [11]:
encoded_X = ct.fit_transform(dataset)

In [12]:
encoded_X

array([[1.0, 0.0, 0.0, 44.0, 72000.0, 'No'],
       [0.0, 0.0, 1.0, 27.0, 48000.0, 'Yes'],
       [0.0, 1.0, 0.0, 30.0, 54000.0, 'No'],
       [0.0, 0.0, 1.0, 38.0, 61000.0, 'No'],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778, 'Yes'],
       [1.0, 0.0, 0.0, 35.0, 58000.0, 'Yes'],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0, 'No'],
       [1.0, 0.0, 0.0, 48.0, 79000.0, 'Yes'],
       [0.0, 1.0, 0.0, 50.0, 83000.0, 'No'],
       [1.0, 0.0, 0.0, 37.0, 67000.0, 'Yes']], dtype=object)

In [18]:
X = encoded_X[:, :-1]
y = dataset.iloc[:, 3].values

In [19]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [20]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [21]:
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

In [22]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

### 1.5 Splitting the datasets into training sets and Test sets 

In [26]:
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

### 1.6 Feature scaling

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [40]:
from sklearn.preprocessing import StandardScaler

In [41]:
standard = StandardScaler()

In [48]:
scaled_X_train = standard.fit_transform(X_train)
scaled_X_test = standard.fit_transform(X_test)

In [50]:
scaled_X_train

array([[-0.77459667,  1.73205081, -0.77459667,  1.83377277,  1.94587371],
       [ 1.29099445, -0.57735027, -0.77459667, -0.36187534, -0.35693541],
       [-0.77459667, -0.57735027,  1.29099445,  0.07725429, -0.08059832],
       [-0.77459667, -0.57735027,  1.29099445,  0.19110271, -0.9096096 ],
       [-0.77459667, -0.57735027,  1.29099445, -1.53288766, -1.27805906],
       [-0.77459667,  1.73205081, -0.77459667, -1.09375804, -0.72538487],
       [ 1.29099445, -0.57735027, -0.77459667,  0.95551353,  0.9326377 ],
       [ 1.29099445, -0.57735027, -0.77459667, -0.06912226,  0.47207587]])