#                                               PreProcessing

### Missing values on numerical data

In [29]:
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

In [30]:
imp=SimpleImputer()

In [31]:
X_train = [[np.nan, 1, 2], [3, np.nan, 4], [5, np.nan, 6]]
X_test = [[np.nan, 10, 10], [120, np.nan, 600], [10, np.nan, 30]]

In [32]:
X_train= imp.fit_transform(X_train)
X_test=imp.transform(X_test)
print(X_train)
print(X_test)

[[4. 1. 2.]
 [3. 1. 4.]
 [5. 1. 6.]]
[[  4.  10.  10.]
 [120.   1. 600.]
 [ 10.   1.  30.]]


### Encoding on categorical data


In [33]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()

le.fit_transform(["paris", "paris", "tokyo", "amsterdam"])

array([1, 1, 2, 0], dtype=int32)

There is a problem here. Here the machine learning model understands that tokyo has higher value than paris and then amsterdam. That is not the case. These are not ordered categories and we cannot compare them. This can be done on sizes like small , medium and large

We should prevent machine learning model to think that tokyo is greater than paris and amsterdam. For this we gonna use dummy variables. This can be done in two ways

#### using get_dummies from pandas

In [34]:
dummy=pd.get_dummies(["paris", "paris", "tokyo", "amsterdam"])
dummy

Unnamed: 0,amsterdam,paris,tokyo
0,0,1,0
1,0,1,0
2,0,0,1
3,1,0,0


#### using OneHotEncoder from sklearn

In [35]:
from sklearn.preprocessing import OneHotEncoder

In [36]:
ohe=OneHotEncoder()
cat=[["paris", "paris", "tokyo", "amsterdam"]]

In [37]:
cate=ohe.fit_transform(cat).toarray()

In [38]:
print(cate)

[[1. 1. 1. 1.]]


### Splitting the data into train and test

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
X, y = np.arange(10).reshape((5, 2)), range(5)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Feature Scaling

#### standard scaling is must since it helps to compute faster by many ml models as many models uses euclidean distance



#### x train and x test should be scaled on same basis, hence we use fit transform in train and transform in test 
####  The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data.
#### Now, in a real application, the new, unseen data could be just 1 data point that we want to classify. (How do we estimate mean and standard deviation if we have only 1 data point?) That’s an intuitive case to show why we need to keep and use the training data parameters for scaling the test set.


In [42]:
from sklearn.preprocessing import StandardScaler

In [43]:
X_train = [[0, 0], [0, 0], [2, 10], [91, 199]]
X_test=[[187,190], [91, 19]]
scaler = StandardScaler()


In [44]:
X_t= scaler.fit_transform(X_train)
X_tes=scaler.transform(X_test)
print(X_t)
print(X_tes)

[[-0.59426437 -0.61597805]
 [-0.59426437 -0.61597805]
 [-0.54314485 -0.49808751]
 [ 1.73167358  1.73004362]]
[[ 4.18541032  1.62394213]
 [ 1.73167358 -0.39198603]]
