Thats how it goes like

<font color = 'green'>
    
Step 1: Define Problem.
    
Step 2: Prepare Data.
    
Step 3: Evaluate Models.
    
Step 4: Finalize Model.

---
## Step 2
    
- Data Cleaning: Identifying and correcting mistakes or errors in the data.
- Feature Selection: Identifying those input variables that are most relevant to the task.
- Data Transforms: Changing the scale or distribution of variables.
- Feature Engineering: Deriving new variables from available data.
- Dimensionality Reduction: Creating compact projections of the data.

---



# Title
![boston](housing.jpeg)
## Introduction


In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [3]:
# loading dataset
housing = pd.read_csv('housing.csv', delimiter=',', header=None)
housing.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


## Data Cleaning

#### Check for Null Values 

In [4]:
housing.isnull().sum().any()

False

In [5]:
nunique_cols = housing.nunique()
print(nunique_cols)

0     504
1      26
2      76
3       2
4      81
5     446
6     356
7     412
8       9
9      66
10     46
11    357
12    455
13    229
dtype: int64


In [6]:
for col in housing.columns:
    col = col
    nunique = housing[col].nunique()
    pc = round(housing[col].nunique()/housing.shape[0]*100,2)
    print(col, nunique, str(pc)+'%')

0 504 99.6%
1 26 5.14%
2 76 15.02%
3 2 0.4%
4 81 16.01%
5 446 88.14%
6 356 70.36%
7 412 81.42%
8 9 1.78%
9 66 13.04%
10 46 9.09%
11 357 70.55%
12 455 89.92%
13 229 45.26%


In [7]:
# Import the library
from sklearn.feature_selection import VarianceThreshold

In [8]:
# Reloading the data to undo any changes
housing = pd.read_csv('housing.csv', delimiter=',', header=None)

# Splitting the data into input and output 
data = housing.values
X = data[:, :-1]
y = data[:, -1]
print('Before transformation:',X.shape, y.shape)

Before transformation: (506, 13) (506,)


In [9]:
# define the transform
transform = VarianceThreshold()

# transform the input data
X_sel = transform.fit_transform(X)

print('After transformation:', X_sel.shape)

After transformation: (506, 13)


#### Identification of the rows with duplicate data

In [10]:
# We can check this using the first dataframe object.
housing.duplicated().sum()

0

In [11]:
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor

In [12]:
# Reloading the data to undo any changes
housing = pd.read_csv('housing.csv', delimiter=',', header=None)

# Splitting the data into input and output 
data = housing.values
X = data[:, :-1]
y = data[:, -1]
print('Before transformation:',X.shape, y.shape)

# Splitting the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Before transformation: (506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)


In [13]:
# identify outliers in the training dataset
lof = LocalOutlierFactor()
y_pred = lof.fit_predict(X_train)

# select all rows that are not outliers
mask = y_pred != -1
X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)

(305, 13) (305,)
