# Handling numerical data:
Numerical value is measurement of some feature like sales/cost/price etc.

In this tutorial, I'll teach (as well as learn with) you numerous strategies for transforming raw numerical data into feature-purpose built for ML algorithms.

# Rescaling a Feature:

In [2]:
# Load libraries
import numpy as np
from sklearn import preprocessing

# Create feature
feature = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]])

# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))

# Scale feature
scaled_feature = minmax_scale.fit_transform(feature)

# Show feature
scaled_feature

array([[0., 0., 0., 0., 0.]])

Many of the ML algorithms will assume all features are on same scale, typically from [0, 1] or [-1, 1]. Simplest way to compute is [x - min(x)]/[max(x) - min(x)]

# Standardize a Feature:

Make sure that in this case, feature is so changed that it'll have mean = 0 and standard deviation = 1.
Standard deviation = [SUM (x - mean(x))^2]/n

In [5]:
# new x = [x - mean(x)]/STDDEV

#This is used more than MinMaxScaler. MinMaxScale particularly used for Neural-Networks

from sklearn import preprocessing

feature = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]])
scaler = preprocessing.StandardScaler()
feature = scaler.fit_transform(feature)

print(np.mean(feature))
print(np.std(feature))

0.0
1.0


One point I'd like to draw your attention towards is if our data contains significant outliers, standardization feature may impact our algorithm. In that case it is better to scale using median and quartile range which is implemented using RobustScaler.

In [6]:
scaler = preprocessing.RobustScaler()
feature = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]])

feature = scaler.fit_transform(feature)

feature

array([[-2.5],
       [-0.5],
       [ 0. ],
       [ 0.5],
       [ 4.5]])

# Detecting outliers:
Detecting outliers is more of an art than science. There are many ways to detect outliers like Z-score, modified Z-score, IQR(InterQuartileRange) etc. Here I'm particularly focusing on the method which is widely used i.e., IQR

In [9]:
import numpy as np
def getIndicesOfOutliers(x):
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3 - q1
    lowerBound = q1 - (iqr * 1.5)
    upperBound = q3 + (iqr * 1.5)
    
    return np.where((x < lowerBound) | (x > upperBound))

feature = np.array([-500.5, -100.1, 0, 100.1, 900.9])

getIndicesOfOutliers(feature)

(array([0, 4], dtype=int64),)

Result implies values -500.5 and 900.9 are outliers with respect to feature array.

# Handling outliers:
Typically we've 3 strategies to handle outliers. First, drop them simply.

In [14]:
# Create dataframe
import pandas as pd
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

print('Before filter:')
print(houses)

# Apply filter to drop
houses = houses[houses['Bathrooms'] < 4] 
print('\n\nAfter filter')
houses

Before filter:
     Price  Bathrooms  Square_Feet
0   534433        2.0         1500
1   392333        3.5         2500
2   293222        2.0         1500
3  4322032      116.0        48000


After filter


Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


Second: We can mark them as feature and include it as new feature

In [23]:
import pandas as pd
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

print('Before filter:')
print(houses)

print('\n\nAfter filter:')
houses['filter'] = np.where(houses['Bathrooms']>3, 1, 0)
houses

Before filter:
     Price  Bathrooms  Square_Feet
0   534433        2.0         1500
1   392333        3.5         2500
2   293222        2.0         1500
3  4322032      116.0        48000


After filter:


Unnamed: 0,Price,Bathrooms,Square_Feet,filter
0,534433,2.0,1500,0
1,392333,3.5,2500,1
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


Finally, we can take transform feature to dampen effect of outlier

In [24]:
houses['LogSquareFeet'] = [np.log(x) for x in houses['Square_Feet']]
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,filter,LogSquareFeet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,1,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


# Deleting observations with missing values

In [33]:
import numpy as np
import pandas as pd
from sklearn import datasets

df = pd.read_csv('F:\\100DaysOfMLChallenge\\3 Pandas basic tutorial\\train.csv')
df.dropna(inplace=True)

df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,183.0,183.0,183.0,183.0,183.0,183.0,183.0
mean,455.36612,0.672131,1.191257,35.674426,0.464481,0.47541,78.682469
std,247.052476,0.470725,0.515187,15.643866,0.644159,0.754617,76.347843
min,2.0,0.0,1.0,0.92,0.0,0.0,0.0
25%,263.5,0.0,1.0,24.0,0.0,0.0,29.7
50%,457.0,1.0,1.0,36.0,0.0,0.0,57.0
75%,676.0,1.0,1.0,47.5,1.0,1.0,90.0
max,890.0,1.0,3.0,80.0,3.0,4.0,512.3292


# Imputing missing values:

We can either use mean, median or mode to fill missing value. Apart from that one can use KNN or ___ based algorithm where we consider feature with missing value as prediction value (y) and other features as input to algorithm (X).

In [9]:
from sklearn.preprocessing import Imputer
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Make a simulated feature matrix
features, _ = make_blobs(n_samples = 1000, n_features = 2, random_state = 1)
scaler = StandardScaler()

standardized_features = scaler.fit_transform(features)
true_value = standardized_features[0,0]
standardized_features[0,0] = np.nan

mean_imputer = Imputer(strategy = 'mean', axis = 0)
features_mean_imputed = mean_imputer.fit_transform(features)

# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_mean_imputed[0,0])

'''
from fancyimpute import KNN

# Predict the missing values in the feature matrix
features_knn_imputed = KNN(k=5, verbose=0).complete(standardized_features)

# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_knn_imputed[0,0])
'''



True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


'\nfrom fancyimpute import KNN\n\n# Predict the missing values in the feature matrix\nfeatures_knn_imputed = KNN(k=5, verbose=0).complete(standardized_features)\n\n# Compare true and imputed values\nprint("True Value:", true_value)\nprint("Imputed Value:", features_knn_imputed[0,0])\n'