# Numerical Data Handling

**About Notebook:** 
In this notebook, I have covered various techniqes to handle numerical datas. what does acutally mean numerical data handling is. For example if we have some values which can be in any forms like array, lists, dictionaries, datasets, ect. And we want to rescale those values or we want to transform those values such that their mean value become 0 and stadard deviation become 1, etc. These types of operations can be performed.

Formally, In other hand, Creating some features from existing values and transforming those values in some desired purpose values by applying some techniques is numerical data handling.

**Author:** *Pintu Ram*

### List of Operations Performed in this notebook:

1. Rescaling Features 
2. Standardizing Features 
3. Normalizing Observations 
4. Generating Polynomial and Interaction Features 
5. Transforming Features 
6. Detecting Outliers 
7. Handling Outliers 

### 1. Rescaling a Feature 
Rescale the values of a numerical feature to be between two values

In [1]:
# load libraries
import numpy as np
from sklearn import preprocessing

In [2]:
# Create feature
feature = np.array([[-500.5],
                    [-100.1],
                    [0],
                    [100.1],
                    [900.9]])

In [3]:
# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)) # here 0 and 1 are the two values

In [4]:
# Scale feature
scaled_feature = minmax_scale.fit_transform(feature)

In [5]:
# Show feature
scaled_feature  # the values will be between 0 & 1

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

### 2. Standardizing a Featur
Transform a feature to have a mean of 0 and a standard deviation of 1

In [6]:
# Load libraries
import numpy as np
from sklearn import preprocessing

In [7]:
# Create feature
feature = np.array([[-500.5],
                    [-100.1],
                    [0],
                    [100.1],
                    [900.9]])

In [8]:
# Create scaler
scaler = preprocessing.StandardScaler()

In [9]:
# Transform the feature
standardized = scaler.fit_transform(feature)

In [10]:
# Show feature
standardized

array([[-1.26687088],
       [-0.39316683],
       [-0.17474081],
       [ 0.0436852 ],
       [ 1.79109332]])

In [11]:
# Print mean and standard deviation
print("Mean:", round(standardized.mean()))
print("Standard deviation:", standardized.std())

Mean: 0.0
Standard deviation: 1.0


### 3. Normalizing Observations
Rescale the feature values of observations to have unit norm (a total
length of 1)

In [12]:
# Load libraries
import numpy as np
from sklearn.preprocessing import Normalizer

In [13]:
# Create feature
feature = np.array([[0.5, 0.5],
                    [1.1, 3.4],
                    [1.5, 20.2],
                    [1.63, 34.4],
                    [10.9, 3.3]])

In [14]:
# Create normalizer
normalizer = Normalizer(norm="l2") # l2 is called here Manhattan norm

In [15]:
# Transform feature matrix
normalizer.transform(feature)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

In [16]:
# Create normalizer
normalizer = Normalizer(norm="l1") # l1 is called here Taxicab norm

In [17]:
# Transform feature matrix
normalizer.transform(feature)

array([[0.5       , 0.5       ],
       [0.24444444, 0.75555556],
       [0.06912442, 0.93087558],
       [0.04524008, 0.95475992],
       [0.76760563, 0.23239437]])

### 4. Generating Polynomial and Interaction Features
Create polynominal and interaction features

In [18]:
# Load libraries
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

In [19]:
# Create feature matrix
features = np.array([[2, 3],
                    [2, 3],
                    [2, 3]])

In [20]:
# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False) #degree=2 means x1, x2, x1 2 , x2 2

In [21]:
# Transform polynomial features
polynomial_interaction.fit_transform(features)

array([[2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.]])

In [22]:
# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=3, include_bias=False) #degree=3 means x1, x2, x1 2 , x2 2 , x1 3 , x2

In [23]:
# Transform polynomial features
polynomial_interaction.fit_transform(features)

array([[ 2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.],
       [ 2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.],
       [ 2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.]])

In [24]:
# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures() #default degree means x1x2

In [25]:
# Transform polynomial features
polynomial_interaction.fit_transform(features)

array([[1., 2., 3., 4., 6., 9.],
       [1., 2., 3., 4., 6., 9.],
       [1., 2., 3., 4., 6., 9.]])

In [26]:
# We can restrict the features created to only interaction features by setting interaction_only to True:
interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction.fit_transform(features)

array([[2., 3., 6.],
       [2., 3., 6.],
       [2., 3., 6.]])

### 5. Transforming Features
Make a custom transformation to one or more features

In [27]:
# Load libraries
import numpy as np
from sklearn.preprocessing import FunctionTransformer

In [28]:
# Create feature matrix
features = np.array([[2, 3],
                    [2, 3],
                    [2, 3]])

In [29]:
# Define a simple function
def add_ten(x):
    return x + 10

In [30]:
# Create transformer
ten_transformer = FunctionTransformer(add_ten)

In [31]:
# Transform feature matrix
ten_transformer.transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

In [32]:
# Alternative way using pandas to this because it's simple

In [33]:
# Load library
import pandas as pd

In [34]:
# Create DataFrame
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])

In [35]:
# Apply function
df.apply(add_ten)

Unnamed: 0,feature_1,feature_2
0,12,13
1,12,13
2,12,13


### 6. Detecting Outliers
Identify extreme observations.

In [36]:
# Load libraries
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

In [37]:
# Create simulated data
features, _ = make_blobs(n_samples = 10,
                        n_features = 2,
                        centers = 1,
                        random_state = 1)

In [38]:
# Replace the first observation's values with extreme values
features[0,0] = 10000
features[0,1] = 10000

In [39]:
# Create detector
outlier_detector = EllipticEnvelope(contamination=.1)

In [40]:
# Fit detector
outlier_detector.fit(features)

EllipticEnvelope()

In [41]:
# Predict outliers
outlier_detector.predict(features)

array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

In [42]:
# identify extreme values in those features using interquartile range (IQR):

In [43]:
# Create one feature
feature = features[:,0]

In [44]:
# Create a function to return index of outliers
def indicies_of_outliers(x):
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    return np.where((x > upper_bound) | (x < lower_bound))

In [45]:
# Run function
indicies_of_outliers(feature)

(array([0], dtype=int64),)

### 7.Handling Outliers
You have outliers.

In [46]:
# WE HAVE THREE METHOD TO SOLVE IT
# 1. DROPE
# Load library
import pandas as pd

In [47]:
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

In [48]:
# Filter observations
houses[houses['Bathrooms'] < 20]

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


In [49]:
# 2. mark them as outliers and include it as a feature
# Load library
import numpy as np

In [50]:
# Create feature based on boolean condition
houses["Outlier"] = np.where(houses["Bathrooms"] < 20, 0, 1)

In [51]:
# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


In [52]:
# 3. transform the feature to dampen the effect of the outlier
# Log feature
houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]]
# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,Log_Of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956
