# **INTRODUCTION**

# ML engineering vs. data science

In industry, there is quite a bit of overlap between machine learning engineering and data science. Both jobs involve working with data, such as data analysis and data preprocessing.

The main task for machine learning engineers is to first analyze the data for viable trends, then create an efficient input pipeline for training a model. This process involves using libraries like NumPy and pandas for handling data, along with machine learning frameworks like TensorFlow for creating the model and input pipeline.

While the NumPy and pandas libraries are also used in data science, the Data Preprocessing section will cover one of the core libraries that is specific to industry-level data science: scikit-learn. Data scientists tend to work on smaller datasets than machine learning engineers, and their main goal is to analyze the data and quickly extract usable results. Therefore, they focus more on traditional data inference models (found in scikit-learn), rather than deep neural networks.

# **STANDARDIZING DATA**

In [None]:
# A- Standard Data format

Data can contain all sorts of different values. For example, Olympic 100m sprint times will range from 9.5 to 10.5 seconds, while calorie counts in large pepperoni pizzas can range from 1500 to 3000 calories. Even data measuring the exact same quantities can range in value (e.g. weight in kilograms vs. weight in pounds).

When data can take on any range of values, it makes it difficult to interpret. Therefore, data scientists will convert the data into a standard format to make it easier to understand. The standard format refers to data that has 0 mean and unit variance (i.e. standard deviation = 1), and the process of converting data into this format is called data standardization.

Data standardization is a relatively simple process. For each data value, x, we subtract the overall mean of the data, μ, then divide by the overall standard deviation, σ. The new value, z, represents the standardized data value. Thus, the formula for data standardization is:

z = (x - μ) / σ
​

In [1]:
# imports
import numpy as np

In [7]:
# B- NumPy and scikit-learn
# define pizza_data
pizza_data = np.array([[2100,   10,  800],
                       [2500,   11,  850],
                       [1800,   10,  760],
                       [2000,   12,  800],
                       [2300,   11,  810]])

print('{}\n'.format(repr(pizza_data)))
print('{}\n'.format(pizza_data.sum()))

# import scale from sklearn.preprocessing
from sklearn.preprocessing import scale

# standardizing each column of pizza_data
col_standardized = scale(pizza_data)
print('{}\n'.format(repr(col_standardized)))

# column means (rounded to nearest tousandth)
col_means = col_standardized.mean(axis=0).round(decimals=3)
print('{}\n'.format(repr(col_means)))

# column standard deviations
col_stds = col_standardized.std(axis=0)
print('{}\n'.format(repr(col_stds)))

array([[2100,   10,  800],
       [2500,   11,  850],
       [1800,   10,  760],
       [2000,   12,  800],
       [2300,   11,  810]])

14774

array([[-0.16552118, -1.06904497, -0.1393466 ],
       [ 1.4896906 ,  0.26726124,  1.60248593],
       [-1.40693001, -1.06904497, -1.53281263],
       [-0.57932412,  1.60356745, -0.1393466 ],
       [ 0.66208471,  0.26726124,  0.2090199 ]])

array([ 0., -0.,  0.])

array([1., 1., 1.])



# **DATA RANGE**

In [None]:
# A- Range scaling

# To scale a data by compressing it into a fixed range. 
# one of the biggest use cases for this is compressig data into the range [0, 1].
# helps to view data in terms of porportions, or percentages based on maximum and minimum valuesin data.
# the formula for scaling based on a range is. a two-step process. for a given data value x,
# we first compute the porportion of the value with respect to the min and max of the data
# like:  Xprop = (X - Dmin) / (Dmax - Dmin)
# we compute the proportion of the data value Xprop !!!!(works only if Dmin != Dmax)
# then we use proprotion to scale to the range [Rmin, Rmax].
# the formula to do that is: Xscale = Xprop * (Rmax - Rmin) + Rmin

In [9]:
# B- Range compression in scikit-learn

# define data
data = np.array([[ 1.2,  3.2],
                 [-0.3, -1.2],
                 [ 6.5, 10.1],
                 [ 2.2, -8.4]])

print('{}\n'.format(repr(data)))

# import MinMaxScaler from sklearn.preprocessing
from sklearn.preprocessing import MinMaxScaler

# default scaler range is [0, 1]
default_scaler = MinMaxScaler()
transformed = default_scaler.fit_transform(data)
print('{}\n'.format(repr(transformed)))

# custom Scaler, example range [-2, 3]
custom_scaler = MinMaxScaler(feature_range=(-2, 3))
transformed = custom_scaler.fit_transform(data)
print('{}\n'.format(repr(transformed)))

array([[ 1.2,  3.2],
       [-0.3, -1.2],
       [ 6.5, 10.1],
       [ 2.2, -8.4]])

array([[0.22058824, 0.62702703],
       [0.        , 0.38918919],
       [1.        , 1.        ],
       [0.36764706, 0.        ]])

array([[-0.89705882,  1.13513514],
       [-2.        , -0.05405405],
       [ 3.        ,  3.        ],
       [-0.16176471, -2.        ]])



In [10]:
# define new_data
new_data = np.array([[ 1.2, -0.5],
                     [ 5.3,  2.3],
                     [-3.3,  4.1]])

print('{}\n'.format(repr(new_data)))

# import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

default_scaler = MinMaxScaler()
transformed = default_scaler.fit_transform(new_data)
print('{}\n'.format(repr(transformed)))

# new instance of MinMaxScaler
default_scaler = MinMaxScaler()
# different data value fit
default_scaler.fit(data)
transformed = default_scaler.transform(new_data)
print('{}\n'.format(repr(transformed)))


array([[ 1.2, -0.5],
       [ 5.3,  2.3],
       [-3.3,  4.1]])

array([[0.52325581, 0.        ],
       [1.        , 0.60869565],
       [0.        , 1.        ]])

array([[ 0.22058824,  0.42702703],
       [ 0.82352941,  0.57837838],
       [-0.44117647,  0.67567568]])



# **ROBUST SCALING**

In [None]:
# A- Data Outliers

