# 💻 UnpackAI DL201 Bootcamp - Week 2 - Concepts: Continuous Variables

### 📕 Learning Objectives
<hr style="border:2px solid gray"> </hr>


* Firmly grasp how continuous variables have a mathematical meaning which allows for certain algorithms to use them as input data  
* Gain an Appreciation for how **scaling** can not only decrease training time, but also increase increase the quality of the input data in both tabular and image data
* Understand the differences between standardization and normalization, and in which situations to apply them
* Build awareness of the importance of the normal distribution, and how to transform data using log and boxcox transforms
* Appreciate how these very same properties can be applied in 2 or more dimensions in image data through broadcasting methods



### 📖 Concepts map
<hr style="border:2px solid gray"> </hr>


* Quantitative vs Qualitative
* Fundamental Theorem of Calculus
* Law of Large Numbers/
* Normal Distribution
* Standard Deviation
* Skew
* Tensor Data Types
* Garbage in Garbage out
* Local vs Global Transformation


In [1]:
# Imports 
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import os

In [2]:
is_kaggle = True   # True if you are on Kaggle, False for local Windows, Linux or Mac environments.

In [3]:
# if is_kaggle:
    
#     IMAGE_DIR = Path('/kaggle/working/DL201/img')
#     DATA_DIR = Path('/kaggle/working/DL201/data')


In [4]:
# path preparation
if is_kaggle:
    !git clone https://github.com/unpackAI/DL201.git
    IMAGE_DIR = Path('/kaggle/working/DL201/img')
    DATA_DIR = Path('/kaggle/working/DL201/data')
    OUTPUT_DIR='./output/'
else:
    # This section is for local execution, it is assumed that we launch the notebooks from the DL201 repository.
    DATA_DIR = Path('../data')
    IMAGE_DIR = Path('../img')
    OUTPUT_DIR = Path('../output')

# finally, check if we found the right pathes
if os.path.isdir(DATA_DIR):
    print(f'DATA_DIR is a directory, its path is {DATA_DIR}')
else:
    print("ERROR : DATA_DIR is not a directory")

if os.path.isdir(IMAGE_DIR):
    print(f'IMAGE_DIR is a directory, its path is {IMAGE_DIR}')
else:
    print("ERROR : IMAGE_DIR is not a directory")

# create an output directory if necessary
if not os.path.isdir(OUTPUT_DIR):
    os.mkdir(OUTPUT_DIR)
if os.path.isdir(OUTPUT_DIR):
    print(f'OUTPUT_DIR is a directory, its path is {OUTPUT_DIR}')
else:
    print("ERROR : OUTPUT_DIR is not a directory")

In [5]:
# Splits columns into categorical and continious variables 
def cont_cat_split(df, max_card=20, dep_var=None):
    "Helper function that returns column names of cont and cat variables from given `df`."
    cont_names, cat_names = [], []
    for label in df:
        #if label in L(dep_var): continue
        if ((pd.api.types.is_integer_dtype(df[label].dtype) and
            df[label].unique().shape[0] > max_card) or
            pd.api.types.is_float_dtype(df[label].dtype)):
            cont_names.append(label)
        else: cat_names.append(label)
    return cont_names, cat_names

## What is a Continuous Variable?
<hr style="border:2px solid gray"> </hr>


There are primarily two kinds of variables in statistics, being continuous and categorical. Continuous variables are quantifiable numbers that exist on a spectrum. An ideal continuous variable can be any number or decimal between the minimum and maximum. For example, measurements based in units such as grams, meters, and liters, temperature ect. meet this criteria. 

However, it gets a bit fuzzy when you have real world data. However, there's enough data, it's generally accepted to treat anything that can exist on an ordered spectrum, has many possible states and has mathematical meaning as a continuous variable. For example, financial data such as revenue, profits, number of items sold, or price are continuous variables.

Another question here comes down to where an image fits into this. Because there are many different possible values in a pixel, and many thousands of pixels in each image, it becomes possible to treat an image as a continuous variable. Because they have these same, base, numerical properties, it is possible to extend the same techniques done across columns in statistics and broadcast them in 2 or 3 dimensions. 

### A few examples of continuous variables

* Weight
* Running Distance
* Revenue
* Number of Tickets Sold

* Grayscale Image
* RGB Image

In [6]:
houses_df = pd.read_csv(DATA_DIR/'house-prices'/'train.csv',index_col=0)

In [7]:
cont_vars, cat_vars = cont_cat_split(houses_df)

Here are the continuous variables in this dataset.

In [8]:
print(cont_vars)

If you wish to swap one of the features of your dataset, you can use this function to do so.

In [9]:
def change_feature_type(swapFeature,cat_vars,cont_vars):
    cat_df = pd.DataFrame(cat_vars)
    cont_df = pd.DataFrame(cont_vars)

    cat_result = cat_df[cat_df[0]==swapFeature]
    cont_result = cont_df[cont_df[0]==swapFeature]

    if len(cat_result) > 0:
        print(f'Found: {swapFeature} in catagorical varaibles, swapping to continous variable')
        cont_vars.append(swapFeature)
        cat_vars.remove(swapFeature)

    elif len(cont_result) > 0:
        print(f'Found: {swapFeature} in continuous variables, swapping to catagorical variable')
        cat_vars.append(swapFeature)
        cont_vars.remove(swapFeature)

    else:
        print(f"Feature: {swapFeature} was not found in either list, please check spelling")  
    return (cat_vars,cont_vars)

In [10]:
#swapThisFeature = '' #'Column_Name'

#cat_vars, cont_vars = change_feature_type(swapThisFeature,cat_vars,cont_vars)


## How can Machine Learning and Deep Learning Use Them?
<hr style="border:2px solid gray"> </hr>


As a result of being numbers, they have some amazing properties that form the foundations which science, and ultimately many Machine Learning and Deep Learning models are built on. 

### Mathematical Meaning



For this, at a high level, there are two very critical properties of continous numbers that models use we are going to talk about.

* One is the the slope of a line, which comes from methods of calculus.

* The Second is distance, which comes from geometry. 

The first point about the slope of a line (or gradient 2D+) gives us the parameters in regression based models, SGD (Stochastic Grade Descent) ect.

The second property of distance we can use, allows us to create a relationship between different values using the concept of distance (Euclidean Distance).

At this point, we're note getting into how the models work, but rather going over just enough to understand the reasons why they need to be preprocessed to get the best results out of a Machine Learning or Deep Learning Model. 



## Why do We Need to Scale Continuous Variables?
<hr style="border:2px solid gray"> </hr>


### Many AI Algorithms Train Faster
<hr style="border:2px solid gray"> </hr>


Real world data can have many different forms. There could be continuous variables that have only small differences between the largest and smallest value, or there can be a range of millions. 

But, because there are numbers involved, it's not so much the actual values that are important, but the relationship between them. Scaling not only makes the models have to work less hard, but extracts the key relationships in the data which actually are important and puts them onto a level playing field.

#### Models that use Distance as a measure

Models such as K-Means or K-Nearest Neighbors use the concept of distance to extract information from the features. 

If the different features have very different ranges of values, then not only does the model have to train longer to get the relationships, but the distances are relative only to themselves rather than across the dataset. 

For example, if we had a distance of meters walked by an ant, and a distance of kilometers walked by an elephant, then the distance the elephant traveled is much greater, even if the ant traveled much further relative to it's body size. 

#### Models that calculate gradients/slopes

Other models, such as SGD (Stochastic Grade Descent) use the concept of slope to optimize the performance of the model. When the values are normalized within a same range, it trains faster than having some parameters as huge, than other parameters as very small.

### Allows model to become more complex
<hr style="border:2px solid gray"> </hr>

In Deep Learning there can be thousands upon thousands of parameters in a model. As a result, some of the parameters can become quite tiny because they all need to add up to a small value.

If some of the features are larger than others, then it drowns out the importance of smaller valued features. By having scaled values, we can increase the complexity of the model



Scaling continuous variables is critical not only to save computer resources, but also to improve the accuracy of the model. Many AutoML tools do this automatically, but it is important to understand the concepts so that they are applied appropriately.

# Section 1: Scaling Tabular Data 
<hr style="border:4px solid gray"> </hr>


In this notebook, we'll start with Tabular Data because we can treat it as 1D, then build up the knowledge base so that it can be applied to images as well.

### Normalization 
<hr style="border:2px solid gray"> </hr>


#### Min Max Scaling 

One problem that could result from having a large range of data is that the large numbers and variance can cause the model to see it as more important than variables that have a smaller variance. 

Normalization can put the values into a range that falls between zero and one

Here is this way of scaling : substract the smallest value x_min to each value x. This will let the new values start from 0. Then, you will divide those differences by the difference between the largest and smallest elements x_max - x_min.

$$ \tilde{x}_i = \frac{x_i - x_{min}}{x_{max} - x_{min}}. $$

A year column would benefit from this kind of transformation. Since we are in the 21st century, everything in this column is going to have a huge offset of 2000

### Question : 
could you give an example when this formula does not give good results ?

Below, let's see how it can transform the data

In [11]:
houses_df['YearBuilt'].mean()

In [12]:
sns.histplot(houses_df['YearBuilt'])

In a large model, this is a problem, because it will change the parameter associated with the house prices to assume that the number will be very large relative to the other ones.  

In [13]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()

df = houses_df

column_names_to_normalize = ['YearBuilt','YearRemodAdd','GarageYrBlt']

x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)

df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

houses_df = df.copy()

In [14]:
houses_df['YearBuilt'].describe()

In [15]:
sns.histplot(houses_df['YearBuilt'])

Although it is no longer meaningful to a person because we've lost the dates and our associations with them, it is much more clear to a machine because all the values are between zero and one. 

This transform does not change the ratios of values to eachother. The two graphs look identical, except for now that they have different values. 

Why is the skew unaffected by this transform? What information does this transform preserve?

In [16]:
houses_df['YearBuilt'].skew()

You do not know what is a the skew of a distribution ? The following page explains it very well : https://www.mathsisfun.com/data/skewness.html

Generally speaking, standardization should be used when your model has a regularization term or is otherwise sensitive to the scaling of the input features. Standardization transforms all features onto the same scaling, thereby ensuring that regularization and other scaling-sensitive operations work properly.

### Standardization 
<hr style="border:2px solid gray"> </hr>


Problems with large scales of numbers can be that it is not easy to see a pattern if there are a huge range of possible values.

This is where the normal distribution comes in. A normal distribution is an incredibly important property of large samples that allows a class of machine learning algorithms to work.

They do this because a probability value can be assigned to a sample. Based on ***how far it is from the mean***, assumptions can be made about how likely this value is to appear.

If they are standardized, we are no longer looking at the raw value. But instead looks at how many standard deviations it is from the mean.

Again, it is not the individual numbers that are important, but the relationship between them which really matters. Standardization shows this to the model more clearly.

https://www.kdnuggets.com/2020/04/data-transformation-standardization-normalization.html

### Skewed Data

The problem is that data often comes in a skewed form. This can be corrected by various transforms. 

Although there is no consensus on the threshold of what constitutes skewed data, but for the purposes of this notebook, a skew value above 1 is considered skewed. 

In [17]:
sns.histplot(houses_df['SalePrice'])
print('Skew of Sales Price Distribution: ', houses_df['SalePrice'].skew())

This looks like a very nice graph, but it has a real problem. There is a strong skew to the left with an appreciable tail. As you can see below, the skew value is quite large

In [18]:
skewed_features = houses_df[cont_vars].skew()
sorted_skewed_features = skewed_features.sort_values(ascending=False) # for information
skewed_features_series = pd.Series(skewed_features)

PositiveSkewedFeatures = skewed_features_series[skewed_features_series > 1]
NegativeSkewedFeatures = skewed_features_series[skewed_features_series < -1]
normalFeatures = skewed_features_series[skewed_features_series.abs() <= 1]

In [19]:
sorted_skewed_features

In [21]:
print('Number of Negative Skewed Features: ',len(NegativeSkewedFeatures))
print('Number of Positive Skewed Features: ',len(PositiveSkewedFeatures))
print('Number of not Skewed Features: ',len(normalFeatures))

These statistics, among many others can rapidly and easily be obtained by pandas methods and a quick internet search.

In this notebook, we don't go into kurtosis but this is also available in the pandas library : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.kurtosis.html.

#### Log Transforms 

Log transforms help to normalize data significantly. This is useful in positive skew distributions

In [22]:
houses_df['SalePrice'] = np.log(houses_df['SalePrice'])


In [23]:
sns.histplot(houses_df['SalePrice'])

As you can see, this graph is now looking much more like a normal distribution. It has a skew much closer to zero, which means that we can use linear models, and the model will train faster.

In [24]:
houses_df['SalePrice'].skew()

#### Boxcox Transform

The Boxcox Transform is another method in the scipy library that usually performs better than log transforms. It is a commonly used tool to normalize skewed data

In [25]:
#stats
from scipy import stats

df = houses_df.copy()

for col in PositiveSkewedFeatures.index:
    #print(col)
    original_skew = round(df[col].skew(),2)
    
    #np.clip replaces zeros with tiny numbers
    # to avoid how it's impossible to take
    # the log of zero
    
    transformed_col = stats.boxcox(np.clip(df[col],0.0001,None))[0]
    
    tsfm_skew = round(pd.Series(transformed_col).skew(),2)
    
    if abs(tsfm_skew) < abs(original_skew):
        df[col] = transformed_col
        print(f'{col} skewed decreased from {original_skew} to {tsfm_skew}')
    else:
        pass


houses_df = df

In [26]:
from sklearn.preprocessing import StandardScaler
standard_scalar = StandardScaler()

df = houses_df

columns_to_standardize = cont_vars

x = df[columns_to_standardize].values

x_scaled = standard_scalar.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=columns_to_standardize, index = df.index)
df[columns_to_standardize] = df_temp

houses_df = df

# Section 2: Scaling Image Data
<hr style="border:4px solid gray"> </hr>


Now that we have done through scaling 1D tabular data, it is only a matter of applying an understanding of broadcasting rules to  2D and 3D data. 


Because the normal distribution is so pervasive, it is only natural that it would appear in images as well. Images also often have a normal distribution of pixel values.


### Why do images need to be scaled?
<hr style="border:2px solid gray"> </hr>


Secondly, scaling becomes even more important for images going into Neural Networks. Neural Networks can have millions of parameters, and this can also affect the magnitude of the loss functions, and activation functions. It's easier for networks to learn when the data is scaled to zero mean, or between 0-1. 

In [27]:
from PIL import Image
from matplotlib import pyplot as plt

imagePath = IMAGE_DIR/'week2'/'goldengatebridge.jpg'

image = Image.open(imagePath)

In [28]:
# It's always a good idea to check the type
plt.imshow(image)
plt.show()
type(image)

This image has many colors in it and is a good candidate to test using what we learned about tabular data and applying to higher dimensional data. Let's first represent it as a 3D array

In [29]:
rgb_array = np.asarray(image)

In [30]:
#always check the shape 
rgb_array.shape

Most images have pixel values between 0 and 255.

Since we are no longer using it to display it for a person to read, we should change its representation to make the mathematical meaning clearer. This not only speeds up the training process, but eliminates possible complications that become difficult to check out once the model is trained because there are many different parameters.

Scaling all values to be between between 0-1 accomplishes this. 

In [31]:
# This allows us to make decimal values
rgb_array = rgb_array.astype('float32')

In [32]:
scaled_array = rgb_array / 255.0

In [33]:
print(f'before scaling : min : {rgb_array.min()}  max : {rgb_array.max()}')

In [34]:
print(f'after scaling : min : {scaled_array.min()}  max : {scaled_array.max()}')

Scaling is the best way to prepare data which you are not sure how to preprocess. It preserves all the quantifiable information in the dataset, and doesn't require any statistics to do.

### Image Standardization
<hr style="border:2px solid gray"> </hr>


Often, the distribution of pixels in an image will follow a normal distribution (bell curve).

This may be present across the entire dataset, or in batches of images, which allows for the transformation to be done in batches on a GPU very quickly.

However, in this example, we will just standardize one image so that we can form a base of understanding.

In [35]:
# calculate global mean and standard deviation
global_pixel_mean = rgb_array.mean()
print('Mean of all Pixels: ', global_pixel_mean)

global_pixel_std = rgb_array.std()
print('standard deviation of all pixels: ',global_pixel_std)

# global standardization of pixels
standardized_array = (rgb_array - global_pixel_mean) / global_pixel_std
# values before -1 will be set to -1, values above 1 will be set to 1
standardized_array = np.clip(standardized_array, -1.0, 1.0)

In [36]:
# check the mean and standard deviation after standardization
standardized_pixel_mean = standardized_array.mean()
print('Mean of all Pixels: ', standardized_pixel_mean)

standardized_pixel_std = standardized_array.std()
print('standard deviation of all pixels: ',standardized_pixel_std)

### Local Standardization
<hr style="border:1px solid gray"> </hr>

It is also possible to standardize each channel invididually rather than across the whole image

In [37]:
rgb_means = rgb_array.mean(axis=(0,1), dtype='float64')
rgb_stds = rgb_array.std(axis=(0,1), dtype='float64')

print('Means: %s, Stds: %s' % (rgb_means, rgb_stds))
# per-channel standardization of pixels

standardized_array_3_channel = (rgb_array - rgb_means) / rgb_stds

In [38]:
# check the mean and standard deviation after standardization
rgb_means_after = standardized_array_3_channel.mean(axis=(0,1), dtype='float64')
rgb_stds_after = standardized_array_3_channel.std(axis=(0,1), dtype='float64')

print('Means: %s, Stds: %s' % (rgb_means_after, rgb_stds_after))

Now, the values are much more tightly packed than before, which allows for the model to see patterns more easily because the values are within well defined ranges.

# Wrap up: Discussion
<hr style="border:4px solid gray"> </hr>


### Extension Questions 

* What is a continuous variable? Why are they special?

* What is a normal distribution? What is skew?

* What is min-max scaling? Why is it used with tensors?

* What is standardization? What does it do to data?

* When should one use standardization and when should one use normalization?