### Introduction

* Featurization and Feature Engineering are one of the most important aspects of Machine Learning.
    - Converting some type of data into a numerical vector. **Textual data** into numerical vector.
    - Various featurizations are -
        - BoW
        - TFIDF
        - AvgW2V
        - TFIDFW2V
    - **Categorical data**
        - One Hot Encoding
        - Mean Response Rate
        - Response Coding by Probability
    - **Time Series data**
        - The data or information has time as one of the features. Heart rate data can be represented in the form of time-series data.
        - Earthquake Tracking System
        - Stock Market
    - **Image data**
        - Face detection
        - Face recognition
        - Vide → (Image data + time-series)
    - **Database tables**
        - Data stored in relation database in different tables.
    - **Graph data**
        - Graph analytics is another commonly used term, and it refers specifically to the process of analyzing data in a graph format using data points as nodes and relationships as edges.
        - Recommend a friend in facebook.

* There are tons of types of data and featurization is somethind that researchers have spent decades doing research on this field.

### Time Series - Moving Window

Simplest featurization technique for time-series data

* One snapshot of time is known as window.
* Right window (measurement) depends on the problem statement and it is more domain specific.
* Some methods are - 
    - Mean, Std-dev
    - Medians, Quantiles
    - Max and Min
    - Max minus Min (max - min)
    - Max divied by Min (max / min)
    - Local minima and Local maxima
    - Mean crossing
    - Zero crossing
    - All this have to be done in a snapshot of window
* Domain specific knowledge is must to come up with important features in the data.

### Fourier Decomposition / Fourier Transformation

* It is a method to represent the time-series data.
* **Frequency** - Frequency is just the inverse of time period. It is often represented as `f`. It is measured in `Hertz`.
* **Amplitude** - The height of the wave is called an amplitude of the wave. It is often represented as `A`.
* **Period** - The time for a wave to complete one oscillation is called a time period. It is often represented as `T`.
* **Fourier Transformation** - Given a composite wave which is repeating, the process of decomposing the pattern into sum of multiple waves (sine waves) is called fourier transformation.

<img src="https://i2.wp.com/blog.fossasia.org/wp-content/uploads/2017/07/image1-1.jpg">
<!-- <img src="https://thepracticaldev.s3.amazonaws.com/i/v1p6fhprekoheceqafw1.png"> -->

**Credits** - Image from Internet

**Steps**

* Decompose the repeated data into multiple sine waves.
* Get the amplitude of each sine wave $f_i$.
* Plot $f_i$ and $A_i$.
* Represent $f_i$ and $A_i$ in vector form.
* Fourier representation of the data.

![fe-1](https://user-images.githubusercontent.com/63333753/126126908-50b8930b-480b-49c3-b17b-9d45e51b419b.png)

### DL - LSTM

* The idea of DL is, given **huge amount of data** - it will automatically learn gives the best featurization for that data. These features are also known as **Deep Learnt Features**.

* Works brilliantly for
    - Time series data
    - Text data
    - Image data

* **LSTM** - Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. This is a behavior required in complex problem domains like machine translation, speech recognition, and more. LSTMs are a complex area of deep learning. This is extensively used for time-series data.

### Image Histogram

* The best way to featurize images is to use DL - CNN.
* Two types of image histograms -
    - **Color Histograms**
        - Separate the pixels based on the color and plot the histogram for each color.
        - Get the frequency count and replace the pixel values with the count of occurrence.
        - Merge the new pixels for each color.
        - Thus, we can somehow use this featurization of detection.
        - Color histograms cannot detect shapes.
    - **Edge Histograms**
        - Highly used to detect where an edge is in the whole image.

### SIFT - Scale Invariant Feature Transforms

* Very popular in detecting objects in an image.
* It detects **keypoints** which are mostly corners.
* For every **keypoint** it creates a `128` dimentional vector

<img src="https://courses.cs.washington.edu/courses/cse576/13sp/images/features.png">

* This technique is highly used for image search.
* In the above example image, the left image is known as **query image** and the right image is known as **db image**.
* Scale Invariance
    - Invariance means - doesn't change much.
    - The features do not change much even of the size of the images are different - **scale invariance**
    - This is also called as rotational invariance.

* SIFT works for both,
    - When the scale of image is different (size).
    - When the image is rotated (slightly).

**Credits** - Image from Internet

### DL - CNN

* Convulutional Neural Networks.
* Best featurization DL technique for image data. It automatically does featurization given that the data is huge.
* It is almost used for every image related problems.

### Relational Data & Featurization

* Data stored in various tables that are related to each other by a unique id.
    - Oracle
    - MySQL
    - SQLServer
* To obtain nice features for data stored in relational tables, along with SQL it is must to have domain knowledge.

**Example for E-commerce**

If a data is stored in tables, to obtain features in order to predict if a customer buys the product in a week, we can think of the features like -

* Number of times the customer has viewed the product
* Income details of the customer
* Zipcode for geographical details
* Similar product that the customer has viewed
* Offer and discounts
* Season of the year
* Salary week

This featurization is highly dependant on the domain knowledge.

### Graph Data

Graph analytics is another commonly used term, and it refers specifically to the process of analyzing data in a graph format using data points as nodes and relationships as edges.

![fe-graphs](https://user-images.githubusercontent.com/63333753/126270541-dba5f1fa-e719-4ba3-8550-d00ba6b030cf.png)

The above graph is a social graph (facebook) where -

* $u_i$ → vertex
* Edge $(u_i, u_j)$ → friendship

Given this data, if our task is to recommend new friends for a user $(u_i)$ - we can do it in the following ways -

* We can look for number of mutual friends for $u_i$.
    - more the number of mutual friends, higher is the chance recommend a friend.
* The number of paths between users.
    - more the number of paths, higher the change of being friends.

The features that we get by applying the concepts of graph theory are known as graph theoretic features.

### Indicator Variables

* Indicator variables are mostly binary.

* If the data consists of `height` as a feature, then we can convert this feature into indicator variable by
    - if `height` > 150 → `1`
    - else if `height` <= 150 → `0`
    - deciding the threshold to convert a feature into an indicator variable is again a `problem specific` matter.

### Feature Binning

* It is a logical extension for indicator variables.
* It is also known as feature bucketing.
* Instead of having the feature in binary values, we will have multiple indicators associated with multiple conditions.
    - Again, find the right threshold is problem specific.
    - We can use DT model to come up right threshold using gini-impurity or information gain.

### Interaction Variables

* This is also known as logical two way interaction variables. We see that there is an interaction happening between two features.
* We use logical operators to interact with more than one condition, and thus create a new feature.
    - if (`height` < 150) and (`weight` < 60)
        feature = 1
    - ...
* Apart from using logical operators, we can use numerical (arithematical operators).
    - (`height` * `weight`)
    - ...
* Given a task, how do we come up with right interaction variables?
    - We can use DT which are very handy. We will get right threshold values with which we can create interactive variables.

### Mathematical Transforms

When we have a single feature, considering the problem statement, we can apply some mathematical transforms like -

* $\log(x), e^x$
* $\sqrt{x}, \sqrt[3]{x}$
* $x^2, x^3, x^w, \dots$ → polynomial
* $\sin(x), \cos(x), \tan(x)$

### Model Specific Featurizations

* If we have features say `f1`, `f2`, and `f3` and we want to predict `y` which is real-valued target, by domain knowledge if we know that `y` can be predicted by some linear combination of `f1`, `f2`, and `f3`. Definitely for this kind of problems linear models (linear regression) are better.
    - In such cases, decision trees may not work very well.

* If we know that `y` can predicted with the interactions of `f1` and `f2`, hence for this kind of problems it is appropriate to use `RF` and `DT`. This is all done by domain knowlege.

### Feature Orthogonality

* The more different (orthogonal) the features are, the better would the model perform.
* Features that have high correlation amongst themselves irrespective having high correlation with the target variable, their overall impact to build a model will be less.
* Features having correlatation with target variable and are not correlated with each other are good to build the model.
    - the performance of such a model produces higher accuracies.
* The errors ($y_i - \hat{y_i}$) that we obtain from the model, we can create a new feature that are correlated with errors and thus re-train the model combining the new feature ultimately gives better results.
    - be careful from overfitting the model.
    - the new feature can be said as orthogonal (different) to all other features.

### Feature Slicing

* Slicing the data based on features is known as feature slicing.
* After slicing the data, we apply different models to train the sliced data.

**Steps**

* First train a model on the whole data.
* Separate the errors based on category.
* If error distributions are different, then split the data by category feature.
* Build different models for sliced data.

**Cases**

* Each category should be different in their behaviour.
* There has to be sufficient number of point for each slice of the data.