<table border="0" style="width:100%">
 <tr>
    <td>
        <img src="https://static-frm.ie.edu/university/wp-content/uploads/sites/6/2022/06/IE-University-logo.png" width=150>
     </td>
    <td><div style="font-family:'Courier New'">
            <div style="font-size:25px">
                <div style="text-align: right"> 
                    <b> MASTER IN BIG DATA</b>
                    <br>
                    Python for Data Analysis II
                    <br><br>
                    <em> Daniel Sierra Ramos </em>
                </div>
            </div>
        </div>
    </td>
 </tr>
</table>

# **S07: FEATURE ENGINEERING**

The previous sections outline the fundamental ideas of machine learning, but all of the examples assume that you have numerical data in a tidy, ``[n_samples, n_features]`` format.
In the real world, data rarely comes in such a form.
With this in mind, one of the more important steps in using machine learning in practice is *feature engineering*: that is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.

In this section, we will cover a few common examples of feature engineering tasks: features for representing *categorical data*, features for representing *text*, and features for representing *images*.
Additionally, we will discuss *derived features* for increasing model complexity and *imputation* of missing data.
Often this process is known as *vectorization*, as it involves converting arbitrary data into well-behaved vectors.

## Data Types

From a Machine Learning perspective, we can find different types of data as input for our problem

 - **Numerical** - This is the most typical data type we can find in a dataset. It represents quantitative data, that is, a measure about something (ie: the price of a house, its number of bedrooms, the age of someone, etc.). Numerical data can be continuous (ie: the price) or discrete (ie: the number of bedrooms)
 - **Categorical** - This type of data represents an attribute or characteristics of something. For example, a categorical variable is the weather (Raining, Sunny, Foggy, etc.) or the label in a classification problem (ie: is the customer is gonna make churn or not)
 - **Time Series** - In essence, this is a numerical data type but it has special considerations given that samples follow an specific order, for example, the stock price series of a company.
 - **Text** - This is just a sequence of words with some structure, for example a document, or the content of an email. The field of AI that studies the applications woth text is called NLP (Natural Language Processing)
 - **Audio** - An audio (or voice) track is a 1D signal. In essence, it's a time series where the y-axis is usually the amplitude of the signal.
 - **Image** - An image is also a signal, but in 2D. It's encoded as a map of pixels in two dimensions. Each pixel is a number, depending on the codification used.
 - **Video** - A sequence of images (a 3D signal). It's just adding *time* to a collection of images in an ordered way.

## Models work with numbers...

To train a machine learning model, **we need numeric data**. The reason is that models are just **mathematical representations** of the real world, and the *training* process of a model usually involves some kind of numerical optimization or matrix multiplication under the hood.

**We cannot multiply two matrix that contains text directly!**. We must apply some kind of processing.

The step in which we build our set of features, that is, our feature matrix (X) is also known as **Feature Engineering process**.

In [1]:
import pandas as pd

## Derived Features

The most basic operation in the **feature engineering** process is the calculation of **derived features**, that is, features that are the result of the combination of the original features in the dataset (ie: ratios, nonlinear transformations, etc.)

For example, imagine we have a dataset like this one, that contains monthly information about the customers of a telco company

In [8]:
customers_data = {"n_calls": [23, 45, 10, 346], "mb_consumption": [1034, 2783, 22909, 12211]}
customers_data = pd.DataFrame(customers_data)
customers_data

Unnamed: 0,n_calls,mb_consumption
0,23,1034
1,45,2783
2,10,22909
3,346,12211


I want to calculate new derived features
 - n_calls_per_day
 - ratio_calls_mb

In [9]:
customers_data["n_calls_per_day"] = customers_data["n_calls"] / 30
customers_data["ratio_calls_mb"] = customers_data.n_calls / customers_data.mb_consumption

In [10]:
customers_data

Unnamed: 0,n_calls,mb_consumption,n_calls_per_day,ratio_calls_mb
0,23,1034,0.766667,0.022244
1,45,2783,1.5,0.01617
2,10,22909,0.333333,0.000437
3,346,12211,11.533333,0.028335


## Dealing with diferent data types in `scikit-learn`

### Categorical Features

One common type of non-numerical data is *categorical* data.
For example, imagine you are exploring some data on housing prices, and along with numerical features like "price" and "rooms", you also have "neighborhood" information.
For example, your data might look something like this:

In [46]:
data = [
    {'price': 850000., 'rooms': 4., 'neighborhood': 'Queen Anne'},
    {'price': 700000., 'rooms': 3., 'neighborhood': 'Fremont'},
    {'price': 650000., 'rooms': 3., 'neighborhood': 'Wallingford'},
    {'price': 600000., 'rooms': 2., 'neighborhood': 'Fremont'}
]

data = pd.DataFrame(data)
data.head()

Unnamed: 0,price,rooms,neighborhood
0,850000.0,4.0,Queen Anne
1,700000.0,3.0,Fremont
2,650000.0,3.0,Wallingford
3,600000.0,2.0,Fremont


#### **One Hot Encoding**

In [47]:
from sklearn.preprocessing import OneHotEncoder

Instantiate the `OneHotEncoder`

In [48]:
ohe = OneHotEncoder(sparse_output=False)

We use the `select_dtypes` function in ´pandas´ to select just those columns that are categories ("O")

In [49]:
cat_data = data.select_dtypes("O")

Apply the `OneHotEncoder` directly on data with `fit_transform` (`fit`+`transform`)

In [55]:
cat_data_ohe = ohe.fit_transform(cat_data)

Build a `DataFrame` with the result to append the new columns to the original dataset

In [None]:
cat_data_ohe = pd.DataFrame(cat_data_ohe)
cat_data_ohe.columns = [f"Neigh_{category}" for category in ohe.categories_[0]]

In [53]:
data = pd.concat([data, cat_data_ohe], axis=1)

In [54]:
data

Unnamed: 0,price,rooms,neighborhood,Neigh_Fremont,Neigh_Queen Anne,Neigh_Wallingford
0,850000.0,4.0,Queen Anne,0.0,1.0,0.0
1,700000.0,3.0,Fremont,1.0,0.0,0.0
2,650000.0,3.0,Wallingford,0.0,0.0,1.0
3,600000.0,2.0,Fremont,1.0,0.0,0.0


Drop `neighborhood` column to train a model

In [57]:
data = data.drop(columns=["neighborhood"])

In [58]:
data

Unnamed: 0,price,rooms,Neigh_Fremont,Neigh_Queen Anne,Neigh_Wallingford
0,850000.0,4.0,0.0,1.0,0.0
1,700000.0,3.0,1.0,0.0,0.0
2,650000.0,3.0,0.0,0.0,1.0
3,600000.0,2.0,1.0,0.0,0.0


### Text Features

Another common need in feature engineering is to convert text to a set of representative numerical values.
For example, most automatic mining of social media data relies on some form of encoding the text as numbers.
One of the simplest methods of encoding data is by *word counts*: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

For example, consider the following set of three phrases:

In [24]:
sample = ['problem of evil is the evil',
          'evil queen',
          'horizon problem']

#### **Count Vectorizer**

For a vectorization of this data based on word count, we could construct a column representing the word "problem," the word "evil," the word "horizon," and so on.
While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)

The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:

In [26]:
X

<3x7 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

ℹ️ **A note about sparse matrix**: A matrix can be represented (stored in memory) in two ways: the **dense** form, or the **sparse** form.
 - *dense* - It's the typical way of represent a matrix, with its rows and columns. In memory, we're storing each value in the corresponding position in the matrix.
 - *sparse* - It' a very common way of representing a matrix that **has a lot of zeros**. If a matrix has a lot of zeros, and we store it as a *dense* matrix we're wasting memory. The *sparse* representation consist on: instead of storing all values in memory (zeros included), just store all non-zero values and its position. The rest of positions are assumed to be zero.

To convert a `numpy` sparse matrix to its dense form we can call `toarray()`

In [27]:
X.toarray()

array([[2, 0, 1, 1, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0]])

The `vec` object contains information about the fitted CountVectorizer

In [30]:
# access the detected vocabulary
vec.vocabulary_

{'problem': 4, 'of': 3, 'evil': 0, 'is': 2, 'the': 6, 'queen': 5, 'horizon': 1}

In [31]:
# access the feature names in the order in which X matrix is arranged
vec.get_feature_names_out()

array(['evil', 'horizon', 'is', 'of', 'problem', 'queen', 'the'],
      dtype=object)

In [32]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

Unnamed: 0,evil,horizon,is,of,problem,queen,the
0,2,0,1,1,1,0,1
1,1,0,0,0,0,1,0
2,0,1,0,0,1,0,0


There are some issues with this approach, however: the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms.
One approach to fix this is known as *term frequency-inverse document frequency* (*TF–IDF*) which weights the word counts by a measure of how often they appear in the documents.
The syntax for computing these features is similar to the previous example:

#### **Tf-Idf** (Term Frecuency - Inverse Document Frequency)

The TF-IDF metric can be formulated as follows:
$$\text{TF-IDF} = {TF}*{IDF}$$

where
 - TF - Relative frequency of a specific word in the document
 - IDF - Inverse ratio of documents (records) with that word
 
The **TF-IDF metric weights the words**, giving them more importance if they are scarce. That is, the less documents the word is present, the bigger the TF-IDF metric is. 

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

Unnamed: 0,evil,horizon,is,of,problem,queen,the
0,0.626632,0.0,0.411973,0.411973,0.313316,0.0,0.411973
1,0.605349,0.0,0.0,0.0,0.0,0.795961,0.0
2,0.0,0.795961,0.0,0.0,0.605349,0.0,0.0


ℹ️ **Brief note about NLP processing**: When we work with text, is essential transforming the text into numbers in order to fit a model, but before that there other operation we must apply to ensure we're doing a correct model:
 - **Transform the text to lower case** - This is essential to let the model make no distinction between "Evil" and "evil".
 - **Tokenize the text** - In order to deal with separate words, we need to tokenize the text. Usually using `\s+`
 - **Remove stopwords and punctuation signs** - If we want to effectively codify the useful information of a text, we probably want to get rid of words like `is`, `the`, etc. That is, pronouns, articles, adverbs... and keep just the verbs (actions) and nouns (subjects).
 - **Build n-grams** - Usually we can find groups of words that always are together. For example: "carrot cake". If we treat the words separately, maybe we're losing information in the context of the text.
 - **Word stemming and lemmatization**. (Refer to https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html for more info)
    - **Stemming** (simpler): Usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. *Example: cars -> car*
    - **Lemmatization** (more advanced): Usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. *Example: are -> be, doing -> do*
 - **POS (Part of Speech) tagging** - Consist on identifying the role of every word in the context of the text, that is, VERB, NOUN, ADVERB, etc. Very advanced.

In [34]:
help(CountVectorizer)

Help on class CountVectorizer in module sklearn.feature_extraction.text:

class CountVectorizer(_VectorizerMixin, sklearn.base.BaseEstimator)
 |  CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
 |  
 |  Convert a collection of text documents to a matrix of token counts.
 |  
 |  This implementation produces a sparse representation of the counts using
 |  scipy.sparse.csr_matrix.
 |  
 |  If you do not provide an a-priori dictionary and you do not use an analyzer
 |  that does some kind of feature selection then the number of features will
 |  be equal to the vocabulary size found by analyzing the data.
 |  
 |  Read more in the :ref:`User Guide <text_feature_extraction>`.
 |  
 |  Parameters
 |  -----

In [40]:
vec = CountVectorizer(
    stop_words="english"
)
X = vec.fit_transform(sample)

In [41]:
vec.vocabulary_

{'problem': 2, 'evil': 0, 'queen': 3, 'horizon': 1}