<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Lab 5: Feature Engineering

Let's get started with the initialization of the notebook by importing the required packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from seaborn import load_dataset
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import warnings
warnings.filterwarnings('ignore')

# 1. Load Uber Movement Speeds Dataset For Berlin

To enable easy visualization of the model fitting process we will use a simple traffic speeds dataset, provided by Uber at https://movement.uber.com/cities/berlin/downloads/speeds?lang=en-US

In [None]:
df = pd.read_csv("data/movement-speeds-hourly-berlin-2020-3-joint-location.csv")
df.head()

Compute an absolute time reference, using the day and hour. Times will start at 0--representing March 1, 2020 12-1 AM--incrementing by 1 at a time until 192--representing March 8, 2020 11 PM - 12 AM.

In [None]:
df['time'] = ...
df = df.sort_values(by='time')
df.head()

Plot the average movement speed over time, aggregating across all locations.

In [None]:
time = df.groupby(by=['time']).agg('mean').reset_index()
plt.plot(time['time'], time['speed_kph_mean'])

# Fitting Linear Models with Scikit-Learn

Notebook by Joseph E. Gonzalez, Alvin Wan

In this lesson, we introduce the normal equations as well as several other algorithms to provide some insight behind how these techniques work and perhaps more importantly how they fail.  However, in practice you will seldom need to implement the core algorithms and will instead use various machine learning software packages.  In this class, we will focus on the widely used scikit-learn package.

Scikit-learn, or as the cool kids call it sklearn (pronounced s-k-learn), is an large package of useful machine learning algorithms. For this lecture, we will use the `LinearRegression` model in the [`linear_model`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) module.  The fact that there is an entire module with many different models within the `linear_model` module might suggest that we have a lot to cover still (we do!).

**What you should know about `sklearn` models:**

1. Models are created by first building an instance of the model:
```python
model = SuperCoolModelType(args)
```
1. You then fit the model by calling the **fit** function passing in data:
```python
model.fit(X, Y)
```
1. You then can make predictions by calling **predict**:
```python
model.predict(X)
```

The neat part about sklearn is most models behave like this.  So if you want to try a cool new model you just change the class of model you are using.


# 2. Fit OLS Model using Scikit-Learn

In [None]:
def plot_y_vs_yhat(df, y, yhat):
    plt.figure()
    Y, Yhat = df[y], df[yhat]
    plt.scatter(Yhat, Y, label='(yhat, y)')
    cmin, cmax = max(Yhat.min(), Y.min()), min(Yhat.max(), Y.max())
    plt.plot([cmin, cmax], [cmin, cmax], color='red', label='y=yhat')
    plt.legend()

In [None]:
def plot_predictions(df, x, y, yhat):
    plt.figure()
    X, Y, Yhat = df[x], df[y], df[yhat]
    plt.plot(X, Y, label='ground truth')
    plt.plot(X, Yhat, label='prediction')
    plt.legend()

In [None]:
def plot_predictions_over_time(df, x, y, yhat):
    time = df.groupby(by='time').agg('mean').reset_index()
    plot_predictions(time, x, y, yhat)

We import the `LinearRegression` model

In [None]:
from sklearn.linear_model import LinearRegression

Create an instance of the model. Like before, we will use

In [None]:
model = LinearRegression(fit_intercept=False)

## 2.a Train OLS

Fit the model by passing it the $X$ and $Y$ data:

In [None]:
X, Y = df[['time']], df[["speed_kph_mean"]] # extract data, labels

In [None]:
model.fit(X, Y)

## 2.b Predict with OLS

Make some predictions and even save them back to the original DataFrame

In [None]:
df['Yhat_sklearn'] = Yhat = model.predict(X)
df

## 2.c Analyze Fit with OLS

Analyzing the fit again:

In [None]:
plot_y_vs_yhat(df.sample(frac=0.01), y="speed_kph_mean", yhat="Yhat_sklearn")

We can also plot the residual distribution.

In [None]:
df['residuals_sklearn'] = df['speed_kph_mean'] - df['Yhat_sklearn']
_ = plt.hist(df.sample(frac=0.01)['residuals_sklearn'], bins=100)

## 2.d Evaluate OLS using Scikit-Learn

As we tune the features in our model it will be important to define some useful error metrics.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [None]:
print("Mean Squared Error:", mean_squared_error(Y, Yhat))

In [None]:
print("Mean Absolute Error:", mean_absolute_error(Y, Yhat))

In [None]:
print("Root Mean Squared Error:", np.sqrt(mean_squared_error(Y, Yhat)))

In [None]:
print("Standard Deviation of Residuals:", np.std(df['residuals_sklearn']))

As we play with the model we might want a standard visualization

In [None]:
def evaluate(df, y, yhat):
    """Compute and print error metrics"""
    Y, Yhat = df[y], df[yhat]
    metrics = {
        'MSE': mean_squared_error(Y, Yhat),
        'MAE': mean_absolute_error(Y, Yhat),
        'RMSE': np.sqrt(mean_squared_error(Y, Yhat)),
    }
    for metric, value in metrics.items():
        print(f"{metric}: {value}")
    return metrics

In [None]:
def evaluate_and_plot(df, x, y, yhat):
    """Report error metrics and also visualize"""
    evaluate(df, y, yhat)
    plot_y_vs_yhat(df.sample(frac=0.01), y, yhat)
    plot_predictions_over_time(df, x, y, yhat)

Examining our latest model:

In [None]:
evaluate_and_plot(df, x='time', y='speed_kph_mean', yhat='Yhat_sklearn')

# 4. Fit Biased OLS using Scikit-Learn

Redo the above except using a model with the intercept term. This is as simple as simply passing `fit_intercept=True` to the `LinearRegression` model constructor.

In [None]:
biased = ... # create model
... # train model
df['Yhat'] = ... # predict with model

In [None]:
evaluate_and_plot(df, x='time', y='speed_kph_mean', yhat='Yhat')

Let's amend our table of results with the additional metrics above.

||MSE|MAE|RMSE|
|---|---|---|---|
|**OLS**|643|19.8|25.3|
|**Biased OLS**|213|10.6|14.6|

Examining the above data we see that there is some **periodic** structure as well as some **curvature**. Can we fit this data with a linear model?

Recall that during the lecture we learned that feature engineering help us to achieve 3 main goals:

1. Express non-linear relationships.

2. Capture domain knowledge.

3. Encode non-numeric features.

# Modeling Non-linear Relationships

Notebook by Joseph E. Gonzalez, Alvin Wan

In this notebook, we will use basic feature transformations (feature engineering) to model non-linear relationships using linear models.

**What does it mean to be a _Linear Model_?**

Linear models are **linear combinations** of features.  These models are therefore linear in the **parameters** but not necessarily the underlying data.  We can encode non-linearity in our data through the use of feature functions:


$$
f_\theta\left( x \right) = \phi(x)^T \theta = \sum_{j=0}^{p} \phi(x)_j \theta_j
$$

where $\phi$ is an *arbitrary function* from $x\in \mathbb{R}^d$ to $\phi(x) \in \mathbb{R}^{p+1}$. We could also denote these as a collection of separate feature $\phi_j$ feature functions from $x\in \mathbb{R}^d$ to $\phi_j(x) \in \mathbb{R}$:

$$
\phi(x) = \left[\phi_0(x), \phi_1(x), \ldots, \phi_p(x) \right]
$$


We often refer to these $\phi_j$ as **feature functions** and their design plays a critical role in both how we capture prior knowledge and our ability to fit complicated data

# 2. Fit Biased OLS Model

We'll expand the data features we're allowed to use. Instead of just taking in time, our OLS model will now take in time and location.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
X, Y = df[['Latitude', 'Longitude', 'time']], df[["speed_kph_mean"]] # extract data, labels

In [None]:
model = LinearRegression(fit_intercept=True)
model.fit(X, Y)
df['Yhat'] = model.predict(X)

In [None]:
evaluate_and_plot(df, x='time', y='speed_kph_mean', yhat='Yhat')

Let's see our results so far.


||MSE|MAE|RMSE|
|---|---|---|---|
|**OLS**|643|19.8|25.3|
|**Biased OLS**|213|10.6|14.6|
|**Biased OLS + Location**|201|10.7|14.2|

Examining the above data we see that there is some **periodic** structure as well as some **curvature**. Can we fit this data with a linear model?

# 3. Polynomial Features

There is some curvature.  We can introduce polynomial terms to try to improve the fit of our model.

In [None]:
def phi_curved(X):
    return np.hstack([
        X,
        X * X,
        np.expand_dims(np.prod(X, axis=1), 1),
        X ** 3,
    ])

Can you guess the new number of features?

In [None]:
curvedX = phi_curved(X)
curvedX.shape

In [None]:
curved = LinearRegression()
curved.fit(curvedX, Y)
df['Yhat_curved'] = curved.predict(curvedX)

In [None]:
evaluate_and_plot(df, x='time', y='speed_kph_mean', yhat='Yhat_curved')

Looking at our results so far, we see that higher-order polynomial terms actually improved our best error by 16%.

||MSE|MAE|RMSE|
|---|---|---|---|
|**OLS**|643|19.8|25.3|
|**Biased OLS**|213|10.6|14.6|
|**Biased OLS + Location**|201|10.7|14.2|
|**Biased OLS + Location + Poly**|175|10.1|13.2|

# 4. Sinusoidal Features

In the following, we will add a few different sine functions at different frequencies and offsets.

$$
\sin\left(2 \pi * \textbf{frequency}X + \textbf{phase}\right)
$$

Note that for this to remain a linear model, we cannot make the frequency or phase of the sine function a model parameter.  In fact, these are actually **hyperparameters** of the model that would need to be tuned using either domain knowledge or other search procedures.

In [None]:
def phi_periodic(X):
    return np.hstack([
        X,
        np.sin(X),
        np.sin(0.26*X),
        np.sin(X - 6),
        np.sin(0.26 * X - 6),
    ])

In [None]:
phi_periodic(X).shape

Let's combine all the features we have so far.

In [None]:
def phi_curved_and_periodic(X):
    return np.hstack([phi_curved(X), phi_periodic(X)])

In [None]:
crazyX = ...
crazyX.shape

Notice that to make predictions I need to actually apply the $\Phi$ feature function to my data.

In [None]:
crazy = LinearRegression()
crazy.fit(crazyX, Y)
df['Yhat_crazy'] = crazy.predict(crazyX)

In [None]:
evaluate_and_plot(df, x='time', y='speed_kph_mean', yhat='Yhat_crazy')

Looking at our final table of results, our sinusoidal features improved our best error by 6%. Compared with our original OLS result, we've improved our error by 62%, reducing from 346 MSE by over 2x to 132 MSE.

||MSE|MAE|RMSE|
|---|---|---|---|
|**OLS**|643|19.8|25.3|
|**Biased OLS**|213|10.6|14.6|
|**Biased OLS + Location**|201|10.7|14.2|
|**Biased OLS + Location + Poly**|175|10.1|13.2|
|**Biased OLS + Location + Poly + Sin**|167|9.9|12.9|

## Success!

Using non-linear feature functions, we're now able to model non-linear relationships.

# 6. Imputing Missing Values with Scikit-Learn


In this notebook, we discuss how to deal with missing values. In the process, we will work through feature engineering to construct a model that predicts vehicle efficiency.


## 6.1 Load `mpg` Dataset

For this notebook, we will use the seaborn `mpg` data set which describes the fuel mileage (measured in miles per gallon or mpg) of various cars along with characteristics of those cars.  Our goal will be to build a model that can predict the fuel mileage of a car based on the characteristics of that car.

In [None]:
data = pd.read_csv("data/mpg.csv")
data

Notice a large number of columns are not quantitative continuous. Ignore these for now. We will deal with in next lesson.

In [None]:
Y = data[["mpg"]]

## 6.2 Keeping Track of Progress

Because we are going to be building multiple models with different feature functions it is important to have a standard way to track each of the models.

The following function takes a model prediction function, the name of a model, and the dictionary of models that we have already constructed.  It then evaluates the new model on the data and plots how the new model performs relative to the previous models as well as the $Y$ vs $\hat{Y}$ scatter plot.

In addition, it updates the dictionary of models to include the new model for future plotting.

In [None]:
results = {}

In [None]:
def evaluate_and_plot_mpg(name, df, y, yhat):
    metrics = evaluate(df, y, yhat)
    plot_y_vs_yhat(df, y, yhat)

    results[name] = metrics
    return pd.DataFrame(results).sort_values(by='MSE', axis=1).T

## 6.3 Imputing Missing Quantitative Continuous Features

This data set has several quantitative continuous features that we can use to build our first model.  However, even for quantitative continuous features, we may want to do some additional feature engineering.  Things to consider are:

1. transforming features with non-linear functions (log, exp, sine, polynomials)
2. constructing products or ratios of features
3. dealing with missing values

### Missing Values

We can use the Pandas `DataFrame.isna` function to find rows with missing values:

In [None]:
# show the rows that contain a NaN value
data[data.isna().any(axis=1)]

There are many ways to deal with missing values.  A common strategy is to substitute the mean.  Because missing values can actually be useful signal, it is often a good idea to include a feature indicating that the value was missing.

In [None]:
def impute_mpg(df):
    Phi = df[["cylinders", "displacement",
              "horsepower", "weight",
              "acceleration",
              "model_year"]].copy()
    Phi["horsepower_missing"] = Phi["horsepower"].isna()
    Phi = Phi.fillna(Phi.mean())
    return Phi

### Baseline Biased OLS Model

Using our feature function, we can fit our first model to the transformed data:

In [None]:
def train_model_with_phi(df, phi, X, Y):
    model = LinearRegression()
    Phi = phi(X)
    model.fit(Phi, Y)
    yhat = model.predict(Phi)
    return model, yhat

In [None]:
basic, yhat = train_model_with_phi(...
data['Yhat'] = yhat

In [None]:
evaluate_and_plot_mpg('basic', data, y="mpg", yhat="Yhat")

## 6.4 Stable Feature Functions

Unfortunately, the feature function we just implemented applies a different transformation depending on what input we provide. Specifically, if the `horsepower` is missing when we go to make a prediction we will substitute it with a different mean then was used when we fit our model.  Furthermore, if we only want predictions on a few records and the `horsepower` is missing from those records then the feature function will be unable to substitute a meaningful value.

For example, if we were to get new records that look like the following:

In [None]:
new_data = data[data['horsepower'].isna()].head(3)
new_data

The feature function is be unable to substitute the mean since none of the records have a `horsepower` value.

In [None]:
try:
    basic.predict(impute_mpg(new_data))
except Exception as e:
    print(e)

We can fix this by computing the mean on the original data and using that mean on any new data.

In [None]:
# Making a global variable
def impute_mpg(df, data_mean = data.mean()):
    feature_cols = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
    Phi = df[feature_cols].copy()
    Phi["horsepower_missing"] = Phi["horsepower"].isna().astype(float)
    Phi = Phi.fillna(data_mean)
    return Phi

In [None]:
impute_mpg(new_data)

In [None]:
basic.predict(impute_mpg(new_data))

## 6.5 Scikit-learn Model Imputer

Because these kinds of transformations are fairly common. Scikit-learn has built-in transformations for data imputation.  These transformations have a common pattern of `fit` and `transform`.  You first `fit` the transformation to your data and then you can `transform` your data and any future data using the same transformation.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="mean")

In [None]:
imputer.fit(data[['weight', 'horsepower']])

In [None]:
imputer.transform(data[['weight', 'horsepower']])[32]

In [None]:
imputer.fit(data[['horsepower']])
def impute_mpg_sklearn(df, imputer=imputer):
    feature_cols = ["cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
    Phi = df[feature_cols].copy()
    Phi["horsepower_missing"] = Phi["horsepower"].isna().astype(float)
    Phi["horsepower"] = imputer.transform(Phi[["horsepower"]]).flatten()
    return Phi

In [None]:
basicsk, data['Yhat'] = train_model_with_phi(data, impute_mpg_sklearn, data, Y)
evaluate_and_plot_mpg("basic_sklearn", data, y="mpg", yhat="Yhat")

# 7. Applying Domain Knowledge

Let's try improving the model by applying feature functions from before: polynomial and sinusoidal features.

The displacement of an engine is defined as the product of the volume of each cylinder and number of cylinders.  However, not all cylinders fire at the same time (at least in a functioning engine) so the fuel economy might be more closely related to the volume of any one cylinder.


## 7.1 Displacement Features

We can use this "domain knowledge" to compute a new feature encoding the volume per cylinder by taking the ratio of displacement and cylinders.

In [None]:
def phi_with_displacement(df):
    Phi = impute_mpg_sklearn(df)
    Phi['displacement/cylinder'] = ...
    return Phi

Again fitting and evaluating our model we see a reduction in prediction error (RMSE).

In [None]:
disp, data['Yhat_disp'] = train_model_with_phi(data, phi_with_displacement, data, Y)
evaluate_and_plot_mpg("disp", data, y="mpg", yhat="Yhat_disp")

## 7.2 Polynomial Features

Let's apply the feature functions we explored in the previous lesson. Do they work here?

In [None]:
def phi_crazy(df):
    Phi = impute_mpg_sklearn(df)
    Phi = phi_curved(Phi)
    return Phi

In [None]:
disp, data['Yhat_crazy'] = train_model_with_phi(data, phi_crazy, data, Y)
evaluate_and_plot_mpg("crazy", data, y="mpg", yhat="Yhat_crazy")

## 7.3 Sinusoidal Features

Those seemed to work well. Let's try more.

In [None]:
def phi_crazier(df):
    Phi = impute_mpg_sklearn(df)
    Phi = phi_curved_and_periodic(Phi)
    return Phi

In [None]:
disp, data['Yhat_crazier'] = train_model_with_phi(data, phi_crazier, data, Y)
evaluate_and_plot_mpg("crazier", data, y="mpg", yhat="Yhat_crazier")

This random hodge podge of features is what we call "feature soup". It's senseless feature mashing to get a better result. We'll see why specifically this is bad, in future lectures. For now, it looks like feature soup is getting diminishing returns and has plateau'ed in performance. Hitting a wall here, we'll now turn to an alternative: Below, we'll leverage insights about the problem, our domain knowledge, to *further* significantly improve our model performance.

# 8. Encoding Non-Numeric and Categorical Data

## 8.1 Encoding Categorical Data

The `origin` column in this data set is categorical (nominal) data taking on a fixed set of possible values.

In [None]:
data.head()

In [None]:
_ = plt.hist(data['origin'])

To use this kind of data in a model, we need to transform into a vector encoding that treats each distinct value as a separate dimension.  This is called One-hot Encoding or Dummy Encoding.

### 8.1.1 One-Hot Encoding (Dummy Encoding)


One-Hot encoding, sometimes also called **dummy encoding** is a simple mechanism to encode categorical data as real numbers such that the magnitude of each dimension is meaningful.  Suppose a feature can take on $k$ distinct values (e.g., $k=50$ for 50 states in the United Stated).  A new feature (dimension) is created for each distinct value.  For each record, all the new features are set to zero except the one corresponding to the value in the original feature.

<img src="images/one_hot_state.png" width="600px">

The term one-hot encoding comes from a digital circuit encoding of a categorical state as particular "hot" wire:

<img src="images/one_hot_encoding.png" width="400px">

### 8.1.2  Dummy Encoding in Pandas

We can construct a one-hot (dummy) encoding of the origin column using the `Pandas.get_dummies` function:

In [None]:
pd.get_dummies(data[['origin']])

Using the `Pandas.get_dummies`, we can build a new feature function which extends our previous features with the additional dummy encoding columns.

In [None]:
def phi_with_origin(df):
    Phi = phi_with_displacement(df)
    Phi = Phi.join(pd.get_dummies(df[['origin']]))
    return Phi

We fit a new model with the origin feature encoding:

In [None]:
oh, data['Yhat_oh'] = train_model_with_phi(data, phi_with_origin, data, Y)
evaluate_and_plot_mpg("oh", data, y="mpg", yhat="Yhat_oh")

Unfortunately, the above feature function is not stable.  For example, if we are given a single vehicle to make a prediction the model will fail:

In [None]:
try:
    oh.predict(phi_with_origin(data.head(1)))
except Exception as e:
    print(e)

To see why this fails look at the feature transformation for a single row:

In [None]:
phi_with_origin(data.head(1))

The dummy columns are not created for the other categories.

There are a couple solutions.  We could maintain a list of dummy columns and always add these columns.  Alternatively, we could use a library function designed to solve this problem.  The second option is much easier.

### 8.1.3 Scikit-learn One-hot Encoder

The scikit-learn library has a wide range feature transformations and a framework for composing them in reusable (stable) pipelines.  Let's first look at a basic [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) transformation.

In [None]:
from sklearn.preprocessing import OneHotEncoder
oh_enc = OneHotEncoder()

We then fit that instance to some data.  This is where we would determine the specific values that a categorical feature can take:

In [None]:
oh_enc.fit(data[['origin']])

Once we fit the transformation, we can then use it transform new data:

In [None]:
oh_enc.transform(data[['origin']].head())

In [None]:
oh_enc.transform(data[['origin']].head()).todense()

We can also inspect the categories of the one-hot encoder:

In [None]:
oh_enc.get_feature_names()

We can update our feature function to use the one-hot encoder instead.

In [None]:
def phi_with_origin(df):
    Phi = phi_with_displacement(df)
    dummies = pd.DataFrame(oh_enc.transform(df[['origin']]).todense(),
                           columns=oh_enc.get_feature_names(),
                           index = df.index)
    return Phi.join(dummies)

In [None]:
phi_with_origin(data.head())

In [None]:
# model = LinearRegression()
# model.fit(phi_with_origin(data), data[["mpg"]])
# evaluate_model("cont.+(d/c)+o", model, phi_with_origin, models)
oh, data['Yhat_oh'] = train_model_with_phi(data, phi_with_origin, data, Y)
evaluate_and_plot_mpg("oh_sklearn", data, y="mpg", yhat="Yhat_oh")

## 8.2 Encoding Text using Bag-of-Words

The only remaining feature to encode is the vehicle name.  Is there potentially signal in the vehicle name?


In [None]:
data[['name']].head(10)

Encoding text can be challenging.  The capturing the semantics and grammar of language in mathematical (vector) representations is an active area of research.  State-of-the-art techniques often rely on neural networks trained on large collections of text. In this class, we will focus on basic text encoding techniques that are still widely used.  If you are interested in learning more, checkout [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial).



Here we present two widely used representations of text:

* **Bag-of-Words Encoding**: encodes text by the frequency of each word
* **N-Gram Encoding**: encodes text by the frequency of sequences of words of length $N$

Both of these encoding strategies are related to the one-hot encoding with dummy features created for every word or sequence of words and with multiple dummy features having counts greater than zero.


### 8.2.1 The Bag-of-Words Encoding


The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms.  The following is a simple illustration of the bag-of-words encoding:

<img src="images/bag_of_words.png" width="600px">

**Notice**
1. **Stop words are often removed.** Stop-words are words like `is` and `about` that in isolation contain very little information about the meaning of the sentence.  Here is a good list of [stop-words in many languages](https://code.google.com/archive/p/stop-words/).
1. **Word order information is lost.**  Nonetheless the vector still suggests that the sentence is about `fun`, `machines`, and `learning`.  Thought there are many possible meanings _learning machines have fun learning_ or _learning about machines is fun learning_ ...
1. **Capitalization and punctuation are typically removed.**  However, emoji symbols may be worth preserving.
1. **Sparse Encoding:** is necessary to represent the bag-of-words efficiently.  There are millions of possible words (including terminology, names, and misspellings) and so instantiating a `0` for every word that is not in each record would be inefficient.


### 8.2.2 Bag-of-words in Scikit-learn

We can use scikit-learn to construct a bag-of-words representation of text

In [None]:
frost_text = [x for x in """
Some say the world will end in fire,
Some say in ice.
From what Ive tasted of desire
I hold with those who favor fire.
""".split("\n") if len(x) > 0]

frost_text

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct the tokenizer with English stop words
bow = CountVectorizer(stop_words="english")

# fit the model to the passage
bow.fit(frost_text)

In [None]:
# Print the words that are kept
print("Words:", list(enumerate(bow.get_feature_names())))

In [None]:
print("Sentence Encoding: \n")
# Print the encoding of each line
for (text, encoding) in zip(frost_text, bow.transform(frost_text)):
    print(text)
    print(encoding.todense())
    print("------------------")

## 8.3 Encoding Text using N-Gram Encoding

The n-gram encoding is a generalization of the bag-of-words encoding designed to capture information about word ordering.  Consider the following passage of text:

> _The book was not well written but I did enjoy it._

If we re-arrange the words we can also write:

> _The book was well written but I did not enjoy it._

Moreover, local word order can be important when making decisions about text.  The n-gram encoding captures local word order by defining counts over sliding windows. In the following example a bi-gram ($n=2$) encoding is constructed:

<img src="images/ngram.png" width="800px">

The above n-gram would be encoded in the sparse vector:

<img src="images/ngram_vector.png" width="300px">

Notice that the n-gram captures key pieces of sentiment information: `"well written"` and `"not enjoy"`.

N-grams are often used for other types of sequence data beyond text. For example, n-grams can be used to encode genomic data, protein sequences, and click logs.

**N-Gram Issues**
1. Maintaining the dictionary of possible n-grams can be very costly.  There are several approximations leveraging hashing that can be used to closely approximate n-gram encoding without the need to maintain the dictionary of all possible n-grams.
1. As the size $n$ of n-grams increases the chance of observing more than one instance decreases limiting their value as a feature.

In [None]:
# Construct the tokenizer with English stop words
bigram = CountVectorizer(ngram_range=(1, 2))
# fit the model to the passage
bigram.fit(frost_text)

In [None]:
# Print the words that are kept
print("\nWords:",
      list(zip(range(0,len(bigram.get_feature_names())), bigram.get_feature_names())))

In [None]:
print("\nSentence Encoding: \n")
# Print the encoding of each line
for (text, encoding) in zip(frost_text, bigram.transform(frost_text)):
    print(text)
    print(encoding.todense())
    print("------------------")

### 8.3.1 Applying Text Encoding

We can add the text encoding features to our feature function:

In [None]:
bow = CountVectorizer()
bow.fit(data["name"])

def phi_with_name(df):
    Phi = phi_with_origin(df)
    bow_encoding = pd.DataFrame(
        bow.transform(df['name']).todense(),
        columns=bow.get_feature_names(),
        index = df.index)
    Phi = Phi.join(bow_encoding)
    return Phi

In [None]:
Phi = phi_with_name(data)
Phi.head()

In [None]:
# model = LinearRegression()
# model.fit(phi_with_name(data), data[["mpg"]])
# evaluate_model("cont.+(d/c)+o+n", model, phi_with_name, models)

name, data['Yhat_name'] = train_model_with_phi(data, phi_with_name, data, Y)
evaluate_and_plot_mpg("name", data, y="mpg", yhat="Yhat_name")

## Success!!!!!