## Raw Data to Features

### What makes a good feature?
1. Be related to the objective
> Different problems in the same domain may need different features.

2. Be known at prediction-time

> Some data could be known immediately, and some other data is not known in real time.

> You cannot train with current data and predict with stale data

3. Be numeric with meaningful magnitude

4. Have enough examples

5. Bring human insight to problem

## Representing Features

### Representing Features
Raw data are converted to numeric features in different ways

Numeric values can be used as-is

Categorical variables should be one-hot encoded

Don't know the list of keys? Create a vocabulary
![](images/Features/categorical.png)

Don't mix magic numbers with data. Have missing data, add an extra column to state whether or not you observed the value or not.

### ML vs Statistics

ML = lots of data, keep outliers and build models for them

Statistics = "l've got all the data I'll ever get", throw away outliers

Exact floats are not meaningful. Discretize floating point values into bins

Crazy outliers will hurt trainablity. Ideally, features should have similar range

### Preprocessing and Feature Creation

![](images/Features/preprocessing.png)
![](images/Features/feature_creation.png)

## Apache Beam and Cloud Dataflow

Beam is a way to write elastic data processing pipelines

To implement a data processing pipeline, you write your code using the Apache Beam APIs, and then deploy the code to Cloud Dataflow.
![](images/Beam_DataFlow/pipeline.png)

The code is the same between real-time and batch
![](images/Beam_DataFlow/dataflow.png)
![](images/Beam_DataFlow/pipeline2.png)
![](images/Beam_DataFlow/PCollection.png)
![](images/Beam_DataFlow/ingest.png)
![](images/Beam_DataFlow/write.png)
![](images/Beam_DataFlow/execute.png)

### Data Pipelines that Scale

MapReduce approach splits Big Data so that each compute node pprocessed data local to it

Apache Beam ParDo class
![](images/Beam_DataFlow/ParDo.png)

![](images/Beam_DataFlow/map_flatmap.png)
![](images/Beam_DataFlow/groupby.png)
![](images/Beam_DataFlow/Combine_PerKey.png)

## Preprocessing with Cloud Dataprep
![](images/CloudDataprep/two_approaches_preprocessing.png)

**First approach**

The default Datalab environment is running on a single virtual server with a limited amount of memory. In some case, it will be impractical or too expensive to plot and analyze all of them using just a single datalab environment. 
**Best to aggregate in BigQuery and plot in Datalab.
Write DataFlow code to do any transformations.**

**Second approach**

using Cloud Dataprep for exploring and preprocessing data
![](images/CloudDataprep/CloudDataprep.png)
![](images/CloudDataprep/CloudDataprep_wrangles.png)
![](images/CloudDataprep/wranglers_transformation.png)
![](images/CloudDataprep/monitor_Dataprep.png)

## Introducing Feature Crosses
![](images/FeatureCrosses/feature_crosses.png)

### Memorization cs Generalization

Feature crosses memorize

Goal of ML is generalization

Memorization works when you have lots of data

Feature crosses are powerful

**Feature Crosses bring a lot of power to linear models**

Feature crosses + massive data is an efficient way for learning highly complex spaces

Feature crosses allow a linear model to memorize large datasets

Optimizing linear models is a convex problem

Before TensorFlow, Google used massive scale learners

Feature crosses, as a preprocessor, make neural networks converge a lot quicker

dataset=xor
![](images/FeatureCrosses/xor.png)

dataset=circle
![](images/FeatureCrosses/circle.png)

Which of these is a **good feature cross**?

Different cities in California have markedly different housing prices. Suppose you must create a model to predict housing prices. Which of the following sets of features or feature crosses could learn city-specific relationships between house characteristic and housing price?


Three separate binned features: [binned latitude], [binned longitude], [binned roomsPerPerson]


Two feature crosses: [binned latitude X binned roomsPerPerson] and [binned longitude X binned roomsPerPerson]


One feature cross: [latitude X longitude X roomsPerPerson]


**One feature cross: [binned latitude X binned longitude X binned roomsPerPerson]**

> 正确 
Yes. Crossing binned latitude with binned longitude enables the model to learn city-specific effects of roomsPerPerson. Binning prevents a change in latitude producing the same result as a change in longitude. Depending on the granularity of the bins, this feature cross could learn city-specific or neighborhood-specific or even block-specific effects.


### Lab Solution: Too Much of a Good thing

http://goo.gl/ofiHCT

feature cross can cause overfit


## Implementing Feature Crosses

![](images/FeatureCrosses/hash_bucket.png)

### Embedding Feature Crosses

The model learns how to embed the feature cross in lower-dimensional space
![](images/FeatureCrosses/embedding.png)

### Where to Do Feature Engineering

![](images/FeatureCrosses/three_places.png)
![](images/FeatureCrosses/preprocessing_feature_column.png)
![](images/FeatureCrosses/preprocessing_tf.png)

### Feature Creation in TensorFlow
![](images/FeatureCrosses/create_feature_tf.png)
![](images/FeatureCrosses/call_all_input.png)

### Feature Creation in DataFlow
![](images/FeatureCrosses/add_feature_dataflow.png)

## TensorFlow Transform
![](images/Transform/three_places_pros_cons.png)
TensorFlow is good for on-demand, on-the-fly processing
![](images/Transform/hybrid.png)
![](images/Transform/two_PTransforms.png)
![](images/Transform/two_phases.png)

### Analyze phase
![](images/Transform/analyze_phase1.png)
![](images/Transform/analyze_phase2.png)
![](images/Transform/analyze_phase3.png)

### Transform phase
![](images/Transform/transform_phase_preprocessing1.png)
![](images/Transform/transform_phase_preprocessing2.png)
Analyze and Tranform happens on the training dataset
![](images/Transform/transform_eval.png)

### Supporting serving

Fo training and evaluation, we created preprocessed features using Beam
![](images/Transform/serving.png)
![](images/Transform/input_function.png)
![](images/Transform/serving_input.png)
The model graph includes the preprocessing code