# Representation: Feature Engineering

## Mapping Raw Data to Features

**Feature engineering** means transforming raw data into feature vectors.

## Mapping numeric values

Integer and floating points can just be directly mapped.

## Mapping categorical values

Categorical features like `street_name` can only be a discrete set of possible values. We need to map them into numeric values.

### Example

**Street names:**  
{'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}

We map:

- Charleston Road -> 0
- North shoreline Boulevard -> 1
- Shorebird Way -> 2
- Rengstorff Avenue ->
- Everyting else (OOV) -> 4

_OOV: out-of-vocabulary, where the vocabulary is every possible value for the category_

There are some problems with this approach:

1. Unlikely that there is a linear adjustment between the feature `street_name` and the output
2. Some houses might have multiple street names (houses at a corner of a street)

To fix this, we can create a binary vector instead (called **one-hot encoding**):

- For values that apply to the example, set corresponding vector elements to `1`
- Set all other elements to `0`

_Lenght of this vector is the same as the number of elements in the vocabulary_

### Side Note: Spares Representation

Suppose that you had 1,000,000 different street names in your data set. Createing a vector with 1,000,000 elements would be inefficient. Therefore we can use **sparese representation** which only stores non-zero values

# Qualities of Good Features

## Features values should appear more than 5 times

**Good:**
```
house_type: victorian
```

**Bad:**
```
unique_house_id: 8SK982ZZ1242Z
```

## Clear and obvious meanings for the features

**Good:**
```
house_age_uears: 27
```

**Bad:**
```
house_age: 851472000
```

## Don`t mix "magic" values wit hactual data

If you want to use watch time, and a video hasn't been watched yet. Dont use `-1` as an indicator. Rather create a seperate boolean feature that indicates it

**Good:**
```
watch_time: 52.23
watch_time: 2.89
```

**Bad:**
```
watch_time: -1
```

## Account for upstream instability

The definition of a feature shouldn't change over time. Don't use features that possibly would change.

**Good:**
```
city_id: "br/sao_paulo"
```

_Note: need to be converted to one-hot vector_

**Bad:**
```
inferred_city_clustyer: "219"
```

# Cleaning Data

## Scaling feature values (normalize)

Convert floating-point features to a standard range

_Example:_ from 100-900 to 0-1

## Handling extreme outliers

Can "cap" or "clip" the maximum value. Everything beyond the maximum isn't thrown away, rather is set to the maximum. (Same for minimum)

## Binning

Split data into "bins". Useful when there are no direct linear relationship between the feauter and the label, but the feature is meaningful.

## Scrubbing

A lot of times there would be "bad" data (duplicates, bad labels, bad feature values). Typically "fix" bad examples by removing them.