# The Input Space $X$  
- Our general learning theory setup: no assumptions about $X$  
- But $X = R^d$ for the specific methods we’ve developed:  
   - Ridge regression
   - Lasso regression
   - Linear SVM  
- However, often the input space is not $R^d$  

# Feature Extraction  
- Definition  
Mapping an input from $X$ to a vector in $R^d$ is called **feature extraction** or **featurization**  
<div align="center"><img src = "./feature extraction.jpg" width = '500' height = '100' align = center /></div>  

## Example: Detecting Email Addresses  
- Task: Predict whether a string is an email address
- Could use domain knowledge and write down:
<div align="center"><img src = "./email.jpg" width = '500' height = '100' align = center /></div>   

- But this was ad-hoc, and maybe we missed something  
- Could be more systematic?  

# Feature Templates
- Deﬁnition (informal)   
A feature template is a group of features all computed in a similar way  
<div align="center"><img src = "./feature template.jpg" width = '500' height = '100' align = center /></div>   

- Feature Template: Last Three Characters Equal ___
   - Don’t think about which 3-letter suffixes are meaningful... 
   - Just include them all  
<div align="center"><img src = "./end.jpg" width = '500' height = '100' align = center /></div>   

## Feature Template: One-Hot Encoding
A one-hot encoding is a set of features (e.g. a feature template) that always has exactly one non-zero value.
<div align="center"><img src = "./onehot.jpg" width = '500' height = '100' align = center /></div>   

## Feature Vector Representations
<div align="center"><img src = "./onehot2.jpg" width = '500' height = '100' align = center /></div>   

- Array  
    - assumed ﬁxed ordering of the features 
    - appropriate when signiﬁcant number of nonzero elements (“dense feature vectors”)
    - very eﬃcient in space and speed 
- Map
    - best for sparse feature vectors (i.e. few nonzero features) 
    - features not in the map have default value of zero 
    - Python code for “ends with last 3 characters”: {"endsWith_"+x[-3:]: 1}. 

# Handling nonlinearity with Linear Method  
- Example: Predict Health  
    - Height
    - Weight
    - Body Temperature
    - blood pressure
- Issues for nonlinear problems  
  - For linear predictors, it’s important how features are added  
  - Three types of nonlinearities can cause problems
      - Non-monotonicity 
      - Saturation
      - Interactions between features

## Non-monotonicity: The Issue
- Feature Map: $\phi(x) = [1, temperature(x)]$  
- Action: predict health score $y \in R$(positive is good)  
- Hypothesis Space $F = \{\text{affine functions of temperature}\}$ 
- Issue: Health is not an affine function of temperature  
- Aﬃne function can either say 
     - Very high is bad and very low is good, or 
     - Very low is bad and very high is good
     - But here, both extremes are bad.
- Solution1:  
Transform the input  
$$\phi(x)=\left[1,\{\text { temperature }(x)-37\}^{2}\right]$$  
where 37 is “normal” temperature in Celsius  
    -  but this requires domain knowledge 
- Solution2  
Think less, put in more  
$$\phi(x)=\left[1, \text { temperature }(x),\{\text { temperature }(x)\}^{2}\right]$$  
More expressive than solution 1  

- General Rule   
Features should be simple building blocks that can be pieced together  

## Saturation: The Issue
- Setting: Find products relevant to user’s query  
- Input: Product x 
- Action: Score the relevance of x to user’s query 
- Feature Map  
$$\phi(x) = [1,N(x)]$$  
where $N(x)$ =number of people who bought x. 
- We expect a monotonic relationship between $N(x)$ and relevance, but... Is relevance linear in $N(x)$?   
    - Relevance score reﬂects conﬁdence in relevance prediction
    - Are we 10 times more conﬁdent if N(x) =1000 vs N(x) =100
    - Bigger is better... but not that much better.

### Saturation: Solve with nonlinear transform
- Smooth nonlinear transformation  
$$\phi(x)=[1, \log \{1+N(x)\}]$$  
log(·) good for values with large dynamic ranges   

### Saturation: Solve by discretization
- Discretization (a discontinuous transformation):   
$$\phi(x)=(1(5 \leqslant N(x)<10), 1(10 \leqslant N(x)<100), 1(100 \leqslant N(x)))$$  

- Sometimes we might prefer one-sided buckets   
$$\phi(x)=(1(5 \leqslant N(x)), 1(10 \leqslant N(x)), 1(100 \leqslant N(x)))$$  
    - Why? Hint: What’s the eﬀect of regularization the parameters for rare buckets?   

- Small buckets allow quite ﬂexible relationship  

## Interactions: The Issue
- Input: Patient information x 
- Action: Health score $y \in R$ (higher is better)  
- Feature Map  
$$\phi(x)=[\text { height }(x), \text { weight }(x)]$$  
- Issue: It’s the weight relative to the height that’s important  
- Impossible to get with these features and a linear classiﬁer  
- Need some interaction between height and weight.

### Interactions: Approach 1
- Google “ideal weight from height”   
- J. D. Robinson’s “ideal weight” formula (for a male):  
weight $(\mathrm{kg})=52+1.9[\text { height }(\text { in })-60]$  
- Make score square deviation between height(h) and ideal weight(w)   
$$f(x)=(52+1.9[h(x)-60]-w(x))^{2}$$  
$$f(x)=3.61 h(x)^{2}-3.8 h(x) w(x)-235.6 h(x)+w(x)^{2}+124 w(x)+3844$$  

### Interactions: Approach 2
Just include all second order features  
$$\phi(x)=[1, h(x), w(x), h(x)^{2}, w(x)^{2}, \underbrace{h(x) w(x)}_{\text {cross term }}]$$  
- General Principle  
Simpler building blocks replace a single “smart” feature  

# Predicate Features and Interaction Terms
- Definition  
A predicate on the input space $X$ is a function $P : X \to \{True,False\}$  
- Many features take this form  
    - $x \to s(x) = 1$(subject is sleeping)  
    - $x \to d(x) = 1$(subject is driving)  
- For predicates, interaction terms correspond to AND conjunctions  
    - $x \to s(x)d(x) = 1$(subject is sleeping AND subject is driving)  

# What is linear?  
- Non-linear feature map $\phi : X \to R^d$  
- Hypothesis space  
$$\mathcal{F}=\left\{f(x)=w^{T} \phi(x) \mid w \in \mathbf{R}^{d}\right\}$$  
- Linear in w? Yes  
- Linear in $\phi(x)$? Yes  
- Linear in $x$? No  

# Geometric Example: Two class problem, nonlinear boundary
<div align="center"><img src = "./nonlinear.jpg" width = '500' height = '100' align = center /></div>  
- With linear feature map $\phi(x) = (x1,x2)$ and linear models, no hope  
- With appropriate nonlinearity $\phi(x)=\left(x_{1}, x_{2}, x_{1}^{2}+x_{2}^{2}\right)$, a piece of cake   

Consider a linear hypothesis space with a feature map $\phi : X \to R^d$  
$$\mathcal{F}=\left\{f(x)=w^{T} \phi(x)\right\}$$  
<div align="center"><img src = "./last.jpg" width = '500' height = '100' align = center /></div>  

We can grow the linear hypothesis space F by adding more features. 