In [None]:
%pylab inline

## A feature vector

"A feature vector is a vector in which each dimension $j = 1, \dots , D$ contains a value that describes the example somehow."

* "A feature vector is a vector in which..." 
  * This means that we have an unnamed vector that we're trying to define
* "each dimension $j = 1, \dots , D$"
  * Now we know we have a variable $j$ and that $j$ contains a number of things
  * The thing starts with the 1 and ends with D -- Oh, a second variable!
  * What is D? ... Could be 100; let's say it's 100
  * So D is the number of dimensions, then D must be the length! And if D is 100 then the unnamed feature vector is 100 elements long
* "contains a value that describes the example"
  * Ok, so each thing inside the unnamed vector (the thing with number $1, \dots, D$) helps us describe one piece of example data
  * Ok, that must mean that element $1$ is describing something different that element $2$
  * ... I don't understand much more, let's move on!

"That value is called a *feature* and is denoted as $x^j$"

* Ok, so now we have a name for our list called $x$
* We also heard that each single element (the ones with numbers $1, \dots, D$) is called a **feature**
  * And each element is uniquely identifiable by $j$: $x^j$
  * Hey, I've seen that before! That's an index!
  * Ahhhh, easy, so this just means that I can look at an element in the vector by it's index!
  * ... Oh, and that element is then for some reason called a *feature*

In [None]:
# We have a vector x of 1, ..., D elements
D = 100
x = np.arange(1, D + 1)

# Apparently we can look up a certain element in the list given j:
j = 1
x[j]

In [None]:
# Be careful with indexing!
print(x[0])
print(x[1])

A **feature vector** is then simply just a list where we can look an element up by its index.

Math: $x^1$

Python:
```python
x[0]
    ```

Math: $x^{100}$

Python: 
```python
x[99]
```


## Feature vectors as descriptions of things

Let's say we have a Person class:

```python

class Person:
    
    def __init__(self, age, height):
        self.age = age
        self.height = height
```

A Person is described by two **features**: `age` and `height`

* So a **feature vector** for a person would have 2 dimensions
  * $x^1 = $ age
  * $x^2 = $ height

## Feature vectors in high dimensions

* Imagine an image with the resolution `1024 x 800`
  * How many pixels does that image have?

* That image has `1024 * 800 = 819200` features!
```python
x[0]      # First feature
x[819199] # Last feature
```

## Motivating the problem

“We need to be able to predict whether a particular
customer will stay with us. Here are the logs of customers’ interactions with our product for
five years.”

What information can we get from a log?

* Timestamp
* Number of connections
* Frequency of connections
* Session duration
* ...?

The more precisely we can describe the data, the better our model gets!

We want to select the information that is important and *remove* the ones that aren't.

## Feature engineering

The problem of transforming raw data into a dataset is called feature engineering.

Goal: describe data with *informative* features

Today: techniques to encode data in specific ways

* One-hot
* Binning
* Word embeddings
* Normalisation
* Standardisation
* Dealing with missing features

## One-hot encoding

Imagine $n$ features. One-hot encoding means setting **one** feature to 1 and the rest to 0.

Example:
 * Features `[Left, Right]`: `[1, 0]`
 * Features `[M, T, W, T, F, S, S]`: `[0, 0, 1, 0, 0, 0]`
 * Features `[M, F, X]`: `[0, 1, 0]`

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
xs = [['Male'], ['Female'], ['Female']]
encoder.fit_transform(xs)

In [None]:
encoder.fit_transform(xs).todense()

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
xs = [['Male', 1], ['Female', 3], ['Female', 2]]
encoder.fit_transform(xs).todense()

## Binning

Imagine a list of numbers, say ages, that you want to cluster into groups.

Binning is to take numerical data and turn it into categories.

Examples:
* Age to groups: 0-10, 11-20, 21-30, ...
* Speed to groups: Slow, Medium, Fast, Ultra mega fast

In [3]:
from sklearn.preprocessing import KBinsDiscretizer
binner = KBinsDiscretizer(n_bins=10)
age = np.random.randint(0, 100, (1000, 1)) # Genrate 1000 random numbers
binner.fit_transform(age).todense()

matrix([[0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 1., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [4]:
binner.fit_transform(age).sum(axis=0)

matrix([[ 91.,  97., 109., 102.,  95.,  94., 106., 102.,  94., 110.]])

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_discretization_001.png)

[Discretisation strageties](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_strategies.html#sphx-glr-auto-examples-preprocessing-plot-discretization-strategies-py) in KBinsDiscretizer

## Word embeddings

How do we represent words?

Can we make them into feature vectors?

Yes, yes we can!

![](images/word-embeddings.png)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus)

In [None]:
print(vectorizer.get_feature_names())

In [None]:
print(X.toarray())

## Normalisation

Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval
$[−1,1]$ or $[0,1]$.

$$\bar{x} = { x - min_x \over max_x - min_x }$$

In [None]:
data = np.random.random((1, 20)) * 100
data

In [None]:
from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(data)

## Standardisation
Standardization (or z-score normalization) is the procedure during which the feature values are rescaled so that 
they have the properties of a standard normal distribution with μ = 0 and σ = 1, where μ 
is the mean (the average value of the feature, averaged over all examples in the dataset) and σ 
is the standard deviation from the mean.

## Standard deviation ($\sigma$)

Describes the **variance** in data. How big of a spread is there around the mean?

## Normal distribution


![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/1920px-Standard_deviation_diagram.svg.png)

Standardisation scales AND standardises values to fit the normal distribution

In [None]:
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(data)

In [None]:
StandardScaler().fit_transform(data.T)

## Dealing with missing values

* Removing them
  * `pd.dropna`
* Extrapolating (imputation)

## Imputation techniques

* Imagine a dataset where you have a lot of missing values for *one* feature


* Imagine a dataset that you cannot hand out (NDA)

## Imputation technique 1: Averaging

Replace the missing value with the average value of the feature:

$$x = {1 \over N} x$$

## Imputation technique 2: Sampling

Imagine that the data is distributed in a certain way (for instance the normal distribution).
You can now sample new "*mock*" data from that distribution.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_missing_values_001.png)

Example: Danske Bank customers

## Imputation in sklearn

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit_transform([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])