# Week 1

## Recap of main ML algorithms

### Linear models

* Examples:
  * Logistic regression
  * Support Vector machines

* Useful for sparse high-dimensionality data.

* For some data, relationships are not linear and therefore not useful.

### Tree-based

* Examples:
  * Decision Tree
  * Random Forest
  * GBDT
  
* Uses "divide and conquer to recursively split spaces into subspaces".
* Finds "splits" or decisions that can be used to make inferences on that data.

* For tabular/structured data: winners almost always use this approach.

* Can be harder to find linear dependancies.

* Scikit-learn has RandomForests.
* XGBoost - very common approach.
* LightGBM - also very common.

### kNN-based methods

* Closer objects likely to have same labels.
* Finds "k" nearest neighbours.

### Neural networks

* Important for image and natural language processing.
* Sometimes used with structured data.

## Feature preprocessing and generation with respect to models

### Overview

* Careful preprocessing can give you edge.
* Feature types:
  * Id - unique to each row.
* Feature preprocessing
  * Model dependant.

### Numeric features

#### Tree-based models

* Generally doesn't depend on scale of data.

#### Non-tree based model

* Linear models, neural networks generally want to be scaled.
* MinMaxScaler: a good approach to bring values between 0 and 1.
* StandardScaler: mean=0, std=1.
* Removing outliers a good idea in linear datasets.
  * Winsorization: process of clipping 1st and 99th percentiles.
  
* Rank transformation:
  * Set spaces between sorted values to be equal.
  * Convert values to their percentile values:
    * rank([1000, 1, 10]) [2, 0, 1]
    * Alternative to calculating ranks.
    
* Log transformation:
  * Make data normally distributed.
  * `np.log(1 + x)`
  * Raising to some power < 1: `np.sqrt(x + 2.3)`
  
#### Feature generation

* Dig into the data and generate insights for features to generate.
* Examples:
  * Housing prices: squared meter and price can generate price for m^2.
  * Distance: If you had vertical and horizontal distance to something, could generate combined features.
     * Trees could figure this stuff out, but good feature engineering can result in less trees.
       * Have trouble with division and multiplictive features.
  * Price: could include fractional part of the price to figure out its impact on person's perception. 

### Categorical and ordinal features

* Titanic dataset example of categorical:
  * Sex, Cabin, Embarked
  * Doesn't have an underlying ordering
* Examples of ordinal:
  * Pclass - has a natural ordering.
  
* Preparing categorical variables:
  * Label encoding:
    * Convert to numbers.
    * Works okay for tree based models but not for linear models
    * `sklearn.preprocessing.LabelEncoding` - orders them alphabetical
    * `Pandas.factorize` - orders them based on when it saw them in the data.
  * Frequency encoding: order by how frequently it's seen in the dataset:
    `[S, C, Q] -> [0.5, 0.3, 0.2]`
    * Useful even for tree model, if frequency of category is correlated with target value.
  * One-hot encoding
    * Example:
        * Season: 
          [`winter`,
           `summer`,
           `summer`
          ]
        * Encoding:
          [`season_winter`: 1, `season_summer`: 2]
    * Can utilise sparse matrixes if lots of 0 values.
      * Useful for categorical or text data.
  * Feature generation
    * Interaction of categorical features can be useful for linear models and kNN.
      * Example: `Pclass` and `sex` combined to `Pclass_sex`

### Datetime and coordinates

* Date time feature generation:
  * Periodicity:
     * Day number in week, month, season, year, second, minute, hour.
   * Time since:
     * Row independent: time since 1970
     * Row-dependant: time since last public holiday.
   * Difference:
     * Difference between some other date (time user subscribed vs purchase).
     
* Co-ordinates
  * If you had the data, could add distance to nearby landmarks: shops, hospitals, schools etc.
  * Could extract points on the map from train / test data.
    * Distance to most expensive place.
  * Aggregated stats: slightly rotate coords.
  
### Handling missing values

* Plotting histogram useful for finding numbers used as replacement value.
  * If the distribution is normalised, with a peak somewhere, you might assume they are missing.
* Missing value imputation:
  1. Some number like -999 or -1
    * Can hurt performance of linear models and neural networks.
  2. Mean or median
    * Good for linear models
    * Can be hard for tree models to find which ones are missing.
      * Solution: add another column that describes if a value is missing or not.
  3. Reconstruct value
    * Approximate with nearby observations.
    * Could train another model to find features.
  * Want to think about future feature generation when filling missing values.
* XGBoost can handle missing values out the box.

* Handling features not present in train data:
  * Unsupervised methods like frequency encoding can be used to handle missing test data values.

## Feature extraction from text and images

### Bag of words

* Create new column for each word from the data.
* Add counts to each column for the word.
* `CountVectorizer` takes that approach.
* TF/IDF:
  * Calculate term frequency:
  
     ```
     tf = 1 / x.sum(axis=1)
     x = x * tf
     ```
     
  * Calculate inverse doc frequency:
  
    ```
    idf = np.log(x.shape[0] / (x > 0).sum(0)
    x = x * idf
    ```
    
  * ``sklearn.feature_Extraction.text.TfidfVectorizer``
  
* N-grams:
  * Have a column for each combination of n characters.
  * `hello world`: 1, `hello`: 1, `world`: 1
  
* Preprocessing
  * Lowercase: `Hello -> hello`
  * Stemming: `democracy, democratic, democratization -> democr`
    * Less careful preprocessing and doesn't require knowledge of language.
  * Lemmatisation: `democracy, democratic, democratization -> democracy`
  * Stop words
    * Remove words which are insignificant or too common to be useful.
    
### Word2Vec, CNN

* Creating embeddings of words:
  * Word2Vec
  * Glove
  * FastText
* Embeddings of sentences:
  * Doc2Vec

* BOW and embeddings can give quite different results and can be combined together.

* Convolution models can reuse earlier layers on different problems.
  * Finetuning can be used to retrain layers to solve similar but different models.
  
* Images can be augmented to reduce overfitting.