06/13/2018
# Exploratory Data Analysis (EDA)
* Get comfortable with data
* Find magic features
* Steps:
    1. Getting domain knowledge
    - It helps to deeper understand the problem
    2. Checking if the data is intuitive
    - And agrees with domain knowledge
    3. Understanding how the data was generated
    - As it is crucial to set up a proper validation
    
## Anonymized Data
* Explore individual features
    * Guess the meaning of the columns
    * Guess the types of the columns
* Explore feature relations
    * Find relations between pairs
    * Find feature groups
    
## Visualization
* Explore individual features
    * Histograms
    * Plots
    * Statistics
* Explore feature relations
    * Pairs
        * Scatter plots
        * Scatter matrix
        * Corrplot
    * Groups
        * Corrplot + clustering
        * Plot (index vs feature statistics)
        
## Clean Data
* Duplicated feature
* Duplicated rows
* Constant features

05/31/2018
# Collaborative Filtering
* Blog post: https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0

## Memory Based
* Based on similarity, e.g., cosine similarity
* Predict rates by weighted average, where weights are determined according to similarities
* Two types:
    * Item-Item: Users who liked this item also liked ...
    * User-Item: Users who are similar to you also liked ...
    
## Model Based
* Learn parameters
* Three types:
    * Clustering based: Similarity is calculated by unsupervised
    * Matrix factorization based: Use SVD and find efficient approximation
    * Deep learning: CNN, RNN, etc.
* Deep Learning Based:
    * Nice blog: https://medium.com/@libreai/a-glimpse-into-deep-learning-for-recommender-systems-d66ae0681775
    * Auto Encoder: https://arxiv.org/pdf/1606.07792.pdf
    * CNN: https://arxiv.org/pdf/1510.01784.pdf
    * Survey: https://arxiv.org/pdf/1707.07435.pdf

05/20/2018
# EM Algorithm
* Handles maximization of likelihood including intractable integration of hidden variables

## Problem
Maximize
$$l(\theta) = \sum_{i=1}^N log p(x_i | \theta) = \sum_{i=1}^{N} log [ \sum_{z_i} p(x_i, z_i | \theta)]$$

## Algorithm
#### E Step
* Define the complete data log likelihood
$$l_c(\theta) = \sum_{i=1}^N log p(x_i, z_i | \theta)$$
* Use the expected data log likelihood because $z_i$ is unknown.
$$Q(\theta, \theta^{t-1}) = E[l_c(\theta) | \mathcal{D}, \theta^{t-1}]$$
* Q is called auxiliary function

#### M Step
* Maximize auxiliary function
$$\theta^t = arg\underset{\theta}{max}Q(\theta, \theta^{t-1})$$



# K-means Algorithm
* Can be constructed from EM algorithms

#### E Step
* $p(z_i = k | x_i, \theta) \approx \bf{I}(k = z_i^*)$, 
where $z_i^* = arg\underset{k}{max} p(z_i = k | x_i, \theta)$
* This is called hard EM
* $z_i^* = arg\underset{k}{min} ||x_i - \mu_k||$

#### M Step
* $\mu_k =  \frac{1}{N_k} \sum_{i:z_i = k} x_i$

#### Vector Quantization
* $encode(x_i) = arg\underset{k}{min} ||x_i - \mu_k||$
* $decode(k) = \mu_k$

05/13/2018
# Naive Bayes
* Features are conditional independent given the class label.
$$ p(x | y=c, \theta) = \prod_{j=1}^D p(x_j | y=c, \theta_j)$$
* Efficiency is O(CD), where C is the number of class and D is that of features.
* Immune to overfitting due to the simplicity.


# Decision Tree
* Greedy algorithm
* Solving completely is NP-Complete
* Algorithms:
    1. Find the best feature and threshold to minimize cost function: cost_above_threshold + cost_below_threshold
    2. Calculate the gain of splitting
    $$ \Delta = cost(\mathcal{D}) - (\frac{|\mathcal{D}_{L}|}{|\mathcal{D}|}cost(\mathcal{D}_{L}) + c\frac{|\mathcal{D}_{R}|}{|\mathcal{D}|}cost(\mathcal{D}_{R}))$$
    3. If the gain is larger than the gain threshold determined in advance, split the tree
    4. Iterate 1-3 until reaching the max-depth or stopping.
* Pros:
    - interpretable
    - relatively robust to outliers
    - scale well to large datasets
    - can be modified to handle missing data
* Cons:
    - Not very accurate due to greedy algorithm
    - Unstable: susceptible to small input change => random forest
    
    
# Kernel
* Something like similarity metric
* Kernel Machine
    - Feature vector with centroids: $\phi(x) = [k(x, \mu_{1}), \dots, k(x, \mu_{K})]$. Define features by how far from centroids with kernel metrics
    - Feature vector with data points: $\phi(x) = [k(x, x_{1}), \dots, k(x, x_{N})]$. Define features by how far from data points with kernel metrics
    - Logistic regression: $p(y | x, \theta) = Ber(w^t \phi(x))$
    - L1VM: With L1 reguralization
    - L2VML: With L2 regurlaization
    - RVM: With the reguralization coming from ARD gaussian prior
    - SVM: Introduce sparsity through loss function not regularization
* Kernel trick replace inner product with kernel
* Kernelized ridge regression: Dual problem change the complexity from $O(D^3)$ to $O(N^3)$. Thus, effective in high dimensional data.
* Kernel PCA


# Linear Decision Boundary
* Define $y(\bf{x}) = \bf{w} \cdot \bf{x} + b$
* $\bf{w}$ is a perpendicular vector of a plane defined by $y(\bf{x}) = 0$
* Value for component of $\bf{w}$ direction is calculated by $\frac{\bf{w}}{||\bf{w}||} \cdot \bf{x}$
* Distance between $y(\bf{x})$ and origin point is $\frac{\bf{w}}{||\bf{w}||} \cdot \bf{x}= -\frac{b}{||\bf{w}||}$
* Distance from the $y(\bf{x})$ in general 
$$\frac{\bf{w}}{||\bf{w}||} \cdot \bf{x} - \frac{\bf{w}}{||\bf{w}||} \cdot \bf{x}_{project} = \frac{y(\bf{x})}{||\bf{w}||} - \frac{y(\bf{x_{project}})}{||\bf{w}||} = \frac{y(\bf{x})}{||\bf{w}||}$$ 

    
# SVM
* Perceptron has infinity number of solutions => the best one is determined by validation
* SVM choses the best way to split through the concept called margin
* Makes sparse solution
* Maximum Margin:
$$\underset{w, b}{arg max}\{\frac{1}{||\bf{w}||} \underset{n}{min} [t_n (\bf{w} \cdot \phi(\bf{x}) + b)]\}$$
* Maximization is scale with respect to w and b. We can change objective functions to 
    * $t_n (\bf{w} \cdot \phi(\bf{x}) + b) = 1$ for minimized n
    * $t_n (\bf{w} \cdot \phi(\bf{x}) + b) \geq 1$ for arbitrary n
    * $max \frac{1}{||\bf{w}||}$ is equivalent to $min||\bf{w}||^2$
    * $L = \frac{1}{2}||\bf{w}||^2 - \sum a_n \{t_n (\bf{w} \cdot \phi(\bf{x}) + b) - 1\}$
* Able to introduce regularization through slack variables, which makes soft margin
* Introduce sparsity by changing loss function 
    - Regression: Epsilon intensive loss function:
        \begin{equation}
        L_{\epsilon} (y, \hat{y}) =
        \begin{cases}
              0, & \text{if}\ |y - \hat{y}| < \epsilon \\
              |y - \hat{y}| - \epsilon, & \text{otherwise}
        \end{cases}
      \end{equation} 
    - Classification: Hinge loss
    $$L_{\epsilon} (y, \hat{y}) = (1 - y \hat{y})_{+}$$
* $C = 1 / \lambda$
$$Loss = C \sum_{i=1}^N L_{\epsilon} (y_i, \hat{y}_i) + \frac{1}{2} ||w||^2$$
* $\hat{w} = \sum \alpha_i x_i$, $\alpha \geq 0$. $\alpha$ is sparse and $x_i$ for $\alpha > 0$ is called support vector.

In [1]:
x = set([3, 4, 5])

In [2]:
y = x.copy()

In [3]:
x.remove(3)

In [4]:
x

{4, 5}

In [5]:
y

{3, 4, 5}