# Calculating Subgradient  
1. Since $g \in \partial f_k(x)$, we have $f_k(z) \geq f_k(x) + g^T(z - x)$, and $f(x)=\max _{i=1, \ldots, m} f_{i}(x)$, we have $f(z) \geq f_k(z) \geq f_k(x) + g^T(z - x) = f_k(z) \geq f(x) + g^T(z - x)$, thus $g \in \partial f(x)$  
Here $x$ should be treated as an arbitrary fixed point  

2. Base on (1), let $f_1 = 0,f_2 = 1 - yw^Tx$, and the subgradient of $f_1=0$ is 0, of $f_2 = 1 - yw^Tx$ is $-yx$, and therefore  
$$\partial J(w)=
\begin{cases}
0 & 1-yw^Tx<0\\
-yx & else
\end{cases}
$$  


# Perceptron  
The perceptron algorithm is often the ﬁrst classiﬁcation algorithm taught in machine learning classes. Suppose we have a labeled training set $(x_1,y_1),...(x_n,y_n) \in R^d \times \{1,-1\}$. In the perceptron algorithm, we are looking for a hyperplane that perfectly separates the classes. That is we are looking for a hyperplane $w \in R^n$, such that  
$$y_iw^Tx_i \geq 0$$  
for all $i$   
When such a hyperplane exists,we say, that the data are linearly separable. The perceptron algorithm is given in Algorithm 1.  
<div align="center"><img src = "./perceptron.jpg" width = '500' height = '100' align = center /></div> 

Notice:  
if 
$$y_i xi^T w^{(k)} \leqslant 0$$  
then  
$$w^{(k+1)} = w^{(k)} + y_ix_i$$,  
this indicates that the mispredicted samples contribute the fianl result of $w$

##  Perceptron Loss  
There is also something called the perceptron loss, given by  
$$l(\hat{y},y) = max \{0,-y\hat{y} \}$$  

1. If it is a separating hyperplane, then   
$$y_iw^Tx_i \geqslant 0$$  
and we know $\hat{y}_i = w^Tx_i$, then  
$$-y_i\hat{y}_i \leqslant 0$$  
for all $i$,  
then  
$$l(\hat{y}_i,y_i) = max \{0,-y_i\hat{y}_i \} = 0$$   
then the average loss is 0  


2. Subgradient of Perceptron Loss  
$$\partial J(w)=
\begin{cases}
0 & -yw^Tx<0\\
-yx & else
\end{cases}
$$  
Then using SSGD  
<div align="center"><img src = "./SSGD_perceptron.jpg" width = '500' height = '100' align = center /></div> 

3. Initial $w^{(0)} = (0,0,...0)\in R^d$,  
for those $(x_i,y_i)$ such that $y_ix_i^Tw^{(k)} \leqslant 0$  
$$w^{(k+1)} = w^{(k)} + y_ix_i$$
$$w^{(k)} = w^{(k-1)} + y_jx_j + y_ix_i$$

where $y_i \in \{-1,1\}$, then  
we can write $w = \sum_{i=1}^{n}\alpha_i x_i$, which is a linear combination of $x$ 

# Polarity Data


In [None]:
import os
import numpy as np
import pickle
import random

'''
Note:  This code is just a hint for people who are not familiar with text processing in python. There is no obligation to use this code, though you may if you like. 
'''


def folder_list(path,label):
    '''
    PARAMETER PATH IS THE PATH OF YOUR LOCAL FOLDER
    '''
    filelist = os.listdir(path)
    review = []
    for infile in filelist:
        file = os.path.join(path,infile)
        r = read_data(file)
        r.append(label)
        review.append(r)
    return review

def read_data(file):
    '''
    Read each file into a list of strings. 
    Example:
    ["it's", 'a', 'curious', 'thing', "i've", 'found', 'that', 'when', 'willis', 'is', 'not', 'called', 'on', 
    ...'to', 'carry', 'the', 'whole', 'movie', "he's", 'much', 'better', 'and', 'so', 'is', 'the', 'movie']
    '''
    f = open(file)
    lines = f.read().split(' ')
    symbols = '${}()[].,:;+-*/&|<>=~" '
    words = map(lambda Element: Element.translate(str.maketrans("", "", symbols)).strip(), lines)
    words = filter(None, words)
    return words

###############################################
######## YOUR CODE STARTS FROM HERE. ##########
###############################################

def shuffle_data():
    '''
    pos_path is where you save positive review data.
    neg_path is where you save negative review data.
    '''
    pos_path = "data/pos"
    neg_path = "data/neg"

    pos_review = folder_list(pos_path,1)
    neg_review = folder_list(neg_path,-1)

    review = pos_review + neg_review
    random.shuffle(review)
    return review

'''
Now you have read all the files into list 'review' and it has been shuffled.
Save your shuffled result by pickle.
*Pickle is a useful module to serialize a python object structure. 
*Check it out. https://wiki.python.org/moin/UsingPickle
'''




# Support Vector Machine via Pegasos 

SVM objective function:  

$$\min _{w \in \mathbf{R}^{d}} \frac{\lambda}{2}\|w\|^{2}+\frac{1}{m} \sum_{i=1}^{m} \max \left\{0,1-y_{i} w^{T} x_{i}\right\}$$  
- for simplicity, we are leaving off the unregularized bias term $b$.  
- Pegasos is stochastic subgradient descent using a step size rule $\eta_t = \frac{1}{\lambda t}$  
<div align="center"><img src = "./pegasos.jpg" width = '500' height = '100' align = center /></div> 

(1)  
$$J_{i}(w)=\frac{\lambda}{2}\|w\|^{2}+\max \left\{0,1-y_{i} w^{T} x_{i}\right\}$$  
is not differentiable when $y_iw^Tx_i = 1$  
(2)  
A subgradient of $J_i(w)$ is given by  
$$g=\left\{\begin{array}{ll}
\lambda w-y_{i} x_{i} & \text { for } y_{i} w^{T} x_{i}<1 \\
\lambda w & \text { for } y_{i} w^{T} x_{i} \geq 1
\end{array}\right.$$  
(3) 
Using subgradient descent with step size $\eta_t = \frac{1}{\lambda t} $

(4)  
Implement the Pegasos algorithm to run on a sparse data representation. The output should be a sparse weight vector $w$.   
Also: If you normalize your data in some way, be sure not to destroy the sparsity of your data. Anything that starts as 0 should stay at 0.

(5)  
Note that in every step of the Pegasos algorithm, we rescale every entry of wt by the factor $(1 - \eta_t\lambda)$. Implementing this directly with dictionaries is very slow. We can make things signiﬁcantly faster by representing $w$ as $w = sW$, where $s \in R$ and $W \in R^d$. You can start with $s = 1$ and $W$ all zeros (i.e. an empty dictionary). Note that both updates (i.e. whether or not we have a margin error) start with rescaling $w_t$, which we can do simply by setting $s_{t+1} = (1 - \eta_t\lambda)s_t$, If the update is $w_{t+1}=\left(1-\eta_{t} \lambda\right) w_{t}+\eta_{t} y_{j} x_{j}$, then verify that the Pegasos update step is equivalent to  
$$\begin{aligned}
s_{t+1} &=\left(1-\eta_{t} \lambda\right) s_{t} \\
W_{t+1} &=W_{t}+\frac{1}{s_{t+1}} \eta_{t} y_{j} x_{j}
\end{aligned}$$  

There is one subtle issue with the approach described above: if we ever have $1 - \eta_t \lambda = 0$, then $s_{t+1} = 0$, and we’ll have a divide by 0 in the calculation for $W_{t+1}$. This only happens when $\eta_t = \frac{1}{\lambda}$. With our step-size rule of $\eta_t = \frac{1}{\lambda t}$, it happens exactly when $t = 1$. So one approach is to just start at $t = 2$. More generically, note that if $s_{t+1} = 0$, then $w_{t+1} = 0$. Thus an equivalent representation is $s_{t+1} = 1$ and $W = 0$. Thus if we ever get $s_{t+1} = 0$, simply set it back to 1 and reset $W_{t+1}$ to zero, which is an empty dictionary in a sparse representation

In [None]:
def percent_error(X, y, w):
    """
    X: Array_like, (n_samples, n_features)
       traing points
    y: Array_like  (n_samples, )
       labels
    w: Array_like  (n_features, )
       sparse weight vector 
    
    return
    error: float
    """
    result = np.dot(X, w) * y
    n_sample = y.shape[0]
    return np.sum(np.where(result >= 0, 0, 1)) / n_sample

(10)
If the data such that $y_iw^Tx_i$ takes up a large proportion, ther is no doubt that we can't ignore this part data. If it is only of small amount,

# Error Analysis
This method only investigate the relative importence, what if all features are terrible or the difference of the performance of all features is not significant? We need further research.  