# Regression

+ Application：
    1. Stock Market Forecast
    2. Self-driving Car
    3. Recommendation

## Example Application
### Model
+ Estimating the Combat Power(CP) of a pokemon after evolution
$f(x_{pokeman})= CP\ after\ evolution  $


### Linear model
+ $y=b+w \cdot X_{cp}$

    y代表进化后的cp值，$X_{cp}$代表进化前的cp值，w和b代表未知参数
    
 
+ Linear model: $y = b+\sum w_ix_i$
    
    $x_i$ means feature, $w_i$ means weight, $b$ means bias
    
### Loss function   

+ $L(f) = L(w, b) = \sum \limits_{n=1}^{10} (\hat y^n - y)^2$

![avatar](attachment/1-1.png)

### Gradient Descent
+ https://github.com/irobbwu/Machine-Learning-Study-Note 参考20-23

### How's the results
![avatar](attachment/1-2.png)

+ 将$x^2$加入model，$y=b+w_1 \cdot X_{cp}+w_2 \cdot X_{cp}^2$, 同理$x^3$,$x^4$

![title](attachment/1-3.png)



In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('data/train.csv', encoding = 'big5')
data.head(10)

Unnamed: 0,日期,測站,測項,0,1,2,3,4,5,6,...,14,15,16,17,18,19,20,21,22,23
0,2014/1/1,豐原,AMB_TEMP,14.0,14.0,14.0,13.0,12.0,12.0,12.0,...,22.0,22.0,21.0,19.0,17.0,16.0,15.0,15.0,15.0,15.0
1,2014/1/1,豐原,CH4,1.8,1.8,1.8,1.8,1.8,1.8,1.8,...,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8
2,2014/1/1,豐原,CO,0.51,0.41,0.39,0.37,0.35,0.3,0.37,...,0.37,0.37,0.47,0.69,0.56,0.45,0.38,0.35,0.36,0.32
3,2014/1/1,豐原,NMHC,0.2,0.15,0.13,0.12,0.11,0.06,0.1,...,0.1,0.13,0.14,0.23,0.18,0.12,0.1,0.09,0.1,0.08
4,2014/1/1,豐原,NO,0.9,0.6,0.5,1.7,1.8,1.5,1.9,...,2.5,2.2,2.5,2.3,2.1,1.9,1.5,1.6,1.8,1.5
5,2014/1/1,豐原,NO2,16.0,9.2,8.2,6.9,6.8,3.8,6.9,...,11.0,11.0,22.0,28.0,19.0,12.0,8.1,7.0,6.9,6.0
6,2014/1/1,豐原,NOx,17.0,9.8,8.7,8.6,8.5,5.3,8.8,...,14.0,13.0,25.0,30.0,21.0,13.0,9.7,8.6,8.7,7.5
7,2014/1/1,豐原,O3,16.0,30.0,27.0,23.0,24.0,28.0,24.0,...,65.0,64.0,51.0,34.0,33.0,34.0,37.0,38.0,38.0,36.0
8,2014/1/1,豐原,PM10,56.0,50.0,48.0,35.0,25.0,12.0,4.0,...,52.0,51.0,66.0,85.0,85.0,63.0,46.0,36.0,42.0,42.0
9,2014/1/1,豐原,PM2.5,26.0,39.0,36.0,35.0,31.0,28.0,25.0,...,36.0,45.0,42.0,49.0,45.0,44.0,41.0,30.0,24.0,13.0


In [3]:
def to_rawdata(data, n = 3):
    raw_data = data.iloc[:,n:]
    raw_data[raw_data == 'NR'] = 0
    raw_data = raw_data.astype(float)
    return raw_data

In [4]:
raw_data = to_rawdata(data)
raw_data.head(18)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,14.0,14.0,14.0,13.0,12.0,12.0,12.0,12.0,15.0,17.0,...,22.0,22.0,21.0,19.0,17.0,16.0,15.0,15.0,15.0,15.0
1,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,...,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8,1.8
2,0.51,0.41,0.39,0.37,0.35,0.3,0.37,0.47,0.78,0.74,...,0.37,0.37,0.47,0.69,0.56,0.45,0.38,0.35,0.36,0.32
3,0.2,0.15,0.13,0.12,0.11,0.06,0.1,0.13,0.26,0.23,...,0.1,0.13,0.14,0.23,0.18,0.12,0.1,0.09,0.1,0.08
4,0.9,0.6,0.5,1.7,1.8,1.5,1.9,2.2,6.6,7.9,...,2.5,2.2,2.5,2.3,2.1,1.9,1.5,1.6,1.8,1.5
5,16.0,9.2,8.2,6.9,6.8,3.8,6.9,7.8,15.0,21.0,...,11.0,11.0,22.0,28.0,19.0,12.0,8.1,7.0,6.9,6.0
6,17.0,9.8,8.7,8.6,8.5,5.3,8.8,9.9,22.0,29.0,...,14.0,13.0,25.0,30.0,21.0,13.0,9.7,8.6,8.7,7.5
7,16.0,30.0,27.0,23.0,24.0,28.0,24.0,22.0,21.0,29.0,...,65.0,64.0,51.0,34.0,33.0,34.0,37.0,38.0,38.0,36.0
8,56.0,50.0,48.0,35.0,25.0,12.0,4.0,2.0,11.0,38.0,...,52.0,51.0,66.0,85.0,85.0,63.0,46.0,36.0,42.0,42.0
9,26.0,39.0,36.0,35.0,31.0,28.0,25.0,20.0,19.0,30.0,...,36.0,45.0,42.0,49.0,45.0,44.0,41.0,30.0,24.0,13.0


In [5]:
raw_data = raw_data.values

In [6]:
def classfi(raw_data):
    X = []
    y = []
    for i in range(0, 4320, 18):
        for j in range(0, 15):
            X.append(raw_data[i:i+18,j:j+9])
            y.append(raw_data[i+9,j+9])
    return X, y

X, y = classfi(raw_data)

In [7]:
X = np.array(X)

In [8]:
X.shape

(3600, 18, 9)

## 首先用前九天的PM2.5来预测第十天的PM2.5

In [9]:
X_pm25 = []
for i in range(X.shape[0]):
    X_pm25.append(X[i][9])

In [10]:
X_pm25 = np.array(X_pm25)

In [11]:
X_pm25.shape

(3600, 9)

### Adagrad
![title](attachment/1-4.png)

$$ 
g \leftarrow \nabla f\\
r \leftarrow r + g \bigodot g\\
$$

In [12]:
def x_pre(x):
    b0 = np.ones(x.shape[0]).reshape(-1, 1)
    return np.hstack([b0,x])

def gradient_descent_adagrad(X, y, n_iters = 10e6, eta = 0.1, ):  

    def dJ(beta, X_b, y):
        return X_b.T.dot(X_b.dot(beta) - y) * 2 / len(X_b)
    
    X = x_pre(X)
    initial_beta = np.random.rand(X.shape[1])
     
    i_iter = 0    
    adagrad = 0
    beta = initial_beta
    
    
    while i_iter < n_iters:
        gradient = dJ(beta, X, y)
        adagrad = adagrad + np.sqrt(gradient**2)
        beta = beta - (eta/adagrad)*gradient
        i_iter += 1
    
    
    return beta

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [14]:
normal = StandardScaler()
normal.fit(X_pm25)
X_pm25 = normal.transform(X_pm25)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_pm25, y, test_size = 0.8)

In [16]:
beta = gradient_descent_adagrad(X_train, y_train, n_iters = 100000, eta = len(y_train))
beta

array([ 23.81386436,   0.69461189,  -0.91016254,   3.64400782,
        -4.11578086,   0.44004682,   7.93500593, -10.35936909,
        -0.42691584,  19.08481204])

In [17]:
from sklearn.metrics import mean_squared_error

In [18]:
y_predict = x_pre(X_test).dot(beta)

In [19]:
mean_squared_error(y_test, y_predict)

46.07795392737638

##### Predict

In [20]:
test_data = pd.read_csv('data/test.csv', encoding = 'big5', header = None )
test_data.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,id_0,AMB_TEMP,21.0,21.0,20.0,20.0,19.0,19.0,19.0,18.0,17.0
1,id_0,CH4,1.7,1.7,1.7,1.7,1.7,1.7,1.7,1.7,1.8
2,id_0,CO,0.39,0.36,0.36,0.4,0.53,0.55,0.34,0.31,0.23
3,id_0,NMHC,0.16,0.24,0.22,0.27,0.27,0.26,0.27,0.29,0.1
4,id_0,NO,1.3,1.3,1.3,1.3,1.4,1.6,1.2,1.1,0.9
5,id_0,NO2,17.0,14.0,13.0,14.0,18.0,21.0,8.9,9.4,5.0
6,id_0,NOx,18.0,16.0,14.0,15.0,20.0,23.0,10.0,10.0,5.8
7,id_0,O3,32.0,31.0,31.0,26.0,16.0,12.0,27.0,20.0,26.0
8,id_0,PM10,62.0,50.0,44.0,39.0,38.0,32.0,48.0,36.0,25.0
9,id_0,PM2.5,33.0,39.0,39.0,25.0,18.0,18.0,17.0,9.0,4.0


In [21]:
raw_test = to_rawdata(test_data, n = 2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)


In [22]:
raw_test = raw_test.values

In [23]:
X_test = []
for i in range(0, raw_test.shape[0], 18):
    X_test.append(raw_test[i+9,:])
        
X_test = np.array(X_test)

In [24]:
X_test.shape

(240, 9)

In [25]:
X_test_normal = normal.transform(X_test)

In [26]:
y_predict = beta[0] + X_test_normal.dot(beta[1:])

In [27]:
predict_answer = []

for i in range(y_predict.shape[0]):
    predict_answer.append([f'id_{i}', y_predict[i]])  

In [28]:
pd_predict_answer = pd.DataFrame(predict_answer, columns = ['id', 'PM2.5'] )
pd_predict_answer.head()

Unnamed: 0,id,PM2.5
0,id_0,6.353491
1,id_1,16.010039
2,id_2,24.24489
3,id_3,11.600009
4,id_4,27.065755


In [29]:
pd_predict_answer.to_csv('predict_answer/adagrad.csv', index = False)