# Machine Learning


## Linear Regression
### It takes continuous data and figure out a best fit line (y=mx+b)

### Install Packages 'quandl', 'sklearn', 'numpy' & 'pandas' from Anaconda Prompt
### Note: quandl is used to access dataset (stock data). Numpy is used to access arrays
### Preprocessing is used to scale the features (The goal is to get the features between -1 and 1). Also improve the accuracy and processing speed
### Cross validation is used to create training and testing samples
### SVM can also be used to do regression


In [64]:
import pandas as pd
import quandl
import math, datetime
import numpy as np
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')                                                                   



In [8]:
df=quandl.get('WIKI/GOOGL')

In [9]:
print(df.head())

              Open    High     Low    Close      Volume  Ex-Dividend  \
Date                                                                   
2004-08-19  100.01  104.06   95.96  100.335  44659000.0          0.0   
2004-08-20  101.01  109.08  100.50  108.310  22834300.0          0.0   
2004-08-23  110.76  113.48  109.05  109.400  18256100.0          0.0   
2004-08-24  111.24  111.60  103.57  104.870  15247300.0          0.0   
2004-08-25  104.76  108.00  103.88  106.000   9188600.0          0.0   

            Split Ratio  Adj. Open  Adj. High   Adj. Low  Adj. Close  \
Date                                                                   
2004-08-19          1.0  50.159839  52.191109  48.128568   50.322842   
2004-08-20          1.0  50.661387  54.708881  50.405597   54.322689   
2004-08-23          1.0  55.551482  56.915693  54.693835   54.869377   
2004-08-24          1.0  55.792225  55.972783  51.945350   52.597363   
2004-08-25          1.0  52.542193  54.167209  52.100830   53.1

### All the above columns are called features
### Note: Adj is basically the prices when the stock splits

### Now all the features in the dataset are not relevant to us, so we only use selected features

In [10]:
df=df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]

In [11]:
print (df.head())

            Adj. Open  Adj. High   Adj. Low  Adj. Close  Adj. Volume
Date                                                                
2004-08-19  50.159839  52.191109  48.128568   50.322842   44659000.0
2004-08-20  50.661387  54.708881  50.405597   54.322689   22834300.0
2004-08-23  55.551482  56.915693  54.693835   54.869377   18256100.0
2004-08-24  55.792225  55.972783  51.945350   52.597363   15247300.0
2004-08-25  52.542193  54.167209  52.100830   53.164113    9188600.0


### Now we define relationship amongst features

### Percent Volatility= (High-Low)/Low

### Percent Change= (new-old)/old

In [13]:
df['HL_PCT']=(df['Adj. High']-df['Adj. Low'])/df['Adj. Low']*100.0

df['PCT_change']=(df['Adj. Close']-df['Adj. Open'])/df['Adj. Open']*100.0

### Now we define the only features that are relevant to us 

In [14]:
df=df[['Adj. Close','HL_PCT','PCT_change','Adj. Volume']]

In [15]:
print(df)

             Adj. Close    HL_PCT  PCT_change  Adj. Volume
Date                                                      
2004-08-19    50.322842  8.441017    0.324968   44659000.0
2004-08-20    54.322689  8.537313    7.227007   22834300.0
2004-08-23    54.869377  4.062357   -1.227880   18256100.0
2004-08-24    52.597363  7.753210   -5.726357   15247300.0
2004-08-25    53.164113  3.966115    1.183658    9188600.0
2004-08-26    54.122070  3.143512    2.820391    7094800.0
2004-08-27    53.239345  2.772258   -1.803885    6211700.0
2004-08-30    51.162935  3.411430   -3.106003    5196700.0
2004-08-31    51.343492  1.517228    0.048866    4917800.0
2004-09-01    50.280210  3.310926   -2.385589    9138200.0
2004-09-02    50.912161  3.466748    2.442224   15118600.0
2004-09-03    50.159839  2.436569   -0.931154    5152400.0
2004-09-07    50.947269  2.399357    0.564301    5847500.0
2004-09-08    51.308384  2.517413    1.548541    4985600.0
2004-09-09    51.313400  1.693069   -0.185366    4061700

### To solve the problem of missing values, I usually replace these with -99999 instead of NaN. This will ensure that the data is not lost and at the same time it will be treated as an outlier  

In [20]:
df.fillna(-99999,inplace=True)

### Assign 'Adj. Close' column to a variable so that we can make changes in it

In [28]:
forcast_col='Adj. Close'

### Regression is basically used for forcasting  

### Logic: I'm taking 0.0081 or 0.81% of the length of all the rows within the dataframe. Each row in the dataFrame is representation of a day in the life of the stock. So for example if the stock has been trading for 365 days, there will be 365 rows in the dataFrame. 1% of 365 is 3.65 days which is then rounded up by the math.ceil function to 4 days. 

In [52]:
forcast_out=math.ceil(0.0081*len(df))
print(forcast_out)

28


In [53]:
df['label']=df[forcast_col].shift(-forcast_out)

In [54]:
print(df)

             Adj. Close    HL_PCT  PCT_change  Adj. Volume        label
Date                                                                   
2004-08-19    50.322842  8.441017    0.324968   44659000.0    65.742942
2004-08-20    54.322689  8.537313    7.227007   22834300.0    65.000651
2004-08-23    54.869377  4.062357   -1.227880   18256100.0    66.495265
2004-08-24    52.597363  7.753210   -5.726357   15247300.0    67.739104
2004-08-25    53.164113  3.966115    1.183658    9188600.0    69.399229
2004-08-26    54.122070  3.143512    2.820391    7094800.0    68.752232
2004-08-27    53.239345  2.772258   -1.803885    6211700.0    69.639972
2004-08-30    51.162935  3.411430   -3.106003    5196700.0    69.078238
2004-08-31    51.343492  1.517228    0.048866    4917800.0    67.839414
2004-09-01    50.280210  3.310926   -2.385589    9138200.0    68.912727
2004-09-02    50.912161  3.466748    2.442224   15118600.0    70.668146
2004-09-03    50.159839  2.436569   -0.931154    5152400.0    71

### In this dataset, maybe we want to train our model to predict the price 0.81% into the future. Then, to train, we need historical data to grab values, and then use those values alongside whatever the price was 0.81% into the future (0.81% into the future as in 0.81% of the days of the entire dataset. We use .shift, which is a pandas method, which can take a column and literally shift it in a direction by a number we decide. Thus, we use this to make a new column, which is the price column shifted, giving us the future prices in the same rows as current price, volume to be trained against.

### As we can now see the 29th row Adj. Close value is now 1st row's Label value 

In [55]:
print(df.tail())

            Adj. Close    HL_PCT  PCT_change  Adj. Volume  label
Date                                                            
2018-01-12     1130.65  2.101967    1.851185    1914460.0    NaN
2018-01-16     1130.70  1.972201   -0.842753    1783881.0    NaN
2018-01-17     1139.10  1.409002    0.241121    1353097.0    NaN
2018-01-18     1135.97  1.434466   -0.296660    1333633.0    NaN
2018-01-19     1143.50  0.996026    0.480655    1418376.0    NaN


### Since I shifted the values up by 28 rows there are going be null values at the end of the label column. So I'll drop these null values now

In [57]:
df.dropna(inplace=True)

In [58]:
print(df.tail())

            Adj. Close    HL_PCT  PCT_change  Adj. Volume    label
Date                                                              
2017-12-01     1025.07  2.000197   -0.518240    1850541.0  1130.65
2017-12-04     1011.87  2.191792   -1.549912    1896325.0  1130.70
2017-12-05     1019.60  3.428047    0.851640    1927802.0  1139.10
2017-12-06     1032.72  2.390403    1.593673    1369276.0  1135.97
2017-12-07     1044.57  1.309689    0.820408    1437448.0  1143.50


In [60]:
X=np.array(df.drop(['label'],1))
y=np.array(df['label'])

### We are taking all features in X and labels in y. Therefore we need to drop label column from data frame so that there are only features

### Now scaling features

In [61]:
X=preprocessing.scale(X)

In [68]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

### Using Classifiers

In [69]:
clf=LinearRegression()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)

In [70]:
print(accuracy)

0.981788342583


### The above model is 98.17% accurate
### For Linear Regression accuracy is squared error