### Project 1

Project description: 
- Read data into Jupyter notebook, use pandas to import data into a data frame
- preprocess data: explore data, address missing data, categorical data, if there is any, and data scaling. Justify the type of scaling used in this project. 
- train your dataset using all the linear regression models you've learned so far. If your model has a scaling parameter(s) use Grid Search to find the best scaling parameter. Use plots and graphs to help you get a better glimpse of the results. 
- Then use cross validation to find average training and testing score. 
- Your submission should have at least the following regression models: KNN repressor, linear regression, Ridge, Lasso, polynomial regression, SVM both simple and with kernels. 
- Finally find the best repressor for this dataset and train your model on the entire dataset using the best parameters and predict the market price for the test_set.
- submit IPython notebook. Use markdown to provide an inline report for this project.

##### <font color = 'red'> Important note: All the group members should participate in completing this project.  This includes coding, preparing report and testing the models. 

## Pre-processing the data

In [3]:
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

data = pd.read_csv('bitcoin_dataset.csv')
test = pd.read_csv('test_set.csv')

In [4]:
data.head()

Unnamed: 0,Date,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
0,2/17/2010 0:00,0.0,2043200.0,0.0,0.0,0.0,0.000235,0,1.0,0.0,...,31.781022,0.0,241,244,41240,244,244,65173.13,36500.0,0.0
1,2/18/2010 0:00,0.0,2054650.0,0.0,0.0,0.0,0.000241,0,1.0,0.0,...,154.463801,0.0,234,235,41475,235,235,18911.74,7413.0,0.0
2,2/19/2010 0:00,0.0,2063600.0,0.0,0.0,0.0,0.000228,0,1.0,0.0,...,1278.516635,0.0,185,183,41658,183,183,9749.98,700.0,0.0
3,2/20/2010 0:00,0.0,2074700.0,0.0,0.0,0.0,0.000218,0,1.0,0.0,...,22186.68799,0.0,224,224,41882,224,224,11150.03,50.0,0.0
4,2/21/2010 0:00,0.0,2085400.0,0.0,0.0,0.0,0.000234,0,1.0,0.0,...,689.179876,0.0,218,218,42100,218,218,12266.83,1553.0,0.0


In [5]:
data.shape

(2906, 24)

## Checking the names of the columns to get to know the data

In [6]:
list(data.columns)

['Date',
 'btc_market_price',
 'btc_total_bitcoins',
 'btc_market_cap',
 'btc_trade_volume',
 'btc_blocks_size',
 'btc_avg_block_size',
 'btc_n_orphaned_blocks',
 'btc_n_transactions_per_block',
 'btc_median_confirmation_time',
 'btc_hash_rate',
 'btc_difficulty',
 'btc_miners_revenue',
 'btc_transaction_fees',
 'btc_cost_per_transaction_percent',
 'btc_cost_per_transaction',
 'btc_n_unique_addresses',
 'btc_n_transactions',
 'btc_n_transactions_total',
 'btc_n_transactions_excluding_popular',
 'btc_n_transactions_excluding_chains_longer_than_100',
 'btc_output_volume',
 'btc_estimated_transaction_volume',
 'btc_estimated_transaction_volume_usd']

## Data Description to understand the data types and the distributions of the data

In [7]:
data.describe()

Unnamed: 0,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,btc_hash_rate,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
count,2906.0,2879.0,2906.0,2885.0,2877.0,2906.0,2906.0,2906.0,2894.0,2906.0,...,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0,2906.0
mean,839.104218,11511380.0,13442550000.0,73983810.0,35505.502848,0.350366,0.364074,671.673651,7.501113,1244070.0,...,66.747821,14.639125,193786.1,102081.138334,68445580.0,94348.852374,63140.320028,1566216.0,203647.5,202433800.0
std,2304.972497,4200024.0,38661500000.0,292422800.0,43618.633821,0.353168,0.842259,689.561322,4.974549,2924141.0,...,1761.894646,20.536083,208914.6,103896.92935,82853410.0,103966.111763,69687.052174,2278910.0,268278.1,580051300.0
min,0.0,2043200.0,0.0,0.0,0.0,0.000216,0.0,1.0,0.0,2.25e-05,...,0.136531,0.0,110.0,118.0,41240.0,118.0,118.0,6150.0,7.0,0.0
25%,6.653465,8485300.0,53630810.0,291645.6,781.0,0.024177,0.0,54.0,6.066667,11.6088,...,1.181945,4.15647,16754.75,8025.25,2413376.0,6813.5,6765.5,490171.2,96003.25,958168.0
50%,235.13,12431150.0,3346869000.0,10014140.0,15183.0,0.196022,0.0,375.0,7.916667,21761.89,...,2.493564,7.82243,130445.0,62337.0,32552710.0,53483.0,35283.5,1105205.0,178468.5,37425760.0
75%,594.191164,15200510.0,8075525000.0,28340380.0,58293.0,0.676065,0.0,1232.995223,10.208333,1035363.0,...,5.915591,14.800589,360376.5,190471.25,108066300.0,185901.75,113793.25,2031654.0,258804.6,131249900.0
max,19498.68333,16837690.0,326525000000.0,5352016000.0,154444.5903,1.110327,7.0,2722.625,47.733333,21609750.0,...,88571.42857,161.686071,1072861.0,490644.0,296688800.0,470650.0,318896.0,45992220.0,5825066.0,5760245000.0


## Finding out how many missing values does each column have. This should be helpful in taking care of the missing data

In [8]:
data.isnull().sum()

Date                                                    0
btc_market_price                                        0
btc_total_bitcoins                                     27
btc_market_cap                                          0
btc_trade_volume                                       21
btc_blocks_size                                        29
btc_avg_block_size                                      0
btc_n_orphaned_blocks                                   0
btc_n_transactions_per_block                            0
btc_median_confirmation_time                           12
btc_hash_rate                                           0
btc_difficulty                                         16
btc_miners_revenue                                      0
btc_transaction_fees                                   10
btc_cost_per_transaction_percent                        0
btc_cost_per_transaction                                0
btc_n_unique_addresses                                  0
btc_n_transact

## Since not a lot of data values are missing, we decide to drop the missing variables.

In [9]:
data_clean_missing = data.dropna()
data_clean_missing.head()

Unnamed: 0,Date,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
0,2/17/2010 0:00,0.0,2043200.0,0.0,0.0,0.0,0.000235,0,1.0,0.0,...,31.781022,0.0,241,244,41240,244,244,65173.13,36500.0,0.0
1,2/18/2010 0:00,0.0,2054650.0,0.0,0.0,0.0,0.000241,0,1.0,0.0,...,154.463801,0.0,234,235,41475,235,235,18911.74,7413.0,0.0
2,2/19/2010 0:00,0.0,2063600.0,0.0,0.0,0.0,0.000228,0,1.0,0.0,...,1278.516635,0.0,185,183,41658,183,183,9749.98,700.0,0.0
3,2/20/2010 0:00,0.0,2074700.0,0.0,0.0,0.0,0.000218,0,1.0,0.0,...,22186.68799,0.0,224,224,41882,224,224,11150.03,50.0,0.0
4,2/21/2010 0:00,0.0,2085400.0,0.0,0.0,0.0,0.000234,0,1.0,0.0,...,689.179876,0.0,218,218,42100,218,218,12266.83,1553.0,0.0


In [10]:
data_clean_missing.shape

(2791, 24)

## 96% of the data is preserved

In [11]:
2791/2906*100

96.04267033723332

In [12]:
data_clean_missing.describe()

Unnamed: 0,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,btc_hash_rate,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
count,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,...,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0,2791.0
mean,845.171457,11514060.0,13536810000.0,74852970.0,35015.895488,0.351003,0.369402,672.581246,7.574398,1231861.0,...,68.672234,14.771329,193959.5,102249.258689,68188760.0,94389.780724,63108.620566,1571774.0,205601.7,205004600.0
std,2338.350774,4163732.0,39225780000.0,296887600.0,43193.276241,0.35163,0.84839,686.087983,4.913946,2933729.0,...,1797.796643,20.809404,208338.6,103423.192001,82171370.0,103563.992264,69351.480205,2307159.0,272273.7,589901200.0
min,0.0,2043200.0,0.0,0.0,0.0,0.000216,0.0,1.0,0.0,2.25e-05,...,0.136531,0.0,110.0,118.0,41240.0,118.0,118.0,6150.0,7.0,0.0
25%,6.7675,8499725.0,53636780.0,302961.5,812.5,0.024821,0.0,55.0,6.2,11.69601,...,1.180722,4.172385,17127.0,8231.0,2569088.0,6822.5,6883.0,499217.6,98265.0,1030642.0
50%,235.86,12400650.0,3360446000.0,10210030.0,14901.0,0.19892,0.0,377.0,7.933333,24889.35,...,2.459924,7.80271,131062.0,62872.0,32983300.0,53800.0,35609.0,1103973.0,179648.6,38672030.0
75%,591.545,15156400.0,8025327000.0,28060620.0,57035.5,0.67365,0.0,1220.0,10.2,1019200.0,...,5.779113,14.839347,356965.0,189277.0,107387200.0,184684.0,113415.5,2033451.0,259753.5,130752900.0
max,19498.68333,16837690.0,326525000000.0,5352016000.0,154444.5903,1.110327,7.0,2722.625,47.733333,21609750.0,...,88571.42857,161.686071,1072861.0,490644.0,296688800.0,470650.0,318896.0,45992220.0,5825066.0,5760245000.0


## Taking care of the categorical variable - btc_n_orphaned_blocks

In [13]:
categorical_variables = ['btc_n_orphaned_blocks']

for variable in categorical_variables:
    data_clean_missing[variable].fillna("Missing", inplace=True)
    dummies = pd.get_dummies(data_clean_missing[variable], prefix=variable)
    data_OHE = pd.concat([data_clean_missing, dummies], axis=1)
    data_OHE.drop([variable], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [14]:
data_OHE.head()

Unnamed: 0,Date,btc_market_price,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_transactions_per_block,btc_median_confirmation_time,btc_hash_rate,...,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd,btc_n_orphaned_blocks_0,btc_n_orphaned_blocks_1,btc_n_orphaned_blocks_2,btc_n_orphaned_blocks_3,btc_n_orphaned_blocks_4,btc_n_orphaned_blocks_5,btc_n_orphaned_blocks_7
0,2/17/2010 0:00,0.0,2043200.0,0.0,0.0,0.0,0.000235,1.0,0.0,2.9e-05,...,65173.13,36500.0,0.0,1,0,0,0,0,0,0
1,2/18/2010 0:00,0.0,2054650.0,0.0,0.0,0.0,0.000241,1.0,0.0,2.9e-05,...,18911.74,7413.0,0.0,1,0,0,0,0,0,0
2,2/19/2010 0:00,0.0,2063600.0,0.0,0.0,0.0,0.000228,1.0,0.0,2.3e-05,...,9749.98,700.0,0.0,1,0,0,0,0,0,0
3,2/20/2010 0:00,0.0,2074700.0,0.0,0.0,0.0,0.000218,1.0,0.0,2.8e-05,...,11150.03,50.0,0.0,1,0,0,0,0,0,0
4,2/21/2010 0:00,0.0,2085400.0,0.0,0.0,0.0,0.000234,1.0,0.0,2.7e-05,...,12266.83,1553.0,0.0,1,0,0,0,0,0,0


In [15]:
list(data_OHE.columns)

['Date',
 'btc_market_price',
 'btc_total_bitcoins',
 'btc_market_cap',
 'btc_trade_volume',
 'btc_blocks_size',
 'btc_avg_block_size',
 'btc_n_transactions_per_block',
 'btc_median_confirmation_time',
 'btc_hash_rate',
 'btc_difficulty',
 'btc_miners_revenue',
 'btc_transaction_fees',
 'btc_cost_per_transaction_percent',
 'btc_cost_per_transaction',
 'btc_n_unique_addresses',
 'btc_n_transactions',
 'btc_n_transactions_total',
 'btc_n_transactions_excluding_popular',
 'btc_n_transactions_excluding_chains_longer_than_100',
 'btc_output_volume',
 'btc_estimated_transaction_volume',
 'btc_estimated_transaction_volume_usd',
 'btc_n_orphaned_blocks_0',
 'btc_n_orphaned_blocks_1',
 'btc_n_orphaned_blocks_2',
 'btc_n_orphaned_blocks_3',
 'btc_n_orphaned_blocks_4',
 'btc_n_orphaned_blocks_5',
 'btc_n_orphaned_blocks_7']

## Dividing the dataset into a training and a test set keeping random_state = 0

In [16]:
#Deciding Dependent and Independent Variables
Ind_Vars = ['btc_total_bitcoins', 'btc_market_cap', 'btc_trade_volume', 'btc_blocks_size', 
            'btc_avg_block_size', 'btc_n_transactions_per_block', 'btc_median_confirmation_time', 'btc_hash_rate', 
            'btc_difficulty', 'btc_miners_revenue', 'btc_transaction_fees', 'btc_cost_per_transaction_percent', 
            'btc_cost_per_transaction', 'btc_n_unique_addresses', 'btc_n_transactions', 'btc_n_transactions_total', 
            'btc_n_transactions_excluding_popular', 'btc_n_transactions_excluding_chains_longer_than_100', 
            'btc_output_volume', 'btc_estimated_transaction_volume', 'btc_estimated_transaction_volume_usd', 
            'btc_n_orphaned_blocks_0', 'btc_n_orphaned_blocks_1', 'btc_n_orphaned_blocks_2', 'btc_n_orphaned_blocks_3', 
            'btc_n_orphaned_blocks_4', 'btc_n_orphaned_blocks_5', 'btc_n_orphaned_blocks_7']
X_Variables = data_OHE[Ind_Vars]
Y_Variables = data_OHE['btc_market_price']

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X_Variables, Y_Variables, random_state=0)


## SCaling The Data

In [168]:
scaler = MinMaxScaler()
X_Train_Scaled = scaler.fit_transform(X_Train)
X_Test_scaled = scaler.transform(X_Test)

## Running Regressions - Linear Regression

In [17]:
from sklearn.linear_model import LinearRegression

#X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
#                                                  random_state = 0)
linreg = LinearRegression().fit(X_Train, Y_Train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_Train, Y_Train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_Test, Y_Test)))

linear model coeff (w): [ -4.31536150e-06   5.74337350e-08   2.46679786e-11  -2.16207207e-02
   3.01885436e+01  -1.11673895e-02   5.01215379e-01  -2.17799666e-05
   7.75784689e-11   6.76663578e-06  -1.75195351e-01   3.06068531e-05
   2.65324226e+00   1.16243159e-04   1.30620636e-05   1.16564395e-05
  -5.97160475e-06   1.31994852e-04   3.83306466e-07   3.28526734e-07
   2.30696706e-08   2.10606774e+00  -3.08481539e-01   4.19860334e-01
  -9.40206679e-01  -2.56158454e+00  -1.37603413e+00   2.66037881e+00]
linear model intercept (b): 5.235
R-squared score (training): 1.000
R-squared score (test): 1.000


## We see in the above model that the R-Square value is 1.000 which means that all the variation in (y) is explained by (x). 
## From this, we can say that there is a correlation present in between the data and that needs to be treated before we can run any models on the data.

## Treating the data - checking for collinearity

In [14]:
print(data_OHE.corr())

                                                    btc_market_price  \
btc_market_price                                            1.000000   
btc_total_bitcoins                                          0.390385   
btc_market_cap                                              0.999786   
btc_trade_volume                                            0.869779   
btc_blocks_size                                             0.676715   
btc_avg_block_size                                          0.544035   
btc_n_transactions_per_block                                0.544690   
btc_median_confirmation_time                                0.275712   
btc_hash_rate                                               0.923315   
btc_difficulty                                              0.918243   
btc_miners_revenue                                          0.987069   
btc_transaction_fees                                        0.810470   
btc_cost_per_transaction_percent                           -0.01

## Removing correlated variables and spliting the data again.

In [18]:
#Deciding Dependent and Independent Variables
Ind_Vars = [ 'btc_total_bitcoins', 'btc_trade_volume', 'btc_avg_block_size', 'btc_n_transactions_per_block', 
            'btc_median_confirmation_time','btc_transaction_fees', 'btc_cost_per_transaction_percent', 
            'btc_cost_per_transaction', 'btc_n_transactions_total', 'btc_n_transactions_excluding_chains_longer_than_100', 
            'btc_output_volume', 'btc_estimated_transaction_volume', 'btc_n_orphaned_blocks_0', 'btc_n_orphaned_blocks_1', 
            'btc_n_orphaned_blocks_2', 'btc_n_orphaned_blocks_3', 'btc_n_orphaned_blocks_4', 'btc_n_orphaned_blocks_5', 
            'btc_n_orphaned_blocks_7']
X_Variables = data_OHE[Ind_Vars]
Y_Variables = data_OHE['btc_market_price']

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X_Variables, Y_Variables, random_state=0)

In [19]:
X_Train.shape

(2093, 19)

## Checking that the collinearity has been resolved

In [22]:
print(X_Train.corr())

                                                    btc_total_bitcoins  \
btc_total_bitcoins                                            1.000000   
btc_trade_volume                                              0.289914   
btc_avg_block_size                                            0.849426   
btc_n_transactions_per_block                                  0.829366   
btc_median_confirmation_time                                  0.667400   
btc_transaction_fees                                          0.475768   
btc_cost_per_transaction_percent                             -0.083950   
btc_cost_per_transaction                                      0.372043   
btc_n_transactions_total                                      0.780527   
btc_n_transactions_excluding_chains_longer_than...            0.796657   
btc_output_volume                                             0.320088   
btc_estimated_transaction_volume                              0.190561   
btc_n_orphaned_blocks_0               

## Now, re-running the Reggression Models: 

## Linear Regression

## Cross-Validation for the Linear Regression Model

In [25]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

linregression = LinearRegression()
scores = cross_val_score(linregression, X_Train, Y_Train, cv = 5)
print("Cross validation scores: {}".format(scores))

Cross validation scores: [ 0.92164913  0.85896554  0.90461393  0.93623686  0.93492328]


In [134]:
print("Average cross-validation score for Linear Regression: {:.2f}".format(scores.mean()))

Average cross-validation score for Linear Regression: 0.91


In [27]:
from sklearn.linear_model import LinearRegression

#X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
#                                                  random_state = 0)
linreg = LinearRegression().fit(X_Train, Y_Train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_Train, Y_Train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_Test, Y_Test)))

linear model coeff (w): [ -1.34441521e-04   3.25870871e-06   1.23381051e+02   3.20035438e-01
   2.05221853e+01   1.83452088e+00  -3.76670905e-03   4.02171703e+01
   1.54939231e-05  -9.24962597e-03   1.98056314e-05   6.14331087e-05
   7.04126437e-01  -6.35582761e+01   1.73521861e+01   3.12119918e+01
  -4.41498586e+01   1.14191891e+02  -5.57520601e+01]
linear model intercept (b): 512.619
R-squared score (training): 0.926
R-squared score (test): 0.906


## Ridge Regression

## Cross-Validation on ridge regression

In [89]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

ridgeregression = Ridge()
scores = cross_val_score(ridgeregression, X_Train_Scaled, Y_Train, cv = 5)
print("Cross validation scores: {}".format(scores))

Cross validation scores: [ 0.91328041  0.93386159  0.89879039  0.92187508  0.92612247]


In [135]:
print("Average cross-validation score for Ridge Regression: {:.2f}".format(scores.mean()))

Average cross-validation score for Ridge Regression: 0.91


## Gridsearch to find out the most optimum value of scaling parameter alpha

In [91]:
from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {'alpha': [1, 5, 10, 50, 100, 1000],}

grid_search = GridSearchCV(Ridge(), param_grid, cv=5)

grid_search.fit(X_Train_Scaled, Y_Train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [1, 5, 10, 50, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [92]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Best parameters: {'alpha': 1}
Best cross-validation score: 0.92
Best estimator:
Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)


## Using the most optimum value of alpha in ridge regression

In [93]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

from sklearn.linear_model import Ridge

X_Train_Scaled = scaler.fit_transform(X_Train)
X_Test_Scaled = scaler.transform(X_Test)

#Taking the best alpha score from the grid search above: alpha=10
linridge = Ridge(alpha=1).fit(X_Train_Scaled, Y_Train)

print('Crime dataset')   
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_Train_Scaled, Y_Train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_Test_Scaled, Y_Test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Crime dataset
ridge regression linear model intercept: 237.89606045436403
ridge regression linear model coeff:
[ -2.04506363e+03   1.17205438e+04   1.33662575e+02   6.39322716e+02
   8.67163558e+02   2.77241346e+03  -1.66890968e+02   7.04521683e+03
   4.22366673e+03  -2.08611479e+03   7.02430637e+02   3.51732136e+02
  -1.19584669e+01  -9.81935012e+01  -1.62437141e+00   2.25111305e+01
  -4.86387286e+01   1.21644497e+02   1.62594402e+01]
R-squared score (training): 0.923
R-squared score (test): 0.912
Number of non-zero features: 19


## Ridge with Alpha(Figuring out the trend) - The Rsquare reduces as alpha increases

In [94]:
print('Ridge regression: effect of alpha regularization parameter\n')
for this_alpha in [0, 50, 100, 1000, 2500, 5000]:
    linridge = Ridge(alpha = this_alpha).fit(X_Train_Scaled, Y_Train)
    r2_train_data = linridge.score(X_Train_Scaled, Y_Train)
    r2_test_data = linridge.score(X_Test_Scaled, Y_Test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    print('Alpha = {:.2f}\nnum abs(coeff) > 1.0: {}, \
r-squared training: {:.2f}, r-squared test: {:.2f}\n'
         .format(this_alpha, num_coeff_bigger, r2_train_data, r2_test_data))

Ridge regression: effect of alpha regularization parameter

Alpha = 0.00
num abs(coeff) > 1.0: 19, r-squared training: 0.93, r-squared test: 0.91

Alpha = 50.00
num abs(coeff) > 1.0: 19, r-squared training: 0.73, r-squared test: 0.74

Alpha = 100.00
num abs(coeff) > 1.0: 19, r-squared training: 0.63, r-squared test: 0.63

Alpha = 1000.00
num abs(coeff) > 1.0: 17, r-squared training: 0.29, r-squared test: 0.27

Alpha = 2500.00
num abs(coeff) > 1.0: 17, r-squared training: 0.16, r-squared test: 0.15

Alpha = 5000.00
num abs(coeff) > 1.0: 16, r-squared training: 0.09, r-squared test: 0.09



Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number: 2.0032799151249585e-17


## Lasso Regressor

## Cross-Validation on Lasso Regressor

In [95]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso

lassoregression = Lasso()
scores = cross_val_score(lassoregression, X_Train, Y_Train, cv = 5)
print("Cross validation scores: {}".format(scores))

Cross validation scores: [ 0.92167233  0.85880313  0.90485301  0.93635199  0.93506822]


In [136]:
print("Average cross-validation score for Lasso Regression: {:.2f}".format(scores.mean()))

Average cross-validation score for Lasso Regression: 0.91


## Gridsearch to find out the most optimum value of scaling parameter alpha

In [97]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

param_grid = {'alpha': [1, 5, 10, 50, 100, 1000],}

grid_search = GridSearchCV(Lasso(), param_grid, cv=5)

grid_search.fit(X_Train, Y_Train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [1, 5, 10, 50, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [99]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Best parameters: {'alpha': 1}
Best cross-validation score: 0.91
Best estimator:
Lasso(alpha=1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)


## Using the most optimum value of alpha in lasso regression

In [100]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

#X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
#                                                   random_state = 0)

#X_train_scaled = scaler.fit_transform(X_train)
#X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=1, max_iter = 10000).fit(X_Train_Scaled, Y_Train)

print('Crime dataset')
print('lasso regression linear model intercept: {}'
     .format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'
     .format(linlasso.coef_))
print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_Train_Scaled, Y_Train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_Test_Scaled, Y_Test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X_Variables), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))

Crime dataset
lasso regression linear model intercept: 201.13178681214754
lasso regression linear model coeff:
[ -1770.65925692  14666.46935287      0.            235.92772516
    794.59757348   2021.88555622     -0.           6415.70888877
   4367.47852075  -1863.44629183    443.74558016      0.              0.
    -69.4116401      -0.              0.             -0.              0.
      0.        ]
Non-zero features: 10
R-squared score (training): 0.926
R-squared score (test): 0.905

Features with non-zero weight (sorted by absolute magnitude):
	btc_trade_volume, 14666.469
	btc_cost_per_transaction, 6415.709
	btc_n_transactions_total, 4367.479
	btc_transaction_fees, 2021.886
	btc_n_transactions_excluding_chains_longer_than_100, -1863.446
	btc_total_bitcoins, -1770.659
	btc_median_confirmation_time, 794.598
	btc_output_volume, 443.746
	btc_n_transactions_per_block, 235.928
	btc_n_orphaned_blocks_1, -69.412


## Lasso with Alpha(Figuring out the trend) - The Rsquare reduces as alpha increases

In [101]:
print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

for alpha in [0.5, 1, 5, 10, 20, 50, 100]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_Train_Scaled, Y_Train)
    r2_train_lasso = linlasso.score(X_Train_Scaled, Y_Train)
    r2_test_lasso = linlasso.score(X_Test_Scaled, Y_Test)
    
    print('Alpha = {:.2f}\nFeatures kept: {}, r-squared training: {:.2f}, \
r-squared test: {:.2f}\n'
         .format(alpha, np.sum(linlasso.coef_ != 0), r2_train_lasso, r2_test_lasso))

Lasso regression: effect of alpha regularization
parameter on number of features kept in final model

Alpha = 0.50
Features kept: 13, r-squared training: 0.93, r-squared test: 0.91

Alpha = 1.00
Features kept: 10, r-squared training: 0.93, r-squared test: 0.91

Alpha = 5.00
Features kept: 7, r-squared training: 0.92, r-squared test: 0.90

Alpha = 10.00
Features kept: 6, r-squared training: 0.92, r-squared test: 0.91

Alpha = 20.00
Features kept: 6, r-squared training: 0.90, r-squared test: 0.90

Alpha = 50.00
Features kept: 4, r-squared training: 0.77, r-squared test: 0.79

Alpha = 100.00
Features kept: 3, r-squared training: 0.66, r-squared test: 0.66



## Polynomial Regression

## Cross-Validation for Polynomial Regression

In [145]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_Variables)

X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_poly, Y_Variables, random_state=0)

linreg = Ridge().fit(X_train_p, y_train_p)
poly = linreg

print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train_p, y_train_p)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
     .format(linreg.score(X_test_p, y_test_p)))

from sklearn.model_selection import cross_val_score

scores=cross_val_score(poly, X_Train, Y_Train, cv=5)
print(scores)

(poly deg 2 + ridge) linear model coeff (w):
[  0.00000000e+00   1.07813099e-02  -5.96907950e-04   7.60948210e+00
   2.94968751e+00  -2.95759924e+01  -6.36845260e-01   2.96832526e+01
   4.09360063e+00  -2.58053855e-03  -1.22090260e-02   5.94577908e-02
   5.87531087e-03  -2.17776092e+01   1.28809409e+01   7.97292981e+00
   3.07974516e+00  -2.13310482e+00  -2.29017201e-02   4.38616742e-10
   6.94763429e-13  -1.88711926e-13   3.26485235e-05  -3.10084896e-07
   3.65173962e-06   1.43223378e-07   4.57259029e-08  -1.62057309e-06
   4.04343375e-13   5.17536620e-10   1.72174036e-12  -4.53161853e-11
  -1.07873358e-02  -1.07966652e-02  -1.08005979e-02  -1.07879117e-02
  -1.08153176e-02  -1.07803585e-02  -4.20450767e-03   4.15376672e-17
   7.64147223e-07   1.93641688e-10   1.36173934e-08  -1.35567707e-09
   6.18667355e-07   2.39354673e-09   5.27138892e-15  -3.38323378e-12
  -9.46970066e-16   2.01284669e-12   5.97003050e-04   5.97398955e-04
   5.96195768e-04   5.96307622e-04   5.96420122e-04   5.95

Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number: 3.59385205124772e-20


In [147]:
print("Average cross-validation score for Polynomial Regression: {:.2f}".format(scores.mean()))

Average cross-validation score for Polynomial Regression: 0.91


In [149]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

linreg = LinearRegression().fit(X_Train, Y_Train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_Train, Y_Train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_Test, Y_Test)))

print('\nNow we transform the original input data to add\n\
polynomial features up to degree 2 (quadratic)\n')
poly = PolynomialFeatures(degree=2)
X_Variable_poly = poly.fit_transform(X_Variables)

linreg = LinearRegression().fit(X_Train, Y_Train)

print('(poly deg 2) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.3f}'
     .format(linreg.score(X_Train, Y_Train)))
print('(poly deg 2) R-squared score (test): {:.3f}\n'
     .format(linreg.score(X_Test, Y_Test)))

print('\nAddition of many polynomial features often leads to\n\
overfitting, so we often use polynomial features in combination\n\
with regression that has a regularization penalty, like ridge\n\
regression.\n')

X_Train_Poly, X_Test_Poly, Y_Train_Poly, Y_Test_Poly = train_test_split(X_Variable_poly, Y_Variables,
                                                   random_state = 0)
linreg = Ridge().fit(X_Train_Poly, Y_Train_Poly)

print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
     .format(linreg.score(X_Train_Poly, Y_Train_Poly)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
     .format(linreg.score(X_Test_Poly, Y_Test_Poly)))

linear model coeff (w): [ -1.34441521e-04   3.25870871e-06   1.23381051e+02   3.20035438e-01
   2.05221853e+01   1.83452088e+00  -3.76670905e-03   4.02171703e+01
   1.54939231e-05  -9.24962597e-03   1.98056314e-05   6.14331087e-05
   7.04126437e-01  -6.35582761e+01   1.73521861e+01   3.12119918e+01
  -4.41498586e+01   1.14191891e+02  -5.57520601e+01]
linear model intercept (b): 512.619
R-squared score (training): 0.926
R-squared score (test): 0.906

Now we transform the original input data to add
polynomial features up to degree 2 (quadratic)

(poly deg 2) linear model coeff (w):
[ -1.34441521e-04   3.25870871e-06   1.23381051e+02   3.20035438e-01
   2.05221853e+01   1.83452088e+00  -3.76670905e-03   4.02171703e+01
   1.54939231e-05  -9.24962597e-03   1.98056314e-05   6.14331087e-05
   7.04126437e-01  -6.35582761e+01   1.73521861e+01   3.12119918e+01
  -4.41498586e+01   1.14191891e+02  -5.57520601e+01]
(poly deg 2) linear model intercept (b): 512.619
(poly deg 2) R-squared score (train

Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number: 3.59385205124772e-20


## SVM Linear (Basic)

## Cross-Validation for SVM_Linear

In [109]:
from sklearn.model_selection import cross_val_score
svr = SVR(kernel = 'linear', epsilon = 10)

scores=cross_val_score(svr, X_Train_Scaled, Y_Train, cv=5)
print("Cross validation scores: {}".format(scores))

Cross validation scores: [ 0.05145768  0.03438842  0.04045833  0.067563    0.04814343]


In [137]:
print("Average cross-validation score for SVM-Linear: {:.2f}".format(scores.mean()))

Average cross-validation score for SVM-Linear: 0.91


## Gridsearch to find out the most optimum value of scaling parameter epsilon

In [111]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

param_grid = {'epsilon': [0.001, 0.01, 0.1, 1, 10, 1000]}

grid_search = GridSearchCV(SVR(kernel = 'linear'), param_grid, cv=5)

grid_search.fit(X_Train_Scaled, Y_Train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'epsilon': [0.001, 0.01, 0.1, 1, 10, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [112]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Best parameters: {'epsilon': 10}
Best cross-validation score: 0.05
Best estimator:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=10, gamma='auto',
  kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)


## Using the most optimum value of epsilon in SVR

In [113]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_Train_Scaled = scaler.fit_transform(X_Train)
# we must apply the scaling to the test set that we computed for the training set
X_Test_Scaled = scaler.transform(X_Test)

from sklearn.svm import SVR

svr = SVR(kernel = 'linear', epsilon = 10)
svr.fit(X_Train_Scaled, Y_Train)
svr.score(X_Test_Scaled, Y_Test)

0.056159273085643169

## SVM with kernel 'rbf'

## Cross -Validation for SVM kernel 'rbf'

In [114]:
from sklearn.model_selection import cross_val_score
svr = SVR(kernel = 'rbf', epsilon = 1000)

scores=cross_val_score(svr, X_Train_Scaled, Y_Train, cv=5)
print("Cross validation scores: {}".format(scores))

Cross validation scores: [-0.00097977  0.00628698  0.00483068 -0.01796899  0.00063173]


In [138]:
print("Average cross-validation score for SVM-rbf: {:.2f}".format(scores.mean()))

Average cross-validation score for SVM-rbf: 0.91


## Gridsearch to find out the most optimum value of scaling parameter epsilon

In [116]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

param_grid = {'epsilon': [0.001, 0.01, 0.1, 1, 10, 1000]}

grid_search = GridSearchCV(SVR(kernel = 'rbf'), param_grid, cv=5)

grid_search.fit(X_Train_Scaled, Y_Train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'epsilon': [0.001, 0.01, 0.1, 1, 10, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [117]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Best parameters: {'epsilon': 1000}
Best cross-validation score: -0.00
Best estimator:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=1000, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)


## Using the most optimum value of epsilon in SVR (kernel 'rbf')

In [118]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_Train_Scaled = scaler.fit_transform(X_Train)
# we must apply the scaling to the test set that we computed for the training set
X_Test_Scaled = scaler.transform(X_Test)

from sklearn.svm import SVR

svr = SVR(kernel = 'rbf', epsilon = 1000)
svr.fit(X_Train_Scaled, Y_Train)
svr.score(X_Test_Scaled, Y_Test)

0.0049856210458972816

## SVM with kernel 'poly'

## Cross-Validation for SVM kernel 'poly'

In [119]:
from sklearn.model_selection import cross_val_score
svr = SVR(kernel = 'poly', epsilon = 1000)

scores=cross_val_score(svr, X_Train_Scaled, Y_Train, cv=5)
print("Cross validation scores: {}".format(scores))

Cross validation scores: [ -6.63119840e-03  -6.71050448e-05  -1.64507487e-03  -2.39818672e-02
  -5.09605768e-03]


In [139]:
print("Average cross-validation score-poly: {:.2f}".format(scores.mean()))

Average cross-validation score-poly: 0.91


## Gridsearch to find out the most optimum value of scaling parameter epsilon

In [122]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

param_grid = {'epsilon': [0.001, 0.01, 0.1, 1, 10, 5000]}

grid_search = GridSearchCV(SVR(kernel = 'poly'), param_grid, cv=5)

grid_search.fit(X_Train_Scaled, Y_Train)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'epsilon': [0.001, 0.01, 0.1, 1, 10, 5000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [123]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Best parameters: {'epsilon': 10}
Best cross-validation score: -0.07
Best estimator:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=10, gamma='auto',
  kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)


## Using the most optimum value of epsilon in SVR (kernel 'poly)

In [124]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_Train_Scaled = scaler.fit_transform(X_Train)
# we must apply the scaling to the test set that we computed for the training set
X_Test_Scaled = scaler.transform(X_Test)

from sklearn.svm import SVR

svr = SVR(kernel = 'poly', epsilon = 10)
svr.fit(X_Train_Scaled, Y_Train)
svr.score(X_Test_Scaled, Y_Test)

-0.061919264072588513

## KNN Regressor

In [125]:
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors = 50)

In [126]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_Train_Scaled = scaler.fit_transform(X_Train)
# we must apply the scaling to the test set that we computed for the training set
X_Test_Scaled = scaler.transform(X_Test)

In [127]:
knn.fit(X_Train_Scaled, Y_Train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=50, p=2,
          weights='uniform')

In [128]:
knn.score(X_Test_Scaled, Y_Test)

0.95236125101114932

## Alternate knn regressor

## Cross Validation for knn regression model:

In [129]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor

knnregressor = KNeighborsRegressor()
scores = cross_val_score(knnregressor, X_Train, Y_Train, cv = 5)
print("Cross validation scores: {}".format(scores))

Cross validation scores: [ 0.93017282  0.92081244  0.86021269  0.92075946  0.92968766]


In [140]:
print("Average cross-validation score for knn: {:.2f}".format(scores.mean())) 

Average cross-validation score for knn: 0.91


## Gridsearch to find out the optimum number of k

In [131]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

param_grid = {'weights': ['uniform','distance'], 'n_neighbors': [1, 5, 10, 50, 100]}

grid_search = GridSearchCV(KNeighborsRegressor(), param_grid, cv=5)

grid_search.fit(X_Train, Y_Train)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'weights': ['uniform', 'distance'], 'n_neighbors': [1, 5, 10, 50, 100]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [132]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

Best parameters: {'n_neighbors': 10, 'weights': 'distance'}
Best cross-validation score: 0.93
Best estimator:
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='distance')


## Using the most optimum value of 'k' in the regression

In [133]:
knnreg = KNeighborsRegressor(n_neighbors = 10).fit(X_Train_Scaled, Y_Train)

print(knnreg.predict(X_Test_Scaled))
print('R-squared test score: {:.3f}'
     .format(knnreg.score(X_Test_Scaled, Y_Test)))

[  9.32520191e+02   5.27093000e+00   5.38678411e+03   1.83494530e+01
   5.37422900e+00   1.01171000e-01   2.78307731e+03   3.53877000e+02
   4.09990000e-01   5.94017000e+02   9.41400844e+02   3.52689400e-01
   2.21670000e-01   3.91409000e+02   1.24617150e+01   2.25749490e+01
   9.62469535e+02   7.94632000e+02   1.31431000e+02   8.32189428e+02
   1.21709520e+01   2.47173751e+03   0.00000000e+00   1.60398357e+04
   2.47774000e+02   2.60086000e+02   3.66477000e+02   1.74001000e+02
   9.82179000e-01   4.92714500e+00   1.16328555e+02   6.01187000e+02
   4.22670000e+02   8.24413300e-01   2.90612000e+02   5.54235685e+02
   6.32644784e+02   6.66801200e+00   8.38636000e+02   0.00000000e+00
   1.48894048e+04   5.22076000e+02   7.87761000e+02   7.72642598e+03
   4.67067257e+03   2.62247000e+02   3.11283000e+02   8.15837795e+02
   1.19748000e+02   1.24561498e+02   0.00000000e+00   2.64759400e-01
   6.00907156e+02   0.00000000e+00   5.19828900e+00   1.36812729e+03
   8.21449960e+01   6.25769000e-02

## From all of the regression models, run above, we can see that the most accurate one is the Ridge Regression. Therefore, we apply the ridge regression to the test dataset.

In [151]:
test.head()

Unnamed: 0,Date,btc_total_bitcoins,btc_market_cap,btc_trade_volume,btc_blocks_size,btc_avg_block_size,btc_n_orphaned_blocks,btc_n_transactions_per_block,btc_median_confirmation_time,btc_hash_rate,...,btc_cost_per_transaction_percent,btc_cost_per_transaction,btc_n_unique_addresses,btc_n_transactions,btc_n_transactions_total,btc_n_transactions_excluding_popular,btc_n_transactions_excluding_chains_longer_than_100,btc_output_volume,btc_estimated_transaction_volume,btc_estimated_transaction_volume_usd
0,2/1/2018 0:00,16839687.5,152959000000.0,1509688000.0,154613.2244,1.053963,0,1610.4,12.475,20703947.91,...,0.799509,78.049647,591550,257664,296946448,249466,179686,2190613.0,276923.3207,2515366000.0
1,2/2/2018 0:00,16841787.5,149924000000.0,2213437000.0,154785.0008,1.022479,0,1404.27381,11.225,21739145.31,...,0.717894,89.591902,551198,235918,297182366,229894,155128,1460796.0,330740.2192,2944217000.0
2,2/3/2018 0:00,16843762.5,152885000000.0,952403800.0,154942.4583,0.996567,0,1233.487342,10.475,20445148.56,...,1.290914,98.824757,436196,194891,297377257,184856,131568,910042.5,164374.0244,1491970000.0
3,2/4/2018 0:00,16845987.5,141517000000.0,1080683000.0,155118.7652,0.990488,0,975.769663,9.275,23033142.05,...,1.208997,112.999677,396694,173687,297550944,165753,125143,972248.5,193244.214,1623377000.0
4,2/5/2018 0:00,16848300.0,115222000000.0,1793319000.0,155322.7709,1.102733,0,1169.52973,6.133333,23938939.78,...,0.967008,78.586115,486553,216363,297767307,208757,144850,1848913.0,257109.2993,1758323000.0


In [163]:
test.loc[:,'btc_n_orphaned_blocks_0'] = pd.Series(0, index=test.index)
test.loc[:,'btc_n_orphaned_blocks_1'] = pd.Series(0, index=test.index)
test.loc[:,'btc_n_orphaned_blocks_2'] = pd.Series(0, index=test.index)
test.loc[:,'btc_n_orphaned_blocks_3'] = pd.Series(0, index=test.index)
test.loc[:,'btc_n_orphaned_blocks_4'] = pd.Series(0, index=test.index)
test.loc[:,'btc_n_orphaned_blocks_5'] = pd.Series(0, index=test.index)
test.loc[:,'btc_n_orphaned_blocks_7'] = pd.Series(0, index=test.index)

In [164]:
Columns = ['btc_total_bitcoins', 'btc_trade_volume', 'btc_avg_block_size', 'btc_n_transactions_per_block', 
            'btc_median_confirmation_time','btc_transaction_fees', 'btc_cost_per_transaction_percent', 
            'btc_cost_per_transaction', 'btc_n_transactions_total', 'btc_n_transactions_excluding_chains_longer_than_100', 
            'btc_output_volume', 'btc_estimated_transaction_volume', 'btc_n_orphaned_blocks_0', 'btc_n_orphaned_blocks_1',
          'btc_n_orphaned_blocks_2', 'btc_n_orphaned_blocks_3', 'btc_n_orphaned_blocks_4', 
           'btc_n_orphaned_blocks_5', 'btc_n_orphaned_blocks_7']

test_consider = test[Columns]
scaler = MinMaxScaler()
test_consider_scaled = scaler.fit_transform(test_consider)

In [167]:
ridge2 = Ridge(alpha = 1)
ridge2.fit(X_Train_Scaled, Y_Train)             # Fit a ridge regression on the training data
pred = ridge2.predict(test_consider_scaled)            # Use this model to predict the test data
print(pred) # Print coefficients

[  8992.77575209  14827.13448302   9617.02670793  12058.76871242
  10072.83310453  14216.12589706   8374.12373246   9398.11420458
   8790.33557158   7924.58990838  11325.77955973  10342.89715818
   7520.8464571   11844.58433462]
