### Loading the Data

* instant - id of the record (integer)
* dteday - date of the record (yr-month-day)
* season - season of the record (integer, 1-4)
* 1: Spring
* 2: Summer
* 3: Fall
* 4: Winter
* yr - year of the record(integer, 0-1)
* 0: 2011
* 1: 2012
* mth - month of the record (integer, 1-12)
* hr - hour of the record (integer, 0-23)
* holiday - whether the day is a holiday or not (integer, 0-1)
* weekday - day of the week (integer, 1-7)
* workingday - whether the day is a working day (neither holiday nor weekend) or not (integer, 0-1)
* wheathersit - weather situation (integer, 1-4)
* 1: clear, few clouds, or partly cloudy
* 2: mist (no precipitation)
* 3: light rain or light snow
* 4: heavy rain, hail, or snow.
* temp - temperature in Celsius, normalized by dividing by the highest temperature recorded over these two years (float, [0, 1]).
* atemp - apparent temperature in Celsius, normalized by dividing by the highest apparent temperature over these two years (float, [0, 1]). Apparent temperature quantifies the temperature perceived by humans, combining wind chill, humidity, and actual temperature.
* hum - percentage of humidity (float, [0, 1]).
* windspeed - wind speed, normalized by dividing by the highest speed recorded over these two years (float, [0, 1])
* cnt - number of bikes rented (integer) Train set only, value to predict.

In [95]:
import numpy as np

In [96]:
# Load the features
X = np.loadtxt('data/train_transformed.csv',  delimiter=',', 
               skiprows=1, usecols=range(0, 15))

print X.shape

(10886, 15)


In [97]:
# Load the target 
y = np.loadtxt('data/train_transformed.csv', delimiter=',', 
               skiprows=1, usecols=[16])

print y.shape

(10886,)


In [98]:
# Load the test set
T = np.loadtxt('data/test_transformed.csv',  delimiter=',', 
               skiprows=1, usecols=range(0, 15))

print X.shape

(10886, 15)


In [119]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

log_y = np.array(np.log(y), dtype=np.float_)
print log_y
print log_y.shape
print y.shape

gnb.fit(X, y)

pred = gnb.predict(X)

[ 2.77258872  3.68887945  3.4657359  ...,  5.12396398  4.8598124
  4.47733681]
(10886,)
(10886,)


In [120]:
#Calculating result

n_samples, n_features = X.shape

# \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }

def sle(actual, predicted):
    return (np.power(np.log(np.array(actual)+1) - np.log(np.array(predicted)+1), 2))

def msle(actual, predicted):
    return np.mean(sle(actual, predicted))

def rmsle(actual, predicted):
    return np.sqrt(msle(actual, predicted))

rmsle(y,pred)

1.6579950391742981

In [121]:
# Set up a stratified 10-fold cross-validation
from sklearn import cross_validation
folds = cross_validation.StratifiedKFold(y, 10, shuffle=True)

In [122]:
# This is one way to access the training and test points
for ix, (tr, te) in enumerate(folds):
    print "Fold %d" % ix
    print "\t %d training points" % len(tr)
    print "\t %d test points" % len(te)

Fold 0
	 9638 training points
	 1248 test points
Fold 1
	 9658 training points
	 1228 test points
Fold 2
	 9708 training points
	 1178 test points
Fold 3
	 9750 training points
	 1136 test points
Fold 4
	 9796 training points
	 1090 test points
Fold 5
	 9827 training points
	 1059 test points
Fold 6
	 9846 training points
	 1040 test points
Fold 7
	 9896 training points
	 990 test points
Fold 8
	 9911 training points
	 975 test points
Fold 9
	 9944 training points
	 942 test points


In [123]:
y_pred_cv = np.zeros(y.shape)

for ix, (tr, te) in enumerate(folds):
    gnb.fit(X[tr],y[tr])
    y_pred_cv[te]=gnb.predict(X[te])

In [124]:
#Cross Validated RMSLE
rmsle(y, y_pred_cv)

1.6627251920004475