# Forecast Bike Rental Count - Optimization

In this project:
* Transform Count to _log (Count)_ : Technique for when a model needs to predict positive integers
* Use inverse tranform _Exp (Count)_ on predicted value
* Smooth against seasonality and trend, brings count to similiar scale

**Note:** This is upgraded version of _bikerentalDatPrepv1.ipynb_

## Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas.plotting import register_matplotlib_converters

In [None]:
register_matplotlib_converters()

## Kaggle Bike Sharing Demand Dataset

Modified 'count' to _'log1p(count)'_ for training

* Log can be used when target represents a count (that is non-negative values)
* Model now predicts as _log1p(count)_. Later, convert it back to actual count used _expm1(predicted_target)_

### Let's Look At This Change: log1p(count)

_numpy_ offers another option: _log1p_
* This returns the natural logarithm of one plus the input array, element-wise
Calculates ```log(1+x)```

The *inversion* is found with _expm1_ or ```exp(x) - 1```

More about [numpy.log1p](https://docs.scipy.org/doc/numpy/reference/generated/numpy.log1p.html)

**To download dataset**, sign-in and download from this link:

https://www.kaggle.com/c/bike-sharing-demand/data

**Input Features:** ```['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed', 'year', 'month', 'day', 'dayofweek','hour']```

**Target Feature:** ```log1p('count')```

In [None]:
# Example of log1p useage:
# Convert to log1p(count)
# Print the original through exmp1
print('Test log and exp')
test_count = 1000
print('Starting Value ', test_count)
T = np.log1p(test_count) # Log (x+1)
print ('log1p = ', T)
print ('exmp1 = ', np.expm1(T)) # exp(x) - 1
print ('\nThe calculation attempts to maintain all digits of precision.\nReminder: We cannot calculate value of 0. Some of the data is missing or at 0, so the data will default to 1')

## Set Up Columns for Input Featrues

In [None]:
columns = ['count', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'year', 'month', 'day', 'dayofweek','hour']

In [None]:
df = pd.read_csv('train.csv', parse_dates=['datetime'],index_col=0)
df_test = pd.read_csv('test.csv', parse_dates=['datetime'],index_col=0)

### Convert datetime to numeric for training

In [None]:
# Extract key features into separate numeric columns
def add_features(df):
    df['year'] = df.index.year
    df['month'] = df.index.month
    df['day'] = df.index.day
    df['dayofweek'] =df.index.dayofweek
    df['hour'] = df.index.hour

In [None]:
add_features(df)
add_features(df_test)

## Plot the Current Dataset

In [None]:
plt.plot(df['2011']['count'],label='2011')
plt.plot(df['2012']['count'],label='2012')
plt.xticks(fontsize=16, rotation=45)
plt.xlabel('Date')
plt.ylabel('Rental Count')
plt.title('2011 and 2012 Rentals (Year-to-Year)')
plt.legend()
plt.show()

The white stripes in the data are zeros or missing data.

### Next, switch the plot to log1p

In [None]:
plt.plot(df['2011']['count'].map(np.log1p),label='2011')
plt.plot(df['2012']['count'].map(np.log1p),label='2012')
plt.xticks(fontsize=14, rotation=45)
plt.xlabel('Date')
plt.ylabel('Log(Rental Count)')
plt.title('2011 and 2012 Rentals (Year-to-Year)')
plt.legend()
plt.show()

## Boxplot of Original Dataset

In [None]:
plt.boxplot([df['count']], labels=['count'])
plt.title('Box Plot - Count')
plt.ylabel('Target')
plt.grid(True)

The box is primary data with a lot of outliers above.

## Boxplot: Switch to a _log1p_ format

In [None]:
# Evenly distributed across a log axis
plt.boxplot([df['count'].map(np.log1p)], labels=['log1p(count)'])
plt.title('Box Plot - log1p(Count)')
plt.ylabel('Target')
plt.grid(True)

This view allows for more centered, windowed view of data.

### Update/Convert the 'Count' with log1p

In [None]:
df['count'] = df['count'].map(np.log1p)

In [None]:
df.head() # data check

In [None]:
df_test.head()

### Review Data Types

In [None]:
df.dtypes

## Save All Data

In [None]:
# Save All Data
df.to_csv('bikeAllv2.csv', index=True, index_label='datetime', columns=columns)

# Training and Validation Set

### Target Variable as first column followed by input features

### Training, Validation files do not have a column header

In [None]:
# Training = 70% of data
# Validation = 30% of data
# Randomize the dataset
np.random.seed(478)
l = list(df.index)
np.random.shuffle(l)
df = df.loc[l]

In [None]:
rows = df.shape[0]
train = int(0.7 * rows)
test = rows - train

In [None]:
rows, train, test # data check of shape

## Write Training Set

In [None]:
df.iloc[:train].to_csv('bikeTrainingv2.csv',
                      index=False, header=False,
                      columns=columns)

## Write Validation Set

In [None]:
df.iloc[train:].to_csv('bikeValidationv2.csv',
                      index=False, header=False,
                      columns=columns)

## Test Data has only input features

In [None]:
df_test.to_csv('bikeTestv2.csv', index=True, index_label='datetime')

In [None]:
print(','.join(columns))

## Write Out the Current Column List

In [None]:
with open('bikeTrain_column_listv2.txt','w') as f:
    f.write(','.join(columns))