## Tópicos Especiais em Inteligência Artificial 
### Trabalho I - Regressão
#### Alunos: Lucas Ribeiro Ferreira e Letícia Garcez

==========================================
Bike Sharing Dataset
==========================================

Hadi Fanaee-T

Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto
INESC Porto, Campus da FEUP
Rua Dr. Roberto Frias, 378
4200 - 465 Porto, Portugal


=========================================
Background 
=========================================

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return 
back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return 
back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of 
over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, 
environmental and health issues. 

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by
these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration
of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into
a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important
events in the city could be detected via monitoring these data.

=========================================
Data Set
=========================================
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions,
precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to  
the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C., USA which is 
publicly available in http://capitalbikeshare.com/system-data. We aggregated the data on two hourly and daily basis and then 
extracted and added the corresponding weather and seasonal information. Weather information are extracted from http://www.freemeteo.com. 

=========================================
Associated tasks
=========================================

	- Regression: 
		Predication of bike rental count hourly or daily based on the environmental and seasonal settings.
	
	- Event and Anomaly Detection:  
		Count of rented bikes are also correlated to some events in the town which easily are traceable via search engines.
		For instance, query like "2012-10-30 washington d.c." in Google returns related results to Hurricane Sandy. Some of the important events are 
		identified in [1]. Therefore the data can be used for validation of anomaly or event detection algorithms as well.


=========================================
Files
=========================================

	- Readme.txt
	- hour.csv : bike sharing counts aggregated on hourly basis. Records: 17379 hours
	- day.csv - bike sharing counts aggregated on daily basis. Records: 731 days

	
=========================================
Dataset characteristics
=========================================	
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
	
	- instant: record index
	- dteday : date
	- season : season (1:springer, 2:summer, 3:fall, 4:winter)
	- yr : year (0: 2011, 1:2012)
	- mnth : month ( 1 to 12)
	- hr : hour (0 to 23)
	- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
	- weekday : day of the week
	- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
	+ weathersit : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
	- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
	- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
	- hum: Normalized humidity. The values are divided to 100 (max)
	- windspeed: Normalized wind speed. The values are divided to 67 (max)
	- casual: count of casual users
	- registered: count of registered users
	- cnt: count of total rental bikes including both casual and registered

In [12]:
import pandas as pd
import numpy as np
# from sklearn import tree
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics 
# from sklearn import svm
import itertools
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm

### Funções de Apoio
1. mape:
função que calcula a distancia média percentual do target.

In [13]:
def mape(true_y, pred_y):
    diffs = np.abs((true_y - pred_y)/true_y)
    return (100/np.size(diffs))*np.sum(diffs)

### Importando o Dataset

Aqui fazemos carregamos o dataset e selecionamos as carácteristicas que serão usadas como features na predição e a que será usada como alvo.
Iremos focar em conseguiser prever a quantidade de bicicletas compartilhadas com base no dia, na estação, mês, dia útil, configuração de tempo, temperatura e velocidade do vendo


In [14]:
_features = ["dteday","season","mnth","workingday","weathersit", "atemp", "windspeed"]
_targets = ["cnt"]
dataset = pd.read_csv('data_bike/hour.csv')
X = dataset.loc[:, _features]
Y = dataset.loc[:, _targets]
dataset.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


### Preprocessamento

Para continuar precisamentos preprocessar algumas características. 

#### Preprocessando a data

Como a data se encontra em formato ano-mes-dia, precisamos converte-las para um valor que tenha mesmo significado mas seja mais simples de trabalhar. Decidias em tranformar em "quantidade de dias desde o início". Dessa forma o primeiro dia tem valor 0, o segundo 1 e assim continua. 

Com a data desta forma, ela continua tendo a semantica de sequencia, de forma que o dia dois é 1 dia a mais que o dia 1, assim o modelo preditor poderá usar isso para realizar uma predição melhor.

In [15]:
le = preprocessing.LabelEncoder()
days = np.array(X.loc[:, ["dteday"]])
days = np.reshape(days, (days.shape[0],))
days_tranformed = le.fit_transform(days)
X.loc[:, ["dteday"]] = days_tranformed
X.head()

Unnamed: 0,dteday,season,mnth,workingday,weathersit,atemp,windspeed
0,0,1,1,0,1,0.2879,0.0
1,0,1,1,0,1,0.2727,0.0
2,0,1,1,0,1,0.2727,0.0
3,0,1,1,0,1,0.2879,0.0
4,0,1,1,0,1,0.2879,0.0


#### Preprocessando a weathersit

Como explicado no cabeçalho, weathersit é uma coluna informa como estava configurado o tempo no dia da seguinte forma:
        - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
        
Devido a natureza categórica desse atributo, decitimos realizar uma binarização, transormando em outras 4 colunas onde apenas uma se ativa por vês, como está exemplificado abaixo.

In [16]:
ws = X.loc[:, ["weathersit"]]
result = preprocessing.label_binarize(ws, classes=[1,2,3,4])
X = X.assign(ws_0 = result[:,0])
X = X.assign(ws_1 = result[:,1])
X = X.assign(ws_2 = result[:,2])
X = X.assign(ws_3 = result[:,3])
X = X.drop("weathersit", 1)
X.head()

Unnamed: 0,dteday,season,mnth,workingday,atemp,windspeed,ws_0,ws_1,ws_2,ws_3
0,0,1,1,0,0.2879,0.0,1,0,0,0
1,0,1,1,0,0.2727,0.0,1,0,0,0
2,0,1,1,0,0.2727,0.0,1,0,0,0
3,0,1,1,0,0.2879,0.0,1,0,0,0
4,0,1,1,0,0.2879,0.0,1,0,0,0


#### Normalizando valores.

Obviamente, precisaremos realizar uma normalização dos valores para não acabar dando maior importancia a uma coluna com valores elevados.

Decidimos utilizar um Min-Max para ter todos os valores entre 0 e 1.

In [17]:
scaler = preprocessing.MinMaxScaler()
_X = scaler.fit_transform(X)
X.loc[:,:] = _X
X.loc[1000:1005, :]

Unnamed: 0,dteday,season,mnth,workingday,atemp,windspeed,ws_0,ws_1,ws_2,ws_3
1000,0.060274,0.0,0.090909,1.0,0.3939,0.543905,1.0,0.0,0.0,0.0
1001,0.060274,0.0,0.090909,1.0,0.4091,0.456213,1.0,0.0,0.0,0.0
1002,0.060274,0.0,0.090909,1.0,0.4394,0.263195,1.0,0.0,0.0,0.0
1003,0.060274,0.0,0.090909,1.0,0.5,0.298225,1.0,0.0,0.0,0.0
1004,0.060274,0.0,0.090909,1.0,0.5303,0.52639,1.0,0.0,0.0,0.0
1005,0.060274,0.0,0.090909,1.0,0.5455,0.456213,1.0,0.0,0.0,0.0


### Separando conjunto de treino e de testes

Utilizaremos um conjunto de teste de tamanho de 20% para manter um valor alto de exemplos no conjunto de treino. Testamos alguns outros valores e não houve tanta diferença entre eles.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=4)
y_test = np.array(y_test)
y_test = np.reshape(y_test, (y_test.shape[0],))
X_train.shape

(13903, 10)

### Modelo Preditor

#### Rede Neural

Utilizaremos um regressor utilizando redes neurais para criar o modelo preditor. Escolhemos esse algoritmo pelo seu poder e velocidade. 

In [36]:
y_train = np.array(y_train)
y_train = np.reshape(y_train, (y_train.shape[0],))
pr = MLPRegressor(alpha=0.0001, max_iter=800)
pr.fit(X=X_train,y=y_train)
y_predic = pr.predict(X_test)


### Resultados

Acreditamos que métricas como média do erro absoluto (mean_abs) ou distancia média percentual (mape) sejam as melhores para esse problema. Vale mais chegar mais perto do alvo quase sempre do que tentar minimizar erros grandes em algumas vezes.

In [50]:
mean_abs = metrics.mean_absolute_error(y_pred=y_predic,y_true=y_test)
_mape = mape(y_test, y_predic) 

mean_abs, _mape, np.mean(y_test), np.mean(y_predic), np.std(y_test)

(123.50828619165236,
 485.36099076148309,
 186.13751438434983,
 149.56524311184913,
 180.82294319581413)

Vendo que a média do conjunto de teste é 186.14 e a média do conjunto predito é 149, nos dá a impressão de que o modelo foi bem sucedido. Porém, ao olharmos para a média do erro absoluto, que é igual a 123.50 , vemos que ainda há bastante trabalho a ser feito. 

A média do erro estar relativamente próxima da média dos dados é um indicativo de que estamos errados muito em cada predição. Ao olharmos para a distancia média percentual, que é 485, percebemos que estamos errando ainda mais do que os dados apresentados anteriormente indicavam. Em média, o 'chute' do nosso preditor é 485% maior ou menor que o valor alvo. Com certeza não é uma porcentagem aceitável para se por em prática.

### Conclusão

Por final, concluímos que ainda há trabalho a ser feito para melhorar o preditor. Acreditamos que, pra começar, necessita-se de uma atenção maior no preprocessamento dos dados. Mesmo havendo feito a normalização, características como *workingday* ainda se mantém muito maior do que características como *dteday* nos primeiros exemplos e isso pode estar causando uma importancia desproporcional a *workingday*.

Outro trabalho que deve ser feito é o de ajustar os parâmetros da rede neural. Apesar de nós termos dedicado um bom tempo de nosso trabalho testando diversas combinações até chegar na atual que julgamos ser a melhor dentre as testadas, acredito que um estudo maior, com maior embasamento cientifico, seja eficaz. 