<a href="https://colab.research.google.com/github/heber-augusto/udacity-azure-ml-foudations/blob/master/lab1_linear_regression_model_nyc_taxi_and_limousine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This jupyter notebook was created to compare the effort of doing machine learing tasks without Azure Machine Learning Studio's tools.

The goal at this notebook is to train a linear regression model from one of Azure Open Dataset: NYC Taxi & Limousine Commission - green taxi trip records

A detailed description for this dataset can be found at the link below. 

https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/

The dataset used can be found at:  https://introtomlsampledata.blob.core.windows.net/data/nyc-taxi/nyc-taxi-sample-data.csv

This notebook demonstrate:

1. Getting the dataset and creating dataframe from csv file;
1. Presents some dataset explorations to undestand data;
1. Splitting the dataframe for training;
1. Training a linear regression model.

The output feature from this dataset is the TotalAmount


**The cell bellow contains some libraries and some essencial defines for the tasks**

*   pandas: create a dataframe from csv file;
*   sklearn: split the dataframe between test and train, linear regression trainning and evaluation;
* matplotlib: plot some results.



In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt


url="https://introtomlsampledata.blob.core.windows.net/data/nyc-taxi/nyc-taxi-sample-data.csv"



**Step 1: Getting the dataset and creating dataframe from csv file**

The dataset can be obtained from the url https://introtomlsampledata.blob.core.windows.net/data/nyc-taxi/nyc-taxi-sample-data.csv

The pandas library has a method which does this: read_csv()

So we will use it and pass the url 

In [3]:
nyc_taxi_dataframe = pd.read_csv(url)


**Presents some dataset explorations to undestand data**

The following 2 cells presents:
 
*  Some basic statics information (using dataframe describe method)
*  Correlation information between TotalAmount column and all the others (corr method)

A lot of others analysis can be done but I will list just this two methods outputs.



In [4]:
nyc_taxi_dataframe.describe()

Unnamed: 0,vendorID,passengerCount,tripDistance,hour_of_day,day_of_week,day_of_month,month_num,snowDepth,precipTime,precipDepth,temperature,totalAmount
count,11734.0,11734.0,11734.0,11734.0,11734.0,11734.0,11734.0,11734.0,11734.0,11734.0,11734.0,11734.0
mean,1.790608,1.34856,2.866139,13.633884,3.223879,15.000256,3.502898,1.609015,12.028379,190.782342,10.314244,14.733528
std,0.406892,1.016123,2.90581,6.67053,1.961855,8.467892,1.707729,7.146771,10.158597,1211.087724,8.5006,10.983099
min,1.0,1.0,0.01,0.0,0.0,1.0,1.0,0.0,1.0,0.0,-13.379464,3.3
25%,2.0,1.0,1.06,9.0,2.0,8.0,2.0,0.0,1.0,0.0,3.566372,8.15
50%,2.0,1.0,1.9,15.0,3.0,15.0,4.0,0.0,6.0,3.0,10.318229,11.3
75%,2.0,1.0,3.62,19.0,5.0,22.0,5.0,0.0,24.0,41.0,17.239744,17.8
max,2.0,6.0,62.55,23.0,6.0,30.0,6.0,67.090909,24.0,9999.0,26.524107,339.38


In [12]:
nyc_taxi_dataframe.corr()['totalAmount'].sort_values(ascending=False)

totalAmount       1.000000
tripDistance      0.913318
temperature       0.024572
passengerCount    0.012265
month_num         0.012183
day_of_month      0.011489
snowDepth         0.011220
day_of_week       0.006714
precipTime        0.003953
isPaidTimeOff    -0.001369
vendorID         -0.005318
precipDepth      -0.009635
hour_of_day      -0.028296
Name: totalAmount, dtype: float64

**Splitting the dataframe for training**

For the splitting task we will use train_test_split scikit-learn method
The dataframe was splitted considering 70% of it for trainning and the rest to evaluate the results.


In [19]:
# create X and y removing totalAmount from columns
y = nyc_taxi_dataframe[['totalAmount']]
# the totalAmount was dropped because it represents the output and the 
# normalizeHolidayName 
X = nyc_taxi_dataframe.drop(
    labels=['totalAmount','normalizeHolidayName'], 
    axis=1)


X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size=0.7, 
    random_state=42)

**Training a linear regression model**

For the training task I used the linear_model.LinearRegression class from scikit learn 


In [21]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred))

# The mean squared error
print('Mean absolute error: %.2f'
      % mean_absolute_error(y_test, y_pred))

# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))


ValueError: ignored

In [22]:
X_train

Unnamed: 0,vendorID,passengerCount,tripDistance,hour_of_day,day_of_week,day_of_month,month_num,normalizeHolidayName,isPaidTimeOff,snowDepth,precipTime,precipDepth,temperature
102,2,1,7.90,14,0,25,1,,False,54.411765,6.0,0.0,-0.308929
1989,1,1,4.00,1,0,8,2,,False,3.727273,6.0,30.0,1.832692
5120,2,1,1.00,19,5,26,3,,False,0.000000,24.0,23.0,7.971552
4271,2,1,1.62,0,6,20,3,,False,0.000000,6.0,0.0,3.566372
10605,2,1,0.86,8,5,11,6,,False,0.000000,6.0,25.0,22.323894
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11284,2,5,1.11,14,2,8,6,,False,0.000000,24.0,90.0,16.966923
5191,1,2,4.00,8,3,17,3,,False,0.000000,24.0,10.0,10.828889
5390,2,1,3.96,1,6,6,3,,False,0.000000,1.0,0.0,2.777679
860,2,6,5.21,8,1,5,1,,False,0.000000,6.0,0.0,-7.206250
