# Linear Regression

## Dataset Description 

- `id` - Trip ID
- `vendor_id` - ID of the transportation company
- `pickup_datetime` - Timestamp of the trip start
- `dropoff_datetime` - Timestamp of the trip end
- `passenger_count` - Number of passengers
- `pickup_longitude` - Longitude of the pickup location
- `pickup_latitude` - Latitude of the pickup location
- `dropoff_longitude` - Longitude of the dropoff location
- `dropoff_latitude` - Latitude of the dropoff location
- `store_and_fwd_flag` - Yes/No: Was the information stored in the vehicle's memory due to loss of connection with the server

## Tasks

### Task 1

Prepare data for your linear regression model. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('taxi_dataset_with_target.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0_level_0,vendor_id,pickup_datetime,passenger_count,store_and_fwd_flag,trip_duration,distance_km
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
id2875421,1,2016-03-14 17:24:55,930.399753,0,455.0,1.500479
id2377394,0,2016-06-12 00:43:35,930.399753,0,663.0,1.807119
id3858529,1,2016-01-19 11:35:24,930.399753,0,2124.0,6.39208
id3504673,1,2016-04-06 19:32:31,930.399753,0,429.0,1.487155
id2181028,1,2016-03-26 13:30:55,930.399753,0,435.0,1.189925


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1458644 entries, id2875421 to id1209952
Data columns (total 6 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   vendor_id           1458644 non-null  int64  
 1   pickup_datetime     1458644 non-null  object 
 2   passenger_count     1458644 non-null  float64
 3   store_and_fwd_flag  1458644 non-null  int64  
 4   trip_duration       1458644 non-null  float64
 5   distance_km         1458644 non-null  float64
dtypes: float64(3), int64(2), object(1)
memory usage: 77.9+ MB


In [5]:
X = df.drop(['trip_duration', 'pickup_datetime'], axis=1)
Y = df[['trip_duration']]

In [6]:
X.head()

Unnamed: 0_level_0,vendor_id,passenger_count,store_and_fwd_flag,distance_km
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
id2875421,1,930.399753,0,1.500479
id2377394,0,930.399753,0,1.807119
id3858529,1,930.399753,0,6.39208
id3504673,1,930.399753,0,1.487155
id2181028,1,930.399753,0,1.189925


In [7]:
Y.head()

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
id2875421,455.0
id2377394,663.0
id3858529,2124.0
id3504673,429.0
id2181028,435.0


### Task 2

Create a linear regression model using `sklearn`.

In [8]:
from sklearn.linear_model import LinearRegression

In [9]:
model=LinearRegression()
model.fit(X, Y)

### Task 3

Print out coefficients and an intercept the model has. 

In [10]:
for i in range(len(X.columns)):
    print(X.columns[i], '-->', round(model.coef_[0][i], 3))
print('intercept', '-->', round(model.intercept_[0], 3))

vendor_id --> 198.463
passenger_count --> 0.296
store_and_fwd_flag --> 56.469
distance_km --> 115.274
intercept --> 171.657


### Task 4

Code your own `Linear Regression` solution using `matrix form`.

In [11]:
import numpy as np

In [12]:
def LinearRegressionByMatrix(X: np.array, Y: np.array, fit_intercept: bool = True):
    """
    :param X: matrix with features
    :param Y: matrix with target
    :param fit_intercept: should we use a constant intercept
    
    :return: numpy-array beta coefficients and an intercept
        """
    if fit_intercept:
        constant_value = 1
        num_rows = X_train.shape[0]
        new_column = np.full((num_rows, 1), constant_value)
        X = np.hstack((X_train, new_column))
    
    xxt = np.dot(X.T, X)
    xxt_inv = np.linalg.inv(xxt)
    xxt_inv_xxt = np.dot(xxt_inv, X.T)
    final_betas = np.dot(xxt_inv_xxt, Y)
    
    return final_betas

In [13]:
X_train = df.drop(['trip_duration', 'pickup_datetime'], axis=1).values
Y_train = df['trip_duration'].values

coefficients = LinearRegressionByMatrix(X_train, Y_train)
[round(x, 3) for x in coefficients]

[198.463, 0.296, 56.469, 115.274, 171.657]

As we can see we obtained the same results!

### Task 5

Make predictions using matrix product of the coefficients you've obtained on the previous step. 

In [14]:
constant_value = 1
num_rows = X_train.shape[0]
new_column = np.full((num_rows, 1), constant_value)
X = np.hstack((X_train, new_column))


predictions = np.dot(X, coefficients)

In [15]:
predictions

array([ 818.7747282 ,  655.65912268, 1382.6469154 , ..., 1548.74134353,
        573.4306718 ,  578.2338068 ])

### Task 6

Compare those results with the predictions you can have using `sklearn` `LinearRegression` model.

In [16]:
import warnings
warnings.filterwarnings('ignore')

In [17]:
model.predict(X_train)

array([[ 818.7747282 ],
       [ 655.65912268],
       [1382.6469154 ],
       ...,
       [1548.74134353],
       [ 573.4306718 ],
       [ 578.2338068 ]])

As we can see we obtained the same results again!