### Overview
Making a prediction using a linear regression model is a common use case in ML. In this guide tutorial, we build the model that predicts if a driver will complete a trip based on a number of features ingested into Feast.

The basic local mode gives you ability to quickly try Feast, while the advanced mode shows how you can use Feast in a production setting, in particular for the Google Cloud Platform (GCP) cloud.

This tutorial uses Feast with scikit learn to:

* Train a model locally using data from BigQuery
* Test the model for online inference using SQLite (for fast iteration)
* Test the model for online inference using Firestore (to represent production)
 

## Step 1: Install feast, scikit-learn

Install feast, gcp dependencies and scikit-learn


In [175]:
!pip install feast scikit-learn parquet-cli



#### Check feast version

In [178]:
!feast version 

Feast SDK Version: "feast 0.12.0"


## Step 2: Feast Init

* Init Feast Repository

In [188]:
%%shell
cd /content
feast init feast_driver_ranking_tutorial



## Step 4: Apply and deploy feature definitions

`feast apply` scans python files in the current directory for feature definitions and deploys infrastructure according to `feature_store.yaml`

In [180]:
%%shell
cd /content/feast_driver_ranking_tutorial
feast apply

Registered entity [1m[32mdriver_id[0m
Registered feature view [1m[32mdriver_hourly_stats[0m
Deploying infrastructure for [1m[32mdriver_hourly_stats[0m




### Inspect the files created under your local folder

In [181]:
%%shell
cd /content/feast_driver_ranking_tutorial/data/
ls -l 

total 56
-rw-r--r-- 1 root root 34712 Aug 17 05:30 driver_stats.parquet
-rw-r--r-- 1 root root 16384 Aug 17 07:12 online_store.db
-rw-r--r-- 1 root root   502 Aug 17 08:55 registry.db




In [182]:
%%shell

curl -o /content/feast_driver_ranking_tutorial/driver_orders.csv https://raw.githubusercontent.com/feast-dev/feast-driver-ranking-tutorial/master/driver_orders.csv 
ls /content/feast_driver_ranking_tutorial/

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   371  100   371    0     0   1883      0 --:--:-- --:--:-- --:--:--  1883
data  driver_orders.csv  example.py  feature_store.yaml




## Step 5: Train your model

In [183]:
%%shell
parq ./feast_driver_ranking_tutorial/data/driver_stats.parquet --tail 10 

               event_timestamp  driver_id  conv_rate  acc_rate  \
1797 2021-08-16 22:00:00+00:00       1001   0.283306  0.311559   
1798 2021-08-16 23:00:00+00:00       1001   0.845515  0.066000   
1799 2021-08-17 00:00:00+00:00       1001   0.563196  0.429818   
1800 2021-08-17 01:00:00+00:00       1001   0.570116  0.372267   
1801 2021-08-17 02:00:00+00:00       1001   0.462844  0.833463   
1802 2021-08-17 03:00:00+00:00       1001   0.415600  0.694797   
1803 2021-08-17 04:00:00+00:00       1001   0.488006  0.294181   
1804 2021-04-12 07:00:00+00:00       1001   0.593042  0.283652   
1805 2021-08-09 17:00:00+00:00       1003   0.557066  0.894062   
1806 2021-08-09 17:00:00+00:00       1003   0.557066  0.894062   

      avg_daily_trips                 created  
1797              977 2021-08-17 05:30:48.077  
1798              424 2021-08-17 05:30:48.077  
1799              357 2021-08-17 05:30:48.077  
1800              708 2021-08-17 05:30:48.077  
1801              531 2021-08-17 



In [184]:
orders.sort_values(by=['driver_id'], inplace=True, ignore_index=True)

In [185]:
orders

Unnamed: 0,event_timestamp,driver_id,trip_completed
0,2021-08-02 05:00:00+00:00,1001,1
1,2021-08-02 06:00:00+00:00,1001,1
2,2021-08-02 07:00:00+00:00,1001,1
3,2021-08-02 05:00:00+00:00,1002,0
4,2021-08-02 06:00:00+00:00,1002,0
5,2021-08-02 07:00:00+00:00,1002,0
6,2021-08-02 05:00:00+00:00,1003,0
7,2021-08-02 06:00:00+00:00,1003,0
8,2021-08-02 07:00:00+00:00,1003,0
9,2021-08-02 05:00:00+00:00,1004,1


In [186]:
orders

Unnamed: 0,event_timestamp,driver_id,trip_completed
0,2021-08-02 05:00:00+00:00,1001,1
1,2021-08-02 06:00:00+00:00,1001,1
2,2021-08-02 07:00:00+00:00,1001,1
3,2021-08-02 05:00:00+00:00,1002,0
4,2021-08-02 06:00:00+00:00,1002,0
5,2021-08-02 07:00:00+00:00,1002,0
6,2021-08-02 05:00:00+00:00,1003,0
7,2021-08-02 06:00:00+00:00,1003,0
8,2021-08-02 07:00:00+00:00,1003,0
9,2021-08-02 05:00:00+00:00,1004,1


In [187]:
import feast
from joblib import dump
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load driver order data
orders = pd.read_csv("/content/feast_driver_ranking_tutorial/driver_orders.csv", sep="\t")
orders["event_timestamp"] = pd.to_datetime(orders["event_timestamp"])

orders.sort_values(by=['driver_id'], inplace=True, ignore_index=True)

orders['event_timestamp'][:3] = pq_[pq_['driver_id']==1001].iloc[:3].reset_index()['event_timestamp']
orders['event_timestamp'][3:6] = pq_[pq_['driver_id']==1002].iloc[:3].reset_index()['event_timestamp']
orders['event_timestamp'][6:9] = pq_[pq_['driver_id']==1003].iloc[:3].reset_index()['event_timestamp']
orders['event_timestamp'][9:10] = pq_[pq_['driver_id']==1004].iloc[:1].reset_index()['event_timestamp']

# Connect to your feature store provider
fs = feast.FeatureStore(repo_path="/content/feast_driver_ranking_tutorial")

# Retrieve training data
training_df = fs.get_historical_features(
    entity_df=orders,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

print('ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ')

print(training_df)

# Train model
target = "trip_completed"

print("------------------------------------------------")
reg = LinearRegression()
train_X = training_df[training_df.columns.drop(target).drop("event_timestamp")]
train_Y = training_df.loc[:, target]

reg.fit(train_X[sorted(train_X)], train_Y)

# Save model
dump(reg, "driver_model.bin")

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   event_timestamp  10 non-null     datetime64[ns, UTC]
 1   driver_id        10 non-null     int64              
 2   trip_completed   10 non-null     int64              
 3   conv_rate        10 non-null     float32            
 4   acc_rate         10 non-null     float32            
 5   avg_daily_trips  10 non-null     int32              
dtypes: datetime64[ns, UTC](1), float32(2), int32(1), int64(2)
memory usage: 440.0 bytes
None

----- Example features -----

            event_timestamp  driver_id  ...  acc_rate  avg_daily_trips
0 2021-08-02 05:00:00+00:00       1001  ...  0.925318              842
1 2021-08-02 05:00:00+00:00       1002  ...  0.272748              699
2 2021-08-02 05:00:00+00:00       1003  ...  0.651410              40

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


['driver_model.bin']

In [102]:
training_df

Unnamed: 0,event_timestamp,driver_id,trip_completed,conv_rate,acc_rate,avg_daily_trips
0,2021-08-02 05:00:00+00:00,1001,1,0.100542,0.925318,842
1,2021-08-02 05:00:00+00:00,1002,0,0.825244,0.272748,699
2,2021-08-02 05:00:00+00:00,1003,0,0.276608,0.65141,405
3,2021-08-02 05:00:00+00:00,1004,1,0.248437,0.436958,20
4,2021-08-02 06:00:00+00:00,1001,1,0.581363,0.200917,362
5,2021-08-02 06:00:00+00:00,1002,0,0.440703,0.565963,343
6,2021-08-02 06:00:00+00:00,1003,0,0.703562,0.421865,704
7,2021-08-02 07:00:00+00:00,1001,1,0.468566,0.326638,233
8,2021-08-02 07:00:00+00:00,1002,0,0.915024,0.340808,676
9,2021-08-02 07:00:00+00:00,1003,0,0.125337,0.991286,503


## Step 6: Materialize your online store
Change the provider field in `driver_ranking/feature_store.yam` from `local` to `gcp`

Then apply and materialize data to Firestore

In [170]:
!cd /content/feast_driver_ranking_tutorial/ && feast materialize-incremental 2022-01-01T00:00:00

Materializing [1m[32m1[0m feature views to [1m[32m2022-01-01 00:00:00+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdriver_hourly_stats[0m from [1m[32m2022-01-01 00:00:00+00:00[0m to [1m[32m2022-01-01 00:00:00+00:00[0m:
0it [00:00, ?it/s]


In [171]:
!cd /content/feast_driver_ranking_tutorial/ && feast materialize 2021-01-01T00:00:00 2022-01-01T00:00:00

Materializing [1m[32m1[0m feature views from [1m[32m2021-01-01 00:00:00+00:00[0m to [1m[32m2022-01-01 00:00:00+00:00[0m into the [1m[32msqlite[0m online store.

[1m[32mdriver_hourly_stats[0m:
100%|████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 406.18it/s]


### Step 7:  Make Prediction

In [161]:
import pandas as pd
import feast
from joblib import load


class DriverRankingModel:
    def __init__(self):
        # Load model
        self.model = load("/content/driver_model.bin")

        # Set up feature store
        self.fs = feast.FeatureStore(repo_path="/content/feast_driver_ranking_tutorial/")

    def predict(self, driver_ids):
        # Read features from Feast
        driver_features = self.fs.get_online_features(
            entity_rows=[{"driver_id": driver_id} for driver_id in driver_ids],
            features=[
                "driver_hourly_stats:conv_rate",
                "driver_hourly_stats:acc_rate",
                "driver_hourly_stats:avg_daily_trips",
                "driver_hourly_stats:event_timestamp"
            ],
        )
 
        print(driver_features.to_dict())

        print('  \n  ')

        df = pd.DataFrame.from_dict(driver_features.to_dict())
        
        print(df)

        # Make prediction
        df["prediction"] = self.model.predict(df[sorted(df)])

        print('  \n  ')

        print(df)

        print('  \n  ')


        for i in range(4):
          print(f"drvier_id : {df['driver_id'][i]}, pred_ : {df['prediction'][i]}")
        # Choose best driver
        best_driver_id = df["driver_id"].iloc[df["prediction"].argmax()]

        # return best driver
        return best_driver_id

In [162]:
def make_drivers_prediction():
    drivers = [1001, 1002, 1003, 1004]
    model = DriverRankingModel()
    best_driver = model.predict(drivers)
    print(f"Prediction for best driver id: {best_driver}")

In [172]:
new_pq_ = pq_[(pq_['driver_id'] ==1001) | (pq_['driver_id'] ==1002) | (pq_['driver_id'] ==1003) | (pq_['driver_id'] ==1004)]
new_pq_

Unnamed: 0,event_timestamp,driver_id,conv_rate,acc_rate,avg_daily_trips,created
361,2021-08-02 05:00:00+00:00,1004,0.248437,0.436958,20,2021-08-17 05:30:48.077
362,2021-08-02 06:00:00+00:00,1004,0.774549,0.904169,955,2021-08-17 05:30:48.077
363,2021-08-02 07:00:00+00:00,1004,0.039977,0.038599,324,2021-08-17 05:30:48.077
364,2021-08-02 08:00:00+00:00,1004,0.838780,0.029421,909,2021-08-17 05:30:48.077
365,2021-08-02 09:00:00+00:00,1004,0.902637,0.291054,168,2021-08-17 05:30:48.077
...,...,...,...,...,...,...
1802,2021-08-17 03:00:00+00:00,1001,0.415600,0.694797,950,2021-08-17 05:30:48.077
1803,2021-08-17 04:00:00+00:00,1001,0.488006,0.294181,69,2021-08-17 05:30:48.077
1804,2021-04-12 07:00:00+00:00,1001,0.593042,0.283652,442,2021-08-17 05:30:48.077
1805,2021-08-09 17:00:00+00:00,1003,0.557066,0.894062,713,2021-08-17 05:30:48.077


In [173]:
new_pq_.sort_values(by=['event_timestamp'], ignore_index=True)

Unnamed: 0,event_timestamp,driver_id,conv_rate,acc_rate,avg_daily_trips,created
0,2021-04-12 07:00:00+00:00,1001,0.593042,0.283652,442,2021-08-17 05:30:48.077
1,2021-04-12 07:00:00+00:00,1002,0.451863,0.471993,24,2021-08-17 05:30:48.077
2,2021-04-12 07:00:00+00:00,1003,0.036765,0.824641,192,2021-08-17 05:30:48.077
3,2021-04-12 07:00:00+00:00,1004,0.612772,0.509554,737,2021-08-17 05:30:48.077
4,2021-08-02 05:00:00+00:00,1004,0.248437,0.436958,20,2021-08-17 05:30:48.077
...,...,...,...,...,...,...
1441,2021-08-17 03:00:00+00:00,1002,0.910113,0.911908,928,2021-08-17 05:30:48.077
1442,2021-08-17 04:00:00+00:00,1001,0.488006,0.294181,69,2021-08-17 05:30:48.077
1443,2021-08-17 04:00:00+00:00,1002,0.942232,0.733364,645,2021-08-17 05:30:48.077
1444,2021-08-17 04:00:00+00:00,1004,0.019139,0.452976,645,2021-08-17 05:30:48.077


In [163]:
make_drivers_prediction()

{'acc_rate': [0.2941805422306061, 0.733363926410675, 0.25538429617881775, 0.4529760479927063], 'conv_rate': [0.4880056381225586, 0.9422317147254944, 0.6883655190467834, 0.019138963893055916], 'driver_id': [1001, 1002, 1003, 1004], 'avg_daily_trips': [69, 645, 891, 645]}
  
  
   acc_rate  conv_rate  driver_id  avg_daily_trips
0  0.294181   0.488006       1001               69
1  0.733364   0.942232       1002              645
2  0.255384   0.688366       1003              891
3  0.452976   0.019139       1004              645
  
  
   acc_rate  conv_rate  driver_id  avg_daily_trips  prediction
0  0.294181   0.488006       1001               69    0.818146
1  0.733364   0.942232       1002              645   -1.625998
2  0.255384   0.688366       1003              891    0.722190
3  0.452976   0.019139       1004              645    1.926919
  
  
drvier_id : 1001, pred_ : 0.8181456325479246
drvier_id : 1002, pred_ : -1.62599844818277
drvier_id : 1003, pred_ : 0.7221897609262271
drvier_