<a href="https://colab.research.google.com/github/raptor-ml/docs/blob/master/docs/guides/getting-started-with-labsdk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[↵ Back to the Docs](https://raptor.ml)

<img src="https://raptor.ml/img/logo.svg" height="200" />

# 🦖 LabSDK

Using the LabSDK, data-scientists can build models(that can run on production) directly from the notebook.

When you're done, you can "export" your work, like any other production asset. This way, you can **focus on your model**, while Raptor is taking care of the production concerns.

## 🧐 Getting started
In this quickstart tutorial, we'll build a model that predicts the probability of closing a deal.

Our CRM allow us to track every email communication, and the history of previous deals for each customer. We'll use this data sources to predict whether the customer is ready for closure or not.

To do that, we're going to build a few features from the data:
1. `emails_10h` - the amount of email exchanges over the last 10 hours
1. `deals_10h[sum]` - the sum of the deals of the last 10 hours
1. `emails_deals` - the rate between the emails in the last 10 hours (`emails_10h`) and the avarage of the deals in the last 10 hours (`deals_10h[avg]`)

## ⚡️ Installing the SDK
Yalla, let's go! First, we install the LabSDK and import it.

In [1]:
!pip install raptor-labsdk pyarrow -U --quiet
from raptor import *
import pandas as pd
from datetime import datetime
from typing_extensions import TypedDict

## ✍️ Writing our first features
Our first feature is calculating how many emails an account got over the last 10 hours.

To do that, we first define our data-sources, then we can start transforming our data.

In [2]:
@data_source(
    training_data=pd.read_parquet('https://gist.github.com/AlmogBaku/8be77c2236836177b8e54fa8217411f2/raw/emails.parquet'),  # This is the data as looks in production
    keys=['id', 'account_id'],
    timestamp='event_at',
    production_config=StreamingConfig(kind='kafka'), # This optional, and will create the production data-source configuration for DevOps
)
class Email(TypedDict('Email', {'from': str})):
    event_at: datetime
    account_id: str
    subject: str
    to: str

In [3]:
@feature(keys='account_id', data_source=Email)
@aggregation(function=AggregationFunction.Count, over='10h', granularity='1h')
def emails_10h(this_row: Email, ctx: Context) -> int:
    """email over 10 hours"""
    return 1

In [4]:
@feature(keys='account_id', data_source=Email)
@aggregation(function=AggregationFunction.Avg, over='10h', granularity='1h')
def question_marks_10h(this_row: Email, ctx: Context) -> int:
    """question marks over 10 hours"""
    return this_row['subject'].count('?')

---
> ## 😎 *Cool tip* 
>
> You can use the `@runtime` decorator to specify packages you want to install and use.
>
> [Learn more on the docs »](https://raptor.ml/)

Let's create another feature that calculates various aggregations against the deal amount.

In [5]:
@data_source(
    training_data=pd.read_csv('https://gist.githubusercontent.com/AlmogBaku/8be77c2236836177b8e54fa8217411f2/raw/deals.csv'),
    keys=['id', 'account_id'],
    timestamp='event_at',
)
class Deal(TypedDict):
    id: int
    event_at: pd.Timestamp
    account_id: str
    amount: float

In [6]:
@feature(keys='account_id', data_source=Deal)
@aggregation(
    function=[AggregationFunction.Sum, AggregationFunction.Avg, AggregationFunction.Max, AggregationFunction.Min],
    over='10h',
    granularity='1m'
)
def deals_10h(this_row: Deal, ctx: Context) -> float:
    """sum/avg/min/max of deal amount over 10 hours"""
    return this_row['amount']

Now we can create a *derived feature* that defines the rate between these two features.

**💡Hint:** Notice that when querying a feature with aggregation, we need to specify the feature with the aggregation feature we want using the feature selector.

In [7]:
@feature(keys='account_id', sourceless_markers_df=Deal.raptor_spec.local_df)
@freshness(target='-1', invalid_after='-1')
def emails_deals(_, ctx: Context) -> float:
    """emails/deal[avg] rate over 10 hours"""
    e, _ = ctx.get_feature('emails_10h+count')
    d, _ = ctx.get_feature('deals_10h+avg')
    if e is None or d is None:
        return None
    return e / d

Finally, we'll create `last_amount` which will reserve one previous value. We'll use this feature as our label, and to calculte the delta between the previous amount.

In [8]:
@feature(keys='account_id', data_source=Deal)
@freshness(target='1h', invalid_after='2h')
@keep_previous(versions=1, over='1h')
def last_amount(this_row: Deal, ctx: Context) -> float:
    return this_row['amount']

In [9]:
@feature(keys='account_id', sourceless_markers_df=Deal.raptor_spec.local_df)
@freshness(target='1h', invalid_after='2h')
def diff_with_previous_price(this_row: Deal, ctx: Context) -> float:
    lv, ts = ctx.get_feature('last_amount@-1')
    if lv is None:
        return 0
    return this_row['amount'] - lv

## 🧠 Building our model
After we defined our features, and wrote our feature engineering code, we can start and train our model.

In [10]:
@model(
    keys=['account_id'],
    input_features=[
        'emails_10h+count', 'deals_10h+sum', emails_deals, diff_with_previous_price, 'question_marks_10h+avg',
    ],
    input_labels=[last_amount],
    model_framework='sklearn',
    model_server='sagemaker-ack',
)
@freshness(target='1h', invalid_after='100h')
def deal_prediction(ctx: TrainingContext) -> float:
    from xgboost import XGBClassifier
    from sklearn.model_selection import train_test_split

    df = ctx.features_and_labels()
    X = df[ctx.input_features]
    y = df[ctx.input_labels]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    # Initialize an XGBoost model
    xgb_model = XGBClassifier()

    # Fit the model to the training data
    xgb_model.fit(X_train, y_train.values.ravel())

    return xgb_model

---
> ℹ️ **Looking to train at *scale*?** try [Raptor Enterprise](mailto:contact@raptor.ml) 🦖


## ☁️ Deployment
That's the fun part! 🤗 Making our model run at scale in production couldn't be easier.

The only thing we need to do is exporting the model and it's depending assets(models and data-sources). After that, make sure DevOps deploy it using the existing CI/CD, or manually using `kubectl` or the generated `Makefile` in the `out` dir.

In [11]:
deal_prediction.export()

In [12]:
!make -C out/

make: Entering directory '/content/out'
make: aws: Command not found
make: aws: Command not found
[38;5;97m                    █▀
[38;5;97m                  █▀  █▀
[38;5;97m▄               ██▀ ▄█▀  ▄
[38;5;97m █▄▄           ██▀ ▄█▀ ▄█▀
[38;5;97m   ▀▀██████████▄ █ ▄▄█▀
[38;5;97m    █▄        ███ █▀                                 ▄
[38;5;97m    ███      ███ ▀                                ▂▄██
[38;5;97m    ███     ████     ▄██▀▀▀████   ██ ▄█▀▀▀▀█▄    ▀███▀▀▀▀   ▄███▀▀▀███▄   █ ▄█▀▀▀▀
[38;5;97m    ███████████▎   ▄██▀     ▀██   ██▀      ▀██▄   ███      ███       ██▌  ██▀
[38;5;97m    ███     ▀███   ██▌      ▄██   ██▌       ██▀   ██       ██        ██   ██
[38;5;97m    ███      ▐██▄  ▀██▄   ▄▀███   ███     ▄██▀    ███       ███     ██    █▌
[38;5;97m    ███       ███▄   ▀▀██▀▀  ██   ███▀▀▀▀▀▀       ▀███▀       ▀████▀      █
[38;5;97m                                  ██▌
[38;5;97m                                  █▀
[0m

Usage:
  make [36m<target>[0m

[1mGeneral[0m
  [3

## 🪄 Ta-dam!
**From now on**, you'll have features and models running in production and record the values for historical purposes (so you'll be able to retrain against the production data).

[**🔗 Learn more about what else you can do with Raptor at the official docs**](https://raptor.ml)
