# PyRasgo Tutorial

This notebook explains how to use `pyrasgo` to explore a dataset, track the impact of feature engineering and prune features at the end to produce a final dataframe.

This notebook will use SHAP values from the `catboost` package to calculate feature impact to capture the impact of the feature engineering and prune features at the end of the tutorial.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)
* [PyRasgo](https://app.gitbook.com/@rasgo/s/rasgo-docs/pyrasgo-0.1/dataframe-prep)

Install pyrasgo if it is not already available

In [1]:
#!pip install -U pyrasgo[df]

In [2]:
import statsmodels.api as sm
import pandas as pd
import pyrasgo

## Connect to Rasgo

NB: This only needs to be run the first time you use pyrasgo.  Enter your email and password to create an account.

In [3]:
#pyrasgo.register(email='<your email>', password='<your password>')

Enter the email and password you used at registration to connect to Rasgo.

In [4]:
rasgo = pyrasgo.login(email='<your email>', password='<your password>')

## Reading the data

The data is from `rdatasets` imported using the Python package `statsmodels`.

In [5]:
df = sm.datasets.get_rdataset('GoldSilver', 'AER').data.reset_index().rename(columns={'index': 'date'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9132 entries, 0 to 9131
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    9132 non-null   object 
 1   gold    9132 non-null   float64
 2   silver  9132 non-null   float64
dtypes: float64(2), object(1)
memory usage: 214.2+ KB


### Create target

The target will be the gold price in one week.  **target_df** is created to hold the future gold price and it will be merged back into the original dataframe to create the initial dataframe to be analyzed.  For ease, **target** will be set to **future_gold_price** here.

In [6]:
df['date'] = pd.to_datetime(df.date)
target_df = df[['date', 'gold']].copy()
target_df['date'] = target_df.date - pd.to_timedelta('7 day')
target_df.rename(columns={'gold': 'future_gold_price'}, inplace=True)
target_df

Unnamed: 0,date,future_gold_price
0,1977-12-23,100.00
1,1977-12-26,100.00
2,1977-12-27,100.00
3,1977-12-28,100.00
4,1977-12-29,100.00
...,...,...
9127,2012-12-18,906.96
9128,2012-12-19,907.61
9129,2012-12-20,909.26
9130,2012-12-21,905.00


In [7]:
target = 'future_gold_price'

In [8]:
training_df = df.merge(target_df, on='date', how='left')
df = training_df[training_df.date < pd.to_datetime('2012-12-25')].ffill()

## Feature engineering

### Start experiment

In [9]:
rasgo.activate_experiment('Tutorial Experiment')

Activated existing experiment with name Tutorial Experiment for dataframe: Nxn3BRZA7yjHQC6uYRxmjzbmpqQ5jKvK2IYW6JekPv0


### Profile starting data

#### Generate feature profiles

In [10]:
response = rasgo.evaluate.profile(df)

Column date has an unrecognzied type. Profiling the column as string type.


#### Calculate feature importance

This generates a baseline to compare the impact of our feature engineering to.

In [11]:
response = rasgo.evaluate.feature_importance(df, target_column=target)

Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
                                                                               

### Start feature engineering

Create initial lag variables

In [12]:
df['gold_lag1'] = df['gold'].shift(1)
df['gold_lag7'] = df['gold'].shift(7)

df['silver_lag1'] = df['silver'].shift(1)
df['silver_lag7'] = df['silver'].shift(7)

Calculate feature importance

In [13]:
response = rasgo.evaluate.feature_importance(df, target_column=target)

Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
                                                                               

#### Add more lag variables

In [14]:
df['gold_lag14'] = df['gold'].shift(14)
df['gold_lag60'] = df['gold'].shift(60)

df['silver_lag14'] = df['silver'].shift(14)
df['silver_lag60'] = df['silver'].shift(60)

Again, check feature importance

In [15]:
response = rasgo.evaluate.feature_importance(df, target_column=target)

Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
                                                                               

#### Calculate ratios of gold prices

In [16]:
df['gold_to_last1'] = df['gold'] / df['gold_lag1']
df['gold_to_last7'] = df['gold'] / df['gold_lag7']
df['gold_to_last14'] = df['gold'] / df['gold_lag14']
df['gold_to_last60'] = df['gold'] / df['gold_lag60']

df['gold_1_to_last7'] = df['gold_lag1'] / df['gold_lag7']
df['gold_1_to_last14'] = df['gold_lag1'] / df['gold_lag14']
df['gold_7_to_last14'] = df['gold_lag7'] / df['gold_lag14']

Check feature importance

In [17]:
response = rasgo.evaluate.feature_importance(df, target_column=target)

Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
                                                                               

In [18]:
#### Calculate difference in prices over time

In [19]:
df['gold_minus_last1'] = df['gold'] - df['gold_lag1']
df['gold_minus_last7'] = df['gold'] - df['gold_lag7']
df['gold_minus_last14'] = df['gold'] - df['gold_lag14']
df['gold_minus_last60'] = df['gold'] - df['gold_lag60']

df['gold_1_minus_last7'] = df['gold_lag1'] - df['gold_lag7']
df['gold_1_minus_last14'] = df['gold_lag1'] - df['gold_lag14']
df['gold_7_minus_last14'] = df['gold_lag7'] - df['gold_lag14']

Check feature importance

In [20]:
response = rasgo.evaluate.feature_importance(df, target_column=target)

Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
                                                                               

### Feature selection

Keep top three-quarters of features

In [21]:
df = rasgo.prune.features(df, target_column=target, top_n_pct=.75)

Prune Method: Keeping top 0.75 of features
Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
Dropped features not in top 0.75 pct: ['gold_to_last14', 'gold_to_last1', 'gold_1_to_last7', 'gold_1_minus_last7', 'gold_minus_last14', 'gold_1_to_last14']


Calculate the feature importance to check the impact of pruning the features

In [22]:
response = rasgo.evaluate.feature_importance(df, target_column=target)

Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
                                                                               

Trim another one-quarter of the features

In [23]:
df = rasgo.prune.features(df, target_column=target, top_n_pct=.75)

Prune Method: Keeping top 0.75 of features
Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
Dropped features not in top 0.75 pct: ['silver_lag1', 'gold_to_last7', 'gold_7_to_last14', 'gold_minus_last1']


In [24]:
response = rasgo.evaluate.feature_importance(df, target_column=target)

Calculating Feature Importance:   0%|          | 0/8 [00:00<?, ?step/s]Column date has an unrecognzied type. Profiling the column as string type.
                                                                               

In [25]:
### End the experiment

In [26]:
rasgo.end_experiment()

Experiment ended
