# Notebook to generate three datasets

- two datasets for development stored in s3 (train and test)
- one dataset stored in elasticsearch to simulate the production

### Creation of a pex with Python dependencies

Some basic Python modules are already available in Punch images (ex: pandas). If you need some specific dependencies you need to generate a pex. This same pex will be used in development and in production in order to limit version changes between these two environments.

Here we only need sklearn module but you can give a list of modules if needed (see [punch_pex](https://punch-1.gitbook.io/punch-doc/v/welcome-to-the-punch/applications/jupyter/magic-commands#punchpex)).

In [1]:
%punch_pex -l scikit-learn mlflow --group demo --artifact dependencies -v 1.0.0 -o

  adding: dependencies-1.0.0.pex (deflated 2%)
  adding: metadata.yml (deflated 26%)


++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 upload -f /punch/punch_pex/dependencies-1.0.0.zip -o


Resource uploaded : additional-pex:demo:dependencies:1.0.0


### Adding dependencies to the environment

This notebook only needs the pex created above. Thus we load it via the [punch_dependencies](https://punch-1.gitbook.io/punch-doc/v/welcome-to-the-punch/applications/jupyter/magic-commands#punchdependencies) magic cell.

In [None]:
%%punch_dependencies
additional-pex:demo:dependencies:1.0.0

++ java -Xmx1g -Xms256m -Dlog4j.configurationFile=/punch/conf/log4j2/log4j2-stdout.xml -cp /punch/resourcectl.jar com.github.punchplatform.resourcectl.ResourceCtl -u http://artifacts-server.punch-artifacts:4245 download -r additional-pex:demo:dependencies:1.0.0 -o /usr/share/punch/extlib/python


Resource additional-pex:demo:dependencies:1.0.0 downloaded to /usr/share/punch/extlib/python/dependencies-1.0.0.pex


<IPython.core.display.Javascript object>

### Importing modules

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Reading data from s3
Punch provides magic cells to read data from different sources. If your JupyPunch was deployed with preconfigured databases then you do not need to rewrite your login credentials.

Here, the Kaggle file has been loaded into a minio bucket named "demo". We read the file and store the data in a variable called "data". ([punch_source](https://punch-1.gitbook.io/punch-doc/v/welcome-to-the-punch/applications/jupyter/magic-commands#punchsource-and-punchsink))

In [2]:
%%punch_source --type s3 --name data -o 
bucket: demo
prefix: card

Data is available in data variable.
Execution time: 0:00:00.685834


In [3]:
data.head(2)

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud,_ppf_path,_ppf_last_modified
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0,card_transdata.csv,1671010000.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0,card_transdata.csv,1671010000.0


In [4]:
print("Number of data", len(data))

Number of data 1000000


### Removing unused columns

In [5]:
data = data.drop(["_ppf_path", "_ppf_last_modified"], axis = 1)
data.head(2)

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0


### Splitting dataframe into train and test

In [6]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(["fraud"], axis=1), data["fraud"], test_size=0.3, stratify=data["fraud"], random_state=42)

In [7]:
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

### Export complete dataset as production and train and test as development

Like for sources, Punch provides magic cell to write data in different preconfigured databases ([punch_sink](https://punch-1.gitbook.io/punch-doc/v/welcome-to-the-punch/applications/jupyter/magic-commands#punchsource-and-punchsink)).

We export the entire initial file in elasticsearch and the train and test in minio.

In [9]:
data = data[0:100000]

In [10]:
%%punch_sink --type elasticsearch -df data
index:
    type: constant
    value: credit_card

Data saved.
Execution time: 0:00:03.542044


In [11]:
%%punch_sink --type s3 -df train
bucket: demo
path: train/train.csv

[34mcreated train/train.csv object; bucket: demo ; etag: "f30aa4218434300e8c32715470d5a4ec"[0m
Data saved.
Execution time: 0:00:03.379350


In [12]:
%%punch_sink --type s3 -df test
bucket: demo
path: test/test.csv

[34mcreated test/test.csv object; bucket: demo ; etag: "6511e9e964031113cd17726abb40d774"[0m
Data saved.
Execution time: 0:00:01.525942
