# Simility Read Data Example

This notebook contains an example of how the Simility Read Data sub-package can be used to read pipeline output data stored in various formats, while accounting for the Cassandra datatypes of the fields in a Simility environment.

## Requirements

To run, you'll need the following:

* Install the Simility Read Data package - see the readme for more information.
* Be connected to the PayPal VPN

----

## Import packages

In [2]:
from read_data.read_data import DataReader
from simility_apis.set_password import set_password

import pandas as pd
import numpy as np
import json

---

## Set your password

Before using the read_data module, you need to provide your password that you use to log in to the Simility environment (this is so the Cassandra datatypes of each pipeline output field can be fetched):

In [3]:
set_password()

Please provide your password for logging into the Simility platform:  ·········


---

## Read CSV

We can read a CSV file and ensure the datatypes of the fields align to Cassandra by using the *read_csv* method from the *DataReader* class. 

First, we instantiate the *DataReader* class with parameters relating to the Simility environment in question:

In [4]:
params = {
    "url": 'http://sim-ds.us-central1.gcp.dev.paypalinc.com',
    "app_prefix": 'james_testing',
    "user": 'james@simility.com',
    "base_entity": 'transaction'
}

In [5]:
dr = DataReader(**params)

Now we can read in the CSV file, outlining any keyword arguments that need to be passed to the Pandas read_csv method:

In [6]:
data = dr.read_csv(filepath='dummy_data/dummy_pipeline_output_data.csv',
                   index_col='eid')

### Outputs

The *.read_csv()* method returns a dataframe of the CSV file, using the Cassandra equivalent datatypes in Pandas:

In [7]:
data.head()

Unnamed: 0_level_0,account_number,account_number_avg_order_total_per_account_number_1day,account_number_avg_order_total_per_account_number_30day,account_number_avg_order_total_per_account_number_7day,account_number_avg_order_total_per_account_number_90day,account_number_eid,account_number_num_distinct_transaction_per_account_number_1day,account_number_num_distinct_transaction_per_account_number_30day,account_number_num_distinct_transaction_per_account_number_7day,account_number_num_distinct_transaction_per_account_number_90day,...,sim_queues,sim_sc,sim_sc_ml,sim_updated_at,sim_updated_customer,sim_updated_internal,sim_updated_user_email,sim_wl,sim_wl2,status
eid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
147-5738036-4442504,f2a99b2c64eed603165eec3590d8e162,,0.0,0.0,0.0,f2a99b2c64eed603165eec3590d8e162,1,1,1,1,...,[],,,1574372761000.0,1582237359410.0,,james_testing_api_admin@james_testing.com,['Closed'],,New
254-3871443-0481877,1e83c43e316ecdc539f5611410c366fa,0.0,0.0,0.0,0.0,1e83c43e316ecdc539f5611410c366fa,1,1,1,1,...,[],,,1574372778000.0,1582237359934.0,,james_testing_api_admin@james_testing.com,['Closed'],,New
404-7064563-8888834,d421fb1e54650501ac29f8a139fb7f4d,0.0,0.0,0.0,0.0,d421fb1e54650501ac29f8a139fb7f4d,1,1,1,1,...,[],,,1574372811000.0,1582237360396.0,,james_testing_api_admin@james_testing.com,['Closed'],,New
775-5355315-3130338,c9b125ebddcae943eb4145f02e9cf7d4,0.0,0.0,0.0,0.0,c9b125ebddcae943eb4145f02e9cf7d4,1,1,1,1,...,[],,,1574373202000.0,1582237360917.0,,james_testing_api_admin@james_testing.com,['Closed'],,New
899-4723735-1420281,e6d7963551958b0e6cab027dfa20c318,0.0,0.0,0.0,0.0,e6d7963551958b0e6cab027dfa20c318,1,1,1,1,...,[],,,1574373315000.0,1582237361352.0,,james_testing_api_admin@james_testing.com,['Closed'],,New


In [8]:
data.dtypes

account_number                                              object
account_number_avg_order_total_per_account_number_1day     float64
account_number_avg_order_total_per_account_number_30day    float64
account_number_avg_order_total_per_account_number_7day     float64
account_number_avg_order_total_per_account_number_90day    float64
                                                            ...   
sim_updated_internal                                        object
sim_updated_user_email                                      object
sim_wl                                                      object
sim_wl2                                                     object
status                                                      object
Length: 64, dtype: object

### Specify your own datatype mapping

The DataReader class uses a default Cassandra-Pandas mapping when reading a file (see the class docstring for more information). However, you can specify your own mapping - just ensure that each Cassandra datatype is covered:

In [9]:
new_mapping = {
    'DOUBLE': float,
    'TEXT': object,
    'INT': float,
    'BOOLEAN': object,
    'TIMESTAMP': object,
    'SET': object,
    'MAP': object,
    'FLOAT': float,
    'BLOB': object
}

In [10]:
params = {
    "url": 'http://sim-ds.us-central1.gcp.dev.paypalinc.com',
    "app_prefix": 'james_testing',
    "user": 'james@simility.com',
    "base_entity": 'transaction',
    "cass_python_dtype_mapping": new_mapping
}

In [11]:
dr = DataReader(**params)

In [12]:
data = dr.read_csv(filepath='dummy_data/dummy_pipeline_output_data.csv',
                   index_col='eid')

In [13]:
data.dtypes

account_number                                              object
account_number_avg_order_total_per_account_number_1day     float64
account_number_avg_order_total_per_account_number_30day    float64
account_number_avg_order_total_per_account_number_7day     float64
account_number_avg_order_total_per_account_number_90day    float64
                                                            ...   
sim_updated_internal                                        object
sim_updated_user_email                                      object
sim_wl                                                      object
sim_wl2                                                     object
status                                                      object
Length: 64, dtype: object

---

## The End

That's it folks - if you have any queries or suggestions please put them in the *#sim-datatools-help* Slack channel or email James directly.