<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Building-the-Solution-Design" data-toc-modified-id="Building-the-Solution-Design-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Building the Solution Design</a></span><ul class="toc-item"><li><span><a href="#Cleaning-the-elements" data-toc-modified-id="Cleaning-the-elements-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Cleaning the elements</a></span><ul class="toc-item"><li><span><a href="#Fixing-null-values" data-toc-modified-id="Fixing-null-values-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Fixing null values</a></span></li></ul></li><li><span><a href="#Fixing-metrics-table" data-toc-modified-id="Fixing-metrics-table-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Fixing metrics table</a></span></li><li><span><a href="#Metrics-and-dimensions-SDR" data-toc-modified-id="Metrics-and-dimensions-SDR-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Metrics and dimensions SDR</a></span><ul class="toc-item"><li><span><a href="#Optional-:-Concat-dataframe" data-toc-modified-id="Optional-:-Concat-dataframe-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Optional : Concat dataframe</a></span></li></ul></li></ul></li><li><span><a href="#Connecting-with-AEP-Schema" data-toc-modified-id="Connecting-with-AEP-Schema-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Connecting with AEP Schema</a></span><ul class="toc-item"><li><span><a href="#Schema-Manager" data-toc-modified-id="Schema-Manager-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Schema Manager</a></span></li><li><span><a href="#Merging-SDR-with-Schema-definition" data-toc-modified-id="Merging-SDR-with-Schema-definition-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Merging SDR with Schema definition</a></span></li></ul></li></ul></div>

With very large organization, it may happen that many people are working on Data Views.\
Your data views are your core representation of your data store.\
It is what is being used for the request by your workspaces and reports.\
Having a clear view on what has been defined in your data view is very important and we can create a script to extract that information very easily in the next steps. 

In order to achieve that view and document it, you can use the cjapy module to build your solution design for your analyst.

We first load cjapy and the configuration used.

In [1]:
import cjapy
cjapy.importConfigFile('config_example.json')

Once we have done that we can instantiate the connection to CJA API via the `CJA` class.

In [2]:
cja = cjapy.CJA()

# Building the Solution Design

In order to build a solution design, you need to have a complete view of what has been setup in your data view. 

From the `cja` connection that we have built, we will extract the correct data view that we want to see and all of its dimensions and metrics 

In [3]:
dataviews = cja.getDataViews()

Selecting a data view by using its name

In [4]:
dv_id = dataviews.at[dataviews[dataviews['name']=='Datanalyst'].index[0],'id']
dv_id

'dv_62c2c7ccb373f55b9f617157'

Now that we have its id, we can use it to retrieve the different components associated with it

In [5]:
dimensions = cja.getDimensions(dv_id,full=True)
metrics = cja.getMetrics(dv_id,full=True)

The data returned by the methods are `dataframes` which make them very easy to manipulate and to already save them. 

Using the `head` method, you can see the 5 rows that have been returned

In [6]:
dimensions.head()

Unnamed: 0,id,name,description,sourceFieldId,sourceFieldName,storageId,dataSetIds,dataSetType,schemaType,sourceFieldType,...,fieldDefinition,noValueOptionsSetting,defaultDimensionSort,persistenceSetting,behaviorSetting,substringSetting,baseTableName,required,derivedFieldCompatible,labels
0,variables/placeContext.geo.countryCode,Country code,The two-character [ISO 3166-1 alpha-2](https:/...,placeContext.geo.countryCode,Country code,explicitp9f128de542059012c91a65ntrycode|hits,[6059fd4fc52f8819484a7c1c],event,string,custom,...,"[{'func': 'raw-field', 'id': 'placeContext.geo...","{'includeNoneByDefault': True, 'noneChangeable...",False,{'enabled': False},{'lowercase': False},{'enabled': False},,,,
1,variables/daterangesecond,Second,,,,,,event,,standard,...,"[{'func': 'raw-field', 'id': 'adobe_datetime',...","{'includeNoneByDefault': False, 'noneChangeabl...",True,,,,hits,True,True,
2,variables/device.typeID.global-classify-string...,Audio Support,,device.typeID.global-classify-string._globallo...,Audio Support,device.typeID.global-classify-string.g7ff63669...,[5f739f3043ffd11914d6ddaa],lookup,string,globalLookup,...,"[{'func': 'raw-field', 'id': 'device.typeID.gl...","{'includeNoneByDefault': True, 'noneChangeable...",False,{'enabled': False},{'lowercase': False},{'enabled': False},,,,
3,variables/device.typeID.global-classify-string...,Manufacturer,,device.typeID.global-classify-string._globallo...,Manufacturer,device.typeID.global-classify-string.g7ff63669...,[5f739f3043ffd11914d6ddaa],lookup,string,globalLookup,...,"[{'func': 'raw-field', 'id': 'device.typeID.gl...","{'includeNoneByDefault': True, 'noneChangeable...",False,{'enabled': False},{'lowercase': False},{'enabled': False},,,,
4,variables/adobe_identitynamespace_personid,Identity Namespace:Person ID,,,,,,event,,standard,...,"[{'func': 'raw-field', 'id': 'adobe_identityna...","{'includeNoneByDefault': False, 'noneChangeabl...",False,,,,,,False,


You can see that you have lots of details about each of these dimensions.\
You can see the number of elements by using the `len()` method

In [7]:
len(dimensions)

88

## Cleaning the elements

First of all, you may have default dimensions or metrics that have been integrated in that data view but are not that important because they do not contain data.\
Removing them would actually clean up the table.\
This task, as not sexy, is a crucial task to understand and realize before doing any sort of data analysis in the future.\
We are using this simple task of creating a Solution Design to introduce some concepts such as:
* identifying null values
* cleaning null values

There are some columns that you may also not want to duplicate on your solution design, therefore we will remove them as well.

### Fixing null values

Not a Number (NaN) of Not defined value (na) are elements that could break some basic condition formatting, so we would like to clean these parts first.

In [8]:
dimensions.isnull().sum()

id                         0
name                       0
description               13
sourceFieldId             30
sourceFieldName           30
storageId                 30
dataSetIds                30
dataSetType                0
schemaType                30
sourceFieldType            0
tableName                 30
type                       0
hideFromReporting         13
schemaPath                30
hasData                    0
segmentable                0
favorite                   0
approved                   0
tags                       0
usageSummary               0
hidden                     0
fromGlobalLookup           0
multiValued               30
includeExcludeSetting     32
fieldDefinition            3
noValueOptionsSetting      0
defaultDimensionSort       0
persistenceSetting        30
behaviorSetting           34
substringSetting          34
baseTableName             63
required                  69
derivedFieldCompatible    58
labels                    87
dtype: int64

In [9]:
dimensions['hasData'] = dimensions['hasData'].fillna(False) ## if no information, just place False as default
dimensions['derivedFieldCompatible'] = dimensions['derivedFieldCompatible'].fillna(False) ## if no information, just place False as default
dimensions['dataSetType'] = dimensions['dataSetType'].fillna("system")  ## if no information, just place "system" as default
dimensions['sourceFieldId'] = dimensions['sourceFieldId'].fillna("cja")

Base on these informaiton `hasData`, `derivedFieldCompatible` and `dataSetType`, there could be already a good filtering done for your solution design. 

## Fixing metrics table

You can also look at random lines in your dataframe by using the `sample()` method, the argument giving the number of row to return.  

In [10]:
metrics.sample(4)

Unnamed: 0,id,name,description,dataSetType,sourceFieldType,baseTableName,type,hideFromReporting,hasData,segmentable,...,sourceFieldName,storageId,dataSetIds,schemaType,tableName,schemaPath,multiValued,includeExcludeSetting,required,attributionSetting
11,metrics/environment.browserDetails.viewportWidth,Viewport width,The horizontal size in pixels of the window th...,event,custom,,int,False,True,True,...,Viewport width,expliciteda4aab4f5c863da445d069ortwidth|hits,[6059fd4fc52f8819484a7c1c],integer,hits,environment.browserDetails.viewportWidth,False,{'enabled': False},,
12,metrics/device.screenWidth,Screen width,The number of horizontal pixels of the device'...,event,custom,,int,False,True,True,...,Screen width,explicitd6ed994d121bf281df56229eenwidth|hits,[6059fd4fc52f8819484a7c1c],integer,hits,device.screenWidth,False,{'enabled': False},,
14,metrics/web.webInteraction.linkClicks.value,web.webInteraction.linkClicks.value,The quantifiable value of this measure.,event,custom,,decimal,False,True,True,...,web.webInteraction.linkClicks.value,explicitw94a6416e5fd321e097a91avalue|hits,[6059fd4fc52f8819484a7c1c],double,hits,web.webInteraction.linkClicks.value,False,{'enabled': False},,
4,metrics/visits,Sessions,,event,standard,,int,,True,True,...,,,,,,,,,True,


In [11]:
metrics.isnull().sum()

id                         0
name                       0
description                3
dataSetType                0
sourceFieldType            0
baseTableName             12
type                       0
hideFromReporting          3
hasData                    0
segmentable                0
favorite                   0
approved                   0
tags                       0
usageSummary               0
hidden                     0
fromGlobalLookup           0
fieldDefinition            4
derivedFieldCompatible     8
sourceFieldId              8
sourceFieldName            8
storageId                  8
dataSetIds                 8
schemaType                 8
tableName                  8
schemaPath                 8
multiValued                8
includeExcludeSetting      8
required                  10
attributionSetting        13
dtype: int64

In [12]:
metrics['hasData'] = metrics['hasData'].fillna(False) ## if no information, just place False as default
metrics['derivedFieldCompatible'] = metrics['derivedFieldCompatible'].fillna(False) ## if no information, just place False as default
metrics['dataSetType'] = metrics['dataSetType'].fillna("system")  ## if no information, just place "system" as default
metrics['sourceFieldId'] = metrics['sourceFieldId'].fillna("cja")  ## if no information, just place "cja" as default

## Metrics and dimensions SDR

The Solution Design Reference basde on CJA implementation can be exported via once we reframe it to the columns we want to have.\
You can filter columns by placing them in a list for filtering.\
If you want to have a copy of your dataframe, use the `copy()` method, that will avoid doing some modification to your original dataframe.

In our example here, we will only select attributes that we find important for the usage of that notebook.

In [13]:
dimensions_sdr = dimensions[dimensions['hasData']][['id','name','dataSetType','sourceFieldId']].copy() ## filtering for dimensions that contain data
metrics_sdr = metrics[metrics['hasData']][['id','name','dataSetType','sourceFieldId']].copy()## filtering for metrics that contain data

### Optional : Concat dataframe

You can combine 2 dataframe via the `concat()` method of the pandas module.

In [14]:
import pandas as pd ## using the pd alias

The `concat` method will take an iterable of dataframe and concat them together.\
`ignore_index` will reset the index.

In [15]:
df_cja = pd.concat([dimensions_sdr,metrics_sdr],ignore_index=True)

In [16]:
df_cja.sample(5)

Unnamed: 0,id,name,dataSetType,sourceFieldId
32,variables/618aa8dff4919f19484472fb.web.webPage...,field1,lookup,618aa8dff4919f19484472fb.web.webPageDetails.UR...
22,variables/adobe_firstvreturn_sessiontype,Session Type,event,cja
50,variables/web.webInteraction.type,Type (2),event,web.webInteraction.type
34,variables/placeContext.localTime,Local time,event,placeContext.localTime
21,variables/adobe_identitynamespace,Identity Namespace,event,cja


As you can see the sourcFieldId can be cleaned up as it should provide us with some information about the path used for the data ingestion.\
For the interest of time, we will not cover that part but know that Lookup and profile are ingesting the path with a prefix to avoid collision. 

# Connecting with AEP Schema

Customer Journey Analytics is loading the data based on the dataset that are being used in Adobe Experience Platform.\
Getting to know and understand the schema that is being used to capture the data is important.\
In order to do that, you can always log-in to the Adobe Experience Platform via the UI, but you can also retrieve more useful information by using the `aepp` module

The `aepp` module is divided in different services that can be used for analysing your Adobe Experience Platform implementation.\
In our scenario, we will just require to load the `schema` sub module

In [17]:
import aepp
from aepp import schema

The Adobe Experience Platform is divided itself into different sandboxes.\
While loading the configuration, you can specify which sandbox you want to use.\
It is also recommnended to store the configuration in a variable, that we will name `prod` because we are using the prod sandbox.\
We can save the configuratio by passing `True` to the `connectInstance` parameter

In [18]:
prod = aepp.importConfigFile('config_example.json',sandbox='prod',connectInstance=True)

You can then use the configuration to instantiate your connection to your schema for the `prod` sandbox.

In [19]:
schemaProd = schema.Schema(config=prod)

We will retrieve all schemas

In [20]:
allSchemas = schemaProd.getSchemas()

By retrieving the schemas, we have created a storage to easily find the schema ID to be used in these data classes:
* schemaProd.data.schemas_altId
* schemaProd.data.schemas_id

Using the name of our schema, we can easily extract its id: 

In [21]:
schemaProd.data.schemas_id['datanalyst']

'https://ns.adobe.com/emeaconsulting/schemas/b6368c561a807b6eb05818eaabc15f1948d3e19b6a767285'

## Schema Manager

We can use a native functionality of aepp to build a dataframe representation of the schema\
The usage of the `SchemaManager` class will simplify the extraction of the fields

In [22]:
datanalyst = schema.SchemaManager(schemaProd.data.schemas_id['datanalyst'])

In [25]:
df_schema = datanalyst.to_dataframe(queryPath=True)

You can see that the paths have been flatten and provided in 2 columns:
* path : containing the path flatten with more details for list [] and array of objects []{}
* querypath : it is the same path but without the notation that helps understanding its type. 

For mapping the path to the one display in CJA, we will use the query path. 

In [26]:
df_schema

Unnamed: 0,path,type,fieldGroup,title,querypath
0,_experience,object,AEP Web SDK ExperienceEvent,,_experience
1,_experience.decisioning,object,AEP Web SDK ExperienceEvent,,_experience.decisioning
2,_experience.decisioning.propositionAction,object,AEP Web SDK ExperienceEvent,Proposition Action,_experience.decisioning.propositionAction
3,_experience.decisioning.propositionAction.id,string,AEP Web SDK ExperienceEvent,id,_experience.decisioning.propositionAction.id
4,_experience.decisioning.propositionAction.label,string,AEP Web SDK ExperienceEvent,label,_experience.decisioning.propositionAction.label
...,...,...,...,...,...
1418,channel.mediaAction,string,Channel Details,,channel.mediaAction
1419,channel.mediaType,string,Channel Details,,channel.mediaType
1420,channel.metricTypes[],string[],Channel Details,,channel.metricTypes
1421,channel.mode,string,Channel Details,,channel.mode


We will also remove the type object as they just serve as node and do not contain any data.

In [28]:
df_schema = df_schema[df_schema['type']!='object'][['querypath','title','type']].copy()

In [29]:
len(df_schema)

1119

## Merging SDR with Schema definition

Once you have your dataframe from the schema manager clean-up you can use it to merge it with the solution design.\
We will create a new dataframe that can replicate a path in case a path is used in both the dimension and the metrics.

In [30]:
from copy import deepcopy

In [42]:
new_dataframe = []
for index, row in df_schema.iterrows():
    data = {}
    flag_found = False
    for index_cja, row_cja in df_cja.iterrows():
        if row['querypath'] in row_cja['id']:
            data['xdm_path'] = row['querypath']
            data['xdm_title'] = row['title']
            data['xdm_type'] = row['type']
            data['cja_id'] = row_cja['id']
            data['cja_name'] = row_cja['name']
            data['cja_type'] = (lambda row : 'dimension' if row['id'].startswith('variables') else 'metric')(row_cja)
            new_dataframe.append(deepcopy(data))
            flag_found = True
            data = {}
    if flag_found == False:
        data['xdm_path'] = row['querypath']
        data['xdm_title'] = row['title']
        data['xdm_type'] = row['type']
        data['cja_id'] = None
        data['cja_name'] = None
        data['cja_type'] = None
        new_dataframe.append(deepcopy(data))

In [43]:
df_new = pd.DataFrame(new_dataframe)

In [46]:
df_new[df_new['cja_id'].astype(bool)].sample(3)

Unnamed: 0,xdm_path,xdm_title,xdm_type,cja_id,cja_name,cja_type
893,placeContext.localTimezoneOffset,Local time zone offset,integer,metrics/placeContext.localTimezoneOffset,Local time zone offset,metric
160,environment.ipV4,IPv4,string,variables/environment.ipV4,IPv4,dimension
351,_emeaconsulting.datanalyst.pageSubCategory,pageSubCategory,string,variables/_emeaconsulting.datanalyst.pageSubCa...,pageSubCategory,dimension
