# Data Profile
This first examination of the data seeks to characterize data quality in its (near) raw form. Here, we will discover the scope and breadth of data preprocessing that will be considered before advancing to the exploratory analysis effort. The remainder of this section is organized as follows:

   1. Dataset Overview    
      1.0. Dataset Summary Statistics
      1.1. Dataset Columns Datatypes    
      1.2. Missing Data Analysis   
      1.3. Cardinality Analysis   

  
   2. Qualitative Variable Analysis   
      2.0. Descriptive Statistics     
      2.1. Frequency Distribution Analysis     
    
   3. Quantitative Variable Analysis       
      3.0. Descriptive Statistics     
      3.1. Distribution Analysis     
  
   4. Summary and Recommendations    

In [2]:
from myst_nb import glue
from cvr.core.workspace import Workspace, WorkspaceManager
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.width', 1000)

First, we'll obtain the 'criteo' 'preprocessed' dataset from the 'Vesuvio' workspace created during data acquisition.

In [2]:
wsm = WorkspaceManager()
workspace = wsm.get_workspace('Vesuvio')
dataset = workspace.get_dataset(name='criteo', stage='preprocessed')

'full_month'

## Dataset Overview
### Dataset Summary Statistics

In [8]:
summary = dataset.summary()



                                Dataset Summary                                 
                                staging_vesuvio                                 
                                _______________                                 
                                   Rows : 15,995,634
                                Columns : 23
                          Missing Cells : 167,746,417
                        Missing Cells % : 45.6
                         Duplicate Rows : 20,958
                       Duplicate Rows % : 0.13
                              Size (Mb) : 4,972.71


In [3]:
_ = glue("profile_rows",summary["Rows"], display=False)
_ = glue("profile_columns", summary["Columns"], display=False)
_ = glue("profile_missing", summary["Missing Cells %"], display=False)
_ = glue("profile_size", summary["Size (Mb)"], display=False)
_ = glue("profile_dups", summary["Duplicate Rows"], display=False)
_ = glue("profile_dup_pct", summary["Duplicate Rows %"], display=False)

NameError: name 'summary' is not defined

This dataset contains some {glue:}`profile_rows` observations, each with {glue:}`profile_columns` columns for a size of {glue:}`profile_size` Mb.  Some {glue:}`profile_missing`% of the data are missing which reflects the sparsity of user behavior logs. Further, we have some {glue:}`profile_dups` duplicate rows which, one might consider a large number, although that makes just {glue:}`profile_dup_pct`% of the sample.

### Dataset Columns and Datatypes

In [10]:
dataset.info()



                                Dataset vesuvio                                 
                                _______________                                 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15995634 entries, 0 to 15995633
Data columns (total 23 columns):
 #   Column                 Non-Null Count     Dtype   
---  ------                 --------------     -----   
 0   sale                   15995634 non-null  category
 1   sales_amount           1732721 non-null   float64 
 2   conversion_time_delay  1727341 non-null   float64 
 3   click_ts               15995634 non-null  float64 
 4   n_clicks_1week         9251207 non-null   float64 
 5   product_price          15995634 non-null  float64 
 6   product_age_group      4235576 non-null   category
 7   device_type            15992602 non-null  category
 8   audience_id            4493616 non-null   category
 9   product_gender         4341439 non-null   category
 10  product_brand          8754074 non-null   ca

Converting the pandas object variables to category data types may bring some computational efficiencies which may be material for a dataset of this size. Still, the number that stands out so far is the 45% missing rate. 

In [12]:
_ = dataset.missing



                             Missing Data Analysis                              
                                staging_vesuvio                                 
                             _____________________                              
                              n   Missing  Missingness
sale                   15995634         0         0.00
sales_amount           15995634  14262913        89.17
conversion_time_delay  15995634  14268293        89.20
click_ts               15995634         0         0.00
n_clicks_1week         15995634   6744427        42.16
product_price          15995634         0         0.00
product_age_group      15995634  11760058        73.52
device_type            15995634      3032         0.02
audience_id            15995634  11502018        71.91
product_gender         15995634  11654195        72.86
product_brand          15995634   7241560        45.27
product_category_1     15995634   6142756        38.40
product_category_2     15995634   615124

Here, we get a better sense of the nature of the data sparsity challenge. Nine columns have missing rates over 50%; five of which have missing rates of 90% or more. Notably, the diversity and data sparsity reflect the nature of buying behavior and are common challenges in customer segmentation and analytics. 

Still, the sparsity (and masking) of the data leaves us with few meaningful imputation strategies. In fact, one might replace the term 'missing' with 'absent'. Missing implies an error or omission in the data which may not comport with the underlying patterns. For instance, 91% of the observations have no value for product_category_5. It's possible that the data are missing at random; however, it is also possible that most products don't have 5 product categories. 

Let's take a closer look at the frequency distributions of the categories.
## Frequency Distribution
Let's get an overall sense of cardinality of the data.

In [14]:
_ = dataset.cardinality



                                  Cardinality                                   
                                staging_vesuvio                                 
                                _______________                                 
                   Column    Unique     Total  Pct Unique
0                    sale         2  15995634        0.00
1            sales_amount    513406   1732721       29.63
2   conversion_time_delay    574347   1727341       33.25
3                click_ts   6456934  15995634       40.37
4          n_clicks_1week      4613   9251207        0.05
5           product_price     43143  15995634        0.27
6       product_age_group        11   4235576        0.00
7             device_type         8  15992602        0.00
8             audience_id     18228   4493616        0.41
9          product_gender        17   4341439        0.00
10          product_brand     55983   8754074        0.64
11     product_category_1        21   9852878        0.00
1