# Data Profile
This first examination of the data seeks to characterize data quality in its (near) raw form. Here, we will discover the scope and breadth of data preprocessing that will be considered before advancing to the exploratory analysis effort. The remainder of this section is organized as follows:

   1. Dataset Overview    
      1.0. Dataset Summary Statistics
      1.1. Dataset Columns Datatypes    
      1.2. Missing Data Analysis   
      1.3. Cardinality Analysis   

  
   2. Qualitative Variable Analysis   
      2.0. Descriptive Statistics     
      2.1. Frequency Distribution Analysis     
    
   3. Quantitative Variable Analysis       
      3.0. Descriptive Statistics     
      3.1. Distribution Analysis     
  
   4. Summary and Recommendations    

In [1]:
# IMPORTS
from myst_nb import glue
from cvr.core.workspace import Workspace, WorkspaceManager
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.width', 1000)

In [2]:
wsm = WorkspaceManager()
wsm.set_current_workspace('Vesuvio')

As a first step, let's get the current workspace which was set during the acquisition section.

In [3]:
wsm = WorkspaceManager()
workspace = wsm.get_current_workspace()
workspace.name

'Vesuvio'

The dataset created during the acquistion step was called 'criteo'. We can obtain this preprocessed version using the name='criteo' and stage='preprocessed'.

In [4]:
dataset  = workspace.get_dataset(name='criteo', stage='preprocessed')

ModuleNotFoundError: No module named 'cvr.data.profile'

## Dataset Overview
### Dataset Summary Statistics

In [None]:
summary = dataset.profile.summary

In [None]:
# GLUE
_ = glue("profile_rows",summary["Rows"], display=False)
_ = glue("profile_columns", summary["Columns"], display=False)
_ = glue("profile_missing", summary["Missing Cells %"], display=False)
_ = glue("profile_size", summary["Size (Mb)"], display=False)
_ = glue("profile_dups", summary["Duplicate Rows"], display=False)
_ = glue("profile_dup_pct", summary["Duplicate Rows %"], display=False)

This dataset contains some {glue:}`profile_rows` observations, each with {glue:}`profile_columns` columns for a size of {glue:}`profile_size` Mb.  Approximately {glue:}`profile_missing`% of the data are missing, reflecting the sparsity challenge of user behavior logs. Further, we have about {glue:}`profile_dups` duplicate rows, just {glue:}`profile_dup_pct`% of the sample; nonetheless, duplicates must be removed prior to modeling.

### Dataset Columns and Datatypes

In [None]:
dataset.info()

Converting the pandas object variables to category data types may bring some computational efficiencies which may be material for a dataset of this size. Still, the number that stands out so far is the 45% missing rate. 
### Missing Values Analysis

In [None]:
_ = dataset.missing

Here, we get a better sense of the nature of the data sparsity challenge. Nine columns have missing rates over 50%; five of which have missing rates of 90% or more. Notably, the diversity and data sparsity reflects the nature of buying behavior and are common challenges in customer segmentation and analytics. 

Still, the sparsity (and masking) of the data leaves us with few meaningful imputation strategies. One might replace the term 'missing' with 'absent'. Missing implies an error or omission in the data which may not comport with their underlying pattern. 

### Cardinality Analysis
Let's get an overall sense of the cardinality of the data. One would expect high cardinality among users and very low cardinality on items.

In [None]:
_ = dataset.cardinality

As expected, high cardinality among users and very low cardinality among device types, audiences, products, and their categories. Yet, converting some of these product categories into high-dimensional binary vectors for modeling may require some pruning. One strategy is to obtain the number of observations in each category, rank the categories by the number of observations contained, then keep the categories that account for some percentage threshold of the data. All the lower ranked categories would be changed to 'other'.

With this overview, let's dive deeper with a qualitative variable analysis.

## Qualitative Variable Analysis
Our qualitative or categorical variables include:
 - sale     
 - product_age_group    
 - device_type  
 - audience_id    
 - product_gender   
 - product_brand    
 - product_category_1 thru 7
 - product_country   
 - product_id
 - partner_id
 - user_id

The product_title variable is treated as an object or string. 

Descriptive statistics for the frequencies provide a sense of the categorical frequency distributions in the dataset. 

In [None]:
freq_dist = dataset.frequency_statistics
freq_dist.style.format(precision=2, thousands=",")

In [None]:
# GLUE
_ = glue("freq_stats",freq_dist)

There are {glue:}`freq_stats`['count'].values[0] categorical columns with an average of {glue:}`freq_stats`['mean'].values[0] categories emphasing our call for preprocessing the lower rank categories.  

This section will reveal the most important categories for each categorical variable via two subplots: the Top (up to) 10 Categories and the Cumulative Frequency Distribution subplots.

The Top 10 Categories are plotted by rank. The rank 1 category is that which has the greatest number of observations than any other category. The rank 2 category is the second most common category and so on.  Plotting by rank and not value is a convenience to avoid plotting long strings of values that cannot be interpreted due to hashing.

The subplot on the right: the Cumulative Frequency Distribution plot shows the percent of the data represented by the including categories 1:rank. This will help us determine the most important categories.

### Sale


In [None]:
dataset.categorical_analysis(column='sale')
dataset.frequency_analysis(column='sale')


The sale variable has two values: 0  - no sale occurred within the designated time and 1 - sale occurred. In these visualizations, 0 is the rank 1 value representing about 85% of the data - a ~15% post-click conversion rate. 
### Product Age Group

In [None]:
dataset.frequency_analysis(column='product_age_group')

Here we have the top 10 ranked categories for product_age_group. From the Cumulative Frequency Distribution plot on the right, we get nearly 90% of the data with just 2 categories and over 95% of the data with the top 3 categories.
### Device Type

In [None]:
dataset.frequency_analysis(column='device_type')

Nearly all of the data are represented by top 3 of the 8 device types.
### Audience Id

In [None]:
dataset.frequency_analysis(column='audience_id')

### Product Gender

In [None]:
dataset.frequency_analysis(column='product_gender')

### Product Brand

In [None]:
dataset.frequency_analysis(column='product_brand')

### Product Category 1

In [None]:
dataset.frequency_analysis(column='product_category_1')

### Product Category 2

In [None]:
dataset.frequency_analysis(column='product_category_2')

### Product Category 3

In [None]:
dataset.frequency_analysis(column='product_category_3')

### Product Category 4

In [None]:
dataset.frequency_analysis(column='product_category_4')

### Product Category 5

In [None]:
dataset.frequency_analysis(column='product_category_5')

### Product Category 6

In [None]:
dataset.frequency_analysis(column='product_category_6')

### Product Category 7

In [None]:
dataset.frequency_analysis(column='product_category_7')

### Product Country

In [None]:
dataset.frequency_analysis(column='product_country')

### Product Id

In [None]:
dataset.frequency_analysis(column='product_id')

### Partner Id

In [None]:
dataset.frequency_analysis(column='partner_id')