# Data Profile
This first examination of the data seeks to characterize data quality in its (near) raw form. Here, we will discover the scope and breadth of data preprocessing that will be considered before advancing to the exploratory analysis effort. The remainder of this section is organized as follows:

   1. Descriptive statistics, missing values and cardinality.    
   2. Distribution analysis of continuous variables.    
   3. Frequency analysis of categorical variables.         
   4. Summary and recommendations

If you recall, in the last section, we loaded the Dataset object, vesuvio, into the staging area of the workspace of the same name.  Let's instantiate vesuvio (singleton) and obtain the Datasaet.

In [1]:
from myst_nb import glue
from cvr.core.workspace import Workspace
import pandas as pd
#pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.width', 1000)
random_state = 55

In [2]:
workspace = Workspace('vesuvio')
dataset = workspace.get_dataset(stage='staging', name='vesuvio')
summary = dataset.summary



                                Dataset Summary                                 
                                 staging_greco                                  
                                _______________                                 
                                   Rows : 10,000
                                Columns : 23
                          Missing Cells : 113,984
                        Missing Cells % : 49.56
                         Duplicate Rows : 0
                       Duplicate Rows % : 0.0
                              Size (Mb) : 3.94


In [3]:
_ = glue("rows",summary["Rows"])
_ = glue("columns", summary["Columns"])
_ = glue("missing", summary["Missing Cells %"])
_ = glue("size", summary["Size (Mb)"])
_ = glue("dups", summary["Duplicate Rows"])

10000

23

49.56

3.94

0

This dataset contains some {glue:}`rows` observations, each with {glue:}`columns` columns for a size of {glue:}`size` Mb.  Some {glue:}`missing`% of the data are missing which reflects the sparsity of user behavior logs. Further, we have some {glue:}`dups` duplicate rows which, one might consider a large number, although that makes up less than 0.2 percent of the sample. Let's take a look at a few sample observations from the dataset.

In [4]:
dataset.sample(1, random_state=random_state)



                                     greco                                      
                          1 Randomly Selected Samples                           
                          ___________________________                           


                                  Index = 2082                                  
                                  ____________                                  
                                   sale : 0
                           sales_amount : nan
                  conversion_time_delay : nan
                               click_ts : 1,598,934,309.0
                         n_clicks_1week : nan
                          product_price : 0.0
                      product_age_group : nan
                            device_type : 7E56C27BFF0305E788DA55A029EC4988
                            audience_id : 8186B349AF2B77A015F318DC49AC459B
                         product_gender : nan
                          product_brand : nan
                 

The categorical variables have been hashed, but we can see this observation reflects a conversion, as indicated by the 1 in the sale column. The sale amount of 38 Euro suggests a discount off of the 44.99 Euro product price. Without a sense of the distributions, it would be challenging to derive inferences relating to the conversion time delay or the numbre of clicks during the prior week.  Let's alter the random state and select another from the sample.

In [5]:
dataset.sample(1, random_state=random_state+1)



                                     greco                                      
                          1 Randomly Selected Samples                           
                          ___________________________                           


                                  Index = 4941                                  
                                  ____________                                  
                             sale : 0
                     sales_amount : nan
            conversion_time_delay : nan
                         click_ts : 1,598,889,781.0
                   n_clicks_1week : 50.0
                    product_price : 0.0
                product_age_group : 921B36149E5B081FD24450BFE2CE4430
                      device_type : 7E56C27BFF0305E788DA55A029EC4988
                      audience_id : nan
                   product_gender : A5D15FC386510762EC0DDFF54ABE6F94
                    product_brand : AC9D783A521B93EE58930C47854D995C
               product_c

This sample reflects no conversion and fewer non-na data points in the record. Let's widen our aperture a bit.

In [6]:
dataset.info()



                                 Dataset greco                                  
                                 _____________                                  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   sale                   10000 non-null  category
 1   sales_amount           1612 non-null   float64 
 2   conversion_time_delay  1608 non-null   float64 
 3   click_ts               10000 non-null  float64 
 4   n_clicks_1week         4700 non-null   float64 
 5   product_price          10000 non-null  float64 
 6   product_age_group      1815 non-null   category
 7   device_type            9991 non-null   category
 8   audience_id            2809 non-null   category
 9   product_gender         1794 non-null   category
 10  product_brand          2748 non-null   category
 11  product_category_1     4786 non-nu

The dataset contains approximately 16m occupying nearly 2 GB of memory. Converting the pandas object variables to category data types may bring some computational efficiencies which may be material for a dataset of this size. Still, the number that stands out so far is the 45% rate of missing. Let's understand that a bit better. 

In [7]:
_ = dataset.missing



                             Missing Data Analysis                              
                                 staging_greco                                  
                             _____________________                              
                           n  Missing  Missingness
sale                   10000        0         0.00
sales_amount           10000     8388        83.88
conversion_time_delay  10000     8392        83.92
click_ts               10000        0         0.00
n_clicks_1week         10000     5300        53.00
product_price          10000        0         0.00
product_age_group      10000     8185        81.85
device_type            10000        9         0.09
audience_id            10000     7191        71.91
product_gender         10000     8206        82.06
product_brand          10000     7252        72.52
product_category_1     10000     5214        52.14
product_category_2     10000     5219        52.19
product_category_3     10000     5710    

Significant sparsity is extant in the data. For about nine columns, over 1/2 the data are missing. with nine columns  