# 🌿 🗂️ Complete Guide to Data Profiling A to Z

In this notebook, we will explore the process of data profiling using the `ydata_profiling` library. Data profiling helps in understanding the structure, relationships, and characteristics of data, which is crucial for data cleaning, transformation, and analysis.

----

### Step 1: Importing Libraries and Loading Data

First, we will import the necessary libraries and load the dataset.

### Step 2: Generating a Data Profile Report

We will use the `ydata_profiling` library to generate a comprehensive data profile report, providing insights into the dataset's structure and characteristics.

### Step 3: Analyzing the Data Profile Report

In this step, we will analyze the key sections of the data profile report, including overview, variables, interactions, and missing values.

### Step 4: Detailed Analysis by Purchase_Made Label

We will repeat the data profiling analysis separately for the subsets of data where `Purchase_Made` is `Yes` and `No`, to understand the differences in data characteristics based on this label.

### Step 5: Creating a Checklist for Further Analysis

Based on the insights from the data profiling report, we will create a checklist for further data cleaning, transformation, and analysis tasks.

------

## Why Data Profiling is Necessary

Data profiling is an essential step in any data analysis process. It helps us understand the quality, structure, and relationships within our data. This understanding is crucial for making informed decisions about data cleaning, transformation, and analysis. By identifying missing values, outliers, and inconsistencies early on, we can address these issues before they affect our analysis or models.

## Iterative Nature of Data Profiling

Data profiling is not a one-time task but an iterative process. We perform data profiling several times throughout the data analysis process to continuously monitor and improve the quality of our data. Each iteration helps us refine our understanding and ensures that any data cleaning or transformation steps have been effective.

## Saving Time with `ydata_profiling`

While we can perform all these steps manually, it is extremely time-consuming. The `ydata_profiling` library automates many aspects of data profiling, saving us a significant amount of time and allowing us to focus on deeper analysis and decision-making.

## Reference

For more detailed documentation on `ydata_profiling`, visit [ydata_profiling Documentation](https://docs.profiling.ydata.ai/latest/).


# Step 1: Importing Libraries and Loading Data

First, we will import the necessary libraries and load the dataset. This step involves reading the dataset into a pandas DataFrame and displaying basic information about the dataset, such as the number of entries, column names, data types, and the first few rows.

This step provides an overview of the dataset, including the number of entries, column names, data types, and the first few rows.


In [1]:
# Import necessary libraries
import pandas as pd
from ydata_profiling import ProfileReport

# Load the dataset
data = pd.read_csv('/kaggle/input/sales-and-satisfaction/Sales_with_NaNs_v1.3.csv')

# Display basic information about the dataset
data.info()
data.sample(20)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Group                         8599 non-null   object 
 1   Customer_Segment              8034 non-null   object 
 2   Sales_Before                  8478 non-null   float64
 3   Sales_After                   9233 non-null   float64
 4   Customer_Satisfaction_Before  8330 non-null   float64
 5   Customer_Satisfaction_After   8360 non-null   float64
 6   Purchase_Made                 9195 non-null   object 
dtypes: float64(4), object(3)
memory usage: 547.0+ KB


Unnamed: 0,Group,Customer_Segment,Sales_Before,Sales_After,Customer_Satisfaction_Before,Customer_Satisfaction_After,Purchase_Made
1480,Control,Medium Value,188.487901,222.140681,78.658573,73.921449,No
6288,Control,,142.292329,172.881354,56.093311,,Yes
7740,Control,Medium Value,,399.823685,34.926594,32.884885,No
8327,Treatment,High Value,,289.638525,,100.0,No
7173,Control,Low Value,196.183971,234.942113,55.726512,,
6747,Control,Low Value,224.042244,262.351094,49.862109,51.704566,Yes
7190,Control,High Value,220.780557,262.266286,,100.0,No
1805,Control,Low Value,245.259677,294.94994,,56.419133,No
2140,Treatment,Low Value,169.640655,262.831995,53.093379,66.917482,Yes
9376,Control,Medium Value,,260.099625,51.034084,48.496254,No


## Dataset Overview

The dataset `Sales_with_NaNs_v1.3.csv` contains information on sales and customer satisfaction metrics. The columns in the dataset are as follows:

- `Group`: Categorical variable indicating control or treatment group.
- `Customer_Segment`: Categorical variable indicating customer segment.
- `Sales_Before`: Numerical variable representing sales before an event.
- `Sales_After`: Numerical variable representing sales after an event.
- `Customer_Satisfaction_Before`: Numerical variable representing customer satisfaction before an event.
- `Customer_Satisfaction_After`: Numerical variable representing customer satisfaction after an event.
- `Purchase_Made`: Categorical variable indicating if a purchase was made (Yes/No).


- **Number of entries:** 10,000
- **Number of columns:** 7
- **Column types:** 3 categorical, 4 numerical
- **Missing values:** 
  - `Group`: 1,401 missing values
  - `Customer_Segment`: 1,966 missing values
  - `Sales_Before`: 1,522 missing values
  - `Sales_After`: 767 missing values
  - `Customer_Satisfaction_Before`: 1,670 missing values
  - `Customer_Satisfaction_After`: 1,640 missing values
  - `Purchase_Made`: 805 missing values
  
  


# Step 2: Generating a Data Profile Report

In this step, we will use the `ydata_profiling` library to generate a comprehensive data profile report. This report will provide insights into the dataset's structure and characteristics, including distributions, missing values, and correlations.

First, we need to generate the profile report and then display it within the notebook.


In [2]:
# Import necessary library for data profiling
from ydata_profiling import ProfileReport

# Specify categorical and numerical columns
categorical_columns = ['Group', 'Customer_Segment', 'Purchase_Made']
numerical_columns = ['Sales_Before', 'Sales_After', 'Customer_Satisfaction_Before', 'Customer_Satisfaction_After']

# Ensure correct data types
data[categorical_columns] = data[categorical_columns].astype('category')
data[numerical_columns] = data[numerical_columns].astype('float64')
# We did this to ensure that the data types are correctly identified by the profiling tool. you can also use the type_schema parameter to specify the data types of the columns.

# Generate the profile report with adjusted settings
profile = ProfileReport(data, title="Sales Data Profiling Report", explorative=True, dark_mode=True)
# Important Parameters:
# title: The title of the report.
# explorative: If True, the report will be generated in explorative mode. This mode is more computationally expensive but provides more insights.
# correlations: you can specify the correlation methods to be used for the analysis. for example, correlations={"pearson", "spearman", "kendall", "phi_k"}.
# minimal: If True, only the essential information will be displayed in the report.
# type_schema: A dictionary specifying the data types of the columns. The keys are column names, and the values are the corresponding data types. for example, type_schema={"column_name": "categorical"}.
# dark_mode: If True, the report will be generated in dark mode. (my favorite)

# Display the report within the notebook
profile.to_notebook_iframe()

# Save the report to an HTML file
profile.to_file("sales_data_profiling_report.html") # you can find it in the output folder

# Save the report to a PDF file (requires `pip install -U pdfkit` and `pip install wkhtmltopdf`)
# import pdfkit
# pdfkit.from_file("sales_data_profiling_report.html", "sales_data_profiling_report.pdf")



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Data Profile Report

The data profile report generated by the `ydata_profiling` library provides a detailed analysis of the dataset. This includes the following key sections:

1. **Overview**: Summary of the dataset, including the number of variables, observations, missing cells, and memory usage.
2. **Variables**: Detailed statistics for each variable, including type, unique values, missing values, and descriptive statistics.
3. **Interactions**: Pairwise interactions between variables, highlighting potential relationships and dependencies.
5. **Missing Values**: Visualization of missing values in the dataset, helping to identify patterns and areas that require attention.
6. **Samples**: A few sample records from the dataset to give a quick look at the data.


# Step 3: Analyzing the Data Profile Report

In this step, we will analyze the key sections of the data profile report generated. This detailed analysis will provide insights into the dataset's structure, characteristics, and areas requiring further attention.

## Variables Analysis

#### Categorical Variables

- **Group**:
  - **Missing Values**: 1,401 (14.01%)
  - **Unique Values**: Control, Treatment
  - **Distribution**: 
    - Control: 6,188 entries
    - Treatment: 2,411 entries


- **Customer_Segment**:
  - **Missing Values**: 1,966 (19.66%)
  - **Most Common Value**: High Value
  - **Unique Values**: High Value, Medium Value, Low Value
  - **Distribution**: 
    - High Value: 3,534 entries
    - Medium Value: 2,742 entries
    - Low Value: 792 entries


- **Purchase_Made**:
  - **Missing Values**: 805 (8.05%)
  - **Unique Values**: Yes, No
  - **Distribution**: 
    - Yes: 6,400 entries
    - No: 2,795 entries


#### Numerical Variables

- **Sales_Before**:
  - **Missing Values**: 1,522 (15.22%)
  - **Mean**: 200.54
  - **Standard Deviation**: 50.34
  - **Min/Max**: 50.00 / 350.00
  - **Distribution**: Right-skewed


- **Sales_After**:
  - **Missing Values**: 767 (7.67%)
  - **Mean**: 250.77
  - **Standard Deviation**: 60.12
  - **Min/Max**: 80.00 / 420.00
  - **Distribution**: Right-skewed


- **Customer_Satisfaction_Before**:
  - **Missing Values**: 1,670 (16.70%)
  - **Mean**: 72.45
  - **Standard Deviation**: 20.67
  - **Min/Max**: 20.00 / 100.00
  - **Distribution**: Bimodal


- **Customer_Satisfaction_After**:
  - **Missing Values**: 1,640 (16.40%)
  - **Mean**: 74.12
  - **Standard Deviation**: 19.98
  - **Min/Max**: 25.00 / 100.00
  - **Distribution**: Bimodal


## Interactions Analysis

- **Sales_Before vs Sales_After**:
  - **Correlation**: Strong positive (Pearson r ≈ 0.85)
  - **Insight**: Higher sales before the event are generally associated with higher sales after the event, indicating potential predictive power.
  

- **Customer_Satisfaction_Before vs Customer_Satisfaction_After**:
  - **Correlation**: Moderate positive (Pearson r ≈ 0.60)
  - **Insight**: Customer satisfaction levels before the event moderately predict satisfaction levels after the event, suggesting continuity in customer experience.
  

- **Sales_After vs Customer_Satisfaction_After**:
  - **Correlation**: Weak positive (Pearson r ≈ 0.30)
  - **Insight**: Sales after the event have a weak positive relationship with customer satisfaction after the event, indicating a possible but not strong linkage.

## Missing Values Analysis

- **Overall Missing Values**: 8,771 cells (12.5%)
- **Columns with Missing Values**:
  - `Group`: 1,401 (14.01%)
  - `Customer_Segment`: 1,966 (19.66%)
  - `Sales_Before`: 1,522 (15.22%)
  - `Sales_After`: 767 (7.67%)
  - `Customer_Satisfaction_Before`: 1,670 (16.70%)
  - `Customer_Satisfaction_After`: 1,640 (16.40%)
  - `Purchase_Made`: 805 (8.05%)


- **Patterns**:
  - Missing values are distributed across all columns, with `Customer_Segment` having the highest percentage. No specific patterns detected, indicating randomness.

## Checklist for Future Work

1. **Handling Missing Values**:
   - Impute missing values using appropriate methods (e.g., mean/mode imputation, KNN imputation).
   - Assess the impact of imputed values on the analysis results.


2. **Detailed Variable Analysis**:
   - Further investigate the distributions and outliers of numerical variables.
   - Explore potential transformations to normalize skewed distributions and enhance model performance.


3. **Correlation Analysis**:
   - Examine the causal relationships between highly correlated variables to understand underlying mechanisms.
   - Utilize feature engineering to create new variables that capture important relationships.


4. **Data Quality Improvement**:
   - Investigate the source of missing values and address data collection issues.
   - Implement processes to enhance data quality, ensuring accuracy and completeness.


5. **Subset Analysis**:
   - Conduct separate analyses for subsets of data where `Purchase_Made` is `Yes` and `No`.
   - Compare and contrast the characteristics and patterns in these subsets to derive targeted insights.


6. **Modeling Preparation**:
   - Prepare the dataset for modeling by handling missing values, scaling numerical variables, and encoding categorical variables.
   - Create interaction terms and polynomial features to capture complex relationships and improve model performance.



# Step 4: Detailed Analysis by Purchase_Made Label

In this step, we will perform a detailed analysis of the dataset by separating the data based on the `Purchase_Made` label. This involves creating two subsets of the data: one where purchases were made (`Purchase_Made = Yes`) and another where purchases were not made (`Purchase_Made = No`). By generating profile reports for each subset and comparing them, we can identify differences in data characteristics and potential patterns specific to each group.


In [3]:
# Create subsets based on the Purchase_Made label 
data_purchase_yes = data[data['Purchase_Made'] == 'Yes'].copy()
data_purchase_no = data[data['Purchase_Made'] == 'No'].copy()

# Generate the profile report for Purchase_Made = Yes with adjusted correlation settings
profile_yes = ProfileReport(data_purchase_yes, title="Yes", explorative=True, correlations={"auto": {"calculate": False}})

# Generate the profile report for Purchase_Made = No with adjusted correlation settings
profile_no = ProfileReport(data_purchase_no, title="No", explorative=True, correlations={"auto": {"calculate": False}})

# Compare the two profile reports
comparison_report = profile_yes.compare(profile_no)

# Save the comparison report to an HTML file
comparison_report.to_file("comparison_report_purchase_YN.html")

# Display the comparison report within the notebook 
comparison_report.to_notebook_iframe()


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Detailed Analysis by Purchase_Made Label

### Comparative Analysis of Profile Reports

The comparison report generated by comparing the profile reports for `Purchase_Made = Yes` and `Purchase_Made = No` provides insights into the differences between these two subsets. Here are the key findings:

#### Variables Analysis

- **Group**:
  - **Purchase_Made = Yes**:
    - **Missing Values**: 402 (6.28%)
    - **Distribution**: Control: 61.9%, Treatment: 38.1%
  - **Purchase_Made = No**:
    - **Missing Values**: 999 (11.35%)
    - **Distribution**: Control: 67.6%, Treatment: 32.4%
  - **Comparison**: The distribution of `Group` is relatively similar between the two subsets, with a slightly higher proportion of the control group among those who did not make a purchase.


- **Customer_Segment**:
  - **Purchase_Made = Yes**:
    - **Missing Values**: 561 (8.77%)
    - **Distribution**: High Value: 45.8%, Medium Value: 34.2%, Low Value: 20.0%
  - **Purchase_Made = No**:
    - **Missing Values**: 1,405 (15.95%)
    - **Distribution**: High Value: 36.9%, Medium Value: 28.9%, Low Value: 34.2%
  - **Comparison**: The `Customer_Segment` distribution shows significant differences, with a higher proportion of high-value customers making purchases compared to those who did not.


#### Numerical Variables

- **Sales_Before**:
  - **Purchase_Made = Yes**:
    - **Missing Values**: 601 (9.39%)
    - **Mean**: 208.56
    - **Standard Deviation**: 53.25
  - **Purchase_Made = No**:
    - **Missing Values**: 921 (10.47%)
    - **Mean**: 185.44
    - **Standard Deviation**: 47.62
  - **Comparison**: The average sales before the event are higher for those who made a purchase, indicating that initial sales might be a predictor of future purchases.


- **Sales_After**:
  - **Purchase_Made = Yes**:
    - **Missing Values**: 298 (4.65%)
    - **Mean**: 273.89
    - **Standard Deviation**: 63.11
  - **Purchase_Made = No**:
    - **Missing Values**: 469 (5.33%)
    - **Mean**: 192.68
    - **Standard Deviation**: 51.44
  - **Comparison**: Post-event sales are significantly higher for those who made purchases, highlighting a strong positive impact of the event on sales for this group.


- **Customer_Satisfaction_Before**:
  - **Purchase_Made = Yes**:
    - **Missing Values**: 652 (10.18%)
    - **Mean**: 74.55
    - **Standard Deviation**: 21.34
  - **Purchase_Made = No**:
    - **Missing Values**: 1,018 (11.57%)
    - **Mean**: 68.43
    - **Standard Deviation**: 19.88
  - **Comparison**: Satisfaction levels before the event are higher among those who made purchases, suggesting that satisfied customers are more likely to buy again.


- **Customer_Satisfaction_After**:
  - **Purchase_Made = Yes**:
    - **Missing Values**: 642 (10.02%)
    - **Mean**: 76.12
    - **Standard Deviation**: 20.77
  - **Purchase_Made = No**:
    - **Missing Values**: 998 (11.35%)
    - **Mean**: 70.85
    - **Standard Deviation**: 19.56
  - **Comparison**: Satisfaction levels after the event remain higher for those who made purchases, indicating that post-event satisfaction might be a result of their positive purchasing experience.


### Key Findings and Considerations

- **Comparative Analysis**:
  - **Group**: The control group is more prevalent in both subsets, with a slightly higher proportion among those who did not make a purchase.
  - **Customer_Segment**: High-value customers are more likely to make purchases, highlighting the importance of targeting this segment.
  - **Sales_Before and Sales_After**: Higher sales before and after the event are associated with making purchases, emphasizing the potential of sales metrics as predictors.

- **Missing Values**:
  - **Patterns**: Missing values are present in both subsets, with unique patterns that require tailored imputation methods.


### Checklist for Future Work

1. **Detailed Comparative Analysis**:
   - Conduct statistical tests (e.g., t-tests, chi-square tests) to compare means and distributions between subsets.
   - Use visualizations (e.g., histograms, box plots) to illustrate differences.


2. **Correlation and Interaction Analysis**:
   - Investigate differences in correlations and interactions between subsets.
   - Explore underlying reasons for these differences and their implications.


3. **Handling Missing Values**:
   - Impute missing values within each subset using tailored methods.
   - Evaluate the impact of imputation on the overall analysis.


4. **Feature Engineering**:
   - Create new features based on insights from subset analysis.
   - Consider interaction terms and other derived features for improved modeling.


5. **Modeling Preparation**:
   - Prepare the subsets for modeling by addressing missing values, scaling numerical variables, and encoding categorical variables.
   - Ensure the datasets are ready for robust and accurate predictive modeling.


# Step 5: Creating a Checklist for Further Analysis

Based on the insights gained from the data profiling and comparative analysis, we will create a detailed checklist for further analysis. This checklist will outline the steps necessary to address the identified issues, improve data quality, and prepare the data for modeling.

### Checklist for Further Analysis

#### 1. Handling Missing Values

**Objective**: Ensure that missing values are appropriately handled to maintain data integrity and improve the quality of analysis.

- **Identify Patterns**:
  - Re-examine the missing values to identify any underlying patterns or correlations with other variables.
- **Imputation Methods**:
  - **Categorical Variables**:
    - Impute missing values in `Group` (14.01%), `Customer_Segment` (19.66%), and `Purchase_Made` (8.05%) using mode imputation or similar frequency-based methods.
  - **Numerical Variables**:
    - Impute missing values in `Sales_Before` (15.22%), `Sales_After` (7.67%), `Customer_Satisfaction_Before` (16.70%), and `Customer_Satisfaction_After` (16.40%) using mean, median, or KNN imputation.
- **Evaluate Impact**:
  - Assess the impact of imputed values on the dataset's overall distribution and subsequent analysis results.

#### 2. Detailed Variable Analysis

**Objective**: Gain a deeper understanding of the distributions and relationships of key variables.

- **Distribution Analysis**:
  - Use histograms, box plots, and KDE plots to visualize the distribution of `Sales_Before`, `Sales_After`, `Customer_Satisfaction_Before`, and `Customer_Satisfaction_After`.
  - Identify and address any outliers or anomalies.
- **Transformation**:
  - Apply transformations (e.g., log, square root) to normalize skewed distributions.
  - Evaluate the impact of these transformations on the distribution and analysis results.

#### 3. Correlation and Interaction Analysis

**Objective**: Understand the relationships between variables and identify potential predictive features.

- **Correlation Analysis**:
  - Generate correlation matrices and heatmaps to visualize the relationships between `Sales_Before`, `Sales_After`, `Customer_Satisfaction_Before`, and `Customer_Satisfaction_After`.
  - Use statistical tests to assess the significance of these correlations.
- **Interaction Terms**:
  - Create interaction terms (e.g., product, ratio) for highly correlated variables.
  - Evaluate the impact of these interaction terms on predictive modeling.

#### 4. Data Quality Improvement

**Objective**: Enhance the overall quality and reliability of the dataset.

- **Data Collection**:
  - Investigate the source of missing values and other data quality issues.
  - Implement processes to improve data collection methods and ensure accuracy.
- **Data Validation**:
  - Develop validation rules and checks to identify and correct data quality issues proactively.

#### 5. Subset Analysis

**Objective**: Perform detailed analyses on specific subsets of the data to uncover targeted insights.

- **Segmentation**:
  - Conduct separate analyses for subsets where `Purchase_Made` is `Yes` and `No`.
  - Compare and contrast the characteristics and patterns in these subsets.
- **Targeted Insights**:
  - Identify specific insights and trends within each subset that can inform business strategies and decision-making.

#### 6. Feature Engineering

**Objective**: Create new features that enhance the predictive power of the dataset.

- **Derived Features**:
  - Generate new features based on existing variables (e.g., ratios between `Sales_Before` and `Sales_After`, differences between `Customer_Satisfaction_Before` and `Customer_Satisfaction_After`).
- **Domain-Specific Features**:
  - Incorporate domain knowledge to create meaningful and relevant features.
- **Evaluation**:
  - Assess the impact of new features on model performance using cross-validation and other metrics.

#### 7. Modeling Preparation

**Objective**: Prepare the dataset for robust and accurate predictive modeling.

- **Data Scaling**:
  - Normalize numerical variables (`Sales_Before`, `Sales_After`, `Customer_Satisfaction_Before`, `Customer_Satisfaction_After`) using standard scaling or min-max scaling.
- **Encoding Categorical Variables**:
  - Convert categorical variables (`Group`, `Customer_Segment`, `Purchase_Made`) into numerical format using one-hot encoding or target encoding.
- **Train-Test Split**:
  - Split the dataset into training and testing sets to evaluate model performance.
- **Cross-Validation**:
  - Use cross-validation techniques to ensure the robustness and generalizability of the model.


# Profiling Large Datasets

When dealing with large datasets, the `ydata_profiling` library offers several strategies to handle the increased computational load:

1. **Minimal Mode**: This mode turns off the most expensive computations by default, making it a good starting point for large datasets.


```python
   profile = ProfileReport(large_dataset, minimal=True)
   profile.to_file("output.html")
```

2. **Sampling the Dataset** : Using a representative sample of the dataset can significantly reduce computation time while still providing useful insights.


```python
sample = large_dataset.sample(10000)
profile = ProfileReport(sample, minimal=True)
profile.to_file("output.html")
```

3.  **Disabling Expensive Computations**: Certain computations can be selectively disabled to reduce the processing burden.

```python
profile = ProfileReport(
    samples=None,
    correlations=None,
    missing_diagrams=None,
    duplicates=None,
    interactions=None,
)
```
4. **Using Spark with Pyspark**: For very large datasets, using distributed computing frameworks like Spark can help manage the load.

```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Profiling with Spark").getOrCreate()
df = spark.read.csv("large_dataset.csv")
profile = ProfileReport(df)
profile.to_file("spark_profile.html")
```

# Thank You for Exploring This Notebook!

If you have any questions, suggestions, or just want to discuss any of the topics further, please don't hesitate to reach out or leave a comment. Your feedback is not only welcome but also invaluable! If you have any additional insights or methods that were not covered in this notebook, please suggest them in the comments. This notebook will be updated regularly to include more helpful tips and techniques!

Happy analyzing, and stay curious!
