- Date: 2020-07-01 00:03:12
- Title: Profiling Data Using ydata-profiling
- Slug: profiling-data-using-ydata-profiling
- Category: Computer Science
- Tags: Computer Science, pandas, Python, data profiling, pandas-profiling
- Modified: 2022-07-11 15:04:12


## Tips and Traps

1. It is suggested that you use multiprocessing 
    (e.g., `pool_size=8`)
    to speed up data profiling.
    Note: It seems to me that currently multiprocessing only works
    when `minimal=True`.
    
2. `minimal=True` helps reuce consumed memory.
    ```
    profile = ProfileReport(
        df, title="Data Profiling Report", 
        explorative=True, minimal=True, pool_size=8
    )
    ```

3. `ProfileReport.dump` dumps the report to a pickle file (for caching purpose)
    while `ProfileReport.to_file` dumps the report to a HTML file or a JSON file.

## Installation

```
pip3 install --user -U ydata-profiling[notebook]
```

In [4]:
!wget https://raw.githubusercontent.com/z-o-e/bank_data_analysis/master/bank-full.csv

--2023-06-11 15:15:56--  https://raw.githubusercontent.com/z-o-e/bank_data_analysis/master/bank-full.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4610348 (4.4M) [text/plain]
Saving to: ‘bank-full.csv’


2023-06-11 15:15:56 (36.4 MB/s) - ‘bank-full.csv’ saved [4610348/4610348]



In [5]:
!head bank-full.csv

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"
42;"entrepreneur";"divorced";"tertiary";"yes";2;"yes";"no";"unknown";5;"may";380;1;-1;0;"unknown";"no"
58;"retired";"married";"primary";"no";121;"yes

In [6]:
!pip3 install --user ydata-profiling



In [20]:
import jupyter

jupyter.textOutputLimit = 0

In [7]:
from pathlib import Path
import pandas as pd
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

  def hasna(x: np.ndarray) -> bool:


In [9]:
df = pd.read_csv("bank-full.csv", sep=";")
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [10]:
profile = ProfileReport(
    df,
    title="Profile Report of the UCI Bank Marketing Dataset",
    explorative=True,
    minimal=True,
    pool_size=8,
)

In [11]:
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Save the Report

1. `ProfileReport.to_json` represents the report as a JSON string
    and `ProfileReport.to_html` represents the report as a HTML string.
    
2. You can save a report to a file using the method `ProfileReport.to_file`. 
    The file content will be different based on the specified file extension.
    

In [None]:
from pathlib import Path
from loguru import logger
import pandas as pd
from ydata_profiling import ProfileReport


def dump_profile(df: pd.DataFrame | str | Path, title: str, output_dir: str | Path):
    """Run ydata-profiling on a DataFrame and dump the report into files.

    :param df: A pandas DataFrame.
    :param title: The title of the generated report.
    :param output_dir: The output directory for reports.
    :raises ValueError: If an input file other than Parquet/Pickle/CSV is provided.
    """
    if isinstance(df, str):
        df = Path(df)
    if isinstance(df, Path):
        logger.info("Reading the DataFrame from {}...", df)
        ext = df.suffix.lower()
        if ext == ".parquet":
            df = pd.read_parquet(df)
        elif ext == ".pickle":
            df = pd.read_pickle(df)
        elif ext == ".csv":
            df = pd.read_csv(df)
        else:
            raise ValueError("Only Parquet, Pickle and CSV files are support!")
    logger.info("Shape of the DataFrame: {}", df.shape)
    logger.info("Profiling the DataFrame...")
    report = ProfileReport(df, title=title, minimal=True, explorative=True)
    if isinstance(output_dir, str):
        output_dir = Path(output_dir)
    output_dir.mkdir(exist_ok=True)
    # dump report
    logger.info("Dumping the report to HTML...")
    report.to_file(output_dir / "report.html")
    logger.info("Dumping the report to JSON...")
    report.to_file(output_dir / "report.json")
    logger.info("Dumping the report to Pickle...")
    report.dump(output_dir / "report.pickle")

Write to the report a [HTML file](http://www.legendu.net/media/pandas-profiling/uci_bank_marketing_report.html).

In [21]:
profile.to_file("../../../../../home/media/pandas-profiling/uci_bank_marketing_report.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Write to the report a [JSON file](http://www.legendu.net/media/pandas-profiling/uci_bank_marketing_report.json).

In [20]:
profile.to_file("../../../../../home/media/pandas-profiling/uci_bank_marketing_report.json")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Get the HTML string representation of the report.

In [22]:
profile.to_html()



## Configuration

https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml

```
# Sort the variables. Possible values: ascending, descending or None (leaves original sorting)
sort: None 
```

## References

https://github.com/pandas-profiling/pandas-profiling

https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/getting_started.html?highlight=to_json#saving-the-report