- Date: 2020-07-01 00:03:12
- Title: Profiling Data Using pandas-profiling
- Slug: profiling-data-using-pandas-profiling
- Category: Computer Science
- Tags: Computer Science, pandas, Python, data profiling, pandas-profiling
- Modified: 2022-07-11 15:04:12


## Tips and Traps

1. It is suggested that you use multiprocessing 
    (e.g., `pool_size=8`)
    to speed up data profiling.
    Note: It seems to me that currently multiprocessing only works
    when `minimal=True`.
    
2. `minimal=True` helps reuce consumed memory.
    ```
    profile = ProfileReport(
        df, title="Data Profiling Report", 
        explorative=True, minimal=True, pool_size=8
    )
    ```

3. `ProfileReport.dump` dumps the report to a pickle file (for caching purpose)
    while `ProfileReport.to_file` dumps the report to a HTML file or a JSON file.

## Installation

```
pip3 install --user -U pandas-profiling[notebook]
```

In [1]:
!wget https://storage.googleapis.com/erwinh-public-data/bankingdata/bank-full.csv

--2020-11-19 10:12:20--  https://storage.googleapis.com/erwinh-public-data/bankingdata/bank-full.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 2607:f8b0:400e:c09::80, 2607:f8b0:400e:c07::80, 2607:f8b0:400e:c08::80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|2607:f8b0:400e:c09::80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4610348 (4.4M) [text/csv]
Saving to: ‘bank-full.csv.1’


2020-11-19 10:12:21 (11.1 MB/s) - ‘bank-full.csv.1’ saved [4610348/4610348]



In [2]:
!head bank-full.csv

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"
44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"
33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"
47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"
33;"unknown";"single";"unknown";"no";1;"no";"no";"unknown";5;"may";198;1;-1;0;"unknown";"no"
35;"management";"married";"tertiary";"no";231;"yes";"no";"unknown";5;"may";139;1;-1;0;"unknown";"no"
28;"management";"single";"tertiary";"no";447;"yes";"yes";"unknown";5;"may";217;1;-1;0;"unknown";"no"
42;"entrepreneur";"divorced";"tertiary";"yes";2;"yes";"no";"unknown";5;"may";380;1;-1;0;"unknown";"no"
58;"retired";"married";"primary";"no";121;"yes

In [2]:
!pip3 install --user pandas-profiling

Collecting pandas-profiling
  Using cached pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
Collecting confuse>=1.0.0
  Using cached confuse-1.3.0-py2.py3-none-any.whl (64 kB)
Collecting missingno>=0.4.2
  Using cached missingno-0.4.2-py3-none-any.whl (9.7 kB)
Collecting seaborn>=0.10.1
  Using cached seaborn-0.11.0-py3-none-any.whl (283 kB)
Collecting visions[type_image_path]==0.5.0
  Using cached visions-0.5.0-py3-none-any.whl (64 kB)
Installing collected packages: confuse, seaborn, missingno, visions, pandas-profiling
  NOTE: The current PATH contains path(s) starting with `~`, which may not be expanded by all applications.[0m
Successfully installed confuse-1.3.0 missingno-0.4.2 pandas-profiling-2.9.0 seaborn-0.11.0 visions-0.5.0


In [1]:
from pathlib import Path
import pandas as pd
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

In [2]:
df = pd.read_csv("bank-full.csv", sep=";")

In [3]:
profile = ProfileReport(
    df,
    title="Profile Report of the UCI Bank Marketing Dataset",
    explorative=True,
    minimal=True,
    pool_size=8,
)

In [4]:
profile

Summarize dataset: 100%|██████████| 31/31 [00:12<00:00,  2.46it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:05<00:00,  5.05s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.01s/it]




## Save the Report

1. You can save a report to a file using the method `ProfileReport.to_file`. 
    The file content will be different based on the specified file extension.
    
2. `ProfileReport.to_json` converts the report to a JSON in string format.

In [10]:
profile.json



In [11]:
profile.html

ement[0].focus():this.hide()))},this)),f&&this.$backdrop[0].offsetWidth,this.$backdrop.addClass("in"),!b)return;f?this.$backdrop.one("bsTransitionEnd",b).emulateTransitionEnd(c.BACKDROP_TRANSITION_DURATION):b()}else if(!this.isShown&&this.$backdrop){this.$backdrop.removeClass("in");var g=function(){d.removeBackdrop(),b&&b()};a.support.transition&&this.$element.hasClass("fade")?this.$backdrop.one("bsTransitionEnd",g).emulateTransitionEnd(c.BACKDROP_TRANSITION_DURATION):g()}else b&&b()},c.prototype.handleUpdate=function(){this.adjustDialog()},c.prototype.adjustDialog=function(){var a=this.$element[0].scrollHeight>document.documentElement.clientHeight;this.$element.css({paddingLeft:!this.bodyIsOverflowing&&a?this.scrollbarWidth:"",paddingRight:this.bodyIsOverflowing&&!a?this.scrollbarWidth:""})},c.prototype.resetAdjustments=function(){this.$element.css({paddingLeft:"",paddingRight:""})},c.prototype.checkScrollbar=function(){var a=window.innerWidth;if(!a){var b=document.documentElement.get

Write to the report a [HTML file](http://www.legendu.net/media/pandas-profiling/uci_bank_marketing_report.html).

In [8]:
profile.to_file("../../home/media/pandas-profiling/uci_bank_marketing_report.html")

Export report to file: 100%|██████████| 1/1 [00:00<00:00, 74.82it/s]


Write to the report a [JSON file](http://www.legendu.net/media/pandas-profiling/uci_bank_marketing_report.json).

In [9]:
profile.to_file("../../home/media/pandas-profiling/uci_bank_marketing_report.json")

Render JSON: 100%|██████████| 1/1 [00:02<00:00,  2.04s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 49.15it/s]


You can also convert the report to a HTML formatted string using the method `ProfileReport.to_html`.

In [15]:
profile.to_html()

ement[0].focus():this.hide()))},this)),f&&this.$backdrop[0].offsetWidth,this.$backdrop.addClass("in"),!b)return;f?this.$backdrop.one("bsTransitionEnd",b).emulateTransitionEnd(c.BACKDROP_TRANSITION_DURATION):b()}else if(!this.isShown&&this.$backdrop){this.$backdrop.removeClass("in");var g=function(){d.removeBackdrop(),b&&b()};a.support.transition&&this.$element.hasClass("fade")?this.$backdrop.one("bsTransitionEnd",g).emulateTransitionEnd(c.BACKDROP_TRANSITION_DURATION):g()}else b&&b()},c.prototype.handleUpdate=function(){this.adjustDialog()},c.prototype.adjustDialog=function(){var a=this.$element[0].scrollHeight>document.documentElement.clientHeight;this.$element.css({paddingLeft:!this.bodyIsOverflowing&&a?this.scrollbarWidth:"",paddingRight:this.bodyIsOverflowing&&!a?this.scrollbarWidth:""})},c.prototype.resetAdjustments=function(){this.$element.css({paddingLeft:"",paddingRight:""})},c.prototype.checkScrollbar=function(){var a=window.innerWidth;if(!a){var b=document.documentElement.get

You can also convert the report to a HTML formatted string using the method `ProfileReport.to_json`.

In [20]:
import jupyter

jupyter.textOutputLimit = 0

In [21]:
print(profile.to_json())

     <use xlink:href=\"#ArialMT-108\"/>\n       <use x=\"22.216797\" xlink:href=\"#ArialMT-111\"/>\n       <use x=\"77.832031\" xlink:href=\"#ArialMT-97\"/>\n       <use x=\"133.447266\" xlink:href=\"#ArialMT-110\"/>\n      </g>\n     </g>\n    </g>\n    <g id=\"xtick_9\">\n     <g id=\"text_9\">\n      <!-- contact -->\n      <g style=\"fill:#262626;\" transform=\"translate(365.263791 74.43841)rotate(-45)scale(0.104 -0.104)\">\n       <use xlink:href=\"#ArialMT-99\"/>\n       <use x=\"50\" xlink:href=\"#ArialMT-111\"/>\n       <use x=\"105.615234\" xlink:href=\"#ArialMT-110\"/>\n       <use x=\"161.230469\" xlink:href=\"#ArialMT-116\"/>\n       <use x=\"189.013672\" xlink:href=\"#ArialMT-97\"/>\n       <use x=\"244.628906\" xlink:href=\"#ArialMT-99\"/>\n       <use x=\"294.628906\" xlink:href=\"#ArialMT-116\"/>\n      </g>\n     </g>\n    </g>\n    <g id=\"xtick_10\">\n     <g id=\"text_10\">\n      <!-- day -->\n      <g style=\"fill:#262626;\" transform=\"translate(399.146144 74.352

## Configuration

https://github.com/pandas-profiling/pandas-profiling/blob/master/src/pandas_profiling/config_default.yaml

```
# Sort the variables. Possible values: ascending, descending or None (leaves original sorting)
sort: None 
```

## References

https://github.com/pandas-profiling/pandas-profiling

https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/getting_started.html?highlight=to_json#saving-the-report