# Pandas Profiling Report

Creates an html report with various graphs/statistics/correlations for a given dataset. See sample report [here](https://pandas-profiling.github.io/pandas-profiling/examples/master/titanic/titanic_report.html). Link to GitHub page [here](https://github.com/pandas-profiling/pandas-profiling).


Usage example:

```python
import mlrun, os
mlrun.mlconf.dbpath = 'http://mlrun-api:8080'

# Load pandas_profiling_report function from Github
func = mlrun.import_function("hub://pandas_profiling_report").apply(mlrun.mount_v3io())

# Build MLRun image (only needs to be run once)
func.deploy()

# Create task
data = 'https://iguazio-sample-data.s3.amazonaws.com/datasets/iris_dataset.csv'

task = NewTask(name="pandas-profiling-report", 
               inputs={"data": DATA_URL})

# Run task on cluster
run = func.run(task, artifact_path='/User/artifacts')
```


## mlconfig

In [1]:
from mlrun import mlconf
import os

mlconf.dbpath = "http://mlrun-api:8080"
mlconf.artifact_path = mlconf.artifact_path or f'{os.environ["HOME"]}/artifacts'

## Save

In [2]:
import yaml

with open("item.yaml") as item_file:
    items = yaml.load(item_file, Loader=yaml.FullLoader)

In [3]:
from mlrun import code_to_function

# create job function object from notebook code
fn = code_to_function(
    name=items["name"],
    kind=items["spec"]["kind"],
    handler=items["spec"]["handler"],
    filename=items["spec"]["filename"],
    image=items["spec"]["image"],
    description=items["description"],
    categories=items["categories"],
    labels=items["labels"],
    requirements=items["spec"]["requirements"],
)


fn.export("pandas_profiling_report.yaml")

> 2021-02-17 10:11:23,400 [info] function spec saved to path: pandas_profiling_report.yaml


<mlrun.runtimes.kubejob.KubejobRuntime at 0x7ff1910bf090>

## Examples

In [4]:
from mlrun.platforms import auto_mount

fn.apply(auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7ff1910bf090>

In [5]:
from mlrun import NewTask, run_local
from pandas_profiling_report import pandas_profiling_report

DATA_URL = "https://iguazio-sample-data.s3.amazonaws.com/datasets/iris_dataset.csv"

In [6]:
task = NewTask(
    name="pandas-profiling-report",
    handler=pandas_profiling_report,
    inputs={"data": DATA_URL},
)

## Run  locally

In [7]:
run = run_local(task)

> 2021-02-17 10:11:25,644 [info] starting run pandas-profiling-report uid=363cfff0b65340e9a6fe1b8b97ef21bc DB=http://mlrun-api:8080


Summarize dataset:   0%|          | 0/18 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...97ef21bc,0,Feb 17 10:11:25,completed,pandas-profiling-report,v3io_user=adminkind=handlerowner=adminhost=jupyter-7b854d9bd6-mkmbn,data,,,Pandas Profiling Report


to track results use .show() or .logs() or in CLI: 
!mlrun get run 363cfff0b65340e9a6fe1b8b97ef21bc --project default , !mlrun logs 363cfff0b65340e9a6fe1b8b97ef21bc --project default
> 2021-02-17 10:11:36,706 [info] run executed, status=completed


## Run remotely

In [8]:
# Create MLRun image (only needs to be run once)
fn.deploy()

> 2021-02-17 10:11:36,712 [info] starting remote build, image: .mlrun/func-default-pandas-profiling-report-latest
E0217 10:12:19.811258       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
[36mINFO[0m[0040] Retrieving image manifest mlrun/mlrun:0.6.0-rc13 
[36mINFO[0m[0042] Retrieving image manifest mlrun/mlrun:0.6.0-rc13 
[36mINFO[0m[0045] Built cross stage deps: map[]                
[36mINFO[0m[0045] Retrieving image manifest mlrun/mlrun:0.6.0-rc13 
[36mINFO[0m[0047] Retrieving image manifest mlrun/mlrun:0.6.0-rc13 
[36mINFO[0m[0049] Executing 0 build triggers                   
[36mINFO[0m[0049] Unpacking rootfs as cmd RUN python -m pip install pandas_profiling requires it. 
[36mINFO[0m[0073] RUN python -m pip install pandas_profiling   
[36mINFO[0m[0073] Taking snapshot of full filesystem...        
[36mINFO[0m[0080] cmd: /

True

In [9]:
fn.run(task, inputs={"data": DATA_URL})

> 2021-02-17 10:13:41,948 [info] starting run pandas-profiling-report uid=70f0d678148e411bbac30de11543f5f5 DB=http://mlrun-api:8080
> 2021-02-17 10:13:42,146 [info] Job is running in the background, pod: pandas-profiling-report-62lvb
Summarize dataset: 100%|██████████| 18/18 [00:04<00:00,  4.12it/s, Completed]                         
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.11s/it]
> 2021-02-17 10:14:04,883 [info] run executed, status=completed
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.67it/s]
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...1543f5f5,0,Feb 17 10:13:55,completed,pandas-profiling-report,v3io_user=adminkind=jobowner=adminhost=pandas-profiling-report-62lvb,data,,,Pandas Profiling Report


to track results use .show() or .logs() or in CLI: 
!mlrun get run 70f0d678148e411bbac30de11543f5f5 --project default , !mlrun logs 70f0d678148e411bbac30de11543f5f5 --project default
> 2021-02-17 10:14:07,614 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7ff17a08f410>