Pandas Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [4]:

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Make sure that we have the latest version of pandas-profiling.

In [2]:
import sys

!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Collecting pandas-profiling[notebook]
  Using cached pandas_profiling-2.9.0-py2.py3-none-any.whl (258 kB)
Collecting phik>=0.9.10
  Using cached phik-0.10.0-py3-none-any.whl (599 kB)
Collecting tqdm>=4.43.0
  Using cached tqdm-4.54.1-py2.py3-none-any.whl (69 kB)
Collecting seaborn>=0.10.1
  Using cached seaborn-0.11.0-py3-none-any.whl (283 kB)
Collecting requests>=2.23.0
  Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting matplotlib>=3.2.0
  Using cached matplotlib-3.3.3-cp37-cp37m-win_amd64.whl (8.5 MB)
Collecting missingno>=0.4.2
  Using cached missingno-0.4.2-py3-none-any.whl (9.7 kB)
Collecting jupyter-client>=6.0.0; extra == "notebook"
  Using cached jupyter_client-6.1.7-py3-none-any.whl (108 kB)
Collecting jupyter-core>=4.6.3; extra == "notebook"
  Using cached jupyter_core-4.7.0-py3-none-any.whl (82 kB)
Installing collected packages: matplotlib, phik, tqdm, seaborn, requests, missingno, jupyter-core, jupyter-client, pandas-profiling


ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'C:\\Users\\61435\\anaconda3\\Lib\\site-packages\\matplotlib\\ft2font.cp37-win_amd64.pyd'
Consider using the `--user` option or check the permissions.

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: ok


You might want to restart the kernel now.

Import libraries

In [3]:
from pathlib import Path

import numpy as np
import pandas as pd
import requests

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

ModuleNotFoundError: No module named 'pandas_profiling'

Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [None]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)

df = pd.read_csv(file_name)

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df["year"] = pd.to_datetime(df["year"], errors="coerce")

# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = duplicates_to_add["name"] + " copy"

df = df.append(duplicates_to_add, ignore_index=True)


Inline report without saving object

In [None]:
report = df.profile_report(
    sort="None", html={"style": {"full_width": True}}, progress_bar=False
)
report

Save report to file

In [None]:
profile_report = df.profile_report(html={"style": {"full_width": True}})
profile_report.to_file("/tmp/example.html")

More analysis (Unicode) and Print existing ProfileReport object inline

In [None]:
profile_report = df.profile_report(
    explorative=True, html={"style": {"full_width": True}}
)
profile_report


Notebook Widgets

In [None]:
profile_report.to_widgets()