## Pandas Profiling: NASA Meteorites example
Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [1]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [2]:
import sys

!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Collecting pandas-profiling[notebook]
  Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
[K     |████████████████████████████████| 324 kB 3.4 MB/s eta 0:00:01
[?25hCollecting ydata-profiling
  Downloading ydata_profiling-4.6.1-py2.py3-none-any.whl (357 kB)
[K     |████████████████████████████████| 357 kB 26.4 MB/s eta 0:00:01


Collecting multimethod<2,>=1.4
  Downloading multimethod-1.10-py3-none-any.whl (9.9 kB)
Collecting imagehash==4.3.1
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
[K     |████████████████████████████████| 296 kB 46.6 MB/s eta 0:00:01
[?25hCollecting pydantic>=2
  Downloading pydantic-2.4.2-py3-none-any.whl (395 kB)
[K     |████████████████████████████████| 395 kB 22.3 MB/s eta 0:00:01
Collecting htmlmin==0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting wordcloud>=1.9.1
  Downloading wordcloud-1.9.2-cp39-cp39-macosx_10_9_x86_64.whl (161 kB)
[K     |████████████████████████████████| 161 kB 47.1 MB/s eta 0:00:01
Collecting dacite>=1.8
  Downloading dacite-1.8.1-py3-none-any.whl (14 kB)
Collecting numba<0.59.0,>=0.56.0
  Downloading numba-0.58.1-cp39-cp39-macosx_10_9_x86_64.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 40.3 MB/s eta 0:00:01
[?25hCollecting pandas!=1.4.0,<2.1,>1.1
  Downloading pandas-2.0.3-cp39-cp39-macosx_10_9_x86_64.whl (11

Collecting annotated-types>=0.4.0
  Downloading annotated_types-0.6.0-py3-none-any.whl (12 kB)
Collecting typing-extensions>=4.6.1
  Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Collecting pydantic-core==2.10.1
  Downloading pydantic_core-2.10.1-cp39-cp39-macosx_10_7_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 97.5 MB/s eta 0:00:01
Building wheels for collected packages: htmlmin
  Building wheel for htmlmin (setup.py) ... [?25ldone
[?25h  Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27096 sha256=92b009aa8487d31acffca41be8a8a38a4f53b103bf8e1b46ee22e0734f498402
  Stored in directory: /Users/mks9338/Library/Caches/pip/wheels/1d/05/04/c6d7d3b66539d9e659ac6dfe81e2d0fd4c1a8316cc5a403300
Successfully built htmlmin
Installing collected packages: typing-extensions, tangled-up-in-unicode, pandas, multimethod, visions, pydantic-core, llvmlite, imagehash, annotated-types, wordcloud, typeguard, pydantic, phik, numba, htmlmin, d

You might want to restart the kernel now.

### Import libraries

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import requests

import ydata_profiling
from ydata_profiling.utils.cache import cache_file

### Load and prepare example dataset
We add some fake variables for illustrating pandas-profiling capabilities

In [None]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)

df = pd.read_csv(file_name)

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df["year"] = pd.to_datetime(df["year"], errors="coerce")

# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = duplicates_to_add["name"] + " copy"

df = pd.concat([df, duplicates_to_add], ignore_index=True)

### Inline report without saving object

In [None]:
report = df.profile_report(
    sort=None, html={"style": {"full_width": True}}, progress_bar=False
)
report

### Save report to file

In [None]:
profile_report = df.profile_report(html={"style": {"full_width": True}})
profile_report.to_file("/tmp/example.html")

### More analysis (Unicode) and Print existing ProfileReport object inline

In [None]:
profile_report = df.profile_report(
    explorative=True, html={"style": {"full_width": True}}
)
profile_report

### Notebook Widgets

In [None]:
profile_report.to_widgets()