**Note**: This exercise is adapted from the original [here](https://github.com/pandas-profiling/pandas-profiling/blob/master/examples/meteorites/meteorites.ipynb). As of September 2020 if you install [pandas_profiling on conda](https://anaconda.org/conda-forge/pandas-profiling) you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on [pypi](https://pypi.org/project/pandas-profiling/) (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see [pipenv](https://pipenv-fork.readthedocs.io/en/latest/) for more details) of this example [here](https://github.com/andrewm4894/pandas-profiling/blob/master/Pipfile).


## Pandas Profiling: NASA Meteorites example

Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh


The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.


In [1]:
# %load_ext autoreload
# %autoreload 2


Make sure that we have the latest version of pandas-profiling.


In [2]:
# uncomment and run below if you need to pip install the pandas-profiling library
# import sys
#!{sys.executable} -m pip install -U pandas-profiling==2.9.0
#!jupyter nbextension enable --py widgetsnbextension

You might want to restart the kernel now.


### Import libraries


In [1]:
from pathlib import Path

import requests
import numpy as np
import pandas as pd

import ydata_profiling
from ydata_profiling.utils.cache import cache_file


### Load and prepare example dataset

We add some fake variables for illustrating pandas-profiling capabilities


In [2]:
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)

df = pd.read_csv(file_name)

# original frame snapshot
display(df.sample(5))
display(df.info())

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
# df['year'] = pd.to_datetime(df['year'], errors='coerce')


# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = (
    duplicates_to_add["name"] + " copy"
)  # removed prefixed u as python is unicode by default

# df = df.append(duplicates_to_add, ignore_index=True)
df = pd.concat([df, duplicates_to_add])

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
38807,Yamato 74380,24758,Valid,H5,2.96,Found,1974.0,0.0,35.66667,"(0.0, 35.66667)"
37356,Sayh al Uhaymir 533,55470,Valid,L6,146.35,Found,2010.0,20.26653,56.51653,"(20.26653, 56.51653)"
19573,Larkman Nunatak 06739,47573,Valid,H6,11.9,Found,2006.0,,,
39826,Yamato 790568,25917,Valid,LL,20.11,Found,1979.0,-71.5,35.66667,"(-71.5, 35.66667)"
29633,Northwest Africa 5959,50845,Valid,Howardite,1750.0,Found,2009.0,0.0,0.0,"(0.0, 0.0)"


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45716 non-null  object 
 1   id           45716 non-null  int64  
 2   nametype     45716 non-null  object 
 3   recclass     45716 non-null  object 
 4   mass (g)     45585 non-null  float64
 5   fall         45716 non-null  object 
 6   year         45425 non-null  float64
 7   reclat       38401 non-null  float64
 8   reclong      38401 non-null  float64
 9   GeoLocation  38401 non-null  object 
dtypes: float64(4), int64(1), object(5)
memory usage: 3.5+ MB


None

In [3]:
df.sample(5)


Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation,source,boolean,mixed,reclat_city
18314,LaPaz Icefield 03761,36167,Valid,L5,26.1,Found,2003.0,,,,NASA,True,A,
13678,Grove Mountains 020281,47764,Valid,Mesosiderite,1.79,Found,2003.0,-72.978889,75.260833,"(-72.978889, 75.260833)",NASA,True,A,-80.163394
40707,Yamato 791498,26847,Valid,CR2,3.11,Found,1979.0,-71.5,35.66667,"(-71.5, 35.66667)",NASA,False,1,-73.568851
4215,Cumulus Hills 04063,32519,Valid,Pallasite,6188.3,Found,2003.0,,,,NASA,False,1,
40232,Yamato 791004,26353,Valid,H6,225.09,Found,1979.0,-71.5,35.66667,"(-71.5, 35.66667)",NASA,False,A,-65.468618


In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 45726 entries, 0 to 9
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45726 non-null  object 
 1   id           45726 non-null  int64  
 2   nametype     45726 non-null  object 
 3   recclass     45726 non-null  object 
 4   mass (g)     45595 non-null  float64
 5   fall         45726 non-null  object 
 6   year         45435 non-null  float64
 7   reclat       38411 non-null  float64
 8   reclong      38411 non-null  float64
 9   GeoLocation  38411 non-null  object 
 10  source       45726 non-null  object 
 11  boolean      45726 non-null  bool   
 12  mixed        45726 non-null  object 
 13  reclat_city  38411 non-null  float64
dtypes: bool(1), float64(5), int64(1), object(7)
memory usage: 4.9+ MB


### Inline report without saving object


In [5]:
report = df.profile_report(
    sort=None, html={"style": {"full_width": True}}, progress_bar=False
)
report



### Save report to file


In [6]:
profile_report = df.profile_report(
    html={"style": {"full_width": True}}, correlations={"auto": {"calculate": False}}
)

my_path = Path("../tmp/example.html")
if not my_path.is_file():
    Path("tmp").mkdir(parents=True, exist_ok=True)
# profile_report.to_file("tmp/example.html")
profile_report.to_file(my_path)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### More analysis (Unicode) and Print existing ProfileReport object inline


In [7]:
profile_report = df.profile_report(
    explorative=True, html={"style": {"full_width": True}}
)
profile_report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Notebook Widgets


In [10]:
# profile_report.to_widgets()
