In [1]:
from intedact import univariate_eda_interact
import pandas as pd
import seaborn as sns
import ipywidgets as widgets

In [2]:
# These are needed for text summaries
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/mboggess/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mboggess/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Example 1: Diamonds Dataset

The first example we will use is the classic diamonds dataset packaged with ggplot as well as seaborn. This first example is great for getting introduced to the basic discrete and continuous summaries.

Recommended Explorations:
  - Try playing with number of bins on carat
  - Try removing outliers for the x, y, and z variables

In [3]:
data = sns.load_dataset("diamonds")
# Ordinal categorical variables need to be explicitly overwritten
data["cut"] = pd.Categorical(data["cut"], categories=["Fair", "Good", "Very Good", "Premium", "Ideal"], ordered=True)
data["color"] = pd.Categorical(data["color"], categories=["D", "E", "F", "G", "H", "I", "J"], ordered=True)
data["clarity"] = pd.Categorical(data["clarity"], categories=["I1", "SI1", "SI2", "VS2", "VS1", "VVS2", "VVS1", "IF"], ordered=True)

In [4]:
univariate_eda_interact(data, notes_file="diamonds.json", figure_dir=".")

interactive(children=(Dropdown(description='Column: Column to be plotted', options=('carat', 'cut', 'color', '…

# Example 2: Tidy Tuesday GDPR Violations

Recommended Explorations:
- Try using a log transform on the price column.
- Check out the date column for an example of a datetime summary. Try setting the lower quantile option to .06 so you can see the main time series.
- Check out the summary column for an example of a text summary. By default, doesn't compute top ngrams so you can check the 'Plot most common ngrams' option to plot the top unigrams-trigrams. Also, since text tokenizing can be time consuming, it turns auto updating off so you have to press the 'Run Interact' button to update the summary when control options are changed.
- Check out the article_violated column for an example of a collections summary
- Check out the source column for an example of a url summary

In [5]:
data = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv", sep="\t")
data["date"] = pd.to_datetime(data["date"])
# Collection columns must be encoded as a Python iterable
data["article_violated"] = data["article_violated"].apply(lambda x: x.split("|"))

In [6]:
univariate_eda_interact(data, notes_file="gdpr_violations.json", figure_dir=".")

interactive(children=(Dropdown(description='Column: Column to be plotted', options=('id', 'picture', 'name', '…

# Example 3 - Tidy Tuesday Tweets

Here's a large social media dataset with many columns. Try seeing how it is to explore a larger dataset with many columns.

In [10]:
data = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/tidytuesday_tweets/data.csv")
data["created_at"] = pd.to_datetime(data["created_at"])

In [11]:
univariate_eda_interact(data, notes_file="tidy_tuesday_tweets.json", figure_dir=".")

interactive(children=(Dropdown(description='Column: Column to be plotted', options=('week', 'user_id', 'status…