In [1]:
import pandas as pd
import sklearn
from sklearn import datasets

In [18]:
dataset = sklearn.datasets.fetch_california_housing(as_frame=True)

In [19]:
df = dataset['data'] # X
df['MedHouseVal'] = dataset['target'] # y

In [None]:
!pip install lux-api
!jupyter nbextension install --py luxwidget
!jupyter nbextension enable --py luxwidget

In [None]:
!pip install dtale

# D-Tale

According to the developers: ``D-Tale is the combination of a Flask back-end and a React front-end to bring you an easy way to view & analyze Pandas data structures. It integrates seamlessly with ipython notebooks & python/ipython terminals. Currently this tool supports such Pandas objects as DataFrame, Series, MultiIndex, DatetimeIndex & RangeIndex.''

See the GitHub repo for up-to-date details: https://github.com/man-group/dtale

In general, this is very useful for EDA!  However, for reproducibility purposes, final analysis should always be scripted. Still, this is a very powerful tool for quickly viewing, summarizing data, identifying possible issues, and generating hypotheses - exactly what EDA is supposed to do.

In [4]:
import dtale 

In [5]:
df # In Pandas

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [6]:
dtale.show(df) # In D-Tale!



To look at the dataframe in a new browser, try the code below.  Note that this does not seem to work when logging
into a remote Jupyter server. However, there lots of configuration tools, and even a way to use this on Colab.
Refer to the documentation for the most up-to-date examples: https://github.com/man-group/dtale

In [None]:
# dtale.show(df).open_browser()
# You can also " > Open in New Tab"

Some things to try:
    
1. Click on the header to get a description of the rows, types, skew, outliers, etc.
2. Select 'Filter outliers' at the bottom of the pop-up to remove rows that are considered outliers.
3. Click 'Describe' to see descriptive statistics - also check out the histograms and Q-Q plot!
4. View Duplicates.
5. Convert the column type.
6. Select 'Heat Map' to highlight rows based on values.
7. Use 'Replacements' to replace NaN, etc. using various tools including sklearn imputers.
8. Use filters at the bottom to filter rows based on criteria; click on the icon to get a pop-up that lets you build more complex criteria!
9. Double click on a row to change the value!
10. Under the '>' icon in the top left, select 'Clean Columns' to explore options to clean data.
11. Under the '>' icon in the top left, select 'Feature Analysis by Correlation' to see how columns are correlated with one another.
12. Under the '>' icon in the top left, select 'Charts' to make various plots of features against each other.

D-Tale is also "smart" about inferring things from your column labels. For example, if you look at the Describe options for Latitude or Longitude, it will detect the other and give a "Geolocation" option.  Try it out!

Check out Animations under different types of Charts!

# Lux

According to the developers: ``Lux is a Python library that makes data science easier by automating certain aspects of the data exploration process. Lux is designed to facilitate faster experimentation with data, even when the user does not have a clear idea of what they are looking for.''

See the documentation for up-to-date information: https://lux-api.readthedocs.io/en/latest/index.html

See the installation instructions here: https://lux-api.readthedocs.io/en/latest/source/getting_started/installation.html

If using anaconda, install using: 

```code
$ conda activate myenv
$ conda install -c conda-forge lux-api
$ sudo mkdir /usr/local/share/jupyter; sudo chown -R $USER /usr/local/share/jupyter
$ jupyter nbextension install --py luxwidget
$ jupyter nbextension enable --py luxwidget
```

Note: You do need to **create the dataframe AFTER the import** statement (at least at the time of writing); trying to display previously created dataframes does not work.

In [13]:
import lux

In [21]:
df = dataset['data'] # X
df['MedHouseVal'] = dataset['target'] # y

In [20]:
df

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

By default up to 3 types of ["analytical actions"](https://lux-api.readthedocs.io/en/latest/source/reference/lux.action.html) used to determine what is "interesting".

* The Correlation Tab shows pairwise relationships between quantitative attributes from most linearly correlated (Pearson's correlation score) to least correlated.
* The Distribution Tab shows univariate distributions ordered from most skewed to least skewed. 
* The Occurrence Tab shows up for categorical features and displays a series of bar charts. (no examples in this dataset)

You can "steer" these defaults by suggesting an "intent" described in more detail [here](https://lux-api.readthedocs.io/en/latest/source/getting_started/overview.html#steering-recommendations-via-user-intent).

In [44]:
df.intent = ['MedInc','MedHouseVal']

In [28]:
df # Clock "Toggle" to switch back and forth

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()

In [None]:
df.default_display = "lux" # Set Lux as default display

In [29]:
# You can select a window and click the export button (top right)
df.exported

LuxWidget(recommendations=[{'action': 'Vis List', 'description': 'Shows a vis list defined by the intent', 'vs…

In [30]:
# You can see recommendations of "interesting things" from Lux 
df.recommendation

{'Enhance': [<Vis  (x: MedInc, y: MedHouseVal, color: HouseAge) mark: heatmap, score: 0.10 >,
  <Vis  (x: MedInc, y: MedHouseVal, color: Latitude) mark: heatmap, score: 0.10 >,
  <Vis  (x: MedInc, y: MedHouseVal, color: AveRooms) mark: heatmap, score: 0.10 >,
  <Vis  (x: MedInc, y: MedHouseVal, color: AveBedrms) mark: heatmap, score: 0.10 >,
  <Vis  (x: MedInc, y: MedHouseVal, color: Longitude) mark: heatmap, score: 0.10 >,
  <Vis  (x: MedInc, y: MedHouseVal, color: AveOccup) mark: heatmap, score: 0.10 >,
  <Vis  (x: MedInc, y: MedHouseVal, color: Population) mark: heatmap, score: 0.10 >],
 'Generalize': [<Vis  (x: BIN(MedHouseVal), y: COUNT(Record)) mark: histogram, score: 1.00 >,
  <Vis  (x: BIN(MedInc)     , y: COUNT(Record)) mark: histogram, score: 1.00 >]}

In [31]:
# You can also access these plots
df.recommendation['Enhance'][0]

LuxWidget(current_vis={'config': {'view': {'continuousWidth': 400, 'continuousHeight': 300}, 'axis': {'labelCo…

In [36]:
# The great thing about Lux is that you can get raw code to reproduce these plots so you can further modify them as needed
code = df.recommendation['Enhance'][0].to_code('matplotlib')

In [38]:
print(code)

import matplotlib.pyplot as plt
plt.rcParams.update(
            {
                "axes.titlesize": 20,
                "axes.titleweight": "bold",
                "axes.labelweight": "bold",
                "axes.labelsize": 16,
                "legend.fontsize": 14,
                "legend.title_fontsize": 15,
                "xtick.labelsize": 13,
                "ytick.labelsize": 13,
            }
        )
import numpy as np
from math import nan
df = pd.DataFrame({'count': {0: 3, 2: 4, 3: 6, 4: 6, 5: 12, 6: 6, 7: 4, 8: 4, 9: 1, 10: 2, 11: 2, 12: 8, 13: 2, 14: 3, 15: 1, 21: 3, 25: 1, 27: 4, 29: 1, 39: 4, 40: 1, 41: 5, 42: 21, 43: 34, 44: 36, 45: 35, 46: 38, 47: 19, 48: 18, 49: 10, 50: 15, 51: 11, 52: 11, 53: 4, 54: 6, 55: 1, 56: 2, 57: 3, 58: 2, 59: 2, 60: 1, 61: 2, 63: 2, 67: 3, 71: 1, 75: 2, 79: 3, 81: 1, 82: 65, 83: 123, 84: 81, 85: 61, 86: 74, 87: 50, 88: 31, 89: 19, 90: 26, 91: 13, 92: 14, 93: 10, 94: 9, 95: 8, 96: 13, 97: 10, 98: 3, 99: 3, 100: 2, 101: 3, 104: 4, 105: 3, 10