Configuration for topk and sort order (#206)

* bugfix for describe and convert_dtypes * added back metadata series test * black * default to pandas display when df.dtypes printed * various fixes to support int columns * skip series vis for df.iterrows series element * config setting for modifying top K and sorting * note about regenerated config
lux-org · Jan 9, 2021 · 623fb51 · 623fb51
1 parent 9dc0958
commit 623fb51
Show file tree

Hide file tree

Showing 19 changed files with 188 additions and 22 deletions.
diff --git a/doc/source/reference/config.rst b/doc/source/reference/config.rst
@@ -2,7 +2,28 @@
 Configuration Settings 
 ***********************
 
-In Lux, users can customize various global settings to configure the behavior of Lux through :py:class:`lux.config.Config`. This page documents some of the configurations that you can apply in Lux.
+In Lux, users can customize various global settings to configure the behavior of Lux through :py:class:`lux.config.Config`. These configurations are applied across all dataframes in the session. This page documents some of the configurations that you can apply in Lux.
+
+.. note::
+
+    Lux caches past generated recommendations, so if you have already printed the dataframe in the past, the recommendations would not be regenerated with the new config properties. In order for the config properties to apply, you would need to explicitly expire the recommendations as such:
+
+        .. code-block:: python
+
+            df = pd.read_csv("..")
+            df # recommendations already generated here
+
+            df.expire_recs()
+            lux.config.SOME_SETTING = "..."
+            df # recommendation will be generated again here
+
+    Alternatively, you can place the config settings before you first print out the dataframe for the first time: 
+
+        .. code-block:: python
+
+            df = pd.read_csv("..")
+            lux.config.SOME_SETTING = "..."
+            df # recommendations generated for the first time with config
 
 
 Change the default display of Lux
@@ -108,3 +129,35 @@ The above results in the following changes:
 
 See `this page <https://lux-api.readthedocs.io/en/latest/source/guide/style.html>`__ for more details.
 
+Modify Sorting and Ranking in Recommendations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In Lux, we select a small subset of visualizations to display in each action tab to avoid displaying too many charts at once. 
+Certain recommendation categories ranks and selects the top K most interesting visualizations to display.
+You can modify the sorting order and selection cutoff via :code:`lux.config`.
+By default, the recommendations are sorted in a :code:`"descending"` order based on their interestingness score, you can reverse the ordering by setting the sort order as:
+
+.. code-block:: python 
+
+    lux.config.sort = "ascending"
+
+To turn off the sorting of visualizations based on its score completely and ensure that the visualizations show up in the same order across all dataframes, you can set the sorting as "none":
+
+.. code-block:: python 
+
+    lux.config.sort = "none"
+
+For recommendation actions that generate a lot of visualizations, we select the cutoff criteria as the top 15 visualizations. If you would like to see only see the top 6 visualizations, you can set:
+
+.. code-block:: python 
+
+    lux.config.topk = 6
+
+If you would like to turn off the selection criteria completely and display everything, you can turn off the top K selection by:
+
+.. code-block:: python 
+
+    lux.config.topk = False
+
+Beware that this may generate large numbers of visualizations (e.g., for 10 quantitative variables, this will generate 45 scatterplots in the Correlation action!)
+
diff --git a/doc/source/reference/gen/lux._config.config.Config.rst b/doc/source/reference/gen/lux._config.config.Config.rst
@@ -14,6 +14,8 @@ lux.\_config.config.Config
    .. autosummary::
 
       ~Config.__init__
+      ~Config.register_action
+      ~Config.remove_action
       ~Config.set_SQL_connection
       ~Config.set_executor_type
 
@@ -30,5 +32,7 @@ lux.\_config.config.Config
       ~Config.sampling
       ~Config.sampling_cap
       ~Config.sampling_start
+      ~Config.sort
+      ~Config.topk
 
 
diff --git a/doc/source/reference/gen/lux.core.series.LuxSeries.rst b/doc/source/reference/gen/lux.core.series.LuxSeries.rst
@@ -53,7 +53,6 @@ lux.core.series.LuxSeries
       ~LuxSeries.cumsum
       ~LuxSeries.describe
       ~LuxSeries.diff
-      ~LuxSeries.display_pandas
       ~LuxSeries.div
       ~LuxSeries.divide
       ~LuxSeries.divmod

diff --git a/doc/source/reference/gen/lux.vis.VisList.VisList.rst b/doc/source/reference/gen/lux.vis.VisList.VisList.rst
@@ -14,7 +14,6 @@ lux.vis.VisList.VisList
    .. autosummary::
 
       ~VisList.__init__
-      ~VisList.bottomK
       ~VisList.get
       ~VisList.map
       ~VisList.normalize_score
@@ -23,8 +22,8 @@ lux.vis.VisList.VisList
       ~VisList.remove_index
       ~VisList.set
       ~VisList.set_intent
+      ~VisList.showK
       ~VisList.sort
-      ~VisList.topK
 
 
 

diff --git a/doc/source/reference/gen/lux.vislib.altair.AltairChart.AltairChart.rst b/doc/source/reference/gen/lux.vislib.altair.AltairChart.AltairChart.rst
@@ -19,6 +19,7 @@ lux.vislib.altair.AltairChart.AltairChart
       ~AltairChart.apply_default_config
       ~AltairChart.encode_color
       ~AltairChart.initialize_chart
+      ~AltairChart.sanitize_dataframe
 
 
 

diff --git a/doc/source/reference/gen/lux.vislib.altair.BarChart.BarChart.rst b/doc/source/reference/gen/lux.vislib.altair.BarChart.BarChart.rst
@@ -20,6 +20,7 @@ lux.vislib.altair.BarChart.BarChart
       ~BarChart.apply_default_config
       ~BarChart.encode_color
       ~BarChart.initialize_chart
+      ~BarChart.sanitize_dataframe
 
 
 

diff --git a/doc/source/reference/gen/lux.vislib.altair.Histogram.Histogram.rst b/doc/source/reference/gen/lux.vislib.altair.Histogram.Histogram.rst
@@ -19,6 +19,7 @@ lux.vislib.altair.Histogram.Histogram
       ~Histogram.apply_default_config
       ~Histogram.encode_color
       ~Histogram.initialize_chart
+      ~Histogram.sanitize_dataframe
 
 
 

diff --git a/doc/source/reference/gen/lux.vislib.altair.LineChart.LineChart.rst b/doc/source/reference/gen/lux.vislib.altair.LineChart.LineChart.rst
@@ -19,6 +19,7 @@ lux.vislib.altair.LineChart.LineChart
       ~LineChart.apply_default_config
       ~LineChart.encode_color
       ~LineChart.initialize_chart
+      ~LineChart.sanitize_dataframe
 
 
 

diff --git a/doc/source/reference/gen/lux.vislib.altair.ScatterChart.ScatterChart.rst b/doc/source/reference/gen/lux.vislib.altair.ScatterChart.ScatterChart.rst
@@ -19,6 +19,7 @@ lux.vislib.altair.ScatterChart.ScatterChart
       ~ScatterChart.apply_default_config
       ~ScatterChart.encode_color
       ~ScatterChart.initialize_chart
+      ~ScatterChart.sanitize_dataframe
 
 
 

diff --git a/lux/_config/config.py b/lux/_config/config.py
@@ -3,9 +3,9 @@
 For more resources, see https://github.com/pandas-dev/pandas/blob/master/pandas/_config
 """
 from collections import namedtuple
-from typing import Any, Callable, Dict, Iterable, List, Optional
-import warnings
+from typing import Any, Callable, Dict, Iterable, List, Optional, Union
 import lux
+import warnings
 
 RegisteredOption = namedtuple("RegisteredOption", "name action display_condition args")
 
@@ -30,6 +30,55 @@ def __init__(self):
         self._sampling_cap = 30000
         self._sampling_flag = True
         self._heatmap_flag = True
+        self._topk = 15
+        self._sort = "descending"
+
+    @property
+    def topk(self):
+        return self._topk
+
+    @topk.setter
+    def topk(self, k: Union[int, bool]):
+        """
+        Setting parameter to display top k visualizations in each action
+
+        Parameters
+        ----------
+        k : Union[int,bool]
+            False: if display all visualizations (no top-k)
+            k: number of visualizations to display
+        """
+        if isinstance(k, int) or isinstance(k, bool):
+            self._topk = k
+        else:
+            warnings.warn(
+                "Parameter to lux.config.topk must be an integer or a boolean.",
+                stacklevel=2,
+            )
+
+    @property
+    def sort(self):
+        return self._sort
+
+    @sort.setter
+    def sort(self, flag: Union[str]):
+        """
+        Setting parameter to determine sort order of each action
+
+        Parameters
+        ----------
+        flag : Union[str]
+            "none", "ascending","descending"
+            No sorting, sort by ascending order, sort by descending order
+        """
+        flag = flag.lower()
+        if isinstance(flag, str) and flag in ["none", "ascending", "descending"]:
+            self._sort = flag
+        else:
+            warnings.warn(
+                "Parameter to lux.config.sort must be one of the following: 'none', 'ascending', or 'descending'.",
+                stacklevel=2,
+            )
 
     @property
     def sampling_cap(self):

diff --git a/lux/action/correlation.py b/lux/action/correlation.py
@@ -77,7 +77,8 @@ def correlation(ldf: LuxDataFrame, ignore_transpose: bool = True):
     if ignore_rec_flag:
         recommendation["collection"] = []
         return recommendation
-    vlist = vlist.topK(15)
+    vlist.sort()
+    vlist = vlist.showK()
     recommendation["collection"] = vlist
     return recommendation
 

diff --git a/lux/action/enhance.py b/lux/action/enhance.py
@@ -66,6 +66,7 @@ def enhance(ldf):
     for vis in vlist:
         vis.score = interestingness(vis, ldf)
 
-    vlist = vlist.topK(15)
+    vlist.sort()
+    vlist = vlist.showK()
     recommendation["collection"] = vlist
     return recommendation
diff --git a/lux/action/filter.py b/lux/action/filter.py
@@ -132,7 +132,8 @@ def get_complementary_ops(fltr_op):
     vlist_copy = lux.vis.VisList.VisList(output, ldf)
     for i in range(len(vlist_copy)):
         vlist[i].score = interestingness(vlist_copy[i], ldf)
-    vlist = vlist.topK(15)
+    vlist.sort()
+    vlist = vlist.showK()
     if recommendation["action"] == "Similarity":
         recommendation["collection"] = vlist[1:]
     else:

diff --git a/lux/action/generalize.py b/lux/action/generalize.py
@@ -93,5 +93,6 @@ def generalize(ldf):
 
     vlist.remove_duplicates()
     vlist.sort(remove_invalid=True)
+    vlist._collection = list(filter(lambda x: x.score != -1, vlist._collection))
     recommendation["collection"] = vlist
     return recommendation
diff --git a/lux/action/univariate.py b/lux/action/univariate.py
@@ -82,7 +82,6 @@ def univariate(ldf, *args):
     vlist = VisList(intent, ldf)
     for vis in vlist:
         vis.score = interestingness(vis, ldf)
-    # vlist = vlist.topK(15) # Basic visualizations should not be capped
     vlist.sort()
     recommendation["collection"] = vlist
     return recommendation
diff --git a/lux/core/series.py b/lux/core/series.py
@@ -84,8 +84,12 @@ def __repr__(self):
         ldf = LuxDataFrame(self)
 
         try:
+            # Ignore recommendations when Series a results of:
+            # 1) Values of the series are of dtype objects (df.dtypes)
             is_dtype_series = all(isinstance(val, np.dtype) for val in self.values)
-            if ldf._pandas_only or is_dtype_series:
+            # 2) Mixed type, often a result of a "row" acting as a series (df.iterrows, df.iloc[0])
+            mixed_dtype = len(set([type(val) for val in self.values])) > 1
+            if ldf._pandas_only or is_dtype_series or mixed_dtype:
                 print(series_repr)
                 ldf._pandas_only = False
             else:

diff --git a/lux/vis/VisList.py b/lux/vis/VisList.py
@@ -233,18 +233,22 @@ def sort(self, remove_invalid=True, descending=True):
         # remove the items that have invalid (-1) score
         if remove_invalid:
             self._collection = list(filter(lambda x: x.score != -1, self._collection))
+        if lux.config.sort == "none":
+            return
+        elif lux.config.sort == "ascending":
+            descending = False
+        elif lux.config.sort == "descending":
+            descending = True
         # sort in-place by “score” by default if available, otherwise user-specified field to sort by
         self._collection.sort(key=lambda x: x.score, reverse=descending)
 
-    def topK(self, k):
-        # sort and truncate list to first K items
-        self.sort(remove_invalid=True)
-        return VisList(self._collection[:k])
-
-    def bottomK(self, k):
-        # sort and truncate list to first K items
-        self.sort(descending=False, remove_invalid=True)
-        return VisList(self._collection[:k])
+    def showK(self):
+        k = lux.config.topk
+        if k == False:
+            return self
+        elif isinstance(k, int):
+            k = abs(k)
+            return VisList(self._collection[:k])
 
     def normalize_score(self, invert_order=False):
         max_score = max(list(self.get("score")))

diff --git a/tests/test_config.py b/tests/test_config.py
@@ -28,7 +28,8 @@ def random_categorical(ldf):
         vlist = VisList(intent, ldf)
         for vis in vlist:
             vis.score = 10
-        vlist = vlist.topK(15)
+        vlist.sort()
+        vlist = vlist.showK()
         return {
             "action": "bars",
             "description": "Random list of Bar charts",
@@ -105,7 +106,8 @@ def random_categorical(ldf):
         vlist = VisList(intent, ldf)
         for vis in vlist:
             vis.score = 10
-        vlist = vlist.topK(15)
+        vlist.sort()
+        vlist = vlist.showK()
         return {
             "action": "bars",
             "description": "Random list of Bar charts",
@@ -235,6 +237,41 @@ def test_heatmap_flag_config():
     lux.config.heatmap = True
 
 
+def test_topk(global_var):
+    df = pd.read_csv("lux/data/college.csv")
+    lux.config.topk = False
+    df._repr_html_()
+    assert len(df.recommendation["Correlation"]) == 45, "Turn off top K"
+    lux.config.topk = 20
+    df = pd.read_csv("lux/data/college.csv")
+    df._repr_html_()
+    assert len(df.recommendation["Correlation"]) == 20, "Show top 20"
+    for vis in df.recommendation["Correlation"]:
+        assert vis.score > 0.2
+
+
+def test_sort(global_var):
+    df = pd.read_csv("lux/data/college.csv")
+    lux.config.topk = 15
+    df._repr_html_()
+    assert len(df.recommendation["Correlation"]) == 15, "Show top 15"
+    for vis in df.recommendation["Correlation"]:
+        assert vis.score > 0.2
+    df = pd.read_csv("lux/data/college.csv")
+    lux.config.sort = "ascending"
+    df._repr_html_()
+    assert len(df.recommendation["Correlation"]) == 15, "Show bottom 15"
+    for vis in df.recommendation["Correlation"]:
+        assert vis.score < 0.2
+
+    lux.config.sort = "none"
+    df = pd.read_csv("lux/data/college.csv")
+    df._repr_html_()
+    scorelst = [x.score for x in df.recommendation["Distribution"]]
+    assert sorted(scorelst) != scorelst, "unsorted setting"
+    lux.config.sort = "descending"
+
+
 # TODO: This test does not pass in pytest but is working in Jupyter notebook.
 # def test_plot_setting(global_var):
 # 	df = pytest.car_df

diff --git a/tests/test_series.py b/tests/test_series.py
@@ -51,3 +51,12 @@ def test_print_dtypes(global_var):
     with warnings.catch_warnings(record=True) as w:
         print(df.dtypes)
         assert len(w) == 0, "Warning displayed when printing dtypes"
+
+
+def test_print_iterrow(global_var):
+    df = pytest.college_df
+    with warnings.catch_warnings(record=True) as w:
+        for index, row in df.iterrows():
+            print(row)
+            break
+        assert len(w) == 0, "Warning displayed when printing iterrow"