Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas extension dtypes cause failure when generating profile report #251

Closed
bribass opened this issue Sep 6, 2019 · 5 comments
Closed
Labels
bug 🐛 Something isn't working

Comments

@bribass
Copy link

bribass commented Sep 6, 2019

When attempting to profile a data frame that uses an extension dtype (such as Int64 in order to be able to represent missing values), a ValueError is raised.

To Reproduce

The following is a self-contained example that demonstrates the problem:

"""
Test for issue XXX:
https://github.com/pandas-profiling/pandas-profiling/issues/XXX
"""
import pandas as pd
import pandas_profiling


def test_issueXXX():
    table = pandas.DataFrame([1,2,3,4,5,6], columns=['a'])
    table.profile_report()
    table2 = table.astype('Int64')
    table2.profile_report()
    # Traceback (most recent call last):
    #   File "/home/bbassett/scratch.py", line 6, in <module>
    #     table2.profile_report()
    #   File "/home/bbassett/venv37/lib/python3.7/site- packages/pandas_profiling/controller/pandas_decorator.py", line 16, in profile_report
    #     p = ProfileReport(df, **kwargs)
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/pandas_profiling/__init__.py", line 
81, in __init__
    #     self.html = to_html(sample, description_set)
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/pandas_profiling/view/report.py", 
line 521, in to_html
    #     "content": render_variables_section(stats_object),
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/pandas_profiling/view/report.py", line 448, in render_variables_section
    #     ).render(values=formatted_values)
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/jinja2/asyncsupport.py", line 76, in render
    #     return original_render(self, *args, **kwargs)
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/jinja2/environment.py", line 1008, in render
    #     return self.environment.handle_exception(exc_info, True)
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/jinja2/environment.py", line 780, in handle_exception
    #     reraise(exc_type, exc_value, tb)
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/jinja2/_compat.py", line 37, in reraise
    #     raise value.with_traceback(tb)
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/pandas_profiling/view/templates/variables/row_num.html", line 14, in top-level template code
    #     <td>{{ values['p_unique'] | fmt_percent }}</td>
    #   File "/home/bbassett/venv37/lib/python3.7/site-packages/pandas_profiling/view/formatters.py", line 62, in fmt_percent
    #     raise ValueError("Value '{}' should be a ratio between 1 and 0.".format(value))
    # ValueError: Value '1.1666666666666667' should be a ratio between 1 and 0.

Version information:

  • Python version: 3.7.3
  • Environment: all environments
  • pip: If you are using pip, run pip freeze in your environment and report the results. The list of packages can be rather long, you can use the snippet below to collapse the output.
Click to expand Version information

astroid==2.2.5
astropy==3.2.1
atomicwrites==1.3.0
attrs==19.1.0
bitstring==3.1.6
bleach==3.1.0
Click==7.0
click-plugins==1.1.1
cligj==0.5.0
confuse==1.0.0
cycler==0.10.0
decorator==4.4.0
defusedxml==0.6.0
entrypoints==0.3
Fiona==1.8.6
geopandas==0.5.1
htmlmin==0.1.12
importlib-metadata==0.20
ipython-genutils==0.2.0
isort==4.3.21
Jinja2==2.10.1
jsonschema==3.0.2
jupyter-client==5.3.1
jupyter-core==4.5.0
kiwisolver==1.1.0
lazy-object-proxy==1.4.2
llvmlite==0.29.0
MarkupSafe==1.1.1
matplotlib==3.1.1
mccabe==0.6.1
missingno==0.4.2
mistune==0.8.4
more-itertools==7.2.0
munch==2.3.2
nbconvert==5.6.0
nbformat==4.4.0
numba==0.45.1
numpy==1.17.1
packaging==19.1
pandas==0.25.1
pandas-profiling==2.3.0
pandocfilters==1.4.2
phik==0.9.8
Pillow==6.1.0
pluggy==0.12.0
py==1.8.0
Pygments==2.4.2
pylint==2.3.1
pyparsing==2.4.2
pyproj==2.3.1
pyrsistent==0.15.4
pytest==5.1.2
pytest-pylint==0.14.1
python-dateutil==2.8.0
pytz==2019.2
PyYAML==5.1.2
pyzmq==18.1.0
scipy==1.3.1
seaborn==0.9.0
Shapely==1.6.4.post2
six==1.12.0
testpath==0.4.2
tornado==6.0.3
traitlets==4.3.2
typed-ast==1.4.0
unittest-xml-reporting==2.5.1
wcwidth==0.1.7
webencodings==0.5.1
wrapt==1.11.2
zipp==0.6.0

@bribass bribass added the bug 🐛 Something isn't working label Sep 6, 2019
@sbrugman sbrugman added this to the v2.4.0 milestone Sep 8, 2019
@DerpMind
Copy link

Just encountered this bug in a current project. It was solved when I removed unused categories in all my categorical columns (using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.CategoricalIndex.remove_unused_categories.html).

@sbrugman sbrugman removed this from the v2.4.0 milestone Jan 19, 2020
@github-actions
Copy link

Stale issue

@sbrugman
Copy link
Collaborator

Could not reproduce this issue with the latest version, closing for now.

@SpyderRivera
Copy link

SpyderRivera commented May 19, 2020

I am having this issue. However, it happens inconsistently. The issue is not present until I subset my data to conversion=1.

This works fine:

profile = ProfileReport(
    df, title="Profile Report of the January Conversion Dataset"
)
profile.to_file(Path("../../../products/jan_cvr_report.html"))

profile0 = ProfileReport(
    df[df['conversion']==0], title="Profile Report of the January Conversion==0 Dataset"
)
profile0.to_file(Path("../../../products/jan_cvr0_report.html"))

This is when it breaks:

profile1 = ProfileReport(
    df[df['conversion']==1], title="Profile Report of the January Conversion==1 Dataset"
)
profile1.to_file(Path("../../../products/jan_cvr1_report.html"))

The only difference is what subset of the data it is.

This is the error stack trace I get:

Summarize dataset: 100%
32/32 [00:31<00:00, 1.03it/s, Completed]

Generate report structure: 100%
1/1 [00:04<00:00, 4.85s/it]

Render HTML: 0%
0/1 [00:00<?, ?it/s]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-60-d419b0248170> in <module>
      2     df[df['conversion']==1], title="Profile Report of the January Conversion==1 Dataset"
      3 )
----> 4 profile1.to_file(Path("../../../products/jan_cvr1_report.html"))

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/profile_report.py in to_file(self, output_file, silent)
    243                 create_html_assets(output_file)
    244 
--> 245             data = self.to_html()
    246 
    247             if output_file.suffix != ".html":

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/profile_report.py in to_html(self)
    346 
    347         """
--> 348         return self.html
    349 
    350     def to_json(self) -> str:

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/profile_report.py in html(self)
    166     def html(self):
    167         if self._html is None:
--> 168             self._html = self._render_html()
    169         return self._html
    170 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/profile_report.py in _render_html(self)
    287                 title=self.description_set["analysis"]["title"],
    288                 date=self.description_set["analysis"]["date_start"],
--> 289                 version=self.description_set["package"]["pandas_profiling_version"],
    290             )
    291 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/root.py in render(self, **kwargs)
     11 
     12         return templates.template("report.html").render(
---> 13             **self.content, nav_items=nav_items, **kwargs
     14         )

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in render(self, *args, **kwargs)
   1088             return concat(self.root_render_func(self.new_context(vars)))
   1089         except Exception:
-> 1090             self.environment.handle_exception()
   1091 
   1092     def render_async(self, *args, **kwargs):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in handle_exception(self, source)
    830         from .debug import rewrite_traceback_stack
    831 
--> 832         reraise(*rewrite_traceback_stack(source=source))
    833 
    834     def join_path(self, template, parent):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/_compat.py in reraise(tp, value, tb)
     26     def reraise(tp, value, tb=None):
     27         if value.__traceback__ is not tb:
---> 28             raise value.with_traceback(tb)
     29         raise value
     30 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/templates/report.html in top-level template code()
     20     {% endif %}
     21     <div class="content">
---> 22     {{ body.render() }}
     23     </div>
     24     {% include 'wrapper/footer.html' %}

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/container.py in render(self)
     29             return templates.template("sequence/sections.html").render(
     30                 sections=self.content["items"],
---> 31                 full_width=config["html"]["style"]["full_width"].get(bool),
     32             )
     33         elif self.sequence_type == "grid":

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in render(self, *args, **kwargs)
   1088             return concat(self.root_render_func(self.new_context(vars)))
   1089         except Exception:
-> 1090             self.environment.handle_exception()
   1091 
   1092     def render_async(self, *args, **kwargs):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in handle_exception(self, source)
    830         from .debug import rewrite_traceback_stack
    831 
--> 832         reraise(*rewrite_traceback_stack(source=source))
    833 
    834     def join_path(self, template, parent):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/_compat.py in reraise(tp, value, tb)
     26     def reraise(tp, value, tb=None):
     27         if value.__traceback__ is not tb:
---> 28             raise value.with_traceback(tb)
     29         raise value
     30 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/templates/sequence/sections.html in top-level template code()
      1 <div class="{% if full_width %}container-fluid{% else %}container{% endif %}">
      2     {% for section in sections %}
----> 3         {% set html = section.render() %}
      4         {% if html | length > 0 %}
      5             <div class="row header">

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/container.py in render(self)
      8         if self.sequence_type in ["list", "accordion"]:
      9             return templates.template("sequence/list.html").render(
---> 10                 anchor_id=self.content["anchor_id"], items=self.content["items"]
     11             )
     12         elif self.sequence_type == "named_list":

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in render(self, *args, **kwargs)
   1088             return concat(self.root_render_func(self.new_context(vars)))
   1089         except Exception:
-> 1090             self.environment.handle_exception()
   1091 
   1092     def render_async(self, *args, **kwargs):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in handle_exception(self, source)
    830         from .debug import rewrite_traceback_stack
    831 
--> 832         reraise(*rewrite_traceback_stack(source=source))
    833 
    834     def join_path(self, template, parent):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/_compat.py in reraise(tp, value, tb)
     26     def reraise(tp, value, tb=None):
     27         if value.__traceback__ is not tb:
---> 28             raise value.with_traceback(tb)
     29         raise value
     30 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/templates/sequence/list.html in top-level template code()
      2     {% for item in items %}
      3         <div class="row spacing">
----> 4             {{ item.render() }}
      5         </div>
      6     {% endfor %}

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/variable.py in render(self)
      5 class HTMLVariable(Variable):
      6     def render(self):
----> 7         return templates.template("variable.html").render(**self.content)

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in render(self, *args, **kwargs)
   1088             return concat(self.root_render_func(self.new_context(vars)))
   1089         except Exception:
-> 1090             self.environment.handle_exception()
   1091 
   1092     def render_async(self, *args, **kwargs):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in handle_exception(self, source)
    830         from .debug import rewrite_traceback_stack
    831 
--> 832         reraise(*rewrite_traceback_stack(source=source))
    833 
    834     def join_path(self, template, parent):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/_compat.py in reraise(tp, value, tb)
     26     def reraise(tp, value, tb=None):
     27         if value.__traceback__ is not tb:
---> 28             raise value.with_traceback(tb)
     29         raise value
     30 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/templates/variable.html in top-level template code()
      1 <a class="anchor-pos anchor-pos-variable" id="pp_var_{{ anchor_id }}"></a>
      2 <div class="variable{% if ignore %} ignore{% endif %}">
----> 3     {{ top.render() }}
      4 
      5     {% if bottom is not none %}

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/container.py in render(self)
     33         elif self.sequence_type == "grid":
     34             return templates.template("sequence/grid.html").render(
---> 35                 items=self.content["items"]
     36             )
     37 

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in render(self, *args, **kwargs)
   1088             return concat(self.root_render_func(self.new_context(vars)))
   1089         except Exception:
-> 1090             self.environment.handle_exception()
   1091 
   1092     def render_async(self, *args, **kwargs):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in handle_exception(self, source)
    830         from .debug import rewrite_traceback_stack
    831 
--> 832         reraise(*rewrite_traceback_stack(source=source))
    833 
    834     def join_path(self, template, parent):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/_compat.py in reraise(tp, value, tb)
     26     def reraise(tp, value, tb=None):
     27         if value.__traceback__ is not tb:
---> 28             raise value.with_traceback(tb)
     29         raise value
     30 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/templates/sequence/grid.html in top-level template code()
      1 {% for item in items %}
      2     <div class="col-sm-{% if (loop.last and loop.length == 3) or loop.length == 2 %}6{% else %}3{% endif %}{% if item.content['classes'] %} {{ item.content['classes'] }}{% endif %}">
----> 3         {{ item.render() }}
      4     </div>
      5 {% endfor %}

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/table.py in render(self)
      5 class HTMLTable(Table):
      6     def render(self):
----> 7         return templates.template("table.html").render(**self.content)

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in render(self, *args, **kwargs)
   1088             return concat(self.root_render_func(self.new_context(vars)))
   1089         except Exception:
-> 1090             self.environment.handle_exception()
   1091 
   1092     def render_async(self, *args, **kwargs):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/environment.py in handle_exception(self, source)
    830         from .debug import rewrite_traceback_stack
    831 
--> 832         reraise(*rewrite_traceback_stack(source=source))
    833 
    834     def join_path(self, template, parent):

~/opt/anaconda3/lib/python3.7/site-packages/jinja2/_compat.py in reraise(tp, value, tb)
     26     def reraise(tp, value, tb=None):
     27         if value.__traceback__ is not tb:
---> 28             raise value.with_traceback(tb)
     29         raise value
     30 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/templates/table.html in top-level template code()
      7         <tr{% if 'alert' in row and row['alert'] %} class="alert"{% endif %}>
      8             <th>{{ row['name'] }}</th>
----> 9             <td>{{ row['value'] | dynamic_filter(row['fmt']) }}</td>
     10         </tr>
     11         {% endfor %}

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/presentation/flavours/html/templates.py in <lambda>(x, v)
     24     r"\((\d+)\)", r'<span class="badge">\1</span>', x
     25 )
---> 26 jinja2_env.filters["dynamic_filter"] = lambda x, v: fmt_mapping[v](x)
     27 
     28 

~/opt/anaconda3/lib/python3.7/site-packages/pandas_profiling/report/formatters.py in fmt_percent(value, edge_cases)
     60     """
     61     if not (1.0 >= value >= 0.0):
---> 62         raise ValueError(f"Value '{value}' should be a ratio between 1 and 0.")
     63     if edge_cases and round(value, 3) == 0 and value > 0:
     64         return "< 0.1%"

ValueError: Value '6.180529706513958' should be a ratio between 1 and 0.

I tried the remove unused category and dropping the only constant column, but no luck.

df1 = df[df['conversion']==1].copy(deep=True)
df1.source.cat.remove_unused_categories(inplace=True) #note I only have 2 categorical vars
profile1 = ProfileReport(
    df1.drop('conversion',axis=1), title="Profile Report of the January Conversion==1 Dataset"
)
profile1.to_file(Path("../../../products/jan_cvr1_report.html"))

UPDATE: Solution Found

It works with df1.drop('user_id',axis=1) so I tried
df1.user_id.cat.remove_unused_categories(inplace=True)

and it works! I didn't realize my user_id column was being treated as a category.

I added to stackoverflow in case anyone else runs into this.

@flacle
Copy link

flacle commented Dec 9, 2021

In my case, it also worked by setting duplicates=None for categorical data as explained here: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html#configuration-shorthands

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants