# Histogram with Plotly.Express: Complete guide
Fonte: https://github.com/vaclavdekanovsky/data-analysis-in-examples/blob/master/Vizualizations/Plotly/Histogram/Histograms.ipynb
Notebook to complement the article: https://towardsdatascience.com/histograms-with-plotly-express-complete-guide-d483656c5ad7

In [1]:
import plotly.express as px
import pandas as pd
import numpy as np

# load preprocessed data
We will work with a long data frame - categorical columns has one line per combination of `Country Name`, `Country Code`, `years` and `Region` and each row contain 2 values:

* `visitors` - the number of tourist which visited this country in the year
* `receipts` - how much these tourist have spent in the country in that year

We have values for 215 countries for years 1995-2018

Data were preprocessed using - https://github.com/vaclavdekanovsky/data-analysis-in-examples/blob/master/Vizualizations/Plotly/Preprocess/Preprocessing.ipynb

In [71]:
# First run the preprocessing notebook
# https://github.com/vaclavdekanovsky/data-analysis-in-examples/blob/master/Vizualizations/Plotly/Preprocess/Preprocessing.ipynb
# then load the preprocessed pickles
long_df = pd.read_pickle("../Preprocess/long.plk")
yr2018 = long_df[long_df["years"]=="2018"]
spfrit = long_df[long_df["Country Name"].isin(["Spain","France","Italy"])]
sptuar = long_df[long_df["Country Name"].isin(["Spain","Turkey","Aruba"])]

In [3]:
print(yr2018.shape,long_df.shape)

(215, 6) (5160, 6)


In [4]:
pd.concat([spfrit.head(3), spfrit.tail(3),long_df.sample(3)])

Unnamed: 0,Country Name,Country Code,Region,years,visitors,receipts
58,Spain,ESP,Europe & Central Asia,1995,32971000.0,27369000000.0
63,France,FRA,Europe & Central Asia,1995,60033000.0,31295000000.0
94,Italy,ITA,Europe & Central Asia,1995,31052000.0,30411000000.0
5003,Spain,ESP,Europe & Central Asia,2018,82773000.0,81250000000.0
5008,France,FRA,Europe & Central Asia,2018,89322000.0,73125000000.0
5039,Italy,ITA,Europe & Central Asia,2018,61567200.0,51602000000.0
764,Morocco,MAR,Middle East & North Africa,1998,3095000.0,1934000000.0
3986,Latvia,LVA,Europe & Central Asia,2013,1536000.0,865000000.0
2418,Dominican Republic,DOM,Latin America & Caribbean,2006,3965000.0,3917000000.0


# Range histograms
Ranged histogram is the most typical usecase. We split the values into several bins (of the same size) and count the number of occurances in each bin. For our dataset, if we look on year 2018 and split data into 10 bins we see that: 

* 133 counties were visited by 0-9.9M tourists
* 23 countries were visited by 10-19.9M tourists
* 5 counties by 20-29.9M tourists
* etc.

In [5]:
yr2018.min(), yr2018["visitors"].max()

(Country Name            Afghanistan
 Country Code                    ABW
 Region          East Asia & Pacific
 years                          2018
 visitors                          0
 receipts                          0
 dtype: object,
 89322000.0)

In [None]:
# Here we use a column with categorical data
fig = px.histogram(yr2018, x="visitors")
#fig.update_layout(yaxis_title="Number of Countries")
fig.show()

# categorical histograms
You can also bin the data based on a categorical value. You specify the category in parameter `x`. 

In [None]:
# Here we use a column with categorical data
fig = px.histogram(yr2018, x="Region", title="Regional Histogram")
fig.update_traces(text="x")
fig.show()

We can try to do histogram over the years, but this will show 215 counts in each year, because we have 215 data points for each country every year.

In [None]:
fig = px.histogram(long_df, x="years", title="Yearly Histogram")
fig.show()

Plotly histograms use count as default function, but if you specify `y` as a numerical column (e.g. number of visitors), plotly automatically changes the `histfunc` to sum.

In [None]:
fig = px.histogram(long_df, 
                   x="years", 
                   y="visitors",
                   title="Yearly Histogram - years a integers",
                   )
fig.show()

In [None]:
long_df_2 = long_df.copy()
long_df_2["years"] = pd.to_datetime(long_df_2["years"], format="%Y")
fig = px.histogram(long_df_2, 
                   x="years", 
                   y="visitors",
                   title="Yearly Histogram - years as dates", 
                   nbins=5)
fig.show()

# Parameters
Histogram function has many parameters, let's review them one by one.

## Color
If you specify a categorical column as color parameter, each bar will be split into several colors reflecting this category. Plotly will automatically add an interactive legent. You can click the legend to toggle on/off that category from the chart.

In [None]:
# color parameter allow to split the histogram bar into categories differentiated by a color.
fig = px.histogram(yr2018, x="visitors", nbins=10, color="Region", title="Visitors per region")
fig.show()

Because histogram is actually a bar chart, you can use the barmode options too. 

* `group` - shows histogram as grouped bar chart
* `overlay` - displays the semi-transparent bars on the top of each other
* `stack` - stacks the bars on the top of each other

In [None]:
for barmode in ["stack", "overlay", "group"]:
    
    fig = px.histogram(sptuar, 
                       x="visitors", 
                       nbins=20, 
                       color="Country Name", 
                       barmode=barmode,
                        title=f"Visitors in Aruba, Spain, Turkey - {barmode}")
    fig.show()

Plotly consider `years` to be an integer value so it bins it into groups e.g. 1995-1999. You can direct the engine to consider the years as categories with `fig.update_xaxes(type='category')` and  each year will become a separate bin. 

In [None]:
for barmode in ["stack", "overlay", "group"]:
    
    fig = px.histogram(sptuar, 
                       x="years", 
                       y="visitors", 
                       nbins=3, 
                       color="Country Name", 
                       barmode=barmode,
                        title=f"Visitors in Aruba, Spain, Turkey - {barmode}")
    fig.update_layout(yaxis_title="Number of Visitors")
    fig.update_xaxes(type='category')
    fig.show()

In [None]:
fig = px.histogram(sptuar, 
            x="years", 
            y="visitors", 
            color="Country Name", 
            barmode=barmode,
            title=f"Visitors in Aruba, Spain, Turkey - {barmode}")
fig.update_layout(yaxis_title="Number of Visitors")
fig.update_xaxes(type='category')
fig.show()

## Parameter nbins
This parameter sets the number of bins on the chart. 

In [None]:
for nbins in [3,5,10,20]:
    fig = px.histogram(yr2018, x="visitors", nbins=nbins, title=f"{nbins} bins")
    fig.update_layout(yaxis_title="Number of Countries")
    fig.show()

The parameter is though a bit unreliable. In our case:

* nbins = 3 - 2 bins
* nbins = 5 - 5 bins
* nbins = 10 - 9 bins
* nbins = 20 - 18 bins

Nbins cannot be used on the categorical columns. However it can be used on categories which can be interpreted as integers - eg. years. 

In [None]:
fig = px.histogram(yr2018, x="Region", title="Regional Histogram", nbins=20)
fig.update_layout(yaxis_title="Number of Countries")
fig.show()

In [None]:
fig = px.histogram(yr2018, x="Region", title="Regional Histogram", nbins=20)
fig.update_layout(yaxis_title="Number of Countries")
fig.show()

In [None]:
for nbins in [3,5,10,20]:
    
    fig = px.histogram(spfrit, 
                       x="years", 
                       y="visitors", 
                       nbins=nbins, 
                       color="Country Name", 
                       barmode="stack",
                       title=f"Visitors in Spain, France, Italy - {nbins} bins")
    fig.update_layout(yaxis_title="Number of Visitors")
    fig.show()

Again the bins are calculated somewhat strange. It's related to the number of categories. We have 24 years starting at 1995, but plotly moves the starts from 1990, 1994 or 1995 so that there is always the same number of years in the bin.

* nbins = 3 - 3 bins (1990-1999, 2000-2009, 2010-2019)
* nbins = 5 - 5 bins (95-99, 00-04, 05-09, 10-14, 15-19) 
* nbins = 10 - 5 bins (95-99, 00-04, 05-09, 10-14, 15-19)
* nbins = 20 - 13 bins (94-95, 96-97, 98-99, 00-01 ...)

## Parameter histfunc
Determintes how the hight of the histogram's bar is calculated if it's `count`, `sum`, `avg`, `min` or `max`

In [None]:
for histfunc in ["count","sum","avg","min","max"]:
    fig = px.histogram(yr2018, 
                   x="visitors", 
                   y="visitors",
                   histfunc=histfunc,
                   title=f"histfunc = {histfunc}")
    #fig.update_layout(yaxis_title="Number of Countries")
    fig.show()

In [None]:
histfunc = "min"
fig = px.histogram(yr2018, 
                   x="Region", 
                   y="visitors",
                   histfunc=histfunc,
                   title=f"histfunc = {histfunc}")
#fig.update_layout(yaxis_title="Number of Countries")
fig.show()

In [None]:
for histfunc in ["count","sum","avg","min","max"]:
    fig = px.histogram(yr2018, 
                       x="Region",
                       y="visitors",
                       histfunc=histfunc,
                       title=f"histfunc = {histfunc}")

    fig.show()

## Parameter Cumulative

In [None]:
for cumulative in [False, True]:
    fig = px.histogram(yr2018, x="visitors", cumulative=cumulative, title=f"Cumulative {cumulative} histogram"
                      ,color_discrete_sequence=["#ff5757"])
    fig.show()

## Parameter Barnorm
Barnorm makes sense only if you have multiple categories differentiated by the color parameter. You can display the actual value of each group (e.g. number of visitors in Spain, France and Italy) or their fraction - how many percent represent particular color group. `Fraction` displays the values in the range of 0-1 while `percentage` in the range 0-100.
In case of `stacked` chart, the fraction version always has 0-1 on the y-axis. If you have grouped chart, the height of the bars remains the same, but the labels and grids on the y-axis gets renamed. 

In [None]:
for barnorm in [None, 'fraction', 'percent']:
    fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="stack",
                       barnorm=barnorm, 
                       title=f"Barnorm {barnorm} histogram")
    if barnorm=="percent":
        fig.update_layout(yaxis={"tickfont":{"size":18}, "ticksuffix":"%"})
    else:
        fig.update_layout(yaxis={"tickfont":{"size":18}})
    fig.show()

In [None]:
for barnorm in [None, 'fraction', 'percent']:
    fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       barnorm=barnorm, 
                       title=f"Barnorm {barnorm} histogram")
    fig.update_layout(yaxis={"tickfont":{"size":18}})
    fig.show()

## histnorm

In [None]:
for histnorm in [None, 'percent', 'probability', 'density', 'probability density']:
    fig = px.histogram(yr2018, 
                       x="visitors", 
                       histnorm=histnorm, 
                       title=f"Histnorm {histnorm} histogram",
                      cumulative=True)
    fig.show()

In [None]:
for histnorm in [None, 'percent', 'probability', 'density', 'probability density']:
    fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       histnorm=histnorm, 
                       title=f"Histnorm {histnorm} histogram",
                      cumulative=False)
    fig.show()

In [None]:
histnorm = None
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       histnorm=histnorm, 
                       title=f"Histnorm {histnorm} histogram",
                      cumulative=False)
fig.update_layout(yaxis={"tickfont":{"size":18}})
fig.show()

## category_orders Parameter

In [None]:
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       title=f"Ordered Italy, Spain, France",
                       category_orders={"Country Name":["Italy","Spain","France"]}
                      )
fig.show()

In [None]:
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       title=f"Ordered France, Spain, Italy",
                       category_orders={"Country Name":["France","Spain","Italy"]}
                      )
fig.show()

## Parameter range_x and range_y
Zooms the chart into particular range. Ranges must always by specified as two values - min and max range, so they cannot be used to the categorical bins.

In [None]:
fig = px.histogram(long_df, 
                   x="years", 
                   y="receipts", 
                   range_x=["2009","2018"],
                   title="Yearly Histogram",
                  )
#fig.update_xaxes(type='category')
fig.show()

In [None]:
fig = px.histogram(long_df, 
                   x="years", 
                   y="receipts", 
                   range_x=["2016","2018"],
                   title="Yearly Histogram",
                  )
#fig.update_xaxes(type='category')
fig.show()

In [None]:
fig = px.histogram(long_df, 
                   x="years", 
                   y="receipts", 
                   range_x=["2016","2018"],
                   title="Yearly Histogram",
                   color_discrete_sequence=["#ff913d"]
                  )
#fig.update_xaxes(type='category',tickfont={"size":20})

fig.show()

In [None]:
fig = px.histogram(yr2018, 
                   x="visitors", 
                   range_x=[0,30_000_000],
                   range_y=[0,50],
                   title="Yearly Histogram",
                  )
fig.show()

## Parameter color_discrete_sequence
color_discrete_sequence allow to set a color for each distinct category (specified by the color parameter). If `color` is empty, all the bars have the first color in the list. 

In [None]:
fig = px.histogram(yr2018, 
                   x="visitors", 
                   title="2018 Visitors",
                   # we have only one category so everything is lightgreen
                   color_discrete_sequence=["lightgreen", "black"]
                  )
fig.show()

In [None]:
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       title=f"Color_discrete_sequence for multiple bars",
                       category_orders={"Country Name":["Italy","Spain","France"]},
                       # if you provide only one color, each bar will use it
                       # color_discrete_sequence=["yellow" ]
                       color_discrete_sequence=["yellow", "black", "lightgreen" ]
                      )
fig.show()

In [None]:
# using plotly predefined sequence
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       title=f"Using build-in colors",
                       category_orders={"Country Name":["Italy","Spain","France"]},
                       color_discrete_sequence=px.colors.qualitative.Pastel2
                      )
fig.show()

## Parameter color_discrete_map

In [None]:
## Parameter color_discrete_map
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       barmode="group",
                       title=f"Color_discrete_map",
                       category_orders={"Country Name":["Italy","Spain","France"]},
                       color_discrete_map={"Spain":"lightgreen", 
                                            "France":"rgba(0,0,0,100)", 
                                            "Italy":"#FFFF00"}
                      )
fig.show()

If the values of the `color` argument are real colors, you can pass a string `"identity"` to the `color_discrete_map`, however in this case no legend is displayed.

In [None]:
# create a color column specifying color for each of the countries
spfrit["color"] = spfrit["Country Name"].map(
    {"Spain":"red", 
                             "France":"black", 
                             "Italy":"orange"}
)

## Parameter color_discrete_map using identity in case color contains real color names/hex codes
fig = px.histogram(spfrit, 
 x="years",
 y="visitors",
 color="color",
 barmode="group",
 title=f"Color disrete map using identity",
 category_orders={"Country Name":["Italy","Spain","France"]},
 color_discrete_map="identity"
)
fig.show()

In [39]:
spfrit

Unnamed: 0,Country Name,Country Code,Region,years,visitors,receipts,color
58,Spain,ESP,Europe & Central Asia,1995,32971000.0,2.736900e+10,red
63,France,FRA,Europe & Central Asia,1995,60033000.0,3.129500e+10,black
94,Italy,ITA,Europe & Central Asia,1995,31052000.0,3.041100e+10,orange
273,Spain,ESP,Europe & Central Asia,1996,34027000.0,2.975100e+10,red
278,France,FRA,Europe & Central Asia,1996,62406000.0,3.208800e+10,black
...,...,...,...,...,...,...,...
4793,France,FRA,Europe & Central Asia,2017,86758000.0,6.793600e+10,black
4824,Italy,ITA,Europe & Central Asia,2017,58253000.0,4.671900e+10,orange
5003,Spain,ESP,Europe & Central Asia,2018,82773000.0,8.125000e+10,red
5008,France,FRA,Europe & Central Asia,2018,89322000.0,7.312500e+10,black


## Parameter facet_col and facet_row
color_discrete_sequence allow to 

In [None]:
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       color="Country Name",
                       facet_col="Country Name",
                       barmode="group",
                       title=f"Using facet_col parameter together with color",
                       category_orders={"Country Name":["Italy","Spain","France"]},
                      )
fig.update_xaxes(type='category')
fig.show()

In [None]:
# rank the countries based on overall visits since 1995
s = long_df.groupby("Country Name")["visitors"].sum().rank(ascending=False)

# filter top 9 countries by visits
df = long_df[long_df["Country Name"].isin(s[s < 10].index)]

fig = px.histogram(df, 
        x="years",
        y="visitors",
        color="Country Name",
        facet_col="Country Name",
        facet_col_wrap=3,
        barmode="group",
        title=f"Top 9 most visited countries",
        # sort from the most visited     
        category_orders={"Country Name":s[s < 10].sort_values().index.to_list()},
                      )
fig.update_xaxes(type='category')
fig.show()

In [None]:
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       facet_row="Country Name",
                       barmode="group",
                       title=f"Using facer_row parameter",
                       category_orders={"Country Name":["Italy","Spain","France"]},
                      )
fig.update_xaxes(type='category')
fig.show()

In [None]:
# with facet_row_spacing
fig = px.histogram(spfrit, 
                       x="years",
                       y="visitors",
                       facet_row="Country Name",
                       facet_row_spacing=0.2,
                       barmode="group",
                       title=f"facet_row_spacing = 0.2",
                       category_orders={"Country Name":["Italy","Spain","France"]},
                      )
fig.update_xaxes(type='category')
fig.show()

## Parameter hover_name and hover_data
hover_name specifies a value which is highlighted on the top of the tooltip. hover_data allow to True/False any column. But in case of histogram (compared to the other plotly charts) it doesn't seem to work pretty well. You can remove the column from `color` parameter by `{"Country Name":False}`, but you cannot add the region by `{"Region":True}`.

In [None]:
# with facet_row_spacing
fig = px.histogram(spfrit,x="visitors",
                   color="Country Name",
                   barmode="group", 
                   hover_name="Country Name",
                  hover_data={"visitors":False, "Region":True},
                   
                  title="Updated tooltip")
fig.show()

In [None]:
# with facet_row_spacing
fig = px.histogram(spfrit,y="visitors",
                   color="Country Name",
                   barmode="group", 
                   hover_name="Country Name",
                  hover_data={"visitors":False, "Region":True},
                   color_discrete_sequence=["yellow","black","lightgreen"],
                   marginal="box",
                   histfunc="avg",
                  title="Custom colors with marginal boxes")
fig.show()

## Parameter Orientation
Specifies whether the bars are oriented horizontaly or vertically. But in reality it's rather specified by the values of `x` and `y`. Switching `x` and `y` will cause the orientation to change.

In [None]:
fig = px.histogram(yr2018, x="visitors", color_discrete_sequence=["#ff5757"])
fig.show()

In [None]:
# switching x to y will turn the vertical bars to horizontals
fig = px.histogram(yr2018, y="visitors", color_discrete_sequence=["#ff5757"])
fig.update_yaxes(autorange="reversed")
fig.show()

In [None]:
px.histogram(sptuar, x="years", y="receipts", nbins = 25, color="Country Name")

In [None]:
px.histogram(sptuar, y="years", x="receipts", nbins = 25, color="Country Name")

## Parameter Marginal
Allow to add additional plot which can be one of four types:

* `histogram` - which is basically the same as the histogram below it
* `rug` - which shows exact spots of each data value within the
* `violin` - doing the violin plot, estimating the probability density of the variable
* `box` - box plot highlighting the median, first and third quartile

In [None]:
marginal="box"
px.histogram(sptuar, x="visitors", marginal=marginal, title=f"Marginal {marginal}", )
#color_discrete_sequence=["#ff913d"]


In case of having histogram with more than 1 categories, there will be the same number of categorical margins as well.

In [None]:
marginal="histogram"
# for histograms with martingal the barmode parameter is ignored  
px.histogram(sptuar, x="visitors", marginal=marginal, title=f"Marginal {marginal}", color="Country Name", barmode="group")
#color_discrete_sequence=["#ff913d"]


## Parameter Animation_Frame
Setting this parameter to a column of the dataframe will create animations for the values of this column. The ranges are set up based on the initial frame, so you must specify the range_x parameter (range_y in case of vertical bars) to cover values from each animation_frame.

In [None]:
px.histogram(spfrit, 
    y="Country Name", 
    x="visitors",
    color="Country Name", 
    barmode="group", 
    # add the animation
    animation_frame="years",
    # anchor the ranges so that the chart doesn't change frame to frame         
    range_x=[0,spfrit["visitors"].max()*1.1])

In [None]:
px.histogram(long_df, 
    x="years",
    y="visitors",
    color="Region",
    animation_frame="Region",
    color_discrete_sequence=px.colors.qualitative.Safe,
    range_y=[0,long_df.groupby(["Region","years"])["visitors"].sum().max()*1.1]
            )

# Histogram using bar chart

In [None]:
px.histogram(yr2018, x="visitors")

In [55]:
# create 19 bins starting with 0 up to 90M
bins = np.linspace(0, 90_000_000, 19)

# use pd.cut to create the bins. In order to include zero, `include_lowest` is set to True
yr2018["hist"] = pd.cut(yr2018["visitors"], bins, include_lowest=True)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [56]:
# pd.cut creates an interval category which is sorted from lowest bin to the greatest bin
yr2018["hist"].cat.categories

IntervalIndex([(-0.001, 5000000.0], (5000000.0, 10000000.0], (10000000.0, 15000000.0], (15000000.0, 20000000.0], (20000000.0, 25000000.0] ... (65000000.0, 70000000.0], (70000000.0, 75000000.0], (75000000.0, 80000000.0], (80000000.0, 85000000.0], (85000000.0, 90000000.0]],
              closed='right',
              dtype='interval[float64]')

In [57]:
# count the values in each bin. Bins are sorted based on the occurance (from most populated to the least one)
agg = yr2018["hist"].value_counts()

# sort the values according to the bins (`sort_index`), turn into data frame (`to_frame`) and reset index
agg = agg.sort_index().to_frame().reset_index()

# rename index (containing the bin range e.g. "(5000000.0, 10000000.0]" to bins)
agg.rename(columns={"index":"bins"}, inplace=True)

# Plotly cannot work with categories index, so we need to turn it into string
agg["bins"] = agg["bins"].astype("str")

agg.head(3)

Unnamed: 0,bins,hist
0,"(-0.001, 5000000.0]",157
1,"(5000000.0, 10000000.0]",17
2,"(10000000.0, 15000000.0]",12


In [None]:
# now we can use the aggregated values in the plotly bar chart
fig = px.bar(agg, x="bins", y="hist", text="hist",
       title="Histogram using pd.cut and px.bar", 
       labels={"hist":"count"})
fig.show()

If you want to display the just the bin-border numbers and not the bin ranges, let's get the border values into a separate column using `pd.cut(df, bin, labels=bins[1:])`. If `bins` variable is [0,1,2] then `bins[1:]` is [1,2]. This way `plotly` assigns the counts to the higher boundary of the bin, but the bar chart will display this number in the middle of the bar, which is exactly the same way, how `px.histogram()` is doing it. 

In [64]:
# bin under the bins higher boundary using labels argument
yr2018["hist_border"] = pd.cut(yr2018["visitors"], bins=bins, labels=bins[1:], include_lowest=True)
# bins containing both lower and higher boundary
yr2018["bins"] = pd.cut(yr2018["visitors"], bins=bins, include_lowest=True)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [62]:
# aggregate through boths the bins and the higher_boundary
agg = yr2018.groupby(["bins","hist_border"]).count()["Country Name"]
agg = agg.sort_index().to_frame().reset_index()
agg["bins"] = agg["bins"].astype("str")
agg.rename(columns={"Country Name":"values"}, inplace=True)
agg.head(3)

Unnamed: 0,bins,hist_border,values
0,"(-0.001, 5000000.0]",5000000.0,157.0
1,"(-0.001, 5000000.0]",10000000.0,
2,"(-0.001, 5000000.0]",15000000.0,


In [None]:
# plot the bar charts as usuall via px.bar
fig = px.bar(agg, x="hist_border", y="values", text="values",
       title="Histogram using pd.cut with labels and px.bar", 
        hover_data={"bins":True})

# remove the gaps between the bars
fig.update_layout(bargap=0)

# show the image
fig.show()

## Split the bars based on a category (e.g. Region)
Achieve this by grouping the data, by both the bins and the categorical value `Region`

In [65]:
agg = yr2018.groupby(["hist","hist_border","Region"]).count()["visitors"].to_frame().reset_index()
agg.rename(columns={"hist":"bins"}, inplace=True)
agg["bins"] = agg["bins"].astype("str")
agg["visitors"] = agg["visitors"].fillna(0)
agg

Unnamed: 0,bins,hist_border,Region,visitors
0,"(-0.001, 5000000.0]",5000000.0,East Asia & Pacific,24.0
1,"(-0.001, 5000000.0]",5000000.0,Europe & Central Asia,28.0
2,"(-0.001, 5000000.0]",5000000.0,Latin America & Caribbean,37.0
3,"(-0.001, 5000000.0]",5000000.0,Middle East & North Africa,13.0
4,"(-0.001, 5000000.0]",5000000.0,North America,1.0
...,...,...,...,...
2263,"(85000000.0, 90000000.0]",90000000.0,Latin America & Caribbean,0.0
2264,"(85000000.0, 90000000.0]",90000000.0,Middle East & North Africa,0.0
2265,"(85000000.0, 90000000.0]",90000000.0,North America,0.0
2266,"(85000000.0, 90000000.0]",90000000.0,South Asia,0.0


In [None]:
fig = px.bar(agg, x="bins", y="visitors", color="Region", text="visitors", 
       title="Histogram using pd.cut and px.bar", )
fig.show()

In [None]:
fig = px.bar(agg, x="hist_border", y="visitors", color="Region", text="visitors", 
       title="Histogram using pd.cut and px.bar", )
fig.update_layout(bargap=0)
fig.show()

## Simulate the histogram with numpy.histogram
This is possibly the most stright forward method, because `np.histogram` counts the values in each bin, while keeping the order of the bins. When the results are turned into a data frame, you can easily feed Plotly with it.

In [68]:
counts, bins = np.histogram(yr2018["visitors"], bins=bins)

In [69]:
# explude the first values from the bins (it's the starting point)
df = pd.DataFrame({"bins":bins[1:], "counts":counts})
pd.concat([df.head(3),df.tail(3)])

Unnamed: 0,bins,counts
0,5000000.0,157
1,10000000.0,17
2,15000000.0,12
15,80000000.0,1
16,85000000.0,1
17,90000000.0,1


In [None]:
fig = px.bar(df, x="bins", y="counts", text="counts", title="Histogram simulation via px.bar")
fig.update_layout(bargap=0)
fig.show()

# Conclusion
Histograms allow a quick way how to explore the distribution of the data. `px.histogram()` through lack some feature other graphs in the plotly family have. You can annotate the bars and it's difficult to influnce the size of the bins. To overcome this burden you can calculate the bins yourself and draw the chart using `px.bar()`