# Plotly.Express - The Comprehensive guide
In this notebook, which complements an article on towardsdatascience.com, you will learn all you wanted to know about the Plotly.Express, a higher level API of Plotly specially designed to work with the Data Frames. The chart in this notebook are not rendered on purpose, because Github cannot shows these renders. Please run all the cells to see the plots. 

The notebook is completemnting the article: https://towardsdatascience.com/visualization-with-plotly-express-comprehensive-guide-eb5ee4b50b57

## installation
You can learn all about the installation in the article. You have to install pandas too, in order to make Plotly Express working. 

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

# The data
The data were preprocessed using this notebook: 
https://github.com/vaclavdekanovsky/data-analysis-in-examples/blob/master/Vizualizations/Plotly/Preprocess/Preprocessing.ipynb

In [2]:
# original dataframe has years are columns
year_columns = pd.read_pickle("../Preprocess/arr.plk")

# transpose to get the country names as columns
country_columns = year_columns.set_index("Country Name").T

# for some plots we will look on the expenditures made in each country
rec = pd.read_pickle("../Preprocess/rec.plk")

In [3]:
country_columns.head()

Country Name,Aruba,Afghanistan,Angola,Albania,Andorra,United Arab Emirates,Argentina,Armenia,American Samoa,Antigua and Barbuda,...,"Venezuela, RB",British Virgin Islands,Virgin Islands (U.S.),Vietnam,Vanuatu,Samoa,"Yemen, Rep.",South Africa,Zambia,Zimbabwe
Country Code,ABW,AFG,AGO,ALB,AND,ARE,ARG,ARM,ASM,ATG,...,VEN,VGB,VIR,VNM,VUT,WSM,YEM,ZAF,ZMB,ZWE
1995,619000,,9000,,,2.315e+06,2.289e+06,12000,34000,220000,...,700000,219000,454000,1.351e+06,44000,68000,61000,4.488e+06,163000,1.416e+06
1996,641000,,21000,,,2.572e+06,2.614e+06,13000,35000,228000,...,759000,244000,373000,1.607e+06,46000,73000,74000,4.915e+06,264000,1.597e+06
1997,650000,,45000,,,2.476e+06,2.764e+06,23000,26000,240000,...,814000,244000,393000,1.716e+06,50000,68000,80000,4.976e+06,341000,1.336e+06
1998,647000,,52000,,,2.991e+06,3.012e+06,32000,36000,234000,...,685000,279000,422000,1.52e+06,52000,78000,88000,5.732e+06,362000,2.09e+06


# Line Chart
Let's start to explore the Plotly Database using the line chart. The regular syntax for any Plotly.Exress chart is `px.chart_type(df, parameters)` so for the line chart it's `px.line(df, parameters)`. 

There're always the three ways how to create the plot. Check yourself, but I think the first one makes the most sense. 

* using dataframe and referencing the columns, e.g. `y="Spain"` because "Spain" is one of the columns in `country_columns` dataset
* using `pandas.Series`
* using a list of values

In [None]:
"""our dataset comes as a wide dataset with years as column. To turn the country names into the columns, we must set them as index and transpose the frame."""
country_columns = year_columns.set_index("Country Name").T
# 1. I had to reshape the data by transposing
px.line(country_columns
        ,y="Spain"
        ,title="Visitors to Spain")


In [None]:
# 2. you can specify the value as pandas.Series as well
px.line(country_columns, 
        y=country_columns["Spain"],
       title="Visitors to Spain")

In [None]:
# 3. or any array. In this case you must specify y-label. 
px.line(country_columns, 
        y=country_columns["Spain"].to_list(), 
        labels={"y":"Spain"},
       title="Visitors to Spain")

On background, each graph is a dictionary. You can store the chart into a variable, commonly fig and display this dictionary using `fig.to_dict()` or `fig["data"]` or `fig.data` to see the elements data or `fig["layout"]` to review the design of the plot.

In [7]:
fig = px.line(country_columns, 
        y=country_columns["Spain"],
       title="Visitors to Spain")
fig.data, fig.layout

((Scatter({
      'hovertemplate': 'index=%{x}<br>Spain=%{y}<extra></extra>',
      'legendgroup': '',
      'line': {'color': '#636efa', 'dash': 'solid'},
      'mode': 'lines',
      'name': '',
      'orientation': 'v',
      'showlegend': False,
      'x': array(['Country Code', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
                  '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
                  '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018',
                  'Region'], dtype=object),
      'xaxis': 'x',
      'y': array(['ESP', 32971000.0, 34027000.0, 39553000.0, 41892000.0, 45440000.0,
                  46403000.0, 48565000.0, 50331000.0, 50854000.0, 52430000.0, 55914000.0,
                  58004000.0, 58666000.0, 57192000.0, 52178000.0, 52677000.0, 56177000.0,
                  57464000.0, 60675000.0, 64939000.0, 68175000.0, 75315000.0, 81869000.0,
                  82773000.0, 'Europe & Central Asia'], dtype=object)

## Layout Basics
In the examples above, we plot the charts immediately using `px.line(df, params)`, but in order to influence the other parts of the plot, you can assign it into the a variable. As standard `fig = px.chart_type()` is used. In such a case, you render the plot using `fig.show()`

In [None]:
fig = px.line(country_columns, 
        y=country_columns["Spain"].to_list(), 
        labels={"y":"Spain"},
       title="Visitors to Spain")
fig.update_layout(template="plotly_dark")
fig.show()

Rarely you have an option to use multiple columns as input to one of the Plotly parameters. If I'm using `country_columns` data frame which has a column for each country, when I input `y=["Spain","Italy","France"]` three lines will be drawn. 

In [None]:
px.line(country_columns, y=["Spain","Italy","France"], title="International Visitors")

But trying to use multiple columns for any other parameter leads to an error.

In [10]:
try: 
    px.line(country_columns, y=["Spain","Italy","France"], 
            text=["Spain","Italy","France"])
except Exception as e:
    print(e)

All arguments should have the same length. The length of argument `wide_cross` is 26, whereas the length of  previously-processed arguments ['text'] is 3


You can try some workaround, but they won't create a chart with 3 lines with data value labels. 

In [None]:
fig = px.line(country_columns, y=["Spain","Italy","France"], 
                text="Spain",
             title="Impossible with wide data")

fig.update_traces(texttemplate='%{text:.2s}', textposition="top center")

To get the most from Plotly.Express, it's preferable to use the long dataframe. From the wide dataframe you achive it using the `df.melt()` function

In [12]:
melted_df = year_columns.melt(id_vars=["Country Name","Country Code","Region"], 
                              var_name="years",
                             value_name="visitors")
melted_df

Unnamed: 0,Country Name,Country Code,Region,years,visitors
0,Aruba,ABW,Latin America & Caribbean,1995,619000.0
1,Afghanistan,AFG,South Asia,1995,
2,Angola,AGO,Sub-Saharan Africa,1995,9000.0
3,Albania,ALB,Europe & Central Asia,1995,
4,Andorra,AND,Europe & Central Asia,1995,
...,...,...,...,...,...
5155,Samoa,WSM,East Asia & Pacific,2018,164000.0
5156,"Yemen, Rep.",YEM,Middle East & North Africa,2018,
5157,South Africa,ZAF,Sub-Saharan Africa,2018,10472000.0
5158,Zambia,ZMB,Sub-Saharan Africa,2018,1072000.0


Now you can assign one column to each paramter. In case you want to limit the scope to just 3 of 215 countries, let's filter them in advance. 

In [13]:
spfrit = melted_df[melted_df["Country Name"].isin(["Spain","Italy","France"])]

In [None]:
# by assigning single column to each parameter you can achieve almost every chart design. 
fig = px.line(spfrit, 
        x="years", 
        y="visitors", 
        color="Country Name", 
        text="visitors",
       title="International Visitors")

fig.update_traces(texttemplate='%{text:.2s}', textposition="top center")

### Tooltips
* `hover_name` - highlights value of this column on the top of the tooltip
* `hover_data` - let you add or remove tooltips by setting them True/False
* `labels` - let you rename the column names inside the tooltip

You can also use the icons in the Plotly interactive menu to change between single tooltip appearing or a tooltip for all the lines at the `x` coordinate you hover over - `show closest data on hover` vs `compare data on hover` icons.

In [None]:
fig = px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        color="Country Name", 
        hover_name="Country Name",
        text="visitors",
        hover_data = {"Country Name": False,
                     "Country Code": True},
        labels={"Country Code": "Code"},
        title="International Visitors")
fig.update_traces(texttemplate='%{text:.2s}', textposition="top center")

## Parameter - facet_row and facet_col
split the chart into rows or columns

In [None]:
px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        facet_col ="Country Name", 
        color="Country Name",
       title="International Visitors")

## Parameter - color_discrete_sequence
To set up the exact color of each line using a list of colors

In [None]:
px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        facet_row ="Country Name", 
        color="Country Name",
       title="International Visitors",
       color_discrete_sequence  = ["red","yellow"])

## Parameter - color_discrete_map
To set up the exact color of each line using a dictionary

In [None]:
px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        facet_row ="Country Name", 
        color="Country Name",
       title="International Visitors",
       color_discrete_map  = {"Spain":"Black"})


## Parameter - line_group
To separate the lines based on this column. Basically work the same as color, but all lines will have the same color

In [None]:
px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        line_group="Country Name",
       title="International Visitors",
        # you can use color_discrete_sequence to change the color, but only the first item in the list is considered
       color_discrete_sequence  = ["red", "yellow"])

## Parameters - range_x, range_y
Allowing to zoom into the plot. You can zoom out using `autoscale` icon in the Plotly interactive menu, which appear when you hover your mouse over the chart in the top right corner. 

In [None]:
px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        color="Country Name",
        title="International Visitors",
        range_x=[2014,2018],
        range_y=[50000000,90000000])

## Parameters - log_x, log_y
Change the axes to log scale

In [None]:
px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        color="Country Name",
        title="International Visitors",
        log_x=True,
        log_y=True)

## Parameters - line_dash
Changes the dash pattern of the lines

In [None]:
px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        line_dash ="Country Name",
        title="International Visitors",)

## Parameters - animation_frame
In order to use animation with a line chart, I have to treak the data frame a bit. I'll add a new column `year_upto` and for every year I'll hold the data for all previous years.

`year_upto   year country visitors`<br>
`1995        1995 ESP     1000`<br>
`1996        1995 ESP     1000`<br>
`1996        1996 ESP     1099`

In [23]:
data = []
for y in melted_df["years"].unique():
    df = spfrit[spfrit["years"]<=y]
    df["year_upto"] = y
    data.append(df)
spfrit_animation = pd.concat(data)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
px.line(spfrit_animation, 
        x="years", 
        y="visitors", 
        color="Country Name",
        title="International Visitors",
        range_x=[1995, 2018],
        range_y=[25000000,90000000],
        animation_frame="year_upto")

# The Layout and styling
Because every plotly chart contain a background dictionary you can update it using 3 ways. 

In [25]:
# Every graph is a dictionary
fig = px.line(country_columns
        ,y="Spain"
        ,title="Visitors to Spain")

# display the data
fig.to_dict()

{'data': [{'hovertemplate': 'index=%{x}<br>Spain=%{y}<extra></extra>',
   'legendgroup': '',
   'line': {'color': '#636efa', 'dash': 'solid'},
   'mode': 'lines',
   'name': '',
   'orientation': 'v',
   'showlegend': False,
   'x': array(['Country Code', '1995', '1996', '1997', '1998', '1999', '2000',
          '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
          '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
          '2017', '2018', 'Region'], dtype=object),
   'xaxis': 'x',
   'y': array(['ESP', 32971000.0, 34027000.0, 39553000.0, 41892000.0, 45440000.0,
          46403000.0, 48565000.0, 50331000.0, 50854000.0, 52430000.0,
          55914000.0, 58004000.0, 58666000.0, 57192000.0, 52178000.0,
          52677000.0, 56177000.0, 57464000.0, 60675000.0, 64939000.0,
          68175000.0, 75315000.0, 81869000.0, 82773000.0,
          'Europe & Central Asia'], dtype=object),
   'yaxis': 'y',
   'type': 'scatter'}],
 'layout': {'xaxis': {'anchor': 'y',


In [None]:
# 1. using fig.update_layout
fig.update_layout(title="Visitors to Spain",
                  yaxis_tickformat = '.2f',
                  xaxis_title="X-axis title ",
                  yaxis = {"title": "Y-axis with updated numeric format",
                          "range": [10000000,120000000],
                          "scaleanchor": "x19"}
                  ,xaxis={})
fig.show()

In [None]:
# 2. update axes using fig.update_yaxes
# parameters of the yaxis can be found in the doc - 
# https://plotly.com/python-api-reference/generated/plotly.graph_objects.layout.yaxis.html?highlight=yaxis#module-plotly.graph_objects.layout.yaxis
fig.update_yaxes(title_text="different axis title", 
                 type="linear", 
                 ticks="inside", 
                 color="red")

In [None]:
# Updating the layout by modifying charts dictionary.
# standoff is the distance between the tick labels and the axis title
fig["layout"]["yaxis"]["title"]["text"] = "Y-Axis"
fig["layout"]["yaxis"]["title"]["standoff"] = 150
fig["layout"]["yaxis"]["title"]["font"]["size"] = 22
fig.show()

## Templates

In [29]:
def fig_reset():
    """Initialize basic chart"""
    return px.line(spfrit, x="years", y="visitors",
                   color="Country Name",
                   title="International Visitors")


In [None]:
fig = fig_reset()
fig.update_layout(template="plotly_dark")

## X and Y-axis
Axes are controlling wide range of parameters, like gridlines, ticks, ticklabels, axes titles, spikes, 

In [None]:
df = pd.DataFrame({"values": [-2,0,1,2,3,4,4,5,6,7,8]})
fig = px.line(df, y="values", color_discrete_sequence=["#f66e20"])
fig.update_layout(
                template="plotly_white",
                title={"text":"Axis Elements", "font": {"size": 25}, "x": 0.5, "y": 0.95},
                yaxis={
                        "zerolinecolor": "black",
                        "gridcolor": "#0071b3",
                        "linecolor": "#fab22e",
                        "spikecolor": "#cb001c",
                        "spikesnap": "hovered data",
                        "ticks": "inside",
                        "tickangle": 30,
                        "ticklen": 10,
                        "tickwidth": 2,
                        "tickcolor": "#35b729",
                        "showticklabels": True,},
                 annotations=[
                             {"x":9, "y":0, "ay": -40, 
                               "text": "zeroline",
                               "arrowhead": 3, "showarrow":True,
                                "font": {"size": 15}},
                             {"x":7, "y":2, "ay": 40, 
                               "text": "gridline",
                               "arrowhead": 3, "showarrow":True,
                               "font":{"color":"#0071b3", "size": 15}},
                             {"x":3, "y":3, "ay": -60,  
                               "text": "spike",
                               "arrowhead": 3, "showarrow":True,
                               "font":{"color":"#cb001c", "size": 15}},
                             {"x":0.0, "y": 7, "ax": 60, "ay": 0,
                               "text": "line",
                               "arrowhead": 3, "showarrow":True,
                               "bgcolor":"#fab22e",
                               "font": {"size": 15}},
                             {"x":0.05, "y":4.10, "ax": 60, "ay": -35, 
                               "text": "ticks",
                               "arrowhead": 3, "showarrow":True,
                              "font":{"color":"#35b729", "size": 15}},
                             {"x":0.05, "y":5.90, "ax": 45, "ay": 25, 
                               "arrowhead": 3, "showarrow":True,
                              "font":{"color":"#35b729", "size": 15}}],
                    xaxis={"range":[0,10]})

### Grid Ranges

In [None]:
fig = fig_reset()
# y-ticks are precise location
fig.update_layout(yaxis={"tickvals":[40000000,70000000], "title": "visitors"})

In [None]:
# there is 5 y-ticks
fig = fig_reset()
fig.update_layout(yaxis={"tickmode": "auto",
                         #"tick0": 11000000, 
                         #"dtick":6000000,
                         "nticks": 5,
                         "title": "visitors"})

In [None]:
# ticks start at 11M and appear every 6M
fig = fig_reset()
fig.update_layout(yaxis={"tickmode": "linear",
                         "tick0": 11000000, 
                         "dtick":6000000,
                         "title": "visitors"
                         })

### Range slider

In [None]:
fig = px.line(melted_df[melted_df["Country Name"].isin(["France","Italy","Spain"])], 
        x="years", 
        y="visitors", 
        line_dash ="Country Name",
        title="Range Slider")
fig.update_xaxes(rangeslider_visible=True)

### Axis types
Our x-axis is of the `string` (object) type, but plotly turns it into float. If you look into the plotly's figure dictionary it still looks as a string, but when you zoom the chart, plotly automatically add extra grids with decimal values.

In [36]:
fig["data"][0]["x"]

array(['1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002',
       '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'],
      dtype=object)

In [None]:
fig = px.line(spfrit, 
        x="years", 
        y="visitors", 
        line_dash ="Country Name",
        range_x=[2004,2006])
fig.show()

In [None]:
# because our values are rather dates, let's turn it into date.
spfrit["years"] = pd.to_datetime(spfrit["years"], format="%Y")
fig = px.line(spfrit, 
        x="years", 
        y="visitors", 
        line_dash ="Country Name",
        # since range_x accepts a list with two values - start and end, we can use pandas daterange with 2 values
        range_x=pd.date_range(start='2004', periods=2, freq='2Y').to_list())
fig.show()

In [None]:
spfrit = melted_df[melted_df["Country Name"].isin(["Spain","Italy","France"])]
# turning serie used as x-axis to string or category, doesn't prevent plotly to turn it into float
spfrit["years"] = spfrit["years"].astype("string")
# spfrit["years"] = spfrit["years"].astype("category")
fig = px.line(spfrit, 
        x="years", 
        y="visitors", 
        line_dash ="Country Name",
        range_x=["2004","2006"])
# despite the years are being string, plotly itroduces grid lines `2,004.5` and `2,005.5` which are not existing in the dataframe
fig.show()

In [None]:
# the fix is however simple, you need to turn the axis into a category.
fig.update_xaxes(type='category')
fig.show()

# Annotations
4 main use cases for annotation:

* To highlight some point(s)
* To describe/highlight an area
* To label desired point
* Instead of a legend

I have said that long data structure is ideal for plotly.express. Well, sometimes you can struggle with it too. Like in this case, because the order of the lines depend when they appear in the dataframe a you cannot suprisingly assign color using `color_discrete_sequence`. So I must transform my datagrame into a dictionary with a color assigned to each country. Most countries would be `lightgrey` and the two I want to highlight will have `blue` and `red` color.

In [41]:
melted_df_jptr = melted_df.copy()

# crete a dict with colors:
colors = pd.DataFrame(melted_df_jptr["Country Name"].unique(), columns=["Country Name"])
colors["color"] = colors["Country Name"].map({"Japan": "blue", "Turkey": "red"}).fillna("lightgrey")

# color map is a dict with colors, lightgrey for most, {"Aruba": "lightgrey", ... "Japan: "blue", ...}
color_map = {v["Country Name"]: v["color"] for k,v in colors.iterrows()}

# show sample from the dictionary
{k:color_map[k] for k in color_map if k in ["Aruba","Japan","Turkey","Zimbabwe"]}

{'Aruba': 'lightgrey',
 'Japan': 'blue',
 'Turkey': 'red',
 'Zimbabwe': 'lightgrey'}

In [42]:
# sort the dataframe

melted_df_jptr["order"] = melted_df_jptr["Country Name"].map({"Japan": 1, "Turkey": 2}).fillna(3)
melted_df_jptr.sort_values(by=["order","years"], ascending=True, inplace=True)
melted_df_jptr.head(3)

Unnamed: 0,Country Name,Country Code,Region,years,visitors,order
97,Japan,JPN,East Asia & Pacific,1995,3345000.0,1.0
312,Japan,JPN,East Asia & Pacific,1996,3837000.0,1.0
527,Japan,JPN,East Asia & Pacific,1997,4218000.0,1.0


Create the plot with the annotations.
You can use two reference points:

* "xref": "paper" where 0 is the left side of the plot and 1 the right one
* "xref": "x" (default) which allow to annotate inside the plotting area using chart coordinates

In [None]:

# but still my lines are somewhere in the middle
fig = px.line(melted_df_jptr.sort_values(by=["order","years"], ascending=False),
              x="years",
              y="visitors", 
              color="Country Name", 
              line_group="Country Name",
              color_discrete_map=color_map)

fig.update_layout(title="Tourism Growth in Turkey and Japan",
                # remove the legent
                showlegend=False,
                  
                # make y-axis invisible
                yaxis={"visible":False},
                
                xaxis={"type": "linear"},
                
                # create the annoations
                # point annotattion
                annotations=[
                        {"x":2011, "y":6220000, "ay": -40, 
                        "text": "<b>Tourism Boom<br> in Japan 2011</b>",
                        "arrowhead": 3, "showarrow":True,
                        "font": {"size": 15}},
                    # area annotation
                    {"x":2007, "y":40000000, 
                        "text": "<b>Number of tourist is growing</b>",
                         "textangle": -25,
                        "showarrow":False,
                         "bgcolor":"lightblue",
                        "font": {"size": 15}},
                    # start of the line annotation   
                    # use the "xanchor": "right" so that the labels stick to the right side of the plot area
                    {"xref":"paper", "yref":"paper", "x":0, "y":0.15,
                              "xanchor":'right', "yanchor":"top",
                              "text":'7M',
                              "font":dict(family='Arial',
                                        size=12,
                                        color="red"),
                              "showarrow":False},
                    {"xref":"paper", "yref":"paper", "x":0, "y":0.1,
                              "xanchor":'right', "yanchor":'top',
                              "text":'3.3M',
                              "font":dict(family='Arial',
                                        size=12,
                                        color="blue"),
                              "showarrow":False},
                    # end of the line legend
                    # use the "xanchor": "left" so that the labels stick to the right side of the plot area
                    {"xref":"paper", "yref":"paper", "x":1, "y":0.53,
                              "xanchor":"left", "yanchor":"top",
                              "text":'Turkey (45M)',
                              "font":dict(family='Arial',
                                        size=12,
                                        color="red"),
                              "showarrow":False},
                    {"xref":"paper", "yref":"paper", "x":1, "y":0.39,
                              "xanchor":'left', "yanchor":'top',
                              "text":'Japan (31M)',
                              "font":dict(family='Arial',
                                        size=12,
                                        color="blue"),
                              "showarrow":False}
                    
                ])
fig.show()

You might also notice, that with increased number of lines, Plotly automatically switched to WebGL format proven to improve the usability of the JavaScript plots with many data points.

In [44]:
type(fig.data[0])

plotly.graph_objs._scattergl.Scattergl

How to create this chart is described in the article - https://towardsdatascience.com/highlighted-line-chart-with-plotly-express-e69e2a27fea8 and notebook: https://github.com/vaclavdekanovsky/data-analysis-in-examples/blob/master/Vizualizations/Plotly/Highlighted_Line_Chart_on_Grey_Lines_Background/Highlight_Lines_on_Grey_Background.ipynb

# Bar Chart

In [None]:
fig = px.bar(spfrit, 
             y="visitors", 
             x="years", 
             color="Country Name",
            title="Visitors to Europe - Bar Mode Group",
            barmode='group')
fig.update_layout(
        xaxis={"tick0":1995, "dtick":1, "tickangle": 30}
)

Let's plot just year 2018 to see which country was the most often visited

In [None]:
arr_2018 = melted_df[melted_df["years"]=="2018"]
fig = px.bar(arr_2018.sort_values(by="visitors", ascending=False), 
             y="visitors", 
             x="Country Name", 
            title="World Tourism 2018",
            barmode='group')
fig.show()

The chart like this provides some information, but it's not very readable. Let's take top 15 countries and group the rest into "Other"

In [47]:
# for the most visited in 2018 take name, for others use label "Other"
arr_2018["2018_name"] = arr_2018["Country Name"].where(arr_2018["visitors"].rank(ascending=False) <= 25, "Other")



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
fig = px.bar(arr_2018.sort_values(by="visitors", ascending=False), 
             y="visitors", 
             x="2018_name", 
            title="World Tourism 2018",
            barmode='group',
             hover_data = {"2018_name": False, "Country Name": True}
            )
fig.show()

Top 20 counties, text labels through `text` parameter and formated using `update_traces` and changing the bar color via `color_discrete_sequence` parameter. 

In [None]:
fig = px.bar(arr_2018.sort_values(by="visitors", ascending=False)[:20], 
             y="visitors", 
             x="2018_name", 
            title="World Tourism 2018",
            barmode='group',
            hover_data = {"2018_name": False, "Country Name": True},
             text="visitors",
             #color="2018_name"  # for some reason it makes the bars very thin
             color_discrete_sequence =["#008037"]
            )
fig.update_traces(texttemplate='%{text:.2s}', 
                  textposition="outside",
                 )
fig.update_layout(xaxis={"title":"countries"})
fig.show()

In [50]:
arr_2018

Unnamed: 0,Country Name,Country Code,Region,years,visitors,2018_name
4945,Aruba,ABW,Latin America & Caribbean,2018,1082000.0,Other
4946,Afghanistan,AFG,South Asia,2018,,Other
4947,Angola,AGO,Sub-Saharan Africa,2018,218000.0,Other
4948,Albania,ALB,Europe & Central Asia,2018,5340000.0,Other
4949,Andorra,AND,Europe & Central Asia,2018,3042000.0,Other
...,...,...,...,...,...,...
5155,Samoa,WSM,East Asia & Pacific,2018,164000.0,Other
5156,"Yemen, Rep.",YEM,Middle East & North Africa,2018,,Other
5157,South Africa,ZAF,Sub-Saharan Africa,2018,10472000.0,Other
5158,Zambia,ZMB,Sub-Saharan Africa,2018,1072000.0,Other


The other column is still too high, you can play with it and split it by continent. But let's explore another feature of Plotly, the continuous color scale. If we set color as a column with numerical range, rather than categorical, we automatically get the continuous color scale. You can apply various predefined color scales:
https://plotly.com/python/builtin-colorscales/

In [None]:
fig = px.bar(arr_2018.fillna(0).sort_values(by="visitors", ascending=False), 
             y="visitors", 
             x="2018_name", 
            title="World Tourism 2018",
            barmode='group',
             hover_data = {"2018_name": False, "Country Name": True},
             color = "visitors",
             color_continuous_scale=px.colors.sequential.Viridis
            )
fig.update_layout(xaxis={"title":"countries"})
fig.show()

## Combine bar and line chart

In [52]:
# Create simple dataset
df = pd.DataFrame({"revenue":[100,200,300], "cost":[-150,-150,-200], "year":[2018,2019,2020]})
df["profit"] = df["revenue"]+df["cost"]

In [None]:
# adding line to a bar doesn't work

fig = px.bar(df, x="year", y=["revenue","cost"], title="Bar chart without any line")
try:
    fig.add_line(x=df["year"],y=df["profit"])
except Exception as e:
    print(e)
fig.show()

In [None]:
# adding bars to line chart is possible, but not very flexible. 
fig = px.line(df, x="year", y="profit", title="Profit")
fig.add_bar(x=df["year"],y=df["revenue"], name="revenue")
fig.add_bar(x=df["year"],y=df["cost"], name="cost")
fig.show()

In [55]:
pd.concat(data)

Unnamed: 0,Country Name,Country Code,Region,years,visitors,year_upto
58,Spain,ESP,Europe & Central Asia,1995,32971000.0,1995
63,France,FRA,Europe & Central Asia,1995,60033000.0,1995
94,Italy,ITA,Europe & Central Asia,1995,31052000.0,1995
58,Spain,ESP,Europe & Central Asia,1995,32971000.0,1996
63,France,FRA,Europe & Central Asia,1995,60033000.0,1996
...,...,...,...,...,...,...
4793,France,FRA,Europe & Central Asia,2017,86758000.0,2018
4824,Italy,ITA,Europe & Central Asia,2017,58253000.0,2018
5003,Spain,ESP,Europe & Central Asia,2018,82773000.0,2018
5008,France,FRA,Europe & Central Asia,2018,89322000.0,2018


## Animated Bar Chart

In [None]:
data = []
for y in melted_df["years"].unique():
    sub_df = melted_df[melted_df["years"]==y].sort_values(by="visitors", ascending=False)[:10]
    data.append(sub_df)

fig = px.bar(pd.concat(data).sort_values(by="visitors", ascending=True), 
             y="Country Name", x="visitors", 
             orientation="h", animation_frame="years",
            title="Evolution of Tourism",
            text="visitors")
fig.update_layout(xaxis={"range":[0,100000000]},
                 yaxis={"title":{"standoff":150}, "tickwidth": 200, "automargin": True})
                  #yaxis={"title":{"standoff":150}, "ticklen": 200, "automargin": False})
fig.update_traces(texttemplate='%{text:.2s}', 
                  textposition="outside"
                 )

# Histograms
There's separater article and notebook going into the details of the Plotly Express histograms. 

* Article: https://towardsdatascience.com/histograms-with-plotly-express-complete-guide-d483656c5ad7
* Notebook: https://github.com/vaclavdekanovsky/data-analysis-in-examples/blob/master/Vizualizations/Plotly/Histogram/Histograms.ipynb

# Pie Chart

In [None]:
fig = px.pie(arr_2018.sort_values(by="visitors", ascending=False), 
             values="visitors", 
             names="2018_name",
             # rename the label on the hover
             labels={'2018_name':'country'},
            title="2018 visitors")
fig.update_traces(textposition='inside', textinfo='percent+value+label')
fig.update_traces(pull=[.3,.2,.1])

## Donnut chart - pie with hole

In [None]:
fig = px.pie(arr_2018.sort_values(by="visitors", ascending=False)[:10], 
             values="visitors", 
             names="2018_name",
             # rename the label on the hover
             labels={'2018_name':'country'},
            title="2018 visitors",
            hole=.3)
fig.update_traces(textposition='inside', textinfo='percent+label')

# Sunburts plot

In [None]:
chart_df = arr_2018.sort_values(by="visitors", ascending=False)[:10]
chart_df["visitors"] = round(chart_df["visitors"]/10000000)
chart_df["visitors"] = chart_df["visitors"].astype(int)
fig = px.sunburst(
    chart_df,
    path=['Region', '2018_name'],
    values='visitors',
)
fig.show()

# Treemap

In [None]:
fig = px.treemap(arr_2018.sort_values(by="visitors", ascending=False)[:50], 
             values="visitors", 
             path=['Region', 'Country Name'],
             color="visitors",
            title="2018 visitors (top 50 countries)",
            color_continuous_scale='RdBu',
            #color_continuous_midpoint=20000000
           )
fig.show()

In [None]:
fig = px.treemap(arr_2018.sort_values(by="visitors", ascending=False)[:50], 
            values="visitors", 
            path=['Region', 'Country Name'],
            color="Region",
            title="2018 visitors (top 50 countries)",
            hover_data={"Region":False},
            labels={'labels':'Country'},
           )
fig.show()

# Chropleth
Working with the geospacial data, especially on the country level is easy with plotly. 

In [None]:
fig = px.choropleth(
    arr_2018, 
    locations="Country Code",                    
    color="visitors",
    hover_name="Country Name", # column to add to hover information
    color_continuous_scale=px.colors.sequential.matter)
fig.show()

zoom in with `scope`

In [None]:
fig = px.choropleth(arr_2018, locations="Country Code",
                    color="visitors", # lifeExp is a column of gapminder
                    hover_name="Country Name", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.matter,
                   scope="south america"
                   )          
fig.show()

# Scatterplot
## World map with scatter_geo

In [None]:
fig = px.scatter_geo(
    melted_df.fillna(0), 
    locations ="Country Code", 
    color="visitors",
    size="visitors",
    # what is the size of the biggest scatter point
    size_max = 30,
    projection="natural earth",
    # range, important to keep the same range on all charts
    range_color=(0, 100000000),
    # columns which is in bold in the pop up
    hover_name = "Country Name",
    # format of the popup not to display these columns' data
    hover_data = {"Country Name":False, "Country Code": False},
    title="International Tourism",
    animation_frame="years"
                     )
fig.update_geos(showcountries = True)
fig.show()

To get some relations into the data, let's combine the number of visitors with the income these visitors have brought to the countries. In this case, I will join melted receipts to metled visitors to have two columns with data. These can be assigned to `x` and `y` of the plot.

In [65]:
year_receipts = pd.read_pickle("../Preprocess/rec.plk")
melted_receipts = year_receipts.melt(id_vars=["Country Name","Country Code"], 
                              var_name="years",
                             value_name="receipts")
scatter_df = melted_df.merge(melted_receipts, on=["Country Name","Country Code","years"]).fillna(0)
scatter_df.sample(3)

Unnamed: 0,Country Name,Country Code,Region,years,visitors,receipts
3900,Botswana,BWA,Sub-Saharan Africa,2013,1544000.0,484200000.0
22,Belarus,BLR,Europe & Central Asia,1995,160600.0,28000000.0
600,Solomon Islands,SLB,East Asia & Pacific,1997,13800.0,9700000.0


In [None]:
px.scatter(scatter_df[scatter_df["years"]=="2018"],
          x="visitors",
          y="receipts",
          color="Region",
          hover_name="Country Name",
          size="receipts")

In [None]:
px.scatter(scatter_df[scatter_df["years"]=="2018"],
          x="visitors",
          y="receipts",
          #color="Region",
          hover_name="Country Name",
          size="receipts",
          marginal_x="rug",
          marginal_y="violin",
          trendline="lowess")

In [None]:
two018 = scatter_df[scatter_df["years"]=="2018"]
two018["type"] = np.where(two018["Region"]=="Europe & Central Asia", "European", "Outside")
fig = px.scatter(two018,
          x="visitors",
          y="receipts",
          color="type",
          hover_name="Country Name",
          size="receipts",
          marginal_x="histogram",
          marginal_y="box",
          trendline="ols",
          title="Scatter plot with histogram and box marginal plot and two trendlines")
fig.show()

Print the regression parameters of the trendline. First get the values using get_trendline_results and then display them using statsmodels .summary() function.

In [69]:
res = px.get_trendline_results(fig)
european_trendline = res[res["type"]=="European"]["px_fit_results"].iloc[0]
print(type(european_trendline))

<class 'statsmodels.regression.linear_model.RegressionResultsWrapper'>


In [70]:
european_trendline.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.903
Model:,OLS,Adj. R-squared:,0.901
Method:,Least Squares,F-statistic:,500.8
Date:,"Wed, 21 Oct 2020",Prob (F-statistic):,5.46e-29
Time:,21:24:27,Log-Likelihood:,-1337.9
No. Observations:,56,AIC:,2680.0
Df Residuals:,54,BIC:,2684.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.084e+07,9.39e+08,-0.022,0.982,-1.9e+09,1.86e+09
x1,921.7867,41.190,22.379,0.000,839.207,1004.367

0,1,2,3
Omnibus:,31.241,Durbin-Watson:,1.968
Prob(Omnibus):,0.0,Jarque-Bera (JB):,78.847
Skew:,1.61,Prob(JB):,7.56e-18
Kurtosis:,7.84,Cond. No.,27400000.0


## Histogram

In [None]:
px.histogram(arr_2018, x="visitors")

In [None]:
# cumulative histogram adds on the top of previous bins
px.histogram(arr_2018, x="visitors", cumulative=True)

In [None]:
# nbins paramter influences the number of bins
px.histogram(arr_2018, x="visitors", nbins=5)

### Histnorm
Histnorm parameter basically influence the look of the y-axis (x-axis if vertical histogram). It can be:

* `None` - histogram shows the aggregated values (most often it's count)
* `percent` - axis shows values 0-100 showing number of percent in the bin
* `probability` - axis show values 0-1 displaying the probability that the value appear in the bin
* `density` - the output of histfunc for a given bin is divided by the size of the bin
* `probability density` - he output of histfunc for a given bin is normalized such that it corresponds to the probability that a random event whose distribution is described by the output of histfunc will fall into that bin


In [None]:
# if histnorm percent is used, then the y-axis shows percent not total count
px.histogram(arr_2018, x="visitors", nbins=5, histnorm="percent")

In [None]:
px.histogram(arr_2018, x="visitors", nbins=5, histnorm="probability density")

In [76]:
# you can do several histograms together, eg. for several years. Let's first set up a data frame containing 3 years.
arr_2016_18 = melted_df[melted_df["years"].isin(["2016","2017","2018"])]

In [None]:
# barnorm="percent" shows the percent of the values which fall to the bin
px.histogram(arr_2016_18, x="visitors", color="years", barmode="group", barnorm="", cumulative=False)

In [None]:
# barnorm can be used in case we have several histograms. It shows how many percent fall into the particular bin in each category.
# in our case we see that 0-5M visitors is split to 30% in 2018, 35% in 2017 and 35.5% in 2016.
# since there was one country having 85M-90M visitors in 17' and 18' both occupy 50% of the bin
px.histogram(arr_2016_18, x="visitors", color="years", barmode="group", barnorm="percent", cumulative=False)

### Marginal
Similarly like the scatter plot, the histograms also allow quick creation of marginal chart. All four types are available `rug`, `box`, `violin`, or `histogram`.

In [None]:
# barnorm="percent" shows the percent of the values which fall to the bin
px.histogram(arr_2016_18, x="visitors", color="years", barmode="stack", marginal="violin")

# Buttons

In [None]:
fig = px.bar(arr_2018.sort_values(by="visitors", ascending=False)[:10], 
             y="Country Name", x="visitors", 
            title="Evolution of Tourism",
             color_discrete_sequence= ["red"],
            text="visitors")
# Add dropdown
fig.update_layout(
    updatemenus=[
        # a dropdown `direction="down"`
        # changing the color of the bars
        dict(
            buttons=list([
                dict(
                    args=["marker.color",["red"]],
                    label="Red",
                    method="restyle"
                ),
                dict(
                    args=["marker.color",["green"]],
                    label="Green",
                    method="restyle"
                ),
                dict(
                    args=[{"marker.color":["blur"]}],
                    label="Blue",
                    method="restyle"
                )
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.3,
            xanchor="left",
            y=1.25,
            yanchor="top"
        ),
        # second set of buttons updating the position of the labels
        dict(
            buttons=list([
                dict(
                    args=[{"textposition":"inside"}],
                    label="Inside",
                    method="restyle")
                ,
                dict(
                    args=[{"textposition":"outside"}],
                    label="Outisde",
                    method="restyle"
                )
            ]),
            type = "buttons",
            direction="left",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.5,
            xanchor="left",
            y=1.25,
            yanchor="top"
        ),
        # third set of buttons updating the text format of the labels
        dict(
            buttons=list([
                {
                    "args":[{'texttemplate': '%{text:.2s}'}],
                    "label":"million",
                    "method":"restyle"
                },
                {
                "args":[{'texttemplate': '%{text:0.}'}],
                    "label":"full",
                    "method":"restyle"
                }
            ]),
            type = "buttons",
            direction="left",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.5,
            xanchor="left",
            y=1.15,
            yanchor="top"
        )])

fig.show()


### Positions and anchor of the buttons. 

In [None]:
df = arr_2018.sort_values(by="visitors", ascending=False)

fig = px.pie(df[:10], 
             values="visitors", 
             names="2018_name",
             # rename the label on the hover
             labels={'2018_name':'country'},
             title="Buttons' positions and anchors",
             hole=.3)

# click on x: left 0, y: top 0 to change to empty chart
fig.update_layout(
    updatemenus=[
        # a dropdown `direction="down"`
        # changing the color of the bars
        dict(
            buttons=list([
                dict(
                    args=[{"textinfo":"percent+value+label"}],
                    label="x:left 0, y:top 1",
                    method="restyle"
                )]),
            
            
            type="buttons",
            showactive=False,
            x=0,
            xanchor="left",
            y=1,
            yanchor="top",
            bgcolor="#39e"
        ),
        dict(
            buttons=list([
                dict(
                    args=[{"textinfo":"percent+value"}],
                    label="x:left 0, y:bottom 1",
                    method="restyle"
                )]),
            
            
            type="buttons",
            showactive=False,
            x=0,
            xanchor="left",
            y=1,
            yanchor="bottom",
            bgcolor="#e93"
        ),
        dict(
            buttons=list([
                dict(
                    args=[{}],
                    label="x:center 0, y:middle 0",
                    method="restyle"
                )]),
            
            
            type="buttons",
            showactive=True,
            x=0,
            xanchor="center",
            y=0,
            yanchor="middle"
        ),
        dict(
            buttons=list([
                dict(
                    args=[{"type":"bar", 
                           "text": np.array(df["visitors"]), 
                           "textposition": "auto"}],
                    label="x:right 0, y:bottom 0",
                    method="restyle"
                )]),
            
            
            type="buttons",
            showactive=True,
            x=0,
            xanchor="right",
            y=0,
            yanchor="bottom"
        ),
        dict(
            buttons=list([
                dict(
                    args=[{"type":"bar", 
                           "marker.color":"violet",
                          "x": df["Country Name"][:15].values,
                          "y": df["visitors"][:15].values,
                          "xaxis":"x",
                          "yaxis":"y"}],
                    label="x:left 0, y:top 0",
                    method="restyle"
                )]),
            
            
            type="buttons",
            showactive=True,
            x=0,
            xanchor="left",
            y=0,
            yanchor="top"
        ),
     dict(
            buttons=list([
                dict(
                    args=[{}],
                    label="x: center 1, y:middle 1",
                    method="restyle"
                )]),
            
            
            type="buttons",
            showactive=True,
            x=1,
            xanchor="center",
            y=1,
            yanchor="middle"
        ),
    dict(
            buttons=list([
                dict(
                    args=[{}],
                    label="x:center 0.5, y:middle 1.3",
                    method="restyle"
                )]),
            
            
            type="buttons",
            showactive=True,
            x=0.5,
            xanchor="center",
            y=1.1,
            yanchor="middle"
        ),
    dict(
            buttons=list([
                dict(
                    args=[{"xaxis.range":[-2,10],
                          "yaxis.range":[0,5.6]}],
                    label="x:right 1, y:bottom 0 padded {'r': 10, 't': 10}",
                    method="relayout",
                    
                )]),
            
            
            type="buttons",
            showactive=True,
            x=1,
            xanchor="right",
            y=0,
            yanchor="bottom",
        pad={"r": 10, "t": 10},
        ),
        # set of multiple buttons
        dict(
            buttons=list([
                dict(
                    args=[{}],
                    label="x:right 1, y:bottom .5 button 1",
                    method="relayout",
                    
                ),
            dict(
                    args=[{}],
                    label="button 2",
                    method="relayout",
                    
                )]),
            
            direction="left",
            type="buttons",
            showactive=True,
            x=1,
            xanchor="right",
            y=0.5,
            yanchor="bottom",
        
        )
    ])

fig.show()

# Common issues
As have already see some of the issues. Plotly can filter our dataset to display only relevant columns - e.g. only years when we have years and descriptive columns. That is perfect. On the other hand, sometimes it's annoying when you want to display only the grids at your date or int values, when plotly introduces extra data points and sort them in numerical order.

In [None]:
# if you want to display which values are the most frequent and these values are integres
df = pd.DataFrame({"x": [3]*10+[6]*5+[2]*1})
df["x"].value_counts().plot(kind="bar")

In [None]:
# it's rather imposible with plotly which always set up the range containing all the numerical values
fig = px.bar(df["x"].value_counts())
fig.show()

In [None]:
fig.update_xaxes(type='category')
fig.show()

In [None]:
# Also the daterange labels on the x-axis can annoy you when you try to display end of the year/quarter dates. 
# Plotly will always turn them into the Jan next year or the beginning of the following quarter
df = pd.DataFrame({"x":["2019-12-31","2019-03-31","2018-12-31","2017-12-31"],
                   "y":[10,12, 15, 8]})
fig = px.bar(df, x="x", y="y")
fig.show()

In [None]:
# again the solution is to turn the axis into a category
fig.update_xaxes(type='category')
fig.show()

In [None]:
# Plotly also ignores invalid dates, once it decides the axis contain dates.
df = pd.DataFrame({"x":["2019-12-31","2018-12-31","Ohter"],
                   "y":[10,12, 8]})
fig = px.bar(df, x="x", y="y", color_discrete_sequence =["#ff914d"])
fig.update_xaxes(tickfont={"size":18})
fig.show()

In [None]:
fig.update_xaxes(type='category')
fig.show()

# Plotly as default pandas backend

In [89]:
pd.options.plotting.backend = "plotly"

In [None]:
fig = df.plot(kind="bar", x="x", y="y", text="y")
fig.show()

In [91]:
# return to defaul matplotlib backend
pd.options.plotting.backend = "matplotlib"