# About
* Goal: Get a good grip on the fundamentals of Plotly
* Source: https://towardsdatascience.com/python-for-data-science-a-guide-to-data-visualization-with-plotly-969a59997d0c

# Import

In [6]:
# Imports / Configs / Global vars

# Import of native python tools
import os
import json
from functools import reduce

# Import of base ML stack libs
import numpy as np
import sklearn as sc

# Multiprocessing for Mac / Linux
import platform
platform.system()
if platform.system() == 'Darwin':
    from multiprocess import Pool
else:
    from multiprocessing import Pool

# Visualization libraries
import plotly.express as px

# Logging configuraiton
import logging
logging.basicConfig(format='[ %(asctime)s ][ %(levelname)s ]: %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Ipython configs
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
display(HTML("<style>.container { width:100% !important; }</style>"))
InteractiveShell.ast_node_interactivity = 'all'

# Pandas configs
import pandas as pd
import geopandas as gpd
pd.options.display.max_rows = 350
pd.options.display.max_columns = 250

# Jupyter configs
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

# GLOBAL VARS
from pathlib import Path
import glob

# Load data
* use titanic dataset

In [7]:
df = pd.read_csv('../Pandas and Numpy/titanic dataset/train.csv')

In [8]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Background Primer

## What is plotly and how is it different from matplotlib/seaborn?
* Plotly is the Python Library for interactive data visualizations. Plotly allows you to plot superior interactive graphs than either Matplotlib or Seaborn.

## What types of graphs are supported by plotly?
* All Matplotlib and Seaborn Charts
* Statistical Charts which includes but not limited to Parallel Categories and Probability Tree Plots
* Scientific Charts you never thought of, ranging from Network Graphs to Radar Charts
* Financial Charts which are useful for Time-Series Analysis, examples include Candlesticks, Funnels and Bullet Charts
* Geological Maps and 3 Dimensional Plots which allows you to interact with them

## Why is it popular?
* interactive plots
* prettier than matplotlib/seaborn
* more detailed visualisation
* provides maximum customisation

In [10]:
# import
from plotly.offline import init_notebook_mode,iplot
import plotly.graph_objects as go
import cufflinks as cf
init_notebook_mode(connected=True)

## Notes
* Due to how Plotly operates, it saves your plot into a separate html file and opens it in a different window directly. 
* This will happen when you run the code in the console/terminal. 
* Hence, we use plotly.offline, iplot and init_notebook mode to help us plot the graphs on Jupyter Notebook itself.

# Defining what we want to plot

* Questions to consider:
    * What information am I trying to convey?
    * Are we plotting numerical or categorical values?
    * How many variables are you trying to plot?

In [11]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [22]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Our goal with this dataset

> the main objective of this dataset is to study what are the factors that affect the survivability of a person onboard the titanic.

* to start, let's display how many passengers survived the titanic crash

In [12]:
df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

# Data, Layout and Figure

* In Plotly, we define Graph Objects to be plotted. 
* The 3 main parameters required for a plot are:
    * Data
    * Layout  
    * Figure
* We need to define them in a clear and concise way where someone else will be able to understand what we are trying to plot.
* in this case, we want to see the proportion of survivors and can use a pie chart

## Pie chart

In [13]:
#labels
lab = df["Survived"].value_counts().keys().tolist()

#values
val = df["Survived"].value_counts().values.tolist()
trace = go.Pie(labels=lab, 
                values=val, 
                marker=dict(colors=['red']), 
                # Seting values to 
                hoverinfo="value"
              )
data = [trace]

* plotly's pie chart takes in required parameters-labels and values- by default.
* we define the labels as the unique values of the survived column which are [0,1]
* the values we want to display are the counts of these values
* now we define our data as a list containing the pie chart object we just defined

### Layout
* layout of your plot
* e.g.
    * plot titles
    * x,y axis titles
    * legends
    * active filters/sliders

In [14]:
layout = go.Layout(title="Survived Distribution")

### Figure
* figure is what you're trying to plot - it takes in data and layout parameter by default

In [15]:
fig = go.Figure(data=data, layout=layout)

In [16]:
# display
iplot(fig)

### Repeat for Sex

In [19]:
#labels
lab_sex = df["Sex"].value_counts().keys().tolist()

#values
val_sex = df["Sex"].value_counts().values.tolist()
trace_sex = go.Pie(labels=lab_sex, 
                values=val_sex, 
                marker=dict(colors=['green']), 
                # Seting values to 
                hoverinfo="value"
              )
data_sex = [trace_sex]

layout_sex = go.Layout(title="Sex Distribution")
fig_sex = go.Figure(data=data_sex, layout=layout_sex)
iplot(fig_sex)

In [23]:
# Repeat for Pclass

#labels
lab_class = df["Pclass"].value_counts().keys().tolist()

#values
val_class = df["Pclass"].value_counts().values.tolist()
trace_class = go.Pie(labels=lab_class, 
                values=val_class, 
                marker=dict(colors=['green','red','blue']), 
                # Seting values to 
                hoverinfo="value"
              )
data_class = [trace_class]

layout_class = go.Layout(title="Class Distribution")
fig_class = go.Figure(data=data_class, layout=layout_class)
iplot(fig_class)

# Histogram
* When plotting numerical columns alone, we'd want to use a distribution plot such as a histogram
* We will explore the columns : Age and Fare

## Age

In [24]:
# defining data
trace = go.Histogram(x=df['Age'],nbinsx=40,histnorm='percent')
data = [trace]

# defining layout
layout = go.Layout(title="Age Distribution")
# defining figure and plotting
fig = go.Figure(data = data,layout = layout)
iplot(fig)

## Note that we can tweak 2 useful parameters for Histograms which are:
* Histnorm  
    * Value to plot Histogram against, it was set as ‘percent’ so that we are showing the percentage of the bins contributing to the distribution. 
    * If left blank, shows the count of the bin by default.
* nbinsx 
    * Number of Bins for values to be distributed into. 
    * Higher number of bins tend to give you a more detailed distribution.

# Plotting two variables 
* Exploring the relationship between Age and Fare
    * Answering questions like : do older people tend to buy more expensive fare tickets?

## Scatter Plot

In [26]:
# define data
trace = go.Scatter(x=df['Age'],y=df['Fare'],text=df['Survived'], mode = 'markers')
data = [trace]

# define layout
layout = go.Layout(title='Fare VS Age Scatter Plot',
                  xaxis = dict(title='Age'),
                  yaxis = dict(title='Fare'),
                  hovermode='closest')

# defining figure and plotting
figure = go.Figure(data=data, layout=layout)
iplot(figure)

* Note here x and y titles are added as well as value to be displayed when hovering over a point
* you can customize the display value by changing the `text` parameter 
* From this plot there doesn't seem to be much of a linear relationship
* Fares tend to hit a price ceiling around 300 but older people also buy cheap fares
* To further investigate, we can look at the `Pclass` column (the ticket class for the passengers)
    * By plotting the average age and fare for each Pclass, we can see if there's something to connect here

# Bar chart
* For each Pclass, we want to display the average age and fare in that Pclass. 
* For simplicity’s sake, 
    * plot the Pclass on the x-axis, 
    * average age on the y-axis 
    * average fare as a color scale in our bar chart. 
*  we will need to compute the average age and fare for each Pclass first.

In [27]:
y = []
fare = []
for i in list(df['Pclass'].unique()):
    result = df[df['Pclass']==i]['Age'].mean()         # could probably do away with the for list and use group by
    fares = df[df['Pclass']==i]['Fare'].mean()
    y.append(result)                                # append is also inefficient
    fare.append(fares)

In [28]:
# define data
trace = go.Bar( x = list(df['Pclass'].unique()),
               y=y,
               marker=dict(color=fare,colorscale='Viridis',showscale=True),
               text = fare)
data = [trace]

#define layout
layout = go.Layout(title='Age/Fare vs Pclass Bar Chart',
                   xaxis=dict(title='Pclass'),
                   yaxis=dict(title='Age'),
                   hovermode='closest')

#defining figure and plotting
figure = go.Figure(data=data,layout=layout)
iplot(figure)


* two new parameters were added
    * color : values where the color scale relate to
    * colar scale : the type of color scale to display the magnitude of values
* by hovering over the bars you can see the average age and fare for each.class
* we can deduce that age and fares decrease as Pclass increases
* we can confirm this by plotting the distribution of age and fare for each pclass -> this will give us a clear picture instead of the average

# Distribution plots
* A bit like histograms but include another count plot at the bottom to better display the distribution
* to do this we need to import an extra library

In [29]:
import plotly.figure_factory as ff

* plot the 2 graphs: the distributions for age and fares according to their Pclass separately

## Fares

In [31]:
#defining data
a = df[df['Pclass']==1]['Fare']
b = df[df['Pclass']==2]['Fare']
c = df[df['Pclass']==3]['Fare']
hist_data=[a,b,c]
group_labels=['1','2','3']

#defining fig and plotting
fig = ff.create_distplot(hist_data,group_labels,bin_size=[1,1,1],show_curve=False)
fig.update_layout(title_text='Distribution for Fares')
iplot(fig)

## Age

In [34]:
a = df[df['Pclass']==1]['Age']
b = df[df['Pclass']==2]['Age']
c = df[df['Pclass']==3]['Age']
hist_data=[a,b,c]
group_labels=['1','2','3']
fig = ff.create_distplot(hist_data,group_labels,bin_size=[1,1,1],show_curve=False)
fig.update_layout(title_text='Distribution for Age')
iplot(fig)

* Both distribution plots clearly shows that the lower Pclass correlates to a higher age and higher fare.
    * i.e. it means that the first class tickets cost more, and older people tend to purchase first class tickets.
* After concluding that, we want to know how does the survivability relate to these variables.
* We can plot the survivability against them with the bubble plot. Bubble plots allow visualization up to 4 variables which can help us communicate our point here.

# Bubble plot
* Bubble Plots are similar to scatter plots, but they have an additional size parameter that defines the radius for each dot.

In [35]:
#defining data
data=[
    go.Scatter(x = df['Age'],
               y=df['Fare'],
               text=df['Pclass'],
                mode='markers',
               marker=dict(size=df['Pclass']*15, color=df['Survived'],showscale=True),
              )]

#defining layout
layout = go.Layout(title='Fare vs Age with Survivability and Pclass',xaxis=dict(title='Age'),yaxis=dict(title='Fare'),hovermode='closest')

#defining figure and plotting
figure = go.Figure(data=data,layout=layout)
iplot(figure)

From the Bubble Plot, we can see that:
* Higher Age does not result in Higher Fares
* All Fares above 50 are considered 1st Class Tickets
* 1st Class has higher survivability if all other variables are constant
* Higher Age results in lesser survivability if all other variables are constant