# Visualization

In this notebook, we will work with the following:

- Generating standard plotly express visualizations.
- Adding customization to standard visualizations.
- Generating geospatial visualizations.

In [None]:
import pandas as pd
import geopandas
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [None]:
pd.set_option("mode.copy_on_write", True)

# Standard visualizations

As we briefly covered before, `plotly` is a package that provides a wide array of powerful graphing capabilities.

One component, `plotly.express`, gives us a straightforward interface to create [high quality visualizations](https://plotly.com/python/plotly-express/) with relatively modest code.
Also, as we will see below, we can use these standard visualizations and then customize as needed.

While visualization is a deep topic—perhaps big enough for its own course—we can capture a substantial amount of that capability with the easy-to-use `plotly.express` interface.

In [None]:
# Visibility data
vis = pd.read_csv("../data/vis.csv", index_col=False)
vis.head()

## Histogram

Histograms are a good tool for inspecting distributions, and the plotly express histogram function has several parameters that allow us to display histograms in different ways.

In [None]:
# Standard histogram.
#
fig_01 = px.histogram(vis, x="vis")
fig_01.show()

In [None]:
# Using color allows us to see which groups are in which bins.
# We can also use the height and width parameters to specify a size.
# reformats to make it easier to read;
# shows natural groups
fig_02 = px.histogram(
    vis,
    x="vis",
    color="ticker",
    height=600,
    width=800,
)
fig_02.show()

## Line

We know this is time series data, so it may be interesting to plot the values over time and by firm using a line chart.

In addition, plotly supports themes—and has several built in—so we can use something that better fits the Jupyter Lab dark theme that I am using.

In [None]:
# more informative graph
#
fig_03 = px.line(
    vis,
    x="year",
    y="vis",
    title="WSJ Coverage by Firm-year",
    color="ticker",
    template="plotly_dark",
    width=800,
    height=600,
)

fig_03.show()

## Scatter matrix

A scatter matrix is a plot that combines scatter plots of multiple variable pairs. 
It's a good way to visually evaluate associations—and the linearity of potential associations—between pairs of variables.

The `scatter_matrix()` function will attempt to use all numeric columns in a given dataframe, so we may need to filter down to only the variables of interest.
You may also note that, in the example below, the scatter matrix needs multiple columns of values, but our data is in a long, record-like shape.
To reshape that, I use the `pivot` method on the `vis` dataframe to separate each firm's values into a separate column.
You can read more in the [pandas reshaping documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html).

In [None]:
#
fig_04 = px.scatter_matrix(
    vis.pivot(index="year", columns="ticker", values="vis"),
    height=600,
    width=800,
    template="plotly_dark",
)
fig_04.show()

## Animation 

Another way of dealing with data features like a time series is with [animation](https://plotly.com/python/animations/).
plotly is significantly more capable than the visualization features we are used to with stats software, and animation is one of the clearest examples.


In [None]:
# use time dimension to look at changes over year
# start y values at 0
# can hover over graph and dl
fig_05 = px.bar(
    vis,
    y="vis",
    x="ticker",
    animation_frame="year",
    color="ticker",
    range_y=[0, 850],
    height=600,
    width=800,
    template="plotly_dark",
)

fig_05.show()

# Customizing visualizations

## Subplots

Here, we are using subplots to make a single figure that has multiple nested parts.
To do so, we use a bit more syntax, detailed below.

1. First, we use `make_subplots()` to create a figure with multiple "cells."
2. Next, we add subplots and specify their positions. Note that the general way to add subgraphs and other items to a graph is with `go.add_trace()`. However, because adding subplots in this way is a common operation, plotly includes convenience methods for graphs (e.g., the `add_histogram` method below). Note that the syntax used in the first and second subplots is functionally identical, but the second is more readable and less typing.
3. We use the `update_layout` method (which also works with any graph) to set layout parameters for the main figure.
4. Finally, we show the figure.

In [None]:
# add_trace is general name for the stuff in the graph
fig_10 = make_subplots(rows=1, cols=2)

fig_10.add_trace(
    go.Histogram(x=vis[vis["ticker"] == "msft"]["vis"], name="msft"),
    row=1,
    col=1,
)
# [index][filter to column you want]
fig_10.add_histogram(
    x=vis[vis["ticker"] == "aapl"]["vis"],
    row=1,
    col=2,
    name="aapl",
)

fig_10.update_layout(
    height=600,
    width=800,
    title_text="Side By Side Subplots",
    template="plotly_dark",
)
fig_10.show()

In [None]:
# nicer syntax here
# can have diff types of graphs in one output

fig_11 = make_subplots(rows=2, cols=1)

fig_11.add_scatter(
    x=vis[vis["ticker"] == "msft"]["year"],
    y=vis[vis["ticker"] == "msft"]["vis"],
    mode="lines",
    row=1,
    col=1,
    name="msft",
)

fig_11.add_histogram(
    x=vis[vis["ticker"] == "aapl"]["vis"],
    row=2,
    col=1,
    name="aapl",
)

fig_11.update_layout(
    height=800,
    width=800,
    title_text="Top and Bottom Subplots with Different Types",
    template="plotly_dark",
)
fig_11.show()

As you see above, we can mix and match subplot types and use a different shape for our subplot layout.

## Adding items

We can also add items more generally.
In the example below, we add horizontal lines displaying the average for each firm to our line graph from before.
We also add a vertical rectangle shading the financial crisis time period.

The plotly [line and rectangle documentation](https://plotly.com/python/horizontal-vertical-shapes/) has more details.

In [None]:
# cutomize a grpah
#

fig_12 = px.line(
    vis,
    x="year",
    y="vis",
    title="WSJ Coverage by Firm-year",
    color="ticker",
    template="plotly_dark",
    width=800,
    height=600,
)

# mean over time
# looks for the key called "color" which is already in the bg
fig_12.add_hline(
    vis[vis["ticker"] == "aapl"]["vis"].mean(),
    annotation_text="aapl (mean)",
    line_dash="dot",
    line={"color": "blue"},
)

fig_12.add_hline(
    vis[vis["ticker"] == "msft"]["vis"].mean(),
    annotation_text="msft (mean)",
    line_dash="dot",
    line={"color": "red"},
)

fig_12.add_vrect(
    x0=2007.5,
    x1=2009.5,
    annotation_text="Financial<br>Crisis",
    annotation_position="top left",
    fillcolor="gray",
    opacity=0.25,
    line_width=0,
)


fig_12.show()

# Geospatial visualizations

One visualization type that we do not often see in our field is geospatial visualizations.
Traditionally, these have been relatively complicated and specialized into fields that have historically relied on mapping data.
However, over time, geospatial visualizations have spread as a result of more accessible tools to convert location data and produce the visualizations themselves.

We will only scratch the surface here, though it is easy to see how location and region influence individual behaviors that aggregate up to the ones our field tends to study.

For our example, we have a four-step process.

1. Obtain data that can be converted to a location (i.e. latitude and longitude). Here, we're making some from a dictionary.
1. Convert the location using an API.
1. Extract the coordinates from the point object.
1. Plot the figure.

In [None]:
# start with data that has addresses, zip, city name, lat/long
# read documentation on api
#   ie is it using outer city limits etc
hq_list = [
    {
        "firm": "Apple",
        "city": "Cupertino, CA",
    },
    {
        "firm": "Microsoft",
        "city": "Redmond, WA",
    },
    {
        "firm": "Tesla",
        "city": "Palo Alto, CA",
    },
    {
        "firm": "Netflix",
        "city": "Los Gatos, CA",
    },
    {
        "firm": "Twitter",
        "city": "San Francisco, CA",
    },
    {
        "firm": "Amazon",
        "city": "Seattle, WA",
    },
]

hq = pd.DataFrame(hq_list)
hq.head()

In [None]:
# We're using a free geocoding API with fairly strict rate limits.
# You'll see errors if you (successfully) request a repeat city
# within a few minutes of a prior request.
# Also note that this errors out a lot. We often have to retry.

# we make geometry column to match onto hq geometry
# photon is the api ; always city,st?
#
hq["geometry"] = geopandas.tools.geocode(hq["city"], provider="photon")["geometry"]

In [None]:
hq.head()

In [None]:
def get_lat(value):
    return value.y


def get_lon(value):
    return value.x


hq["lat"] = hq["geometry"].apply(get_lat)
hq["lon"] = hq["geometry"].apply(get_lon)

In [None]:
hq.head()

In [None]:
# clean frame a little to use pltly express easier
#
fig_20 = px.scatter_mapbox(
    hq,
    lat="lat",
    lon="lon",
    color="firm",
    zoom=3.75,
    mapbox_style="carto-positron",
    width=800,
    height=600,
    template="plotly_dark",
    title="Firm Headquarters Locations",
)
fig_20.show()

# Breakout Exercises

Let's do an exercise to reinforce the concepts we learned above.


## EX1: plot employee data

Let's make a chart using the firm year data from the prior segment.

1. Read your dataset into a pandas dataframe with the name `firmyear`. To find the proper function, you may want to look at the [pandas IO reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) or the prior segment materials. Also, display the first five rows. (Don't forget to fix the type on the `count_of_employees` variable.)
1. Create a bar chart (named `fig_50`) with `count_of_employees` on the y axis, `year` on the x axis, and a bar for each firm in each year.

Note: time permitting, try out different template themes that are included in plotly ([see here](https://plotly.com/python/templates/)).

Hint: if you have issues with "a bar for each firm in each year," see [this documentation](https://plotly.com/python/bar-charts/#grouped-bar-chart).

In [None]:
# 1-1 code
firmyear = pd.read_stata("../data/firmyear.dta")
firmyear.head()

In [None]:
firmyear.dtypes
firmyear["count_of_employees"] = firmyear["count_of_employees"].astype("int")

In [None]:
# 1-2 code
fig_50 = make_subplots(rows=1, cols=2)
fig_50.add_trace(
    go.Histogram(x=vis[vis["name"] == "Microsoft"]["vis"], name="Microsoft"),
    row=1,
    col=1,
)
# [index][filter to column you want]
fig_50.add_histogram(
    x=vis[vis["ticker"] == "Google"]["vis"],
    row=1,
    col=2,
    name="Google",
)
fig_50.show()

In [None]:
# didnt specify a real color but it allowed him to split
# just look at documentation to sve time
fig_50 = px.bar(
    firmyear, y="count_of_employees", x="year", color="name", barmode="group"
)
fig_50.show()

In [None]:
# addding show method to the end doesnt store graph in memeory to display later
fig_50 = px.bar(
    firmyear, y="count_of_employees", x="year", color="name", barmode="group"
).show()
fig_50.show()

In [None]:
fig_50 = px.bar(firmyear, y="count_of_employees", x="year", color="name")
fig_50.show()

In [None]:
"HELLO".lower()