<a name="cid1"></a>
Prev: [Tabular Data](../08_tabular_data/tabular_data.ipynb) | [Table of Contents](../toc.ipynb) | Next: [Managing Big Arrays with Numpy](../10_numpy/numpy.ipynb)

<a name="cid2"></a>
# Data Visualization

## I. Introduction
Data visualization refers presenting information by physically arranging textual and graphic elements. Charts, graphs, plots, tables, and maps are all examples of data visualization. Visualization is an integral part of analytics and data science because it allows people to more easily and efficiently interpret large amounts of data. It's an important skill that is used widely in the professional world. In FRC, we analyze team performance by graphing scouting data, which helps us evaluate the strengths and weaknesses of all robots at the competition!

There are many styles of data visualizations, both static and interactive. Here are some fun examples on the web:
* [Every Satellite Orbiting Earth](https://qz.com/296941/interactive-graphic-every-active-satellite-orbiting-earth/)
* [Music Timeline](https://thenextweb.com/wp-content/blogs.dir/1/files/2014/01/music_timeline_genres.png)
* [NYC Trees](https://www.cloudred.com/labprojects/nyctrees/#about)
* [Matplotlib Examples](https://matplotlib.org/gallery/index.html#showcase)

Python has numerous packages for creating visualizations. We will start by creating visualizations with the [Pandas](https://pandas.pydata.org/pandas-docs/stable/) package. That will be followed by a short review of the [Matplotlib](https://matplotlib.org/stable/users/index.html) package. Finally, we'll explore the [Plotly](https://plotly.com/python/) package.

<a name="cid3"></a>
## II. Notebook Setup
### A. If Running this Notebook on Google Colab
Run the next cell to download data files into the local folder on Google Colab.

In [None]:
# !wget -nv https://raw.githubusercontent.com/irs1318dev/python2023/main/output/09_visualization/get_files.sh
# !bash get_files.sh

<a name="cid4"></a>
### B. If Running Notebook Locally in Jupyter
This notebook uses several external packages that must be installed before the code in this notebook will run.

If you are using the Anaconda or Miniconda flavors of Python, run the following command in Mac Terminal or Windows Powershell:
```bash
conda install pandas plotly dash jupyter-dash
```
If you are using pip, run the following:
```bash
pip install install pandas plotly dash jupyter-dash
```
You don't need to run these commands if you are using Google Colab because the packages are already installed.

<a name="cid5"></a>
### C. Warnings
The Plotly package is issuing future warnings when datetmes are used. The next cell turns off future warnings.

In [None]:
import warnings

warnings.filterwarnings("ignore", category=FutureWarning) 

<a name="cid6"></a>
## III. Using Pandas to Create Charts
We learned to use the *Pandas* package to analyze tabular data in the previous notebook. Pandas has methods for generating charts.

In [None]:
import pandas as pd

<a name="cid7"></a>
### A. Scatter and Bar Charts
Our first few charts will use the *recent-grads.csv* file, which contains employment, demographic, and salary information for 173 different majors.

In [None]:
grads = pd.read_csv("recent-grads.csv")
grads.head()

<a name="cid8"></a>
For our first graph, let's plot the median salary vs. the fraction of women graduates for each major.

In [None]:
grads.plot.scatter(x="ShareWomen", y="Median")

<a name="cid9"></a>
Unfortunately, the plot shows that median salary tends to be lower for majors with higher percentages of female graduates. That's definitely worth discussing, but for now we're going to focus on visualization details.

Creating charts in Pandas is easy -- we just call one of Pandas's charting methods and pass the names of the columns that contain the data to be plotted. In this chart, we plotted the fraction of women on the x axis and the median income on the y axis. 

The chart looks OK, but it's missing a title and the axis labels are confusing. Let's fix that.

In [None]:
grads.plot.scatter(
    x="ShareWomen", y="Median", grid=True,
    title="Median Income vs. Fraction Women Graduates",
    xlabel="Fraction Female Graduates", ylabel="US $")

<a name="cid10"></a>
Pandas's plotting methods accept numerous arguments for customizing plots. See the documentation for [Pandas `plot` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to see more.

Next, let's make a bar chart showing the median incomes for the top 10 highest paying majors.

In [None]:
top10 = (
    grads
    .sort_values("Median", ascending=False)
    .iloc[:10, :]
)
top10

In [None]:
top10.plot(kind="barh", x="Major", y="Median")

<a name="cid11"></a>
Hmmm, computer engineering salaries look lower than what I expected. But honestly, I don't know how old this dataset is.

Nevertheless, did you notice that it took more lines of code to prepare the data for plotting than it took to do the actual plotting? That's typical.
* We used `DataFrame.sort_values()` to sort the data by income.
* We used `DataFrame.iloc[]` to select the top 10 majors.
* Finally, we sent our data to the plotting function (`DataFrame.plot()`).

For the bar chart, we called the `DataFrame.plot()` method and used the `kind` argument to specify the type of chart. We could also have used the `Dataframe.plot.barh()` method.

<a name="cid12"></a>
### B. Method Chaining
The code statements to make the bar chart were arranged using a technique called [method chaining](https://medium.com/@ulriktpedersen/modern-pandas-streamlining-your-workflow-with-method-chaining-f65e75deb193), which improves the readability of the code.

Here is a more traditional way to write the code:
```python
# Using intermediate variables to save results of DataFrame operations
sorted_grads = grads.sort_values("Median", ascending=False)
filtered_grads = sorted_grads.iloc[:10, :]
filtered_grads.plot(kind="barh", x="Major", y="Median")
```
We assigned the results of each operation to a new variable. It works fine, but it takes effort to think of descriptive variable names and the extra variables clutter up the code. We can use the fact that the `.sort_values`, and `.iloc` methods both return Pandas `DataFrame` objects to eliminate the `sorted_grads` and `filtered_grads` variables.

```python
# Chaining DataFrame operations together.
grads.sort_values("Median", ascending=False).iloc[:10, :].plot(kind="barh", x="Major", y="Median")
```
Chaining the dataframe operations together creates a long, complicated line of code that is hard to read. Fortunately we can use parentheses to split the operations into several lines.
```python
(
    grads
    .sort_values("Median", ascending=False)
    .iloc[:10, :]
    .plot(kind="barh", x="Major", y="Median")
)
```

Voila! The dataframe operations are easy to understand because each operation is on its own line and there are no intermediate variables cluttering up the code.

Any method that returns a `DataFrame` can be used in a method-chaining context. These Pandas functions are especially useful for method chaining. 

* [assign](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html#pandas.DataFrame.assign): Creates new dataframe columns.
* [set_axis](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_axis.html#pandas.DataFrame.set_axis): Assign new index or column labels to the dataframe.
* [pipe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe): Run an arbitrary function in a method-chaining context.

[Read this article](https://medium.com/@ulriktpedersen/modern-pandas-streamlining-your-workflow-with-method-chaining-f65e75deb193) to learn more about method chaining.

<a name="cid13"></a>
### C. Time Series Plot
Next we'll make a time series plot. We'll use a dataset that contains atmospheric carbon dioxide (CO2) concentrations that were recorded at the Mauna Loa Observatory in Hawaii. [Click here to learn more abou the dataset.](https://www.kaggle.com/datasets/ucsandiego/carbon-dioxide/)

In [None]:
co2_raw = pd.read_csv("co2.csv")
co2_raw

<a name="cid14"></a>
The dataset contains monthly measurements from 1958 to 2017. Each measurement was taken at midnight on the 15th day of the month. The units are parts per million by volume (ppmv). A ppmv of 300 indicates that 300 of every 1,000,000 molecules in the measurement sample was CO2. If the sample were entirely carbon dioxide, the ppmv would be 1,000,000.

The dataset needs some work before we can plot it. It's using non-standard date formats and the column labels are awkward to type because they contain spaces, mixed case, and parentheses.

First, let's create a date column that uses a date format that Pandas recognizes. We'll use the [DataFrame.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas-to-datetime) method, which accepts a dataframe with year, month, and day colums.

In [None]:
co2_dates = (
    # Extract the year and month columns.
    co2_raw[["Year", "Month"]]
    # Add a day column, with every measuremnet taken on the 15th.
    .assign(Day=15)
    # Create a date column.
    .assign(date=lambda df: pd.to_datetime(df))
    # Discard all columns except for date.
    .loc[:, "date"]
)
co2_dates

<a name="cid15"></a>
Next, let's set the index to the date column, discard the nonstandard date columns, and rename the remaining columns.

In [None]:
co2 = (
    co2_raw
    .set_index(co2_dates)
    .iloc[:, 3:]
    .set_axis(["co2", "seasonally_adjusted_co2",
               "fitted_co2", "fitted_seasonally_adjusted_co2"],
              axis=1)
)
co2

In [None]:
co2.plot(y='co2')

<a name="cid16"></a>
Wow, that is quite a trend! CO2 rises and falls annually because there is more landmass in the northern hemisphere. Plants absorb enough CO2 during the northern hemisphere's summer to cause CO2 concentrations to decrease, but then CO2 increases during the winter months. There is an unmistakable increasing trend over the long term.

Did you notice that we did not provide an `x` parameter to the `plot()` method? Pandas assumes that we want to plot the index on the x-axis if we omit the `x` parameter.

<a name="cid17"></a>
### D. Pandas Plotting Exercises

#### 1. Using Pandas Documentation
Review the [visualization section of the Pandas user's guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) and the API documentation for the [DataFrame.plot()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html#pandas.DataFrame.plot) method.

In your own words, explain what the `xlim`, `ylim`, and `figsize` parameters do.

<a name="cid18"></a>
#### 2. Experiment with Plotting Parameters
Change the size of the CO2 plot and modify the x-axis limits.

HINT: Specify the dates as string years, e.g., "1960".

In [None]:
# Plotting Parameters Exercise



<a name="cid19"></a>
#### 3. Titles and Labels
Add a title and suitable x and y axis labels to the CO2 plot.

In [None]:
# Titles and Labels Exercise




<a name="cid20"></a>
## IV. Matplotlib
Matplotlib is one of the oldest Python packages for visualization. Pandas uses the [Matplotlib package](https://matplotlib.org/stable/users/index.html) to draw plots. This means we can use Matplotlib to customize the charts that we create with Pandas plotting functions.

Pandas is not the only package that uses Matplotlib. [Seaborn](https://seaborn.pydata.org/index.html), [Geoplotlib](https://github.com/andrea-cuttone/geoplotlib), and probably a few other packages use Matplotlib to draw some or all of their plots. It's good to be familiar with Matplotlib if you plan to use Python for data analysis.



### A. Understanding Matplotlib and Pyplot
Matplotlib has a submodule called *pyplot* that emulates a software package called [Matlab](https://www.mathworks.com/products/matlab.html). Matlab is a commercial (and pricey) software package for scientific analysis and visualization. Users of Matplotlib typically import both Matplotlib and its pyplot module, as demonstrated in the next cell.

In [None]:
import matplotlib
import matplotlib.pyplot as plt

<a name="cid21"></a>
Read through [Matplotlib's Pyplot tutorial](https://matplotlib.org/stable/tutorials/pyplot.html#sphx-glr-tutorials-pyplot-py). Refer to the tutorial while completing the Matplotlib exercises.

### B. Matplotlib Exercises
#### 1. Income verses Gender Plot
Recreate the income vs. gender plot using only Matplotlib Pyplot functions. In other words, don't use any Pandas plotting functions. Set the title and axis labels to appropriate values.

HINT: What is the `data` parameter to the `pyplot.plot()` function used for?

In [None]:
# Recreate Income vs Fraction Female Plot




<a name="cid22"></a>
#### 2. Top 10 Majors by Income
Recreate the bar chart showing the top 10 majors by income using Matplotlib Pyplot functions.

HINT: There is a `pyplot.barh()` method.

In [None]:
# Plot top 10 majors by income




<a name="cid23"></a>
## V. Plotly
### A. Getting Started
The Issaquah Robotics Society is increasingly using a package called [Plotly](https://plotly.com/python/) for making visualizations. Plotly has features that make it suitable for building interactive web applications. Consider income the chart in Plotly:

In [None]:
# Income by Major and Gender in Plotly
import plotly.express as px

grads_fig = px.scatter(
    grads,
    x="ShareWomen", y="Median",
    color="Major_category", symbol="Major_category",
    title="Median Income by Major",
    labels={
        "Median": "Median Income in US $",
        "ShareWomen": "Fraction Female Graduates",
        "Major_category": "Category"},
    hover_name="Major",
    hover_data={"ShareWomen": ":.2f"},
    width=900, height=750
)
grads_fig.show()

<a name="cid24"></a>
Now the chart is color-coded by category and you can see the major when you hover over the data points. We have a legend, and there is a cool toolbar in the upper right corner with panning, zoom, and other features.

To be fair, we could have constructed some of these features with Matplotlib, but it would have been more work.

Plotly allowed us to construct most of the plot with a single call to the `plotly.express.scatter()` function, which returned a `Figure` object. We then displayed the figure with the `show()` method.

The comments in the code snippet below explain the `.scatter()` function's parameters.

```python
import plotly.express as px

grads_fig = px.scatter(
    #  Dataframe with the data we want to plot.
    grads,
    
    # Names of columns to be plotted on x and y axes.
    x="ShareWomen", y="Median",
    
    # Assign colors and symbols based on value in "Major_cateogry
    #   column of dataframe.
    color="Major_category", symbol="Major_category",
    
    # The plot title
    title="Median Income by Major",
    
    # Used to set text in legend and axis labels.
    # The key is the dataframe column name and the value is the
    #   text that should replace the column name.
    labels={
        "Median": "Median Income in US $",
        "ShareWomen": "Fraction Female Graduates",
        "Major_category": "Category"},
    
    # Customizes the hover tooltip, which is the box that appears
    #   when a viewer hovers over a mark on the plot.
    # Bold title text of tooltip comes from "Major" column.
    hover_name="Major",
    # Displays ShareWomen value with two decimal places.
    hover_data={"ShareWomen": ":.2f"},
    
    # Sets width and height of plot.
    width=1000, height=750
)
grads_fig.show()

```

<a name="cid25"></a>
### B. Plotly Express Verses Graph Objects
There are two ways to make charts with the Plotly package. The income chart used
`plotly.express`, which is how Ploty recommends making charts. The `plotly.express`
package, which is typically imported as `px`, provides easy-to-use functions
for constructing many different kinds of charts.

The other technique for making charts in Plotly uses `plotly.graph_objects`,
typically imported as `go`. the `graph_objects` module provides a higher
level of control over the plot, and is the only technique that works for some of
the more complex charts.

The next cell re-creates the CO2 plot using Plotly's `graph_objects` syntax.

In [None]:
# CO2 Plot using Plotly's graph objects.
import plotly.graph_objects as go

co2_fig = go.Figure(
    data=go.Scatter(x=co2.index, y=co2.co2)
)
co2_fig.update_layout(
    title_text="CO2 Concentration by Month and Year",
    width=750, height=400
)
co2_fig.update_yaxes(title_text="PPM by Volume")
co2_fig.show()

<a name="cid26"></a>
We'll use the Plotly.express technique in this lesson. I mentioned the
graph objects technique because you will see it mentioned in
[Plotly's documentation](https://plotly.com/python/).

<a name="cid27"></a>
### C. The Plotly-Dash Framework
The Python Plotly package is just one part of a larger system for building
web-based analytic applications. It is designed to work with a
[web framework called Dash](https://dash.plotly.com). You will see numerous
references to Dash within Plotly's
documentation. You can ignore those references for now.

<a name="cid28"></a>
### D. How Plotly Works
When you use Python to create a chart with Plotly, Plotly first builds a `Figure` object that contains all the information needed to construct the chart and is easily converted to JSON. Plotly then passes the data as JSON to Javascript functions that are running in the web browser. The Javascript functions draw the chart in the browser and manage all user interactions.

Plotly's `show()` method can be used to display the JSON object from which the chart is constructed. Here is the JSON object for the CO2 chart. Click on the triangles to expand each section. You can learn more about the [structure of the `Figure` object in Plotly's documentation](https://plotly.com/python/figure-structure/), but that knowledge is not required to complete this notebook.

In [None]:
co2_fig.show("json")

<a name="cid29"></a>
Plotly provides visualization packages for several other programming languages, including [R](https://www.r-project.org/about.html), [Julia](https://julialang.org/), [Javascript](https://developer.mozilla.org/en-US/docs/Web/JavaScript), and [F#](https://learn.microsoft.com/en-us/dotnet/fsharp/what-is-fsharp). All of these packages work the same way in that they create a `Figure` object that is drawn by Javascript functions.

<a name="cid30"></a>
### E. The Supermarket Dataset
Let's add a new dataset. The supermarket_sales.csv file contains information on 1,000 transactions at supermarkets in Myanmar.

In [None]:
# Loading supermarket
sales = pd.read_csv("supermarket_sales.csv")
sales.head()

<a name="cid31"></a>
Let's look at average sale amount by payment type. First, we need to prepare the data. We'll use the aggregation techniques we learned when we were studying tabular data.

In [None]:
by_payment = (
    sales
    .groupby("Payment")
    .agg(
        average_sales=("cogs", "mean"),
        count_of_sales=("cogs", "count"))
    .reset_index()
)
by_payment.head()

In [None]:
payment_fig = px.bar(
    by_payment,
    x="average_sales",
    y="Payment",
    orientation="h",
    title="Transaction Amount by Payment Type",
    hover_name="Payment",
    hover_data={
        "average_sales": ":.2f",
        "Payment": False},
    labels={"average_sales": "Average Transaction Amount"}
)
payment_fig.show()


<a name="cid32"></a>
Maps in Plotly are tricky and we won't be using them much. [The Plotly documentation has several articles on maps.](https://plotly.com/python/maps/). The article on bubble maps is useful for understanding this example.

<a name="cid33"></a>
#### 1. Plotly Barchart Exercise
Create a *vertical* bar chart showing the average transaction amount, grouped by product line.

In [None]:
# Average Transaction Amount by Product Line


<a name="cid34"></a>
### F. Plotting Dates and Times
Plotting data with dates and times can be tricky. Before we can plot the data, we need to determine how the dates and times are represented in the dataframe. The `.dtypes` property can help with that.

In [None]:
sales.dtypes

<a name="cid35"></a>
It would be nice to have a column with a datetime object type. We'll use the Pandas `to_datetime` method again to convert the columns. Review these articles if you have not already done so.
* [Pandas `to_datetime` Method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas-to-datetime)
* [Format Codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)

<a name="cid36"></a>
Every column that's listed as an *object* contains string values. We need to convert the date and time columns to a date-time data type before we can use the column in a plot.

> The term *object* comes from [Numpy](https://numpy.org/doc/stable/index.html), which is a Python package that is used to efficiently work with arrays. We will cover Numpy in the next noteook.

Writing code that accurately reflects dates and times is surprisingly difficult and a frequent target of programmer complaints. Fortunately, [Pandas `to_datetime` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas-to-datetime) will work in this case.

The first thing we need to do is combine the separate date and time columns into a single column:

In [None]:
# Combining Date and Time columns, using a space as a separator
datetime_string_column = sales.Date + " " + sales.Time
datetime_string_column

<a name="cid37"></a>
Next we'll use `to_datetime` to convert the datetime column from strings to a datetime object. Because there are so many ways to represent dates and times in strings, we have to tell `to_datetime` how to do the conversion. Here are some examples.
* Date of first Moon landing in North America: 7/20/1969
* In Europe: 20/7/1969
* Before we started worrying about Y2K: 7/20/69
* With leading zeros: 07/20/69
* Spelled out: July 20th, 1969
* Really spelled out: Sunday, July 20th, 1969
* Time of Moon Landing: 4:17 EDT (Eastern Daylight Time)
* Military Style: 1617
* Universal Coordinated Time: 20:17 UTC
* [ISO 8601 Style](https://en.wikipedia.org/wiki/ISO_8601) in UTC: 1969-07-17T20:17:39Z
* ISO 8601 style in local time with timezone offset: 1969-07-17T16:17:-04:00

We can use a `format` parameter to tell `to_datetime` how the string are formatted. The format string will contain codes that represent the different parts of a date and time. Each code starts with a percentage signe, "%".
* `%m`: Month as a zero-padded decimal number.
* `%d`: Day of the month as a zero-padded decimal number.
* `%Y`: Year with century as a decimal number.
* `%H`: Hour (24-hour clock) as a zero-padded decimal number.
* `%M`: Minute as a zero-padded decimal number.

Characters that are not part of a format code are interpreted literally. There are several more format codes. [The full list is here.](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The `to_datetime` function uses the same format codes as the [datetime module from Python's Standard Library.](https://docs.python.org/3/library/datetime.html)

In [None]:
datetime_column = pd.to_datetime(
    datetime_string_column,
    format="%m/%d/%Y %H:%M")
datetime_column

<a name="cid38"></a>
The cell output indicates that the column has been converted from object to datetime64[ns].

Finally, we'll add the datetime column to the sales dataframe and drop the individual *Date* and *Time* columns. There are a lot of columns, so you might have to scroll right to see the datetime column.

In [None]:
sales = (
    sales
    .assign(datetime=datetime_column)
    .drop(["Date", "Time"], axis=1)
)
sales.head()   

<a name="cid39"></a>
### G. Plotly Documentation
This notebook covers only a small fraction of the charts that can be built with Plotly It would take a thick book to cover them all. Becoming a Plotly expert requires becoming familiar with [Plotly's Official Documentation](https://plotly.com/python/). Take a look at the main documenation page to see what topics are available. [Plotly's Basic Charts page](https://plotly.com/python/basic-charts/). Find at least one code example for a scatter plot, line plot, and bar chart.

<a name="cid40"></a>
#### 1. Scatter Plot Exercise
Create a scatter plot with *datetime* on the x axis and *Total* on the y axis. Color-code the plot by *City*.

HINT: [Plotly's article on scatter plots](https://plotly.com/python/line-and-scatter/) has examples that show how to color-code a plot.

The code for your plot might generate a warning. Don't worry about that.

In [None]:
# Total by Date Exercise



<a name="cid41"></a>
#### 2. Line Plot Exercise
Construct a line chart of total transaction amount (y) by datetime (x) for the city of Yangon. Add a title and a nicely formatted x-axis label.

HINT: Sort the dataframe before plotting it.

In [None]:
# Line Plot Exercise




<a name="cid42"></a>
### H. Colors
Color choices have a big impact on the effectiveness of a data visualization. Plotly provides several methods for fine tuning color choices. Consider this bar chart of supermarket sales data.

In [None]:
# Sum sales totals by payment type and product line.
summed_sales = (
    sales
    .groupby(["Payment", "Product line"])
    .agg({"Total": "sum"})
    .reset_index()
)
summed_sales.head()

In [None]:
# Create a stacked bar chart
gfig = px.bar(
    summed_sales,
    x="Product line", y="Total",
    color="Payment",
    title="Total Sales",
    labels={"Product line": "Product Line", "Total": "Myanmar Kyat"})
gfig.show()

<a name="cid43"></a>
The `px.bar` function, like many of the Plotly Express functions, has a `color` parameter. But instead of passing a color like `Red` or `LightSeaGreen`, we passed in the name of a column. The `color` parameter tells Plotly to assign different colors to the sales totals for each payment type. That's very helpful, but what if we wanted to change the colors?

To change colors we can use the `discrete_color_sequence` parameter, which accepts the name of a Plotly color sequence. You can see the different color sequences in [Plotly's documentation, ](https://plotly.com/python/discrete-color/#color-sequences-in-plotly-express), or you can run the following cell.

In [None]:
# Plotly color sequences
color_fig = px.colors.qualitative.swatches()
color_fig.show()

<a name="cid44"></a>
Plotly uses the *Plotly* color sequence by default. Lets try the *Antique* color sequence.

In [None]:
# Create a stacked bar chart
gfig = px.bar(
    summed_sales,
    x="Product line", y="Total",
    color="Payment",
    color_discrete_sequence=px.colors.qualitative.Antique,
    title="Total Sales",
    labels={"Product line": "Product Line", "Total": "Myanmar Kyat"})
gfig.show()

<a name="cid45"></a>
#### 1. Custom Color Sequence
Plotly's built-in color sequences will work for most charts. But what if we wanted to define our own color sequence? See the next cell.

In [None]:
# Create a stacked bar chart
gfig = px.bar(
    summed_sales,
    x="Product line", y="Total",
    color="Payment",
    color_discrete_sequence=["Indigo", "GoldenRod", "Grey"],
    title="Total Sales",
    labels={"Product line": "Product Line", "Total": "Myanmar Kyat"})
gfig.show()

<a name="cid46"></a>
That seems to work fine, but *Goldenrod*? Really? Can we use any color name we can think of? The answer to that question is obviously no. All modern browsers support [this set of 140 named colors](https://www.w3schools.com/colors/colors_names.asp), and most plotting packages will accept these color names as well.



One hundred and forty colors is plenty for most situations. But suppose you wanted to humor a mentor who went to college in a town called Pullman and decided you just had to use [Washington State University's official colors](https://brand.wsu.edu/colors/)? See the next cell.

In [None]:
# Go Cougs Chart
gfig = px.bar(
    summed_sales,
    x="Product line", y="Total",
    color="Payment",
    color_discrete_sequence=["#888888", "#A60F2D", "#4D4D4D"],
    title="Total Sales",
    labels={"Product line": "Product Line", "Total": "Myanmar Kyat"})
gfig.show()

<a name="cid47"></a>
The color codes that start with a hash character (#) are [CSS Hex Colors](https://www.w3schools.com/css/css_colors_hex.asp). They use the format `#rrggbb` where *rr*, *gg*, and *bb* are two-digit [hexadecimal numbers](https://en.wikipedia.org/wiki/Hexadecimal) that specify the intensity of the color's red, green, and blue components. Hexadecimal numbers use base 16 and range from 0 to F, which is 15 in base 10. Each color component can be assigned a number ranging from 00 to FF (255). The color `#000000` is black and `#FFFFFF` is white. `#FF0000` would be bright red.

In [None]:
co2_fig = px.line(
    co2, x=co2.index, y="co2",
    color_discrete_sequence=["GoldenRod"],
    title="Atmospheric Carbon Dioxide Concentration",
    labels={"co2": "PPM by Volume", "date": ""}
)
co2_fig.show()

<a name="cid48"></a>
#### 2. Color Code Exercise
Go to https://www.w3schools.com/colors/colors_picker.asp and play with the color code selecter. Choose at least three colors and make a color sequence. Add a [named HTML color](https://www.w3schools.com/colors/colors_names.asp) and Run the plotting cell to display the colors.

In [None]:
# Add your colors to the list as strings, e.g., "#FFFFFF".
my_color_sequence = []

In [None]:
# Plotting Cell. Run this cell to display your color sequence.
def plot_color_sequence(colors):
    cfig = px.pie(
        values=[1 for _ in colors],
        names=colors,
        color_discrete_sequence=colors,
        title="My Color Sequence",
        height=500)
    cfig.update_traces(textinfo="label")
    cfig.update_layout(showlegend=False)
    cfig.show()
    
plot_color_sequence(my_color_sequence)

<a name="cid49"></a>
### I. Continuous Colors
So far, we've been using colors on categorical columns in our Pandas dataframes. Categorical columns have a finite number of non-numeric values, like "Cash", "Credit card" and "Ewallet" for the *Payment type* column. Plotly calls this type of data *qualitative*, which is why we used color sequences from the `plotly.express.colors.qualitative` subpackage.

Colors can also be used to highlight continuous numeric data. Plotly provides sequential color sequences for this purpose. Run the next cell to see th sequential color sequences.

In [None]:
# Plotly color sequences
color_fig = px.colors.sequential.swatches()
color_fig.show()

<a name="cid50"></a>
#### 1. Continous Color Exercise
Create a scatter plot of Median salary vs Unemployment rate for graduation data. Color-code the chart by the *Total* column. Choose and assign a sequential color sequence to the plot.

HINT: Review [Plotly's Continuous Color Scales Article](https://plotly.com/python/colorscales/)

In [None]:
# Continuous Color Exercise



<a name="cid51"></a>
## VI. Plotting Scouting Data
For the final set of exercises, We'll plot actual IRS scouting data. Our dataset consists of scouting data from the first 85 matches of the 2023 district district championships at Eastern Washington University.

It will help if you are familiar with the 2023 game rules. [Watch this short video](https://youtu.be/0zpflsYc4PA?si=OPl6Cpo_MipOmsn-) for an overview if you are unfamiliar with the game.

The next cell loads scouting data from a JSON file and constructs a dataframe.

In [None]:
import json

with open("pncmp2023.json") as jfile:
    pncmp = json.load(jfile)
measures = pd.DataFrame.from_dict(pncmp["measures"])

measures

<a name="cid52"></a>
In data visualization, the term *measure* is often used to describe the quantities that will be visualized, which is why the main data table in the IRS scouting system is called the *measures* table. Our measures table has over 10,000 rows of data!

Let's take a look all the measures for team 3218 in match 38.

In [None]:
(
    measures
    .query("team == 3218 and match == 38")
    .sort_values(["phase", "task"])
    .loc[:, ["match", "team", "phase",
             "task", "measure_type", "hit", "cat", "miss"]]
)

<a name="cid53"></a>
There are 21 different measures for one robot in one match. There are six robots in each match and there are 85 matches in the dataset. So we would expect about 21 * 6 * 85 = 10,710 rows in the measures table. We actually have 10,584, which is close enough.

Let's take a closer look at the data for match 38. The task column lists all of the actions a robot can complete in a match. Some tasks can be completed both during autonomous and teleop. Look at the *phase* column to tell which is which.

There are three columns that contain the measured data: *hit*, *miss*, and *cat*. Their interpretation depends on the value in the *measure_type* column.
* For columns with a *measure_type* of *count*, the *hit* column will contain the number of times that the robot successfully completed that action. If used, the *miss* column will contain the number of unsuccessful attempts. If not used, *miss* will contain -1. The *cat* column is not used.
* For boolean columns, *hit* will contain a 1 if the task was completed and zero if not. The *miss* and *cat* columns are not used.
* For categorical columns the *cat* column will describe how the robot completed the task. The *hit* and *miss* columns are not used.
* Finally, for rating columns, the *hit* column contains the number of stars awarded out of *miss* stars. The *cat* column is not used.

If we wanted to know how many cones that team 2046 placed in the upper level during autonomous phase during all matches:

<a name="cid54"></a>
### A. Cone Bar Chart Exercise
Create a bar chart showing the number of cones placed in the upper level by team 1318 during the teleop phase of each match.

**HINTS:**
1. First, filter the measures dataframe to only team 1318, the *tele* phase, and the *cone_up* task.
2. Sort the dataframe by match.
3. Force the x axis to be categorical with `fig.update_xaxes(type="category")`. Replace *fig* with whatever you named your figure object.

In [None]:
# Cone Bar Chart




<a name="cid55"></a>
### B. Average Cubes Bar Chart Exercise
Create a horizontal bar chart showing the total number of cubes and cones placed on the grid across all matches for all teams, in both autonomous and teleop.

**HINTS:**
1. Filter the *measures* dataframe to contain only the *cone_low* task.
2. Group the *measures* dataframe by team and aggregate it with a mean function. See the grouping and aggregaton section from the notebook on tabular data if you need a refresher.
3. Don't forget to make the team axis a categorical axis.
4. Play with the height and width parameters to make all the team numbers visible.
5. Experiment with the nbins parameter.

In [None]:
# Average Cubes Exercise




<a name="cid56"></a>
## VII. Visualization Theory
Finally, let's review some visualizaton theory.

1. Read [this article on types of data](https://www.mygreatlearning.com/blog/types-of-data/). After reading the article you should be able to define the terms *qualitative*, *quantitative*, *nominal*, *ordinal*, *continuous*, and *discrete*. Note that in the IRS, we often use the term *categorical* to mean *qualitative*. The terms *objective* and *subjective* are also useful when thinkng about data. If something is *subjective*, two different people might disagree on the value. *Objective* data is less susceptible to disagreement.
2. Scan [this article on use of color in visualizations](https://www.y42.com/blog/color-rules-data-visualization).
3. If visualization theory is interesting to you, [check out this article on the Gestalt principles of visual perception](https://bootcamp.uxdesign.cc/to-understand-what-makes-good-design-work-you-need-to-understand-the-psychology-of-human-bdff3cf20425). This article is optional.

<a name="cid57"></a>
## VIII. More Types of Plots
### A. Boxplots
Plotting averages for each team is useful. But it's also useful to understand how consistent each team was at placing game pieces. Boxplots are a great tool for this. Here is a boxplot of total game pieces placed per match for each team. You might need to increase the size of your browser window to see all team labels in the boxplot.

In [None]:
# Constructing Box Plots
tasks = [task for task in pd.unique(measures.task)
         if task.startswith("cone") or task.startswith("cube")]
pieces = (
    measures
    .query("task in @tasks")
    .groupby(["match", "team"])
    .agg({"hit": "sum"})
    .reset_index()
    .sort_values("team")
)
pieces.head()

In [None]:
boxfig = px.box(
    pieces, x="team", y="hit",
    title="Game Pieces Place per Match",
    labels={"hit": "Cones + Cubes", "team": "Team"}
)                
boxfig.update_xaxes(type="category")
boxfig.show()

<a name="cid58"></a>
### B. Interpreting Boxplots
Each team has a glyph that shows the total number of cones and cubes placed on the grid during each match. Each box has three parts: the thick "box" part, the vertical line, and the dots.
* Every box has a horizontal line in the middle of the box. This line is drawn at the median number of game pieces placed. For example, team 360's median is at 7 pieces, which means they placed more than 7 pieces in half of their matches and fewer than 7 in the other half of their matches.
* The top of the box is drawn at the upper quartile, wich means the team scored more than this value in 25% of their matches.
* The bottom of the box is drawn at the lower quartile, which means the team scored fewer pieces that this value in 25% of their matches.
* The vertical lines extend from the minimum to maximum values. They usually show the total range of game pieces placed during matches for each team. There is an exception to this rule which is explained in the next bullet.
* For most boxplots, the vertical lines are not allowed to extend more than 1.5 times the interquartile range. The interquartile range is the distance from the bottom quartile to the top quartile (the length of the box). If there are values outside this range, they are considered to be outlierss and are plotted as dots.

<a name="cid59"></a>
### C. Heat Maps
Heat maps are another great way to visualize a dataset.

In [None]:
tasks = [task for task in pd.unique(measures.task)
         if task.startswith("cone") or task.startswith("cube")]
grid = (
    measures
    .query("task in @tasks and phase == 'tele'")
    .groupby(["team", "task"])
    .agg({"hit": "mean"})
    .unstack(level=1)
    .T
    .droplevel(0)
    .loc[::-1, :]
)
grid.head()

In [None]:
gridfig = px.imshow(
    grid,
    title="Average Pieces Placed by Type and Level",
    aspect="auto",
    text_auto=".1f",
    labels={"x": "Team", "y": ""})
gridfig.update_xaxes(type="category")
gridfig.show()

<a name="cid60"></a>
## IX. Quiz

<a name="cid61"></a>
#### 1. Question #1
What's a good type of chart for plotting two numeric variables? (See the articles in the *Visualization Theory* section.)

In [None]:
#1


<a name="cid62"></a>
#### 2. Question #2
What's a good type of chart for comparing numeric and categorical (i.e., qualitative) variables?

In [None]:
#2


<a name="cid63"></a>
#### 3. Question #3
How would you visualize two nominal or ordinal variables? For example, robotics team vs. match starting position? Feel free to [scan Plotly's documentation](https://plotly.com/python/) for ideas.

In [None]:
#3


<a name="cid64"></a>
#### 4. Question #4
What is the difference between a diverging and sequential color palette?

In [None]:
#4


<a name="cid65"></a>
#### 5. Question #5
Review the boxplot. What teams are the most consistent. What teams have a large amount of variation in the number of game pieces placed?

In [None]:
#5


<a name="cid66"></a>
## X. Save Your Work
Once you have completed the exercises, save a copy of the notebook outside of the git repository (outside of the *pyclass_frc* folder). Include your name in the file name. Follow instructions from your instructor to get feedback on your work.

<a name="cid67"></a>
## XI. Concept and Terminology Review
You should be able to define the following terms or describe the concept.
* Scatter plots
* Bar charts
* Line plots
* Heat maps
* Box plots
* Method chaining
* `pandas.plot`
* Matplotlib
* `Plotly`
* `pandas.to_datetime`
* Plotly color sequences
* HTML colors
* CSS hex color codes
* Numeric data
* Categorical data

<a name="cid68"></a>
[Table of Contents](../../index.ipynb)

<a name="cid69"></a>
Prev: [Tabular Data](../08_tabular_data/tabular_data.ipynb) | [Table of Contents](../toc.ipynb) | Next: [Managing Big Arrays with Numpy](../10_numpy/numpy.ipynb)