# C6-Introduction to Data Visualization with Matplotlib

1. Introduction to Matplotlib
    - Introduction to data visualization with Matplotlib
    - Customizing your plots
    - Small multiples
2. Plotting time-series
    - Plotting time-series data
    - Plotting time-series with different variables
    - Annotating time-series data
3. Quantitative comparisons and statistical visualizations
    - Quantitative comparisons: bar-charts
    - Quantitative comparisons: histograms
    - Statistical plotting
    - Quantitative comparisons: scatter plots
4. Sharing visualizations with others
    - Preparing your figures to share with others
    - Saving your visualizations
    - Automating figures from data
    - Where to go next

In [1]:
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the course datasets 
climate_change = pd.read_csv('datasets/climate_change.csv', parse_dates=["date"], index_col="date")
medals = pd.read_csv('datasets/medals_by_country_2016.csv', index_col=0)
summer_2016 = pd.read_csv('datasets/summer2016.csv')
austin_weather = pd.read_csv("datasets/austin_weather.csv", index_col="DATE")
weather = pd.read_csv("datasets/seattle_weather.csv", index_col="DATE")

# Some pre-processing on the weather datasets, including adding a month column
seattle_weather = weather[weather["STATION"] == "USW00094290"] 
month = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"] 
seattle_weather["MONTH"] = month 
austin_weather["MONTH"] = month

# 1. Introduction to Matplotlib

## Introduction to data visualization with Matplotlib
- `fig, ax = plt. subplots()`
- `ax.plot()`

## Customizing your plots
Customizing apperance: 
- `ax.plot(..., ..., marker='o'/'v')`
- `ax.plot(..., ..., marker= , linestyle='--'/None/..., color='r')` 

Customizing the axes labels:
- `ax.set_xlabel('....')`
- `ax.set_ylabel('....')`
- `ax.set_title('....')`

## Small multiples
-  multiple small plots that show similar data across different conditions
-  `fig, ax = plt.subplots(3,2)` , 3 rows of subplots and 2 columns.
-  ax: an array of an axes object, `ax.shape=(3,2)`
-  use index in axis: `ax[0,0],plot(....)`
-  special case: `fig,ax = plt.subplots(2,1) `
    - there is only one column so that no need to indexes
    - `ax[0].plot(...)`
    - `ax[1].plot(...)`
- `plt.subplots(2,1, sharex= True, sharey=True)`: share the axes.

In [None]:
#------------------------------------------------------
# Adding data to an Axes object

import matplotlib.pyplot as plt

# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()

# Plot MLY-PRCP-NORMAL from seattle_weather against the MONTH
ax.plot(seattle_weather["MONTH"],seattle_weather['MLY-PRCP-NORMAL'])

# Plot MLY-PRCP-NORMAL from austin_weather against MONTH
ax.plot(austin_weather['MONTH'], austin_weather['MLY-PRCP-NORMAL'])

plt.show()

#------------------------------------------------------
# Customizing data appearance

# Plot Seattle data, setting data appearance
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color='b', marker='o', linestyle='--')

# Plot Austin data, setting data appearance
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], color='r', marker='v', linestyle='--')

plt.show()

#------------------------------------------------------
# Customizing axis labels and adding titles

ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"])

# Customize the labels and the title
ax.set_xlabel("Time (months)")
ax.set_ylabel("Precipitation (inches)")
ax.set_title("Weather patterns in Austin and Seattle")
plt.show()

#------------------------------------------------------
# Creating small multiples with plt.subplots

# Given DataFrames: seattle_weather and austin_weather
# Plot month and precipitation, and month and temperatures

# Create a Figure and an array of subplots with 2 rows and 2 columns
fig, ax = plt.subplots(2, 2)

# Plot for Seattle
ax[0, 0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
ax[0, 1].plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])

# Plot for Austin
ax[1,0].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"])
ax[1,1].plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
plt.show()

#------------------------------------------------------
# Small multiples with shared y axis

# Create a figure and an array of axes: 2 rows, 1 column with shared y axis
fig, ax = plt.subplots(2, 1, sharey=True)

# Plot Seattle precipitation data in the top axes
ax[0].plot(seattle_weather['MONTH'], seattle_weather["MLY-PRCP-NORMAL"], color = 'b')
ax[0].plot(seattle_weather['MONTH'], seattle_weather["MLY-PRCP-25PCTL"], color = 'b', linestyle='--')
ax[0].plot(seattle_weather['MONTH'], seattle_weather["MLY-PRCP-75PCTL"], color = 'b', linestyle='--')

# Plot Austin precipitation data in the bottom axes
ax[1].plot(austin_weather['MONTH'], austin_weather["MLY-PRCP-NORMAL"], color = 'r')
ax[1].plot(austin_weather['MONTH'], austin_weather["MLY-PRCP-25PCTL"], color = 'r', linestyle='--')
ax[1].plot(austin_weather['MONTH'], austin_weather["MLY-PRCP-75PCTL"], color = 'r', linestyle='--')

plt.show()

# 2. Plotting time-series

## Plotting time-series data
- example: climate change time-series which measures CO2 in the athmosphere.
- take date column as an index: climate_change.index
- `pd.read_csv('csv_file_path', parse_dates=['date'], index_col='date')`
    - parse_dates converts the date column to a datetime format, allowing you to perform date-based operations and filtering.
    - This parameter takes a list of column names or a dictionary with column names as keys and formats as values.

## Plotting time-series with different variables
- when the scales of two variables have different range of scales, use twin axes:
    - `fig, ax = plt.subplots()` \
      `ax.plot(....)` \
      `ax.set_xlabel()`\
      `ax.set_ylabel()` \
      `ax2 = ax.twinx()` \
      `ax2.set_ylabel()`  
- twinx(): share the same x-axis but y-axes are separate.
- add `color` object to both plot and set_ylabel to distinguish them.
- further add `ax.tick_params('y', colors='red')` to color the ticks on the y-axes.

## Annotating time-series data
- Annotations: 
    - are usually small pieces of text that refer to a particular part of the visualization, 
    - pay our attention on some feature of the data and explaining this feature.
- annotation:
    - `ax2.annotate('>1 degree', xy =(pd.Timestamp('2015-10-06'), 1) )`
    - `>1 degree`: text to show up
    - xy coordinate to annotate
- positionin gthe text: 
    - `ax2.annotate(...,..., xytext= (pd.Timestamp('2008-10-06'), -0.2))`
    - xytext is an optional argument that selects xy position of the text.
- adding arrows to annotation:
    - `ax2.annotate(....., arrowprops={})`, arrow properties
    - to customize arrow: `arrowprops={'arrowstyle':'->', 'color':'gray'}`

In [None]:
### Plotting time-series data ###

#----------------------------------------------
# Read data with a time index

# Use the parse_dates argument to parse the "date" column as dates.
# Use the index_col argument to set the "date" column as the index.

import pandas as pd

# Read the data from file using read_csv
climate_change = pd.read_csv('climate_change.csv', parse_dates=['date'], index_col='date')
print(climate_change)

#----------------------------------------------
# Plot time-series data

import matplotlib.pyplot as plt
fig, ax = plt.subplots()

# Add the time-series for "relative_temp" to the plot
ax.plot(climate_change.index, climate_change['relative_temp'])

# Set the labels
ax.set_xlabel('Time')
ax.set_ylabel('Relative temperature (Celsius)')
plt.show()

#----------------------------------------------
# Using a time index to zoom in 

fig,ax = plt.subplots()

# Create variable seventies with data from "1970-01-01" to "1979-12-31"
# Note: as date is an index, use slicing.
seventies = climate_change[ "1970-01-01":"1979-12-31"]
print(seventies)

# Add the time-series for "co2" data from seventies to the plot
ax.plot(seventies.index, seventies["co2"])
plt.show()

In [None]:
### Plotting time-series with different variables ###

#----------------------------------------------
# Plotting two variables

# If the variables have very different scales, you'll want
# to make sure that you plot them in different twin Axes objects. 

import matplotlib.pyplot as plt

fig, ax = plt.subplots()

# Plot the CO2 variable in blue
ax.plot(climate_change.index, climate_change['co2'], color='blue')

# Create a twin Axes that shares the x-axis
ax2 = ax.twinx()

# Plot the relative temperature in red
ax2.plot(climate_change.index, climate_change['relative_temp'], color='red')
plt.show()

#----------------------------------------------
# Defining a function that plots time-series data

# Define a function called plot_timeseries
def plot_timeseries(axes, x, y, color, xlabel, ylabel):
  # Plot the inputs x,y in the provided color
  axes.plot(x, y, color= color)
  # Set the axes label
  axes.set_xlabel(xlabel)
  axes.set_ylabel(ylabel, color=color)
  # Set the colors tick params for y-axis
  axes.tick_params('y', colors=color)

#----------------------------------------------
# Using a plotting function

fig, ax = plt.subplots()

# Plot the CO2 levels time-series in blue
plot_timeseries(ax, climate_change.index, climate_change['co2'], "blue", "Time (years)", "CO2 levels")

# Create a twin Axes object that shares the x-axis
ax2 = ax.twinx()

# Plot the relative temperature data in red
plot_timeseries(ax2, climate_change.index, climate_change['relative_temp'], "red", "Time (years)", "Relative temperature (Celsius)")

plt.show()

In [None]:
###  Annotating time-series data ###

#----------------------------------------------
# Annotating a plot of time-series data

fig, ax = plt.subplots()

# Plot the relative temperature data
ax.plot(climate_change.index, climate_change['relative_temp'])

# Annotate the date at which temperatures exceeded 1 degree
ax.annotate('>1 degree', xy= (pd.Timestamp('2015-10-06'), 1))

plt.show()

#----------------------------------------------
# Plotting time-series: putting it all together

fig, ax = plt.subplots()

# Plot the CO2 levels time-series in blue
plot_timeseries(ax, climate_change.index, climate_change['co2'], 'blue', "Time (years)", "CO2 levels")

# Create an Axes object that shares the x-axis
ax2 = ax.twinx()

# Plot the relative temperature data in red
plot_timeseries(ax2, climate_change.index, climate_change['relative_temp'], 'red', "Time (years)", "Relative temp (Celsius)")

# Annotate point with relative temperature >1 degree
ax2.annotate(">1 degree", xy= (pd.Timestamp('2015-10-06'),1) , xytext=(pd.Timestamp('2008-10-06'),-0.2), arrowprops={'arrowstyle':'->', 'color':'gray'})
plt.show()

# 3. Quantitative comparisons and statistical visualizations

## Quantitative comparisons: bar-charts
- Previous chapter: data is turned into visual descriptions.\
  This chapter: quantitative comparison between parts of data.
- Visualizing more data at the same time:
    - use a data showing Olympic medals of each country
    - `ax.bar(medals.index, medals['Gold'], label='Gold')`\
      `ax.bar(medals.index, medals['Silver'], bottom=medals['Gold'], label='Silver')`\
      `ax.bar(medals.index, medals['Bronze'], bottom=medals['Gold']+medals['Silver'], label='Bronze')`\
      `ax.leged()`  needed to show data labels.
    - each new data is stacked on top of the previous data.
- Rotate tick labels: `ax.set_xticklabels(medals.index, rotation=90)` when labels ar elong and overlapping.

## Quantitative comparisons: histograms
- Histogram shows us the entire distribution of values within a variable.
    - The x-axis is values withing the variable.
    - The height of the bars represents the number of observations within a particular bin of values. 
- Adding labels: \
    `ax.hist(..., label='...')`\
    `ax.legend()`
- Customizing bins: when a sequence of values is provided, these numbers will be set to be the boundaries between the bins.
- Transparency:  
    - the "bar" type is used as default.
    - define `histtype="step"`, which displays the histogram as thin lines, instead of solid bars.
    - this can be useful when more than one type of data are plotted at the same time.

## Statistical plotting
-  Statistical plotting is a set of methods for using visualization to make comparisons. 
-  Statistical plotting techniques add quantitative information for comparisons into the visualization
-  Two of these techniques:
- **Adding error bars to bar charts:**
    - Errors bars are additional markers on a plot or a bar chart to make some comments on the distribution of variable.
    - Histogram shows the entire distribution; error bars summarizes the distr. in one number such as standard deviation.
    - `ax.bar("Rowing", mens_rowing["Weight"].mean())`\
        bar plot takes x and y arguments and y is the mean of weight.
    - `ax.bar("Rowing", mens_rowing["Weight"].mean(), yerr=mens_rowing["Weight"].std())`\
        yerr is the additional argument, which is the std of  the same column and displayed as vertical merker.
- Error bars to line plots: `ax.errorbar(x_values, y_values, yerr= y_values_summary)`
- **Adding box plot**: it is sequence of sequences.
    - It is mplemented as a method of axes object.
    - `ax.boxplot([data_col1, data_col2])`\
      `ax.set_xticklabels(['Data1', 'Data2'])` 
    - Interpretation of box plots:
        - red line: median value of the data
        - the edges of the box: inter-quartile range (IQR) of the data, between the 25th and 75th percentiles
        - the whiskers (firca) at the ends of the thin bars indicate 1,5 times the size of the IQR beyond the 75th and 25th percentiles. 
        - This range should encompass roughly 99 percent of the distribution if the data is Gaussian or normal. 
        - Points that appear beyond the whiskers are outliers.

## Quantitative comparisons: scatter plots
- Bar charts show the values of one variable across different conditions, such as different countries.
- Scatter plots display the values of different variables across observations. It is sometimes called a bi-variate comparison, because it involves the values of two different variables.
- In `climate_change` DataFrame, the data is sliced according to different time range and plotted together for comparison
- Encoding a third variable by color: \
    `ax.scatter(..., ...., c=climate_change.index)`\
    the index vaiable is coded as color, this is not the `color`argument. 
 

In [None]:
### Quantitative comparisons: bar-charts ###

#----------------------------------------------
# Bar charts

# visualize the number of gold medals won by each country in the provided medals DataFrame. 

fig, ax = plt.subplots()

# Plot a bar-chart of gold medals as a function of country
ax.bar(medals.index, medals['Gold'])

# Set the x-axis tick labels to the country names
ax.set_xticklabels(medals.index, rotation=90)

# Set the y-axis label
ax.set_ylabel('Number of medals')
plt.show()

#----------------------------------------------
# Stacked bar chart

# A stacked bar chart contains bars, where the height of each bar represents values.

# Add bars for "Gold" with the label "Gold"
ax.bar(medals.index, medals['Gold'], label='Gold')

# Stack bars for "Silver" on top with label "Silver"
ax.bar(medals.index, medals["Silver"], bottom=medals['Gold'], label="Silver")

# Stack bars for "Bronze" on top of that with label "Bronze"
ax.bar(medals.index, medals["Bronze"], bottom=medals["Gold"]+medals["Silver"], label="Bronze")

# Display the legend
ax.legend()
plt.show()


In [None]:
### Quantitative comparisons: histograms ###

#----------------------------------------------
# Creating histograms

# Histograms show the full distribution of a variable.
# Display the distribution of weights of medalists in gymnastics and
# in rowing in the 2016 Olympic games for a comparison between them.

# The data is stored in a pandas DataFrame object called summer_2016_medals 
# that has a column "Weight". In addition, you are provided a pandas GroupBy 
# object that has been grouped by the sport.

print(mens_rowing)
print(mens_gymnastics)

# Plot the mean of weight in bar plot first

fig, ax = plt.subplots()
ax.bar("Rowing", mens_rowing["Weight"].mean())
ax.bar("Gymnastics", mens_gymnastics["Weight"].mean())
plt.show()

plt.clf()

fig, ax = plt.subplots()
# Plot a histogram of "Weight" for mens_rowing
ax.hist(mens_rowing["Weight"])

# Compare to histogram of "Weight" for mens_gymnastics
ax.hist(mens_gymnastics["Weight"])

# Set the x-axis label to "Weight (kg)"
ax.set_xlabel("Weight (kg)")

# Set the y-axis label to "# of observations"
ax.set_ylabel("# of observations")

plt.show()

#----------------------------------------------
# "Step" histogram

fig, ax = plt.subplots()

# use the histtype argument to visualize the data using the 'step' type 
# and set the number of bins to use to 5.

# Plot a histogram of "Weight" for mens_rowing
ax.hist(mens_rowing["Weight"], label="Rowing", histtype='step', bins=5)

# Compare to histogram of "Weight" for mens_gymnastics
ax.hist(mens_gymnastics["Weight"], label="Gymnastics", histtype='step', bins=5)

ax.set_xlabel("Weight (kg)")
ax.set_ylabel("# of observations")

# Add the legend and show the Figure
ax.legend()
plt.show()


In [None]:
###  Statistical plotting ###

#----------------------------------------------
# Adding error-bars to a bar chart

# Add error bars that quantify not only the difference in the means of 
# the height of medalists in the 2016 Olympic Games, but also 
# the standard deviation of each of these groups, as a way to assess 
# whether the difference is substantial relative to the variability within each group.

fig, ax = plt.subplots()

# Add a bar for the rowing "Height" column mean/std
ax.bar("Rowing", mens_rowing["Height"].mean(), yerr=mens_rowing["Height"].std())

# Add a bar for the gymnastics "Height" column mean/std
ax.bar("Gymnastics", mens_gymnastics["Height"].mean(), yerr=mens_gymnastics["Height"].std())

# Label the y-axis
ax.set_ylabel("Height (cm)")
plt.show()

#----------------------------------------------
# Adding error-bars to a plot

# DataFrames: seattle_weather has data about the weather in Seattle 
# and austin_weather has data about the weather in Austin. 

fig, ax = plt.subplots()

# Add Seattle temperature data in each month with error bars
ax.errorbar(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"], seattle_weather["MLY-TAVG-STDDEV"])

# Add Austin temperature data in each month with error bars
ax.errorbar(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"], austin_weather["MLY-TAVG-STDDEV"])

# Set the y-axis label
ax.set_ylabel("Temperature (Fahrenheit)")
plt.show()

#----------------------------------------------
# Creating boxplots

# Boxplots tell us what the median of the distribution is, what the 
# inter-quartile range is and also what the expected range of 
# approximately 99% of the data should be. 
# Outliers beyond this range are particularly highlighted.

fig, ax = plt.subplots()

# Add a boxplot for the "Height" column in the DataFrames
ax.boxplot([mens_rowing["Height"], mens_gymnastics["Height"]])

# Add x-axis tick labels:
ax.set_xticklabels(["Rowing", "Gymnastics"])

# Add a y-axis label
ax.set_ylabel("Height (cm)")
plt.show()


In [None]:
### Quantitative comparisons: scatter plots ###

#----------------------------------------------
# Simple scatter plot

# Scatter are a bi-variate visualization technique.

fig, ax = plt.subplots()

# Add data: "co2" on x-axis, "relative_temp" on y-axis
ax.scatter(climate_change["co2"], climate_change["relative_temp"])

# Set the x-axis label to "CO2 (ppm)"
ax.set_xlabel("CO2 (ppm)")

# Set the y-axis label to "Relative temperature (C)"
ax.set_ylabel("Relative temperature (C)")

plt.show()

#----------------------------------------------
# Encoding time by color

fig, ax = plt.subplots()

# Add data: "co2", "relative_temp" as x-y, index as color
ax.scatter(climate_change["co2"], climate_change["relative_temp"], c=climate_change.index)

# Set the x-axis label to "CO2 (ppm)"
ax.set_xlabel("CO2 (ppm)")

# Set the y-axis label to "Relative temperature (C)"
ax.set_ylabel("Relative temperature (C)")
plt.show()


# 4. Sharing visualizations with others

Here, focusing on creating visualizations to share with others and incorporate into automated data analysis pipelines. 

## Preparing your figures to share with others
- Begin with customization of figure styles. 
- choosing a style of plot: 
    - `plt.style.use("ggplot")`
    - `plt.style.use("default")` to go back to default style
    - some other styles: "bmh", "seaborn-colorblind"
- guidelines for choosing plotting style:
    -  Dark backgrounds are generally less visible.
    -  If colors are important, consider using a colorblind-friendly style, such as 
        - "seaborn-colorblind" or "tableau-colorblind10".
    -  if someone is going to print out your figures, you might want to use less ink
    -  If the printer used is likely to be black-and-white, consider using the "grayscale" style. 

## Saving your visualizations
- `fig.savefig("....")` to save the created figure
-  Saving a figure as a PNG file provides lossless compression of your image:  \
   High quality image, but relatively large amounts of diskspace or bandwidth.
- JPG file format uses lossy compression: less diskspace or bandwidth.
  `fig.savefig("fig_name.jpg", quality=50)` \
  `quality` takes values between 1 and 100.
- The SVG file-format produces a vector graphics file. 
    - SCG files can be edited by advanced graphics software, such as Gimp or Adobe Illustrator. 
    - If you need to edit the figure after producing it, this might be a good choice.
- Resolution: setting the quality of images:
    - `fig.savefig("fig_name.png", dpi=300)`
    - dpi= dots per inch: higher dpi, denser image
    - `dpi=300` gives already high quality of image.
- Size: to control the size of the figure.
    - `fig.set_size_inches([fig_width,fig_heiht])`
    - this also sets the aspect ratio of the figure.

## Automating figures from data
- getting unique values of a columng: `df_column.unique()`
- One of the main strengths of Matplotlib is that it can be automated to adapt to the data that it receives as input. For example, if you receive data that has an unknown number of categories, you can still create a bar plot that has bars for each category.
- see exercises

## Where to go next
-


In [None]:
### Preparing your figures to share with others ###

#--------------------------------------------------------
# Switching between styles

# Use the "Solarize_Light2" style and create new Figure/Axes
plt.style.use('Solarize_Light2')
fig, ax = plt.subplots()
ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
plt.show()

# Use the "Solarize_Light2" style and create new Figure/Axes
plt.style.use('Solarize_Light2')
fig, ax = plt.subplots()
ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
plt.show()


In [None]:
### Automating figures from data ###
#--------------------------------------------------------
# Unique values of a column

# In this exercise and the next, visualize the weight of athletes in the 2016 summer Olympic Games 
# again, from a dataset that has some unknown number of branches of sports in it. 
# A pandas DataFrame object is called summer_2016_medals, which has a "Sport" column. 
# There is also a "Weight" column that tells you the weight of each athlete.

# Extract the "Sport" column
sports_column = summer_2016_medals["Sport"]
print( sports_column)

# Find the unique values of the "Sport" column
sports = sports_column.unique()

# Print out the unique sports values
print(sports)

#--------------------------------------------------------
# Automate your visualization

fig, ax = plt.subplots()
print(sports)
# Loop over the different sports branches
for sport in sports:
  print(sport)
  # Extract the rows only for this sport
  sport_df = summer_2016_medals.query('Sport== @sport')
  # Add a bar for the "Weight" mean with std y error bar
  ax.bar(sport, sport_df["Weight"].mean(), yerr=sport_df["Weight"].std())

ax.set_ylabel("Weight")
ax.set_xticklabels(sports, rotation=90)

# Save the figure to file
fig.savefig("sports_weights.png")
plt.show()
