<a href="https://colab.research.google.com/github/cs432-websci-master/public/blob/main/Mod_03_InfoVis_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 432/532 InfoVis with Python Tutorial

We're going to use [Seaborn](https://seaborn.pydata.org/), which is a high-level interface to the popular [Matplotlib](https://matplotlib.org/).  We'll also be using [Pandas](https://pandas.pydata.org/) to read in data from a CSV file to a a dataframe (aka table) and manipulate the data.

The goal is to produce plots similar to those from the InfoVis in R tutorial. So first, upload the `midwest.csv` and `economics.csv` datasets that were created in the R tutorial to your notebook.


In [None]:
import matplotlib.pyplot as plt  # will need some Matplotlib functions
import seaborn as sns
import pandas as pd              # will use Pandas for data manipulation
sns.set_style("whitegrid");   # use white grid as default

In [None]:
fCount_csv = pd.read_csv('acnwala-friendscount.csv')

In [None]:
fCount_csv.head()

Unnamed: 0,USER,"""FRIENDCOUNT"""
0,Uloma Faith Nwala,482
1,Chima Emmanuel Nwala,357
2,Tibidabo A. Peters,2143
3,Hany SalahEldeen,1250
4,Deborah Edds,907


In [None]:
user = fCount_csv['USER']
print (user);

0        Uloma Faith Nwala
1     Chima Emmanuel Nwala
2       Tibidabo A. Peters
3         Hany SalahEldeen
4             Deborah Edds
              ...         
93        Chukwuemeka Udeh
94             Wobo Vivian
95     Fortune Tall Essien
96          Mirian Webilor
97           Nwala Johnson
Name: USER, Length: 98, dtype: object


In [None]:
fCount = fCount_csv[' "FRIENDCOUNT"']
print(fCount);

0      482
1      357
2     2143
3     1250
4      907
      ... 
93      40
94     393
95     210
96     341
97     916
Name:  "FRIENDCOUNT", Length: 98, dtype: int64


In [None]:
midwest = pd.read_csv('midwest.csv')
midwest.head()

FileNotFoundError: ignored

Read in the "date" column as a date.

ref: https://www.earthdatascience.org/courses/use-data-open-source-python/use-time-series-data-in-python/date-time-types-in-pandas-python/

In [None]:
economics = pd.read_csv('economics.csv', parse_dates = ['date'])
economics.head()

## Things to Note

* If you're running these in a script, you'll need to include `plt.show()` at the end to actually draw the plot.  We don't need this line in the notebook.

* In Seaborn, you may need to note if the chart function returns a `FacetGrid` object or an `Axes` object.  I've tried to use the variable `g` for `FacetGrid` and `ax` for `Axes`.

## Scatterplot

Here's a basic scatterplot, showing the percentage of college educated (mapped to the y-axis) vs. the total population (mapped to the x-axis) in each county in the midwest states.

In [None]:
ax = sns.scatterplot(x="poptotal", y="percollege", data=midwest)

Now we're going to subset this and just show the counties in Ohio (state==OH). `midwest['state']` refers to the `state` column in the midwest dataset. 

We also have some adjustments to the x-axis labels and tick marks so that the population is printed with commas and everything fits.

In [None]:
ax = sns.scatterplot(x="poptotal", y="percollege", data=midwest[midwest['state']=="OH"])

ax.set_xlabel ('Population')
ax.set_ylabel ('% College Educated')
ax.set_title('Ohio counties (source: midwest)')

# set x-axis parameters (to look nice)
ax.set_xlim(left=0)    # set lowest xtick at 0
ax.set_xticks(ax.get_xticks()[::2]) # use every other tick mark
ticks = ax.get_xticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
ax.set_xticklabels(labels);  # using the semicolon at the end won't print output

## Bar Chart

We need to sum the population in each state.  We can use Pandas functions for this.  First, we group the population by (`groupby`) state and `sum()` the values that we're grouping.  

In [None]:
by_state = midwest.groupby('state').sum()
by_state

Then we have to `reset_index()` to move 'state' back to a column instead of an index and then choose just 'state' and 'poptotal' columns.

In [None]:
state_pop = by_state.reset_index()[['state', 'poptotal']]
state_pop

Then we can sort in descending order

In [None]:
state_pop.sort_values(by=['poptotal'], ascending=False, inplace=True)
state_pop

Now we can plot with `catplot()` and `kind="bar"`

In [None]:
g = sns.catplot(x="state", y="poptotal", kind="bar", data=state_pop, color="steelblue")

g = (g.set_axis_labels ('State', 'Total Population'))
plt.title('(source: midwest)')

# format commas in ticklabels
ticks = g.axes[0][0].get_yticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
g.set_yticklabels(labels);

Let's turn it sideways.  Just switch x and y axes and make the chart wider to accomodate the labels.

In [None]:
g = sns.catplot(y="state", x="poptotal", kind="bar", data=state_pop, color="steelblue", 
                height=5, # make the plot 5 units high
                aspect=1.5) # width should be 1.5 times height)

g = (g.set_axis_labels ('Total Population', 'State' ))
plt.title('(source: midwest)')

# format commas in ticklabels
ticks = g.axes[0][0].get_xticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
g.set_xticklabels(labels);

## Line Chart

In [None]:
economics.head()

In [None]:
ax = sns.lineplot(x="date", y="unemploy", data=economics)

ax.set_xlabel ('Date')
ax.set_ylabel ('Number Unemployed (thousands)')
ax.set_title('(source: economics)')

# set ticks every 5 years, show just the year
import matplotlib.dates as mdates
ax.xaxis.set_major_locator(mdates.YearLocator(5))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

# set y-axis parameters (to look nice)
ticks = ax.get_yticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
ax.set_yticklabels(labels);  

## Scatterplot Matrix

We can use the `pairplot()` function to plot a scatterplot matrix.  Instead of plotting the data in the diagonals, it plots a histogram of that attribute.

The example here goes back to the midwest dataset, selects only columns for 'area', 'poptotal', and 'popdensity'.

In [None]:
g = sns.pairplot(midwest, vars=['area', 'poptotal', 'popdensity'])

## Histogram

For the histogram, we show the distribution of population per county. Note that we're limiting this to counties that have less than 1 M people (in particular, Cook County, IL includes Chicago and has > 5 M people), so that skews the histogram.

To create the histogram, we use the `distplot()` function. We have to pass just a simple array, so we've taken the subset with population < 1M and then returned just the `poptotal` column.

By default `distplot()` also shows a kernel, so we specify `kde=False` to turn that off.

In [None]:
ax = sns.distplot(midwest[midwest['poptotal']<1000000]['poptotal'], kde=False) 

ax.set_xlabel ('Population')
ax.set_ylabel ('Number of Counties')
ax.set_title('(source: midwest)')

# format commas in ticklabels
ticks = ax.get_xticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
ax.set_xticklabels(labels);

We can use the `bins` option to change the number of bins in the histogram.  Note this is number of bins, not the binwidth (as we had in the R example).

In [None]:
ax = sns.distplot(midwest[midwest['poptotal']<1000000]['poptotal'], kde=False, bins=100) 

ax.set_xlabel ('Population')
ax.set_ylabel ('Number of Counties')
ax.set_title('(source: midwest)')

# format commas in ticklabels
ticks = ax.get_xticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
ax.set_xticklabels(labels);

## Boxplot

We're again looking at the total population by county in the midwest (and again, only for counties with < 1M people). This time, we'll use boxplots and create a separate boxplot for each state. 

The code is very similar to the bar chart.  We use `catplot()`, specify `kind="box"`, and use the midwest dataset (filtered to counties with < 1M people) instead of the state_pop dataset that we'd created for the bar chart.

In [None]:
g = sns.catplot(x="state", y="poptotal", kind="box", data=midwest[midwest['poptotal']<1000000]) 

g = (g.set_axis_labels ('State', 'Total Population'))
plt.title('(source: midwest)')

# format commas in ticklabels
ticks = g.axes[0][0].get_yticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
g.set_yticklabels(labels);

## Empirical CDF (ECDF)

Currently, there's no ecdf plotting function for Seaborn, but it's coming soon. (See https://github.com/mwaskom/seaborn/pull/2141).  It will be `ecdfplot()`

But there's code showing how to compute an ECDF and plot it with scatter().  See https://cmdlinetips.com/2019/05/empirical-cumulative-distribution-function-ecdf-in-python/

*Preparing the data*

In [None]:
import numpy as np  # use for sort() and arrange() to build the ECDF
pop_data = midwest[midwest['poptotal']<1000000]['poptotal']
x = np.sort(pop_data)
n = x.size
y = np.arange(1, n+1) / n

Drawing the chart with `scatter()`

In [None]:
plt.scatter(x=x, y=y);
plt.xlabel('Population')
plt.ylabel('Percentage of Counties');

Here's an example of how to do the same thing with Seaborn's `lineplot()` to include the line and points.

In [None]:
ax = sns.lineplot(x=x, y=y, marker="o")

ax.set_xlabel ('Population')
ax.set_ylabel ('Percentage of Counties')
ax.set_title('(source: midwest)')

# set x-axis parameters (to look nice)
ax.set_xlim(left=0, right=1000000)    # set lowest xtick at 0, max at 1M
ticks = ax.get_xticks()
labels = ['{:,.0f}'.format(x) for x in ticks]
ax.set_xticklabels(labels);  # using the semicolon at the end won't print output