## Friday 8

files needed = broadband_size.xlsx, auto_data.dta

This week we are working on

1. Scatter plots
2. Seaborn's facet plots

In [None]:
import pandas as pd     
import matplotlib.pyplot as plt   
from pandas_datareader import data, wb    
import datetime as dt
import seaborn as sns

%matplotlib inline      

## Pandas multi-index slicing


In [None]:
d = {'num_legs': [4, 4, 2, 2,2],
     'num_wings': [0, 0, 2, 2,2],
     'class': ['mammal', 'mammal', 'mammal', 'bird','bird'],
     'animal': ['cat', 'dog', 'bat', 'penguin','ostrich'],
     'locomotion': ['walks', 'walks', 'flies', 'walks','walks']}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
df

In [None]:
# .xs() is for a cross-section of data
# note that it keeps my innermost index level
df.xs(('mammal', 'dog'))

In [None]:
# The tuple () style tells Pandas you're using index levels; list [] style is not
df.loc[(['mammal','bird'],['cat','bat']),:] # this takes a slice across multiple levels

In [None]:
df.xs('bat', level = 1)

In [None]:
df.xs(('bird','ostrich'))['num_legs'][0]

Other assistance:

* [Other methods that may work](https://stackoverflow.com/questions/33194016/python-pandas-slice-multiindex-by-second-level-index-or-any-other-level)

* [Conditionals with multi-index](https://stackoverflow.com/questions/50608749/slicing-a-multiindex-dataframe-with-a-condition-based-on-the-index)

## Scatter plots

Scatter plots are used to compare two variables. This is a good approach for visualizing the correlation of two variables. Let's get some data from FRED.

In [None]:
# These codes are for  real gdp and the unemployment rate in the United States. 
codes = ['GDPC1', 'UNRATE']  

start = dt.datetime(1970, 1, 1)
fred = data.DataReader(codes, 'fred', start)

fred.head()

Gremlins! The gdp data is quarterly, but the unemployment rate is monthly. Let's fix this by downsampling to quarterly frequency. The FRED datareader is really good &mdash; the index is already a datetime object. (How would you check?)

In [None]:
# Create an average quarterly unemployment rate.
fred_q=fred.resample('q').mean()                
fred_q.head()

Let's plot the growth rate of GDP against the change in the unemployment rate. The relationship between these two variables is known as [Okun's Law](https://en.wikipedia.org/wiki/Okun%27s_law). 

Since the unemployment rates are already rates, it makes more sense to just difference them than to compute growth rates.

Example: $u_t$=1.25\% and $u_{t-1}$=0.75\%. 
It is clearer to say that the unemployment rate rose by 0.5 percentage points than to say that it rose by 67\%. 


In [None]:
# Compute the growth rate of gdp. 
fred_q['gdp_gr'] = fred_q['GDPC1'].pct_change()*100        

# .difference() takes the first difference: u(t)-u(t-1).   
fred_q['unemp_dif'] = fred_q['UNRATE'].diff()              
fred_q.head()

We are ready to plot. The `ax.scatter()` method ([docs](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.scatter.html)) takes two pieces of data: the x variable and the y variable. 

```python
ax.scatter(x, y)
```

In our example, the x variable is the gdp growth rate and the y variable is the change in the unemployment rate. 

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
                       
ax.scatter(fred_q['gdp_gr'], fred_q['unemp_dif'])

ax.set_title('Okun\'s Law in the United States' )
ax.set_ylabel('change in unemployment rate')
ax.set_xlabel('gdp growth rate')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)


Clearly, there is a negative correlation here: Low gdp growth rates are associated with positive changes in the unemployment rate. When gdp is growing slowly (or falling) unemployment is rising. 

## Practice: Scatters

Let's explore some of scatter plot's options. Use the [documentation](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.scatter.html) for help.  

Modify the Okun's Law figure we just made. 

1. Change the color of the dots to red.
2. Change the edgecolor of the markers to black. (markers have edges and faces)
3. Use the `s` option to set the size of the markers (in points squared) to 70.


In [None]:
fig, ax = plt.subplots(figsize=(10,5))

# your code here

Check out the documentation for [marker styles](https://matplotlib.org/api/markers_api.html). These styles can be used with `.plot()` command, too. 

3. Change the marker to a triangle. 
4. Use `text()` or `annotate()` to label the point corresponding to third quarter 2009: '2009Q3'

In [None]:
fig, ax = plt.subplots(figsize=(10,5))

# your code here

Scatter plots are very useful and we can do a lot more with them. Places to go from here.

1. Add a line of best fit. A bit clunky in matplotlib (use np's polyfit command), but not too bad. The seaborn package has a regplot command that makes this dead simple. 
2. Make data markers different colors or sizes depending on the value of a third variable. For example, you could get some more data and color the markers for years with a repbulican president red and markers for years with democratic presidents blue. 
3. Other ideas?


The OECD has a project studying [broadband internet coverage](http://www.oecd.org/sti/broadband/broadband-statistics/) across countries. It tracks data on numbers of subscribers, speed, and prices. 

1. Load 'broadband_size.xlsx'. It contains data on broadband accounts per 100 people, GDP per capita, and population (in thousands) for several countries. Are all your variables okay? 
2. Give the columns some reasonable names. 


In [None]:
broad = pd.read_excel('broadband_size.xlsx', thousands=',')
broad.columns = ['cty', 'broad_pen', 'gdp_cap', 'pop']
broad.head()


3. Create a `.regplot()` with broadband penetration on the y axis and GDP  per capita on the x axis. Add the 95 percent confidence interval. Make it look nice. 

In [None]:
fig, ax = plt.subplots(figsize=(12,7)) 

sns.regplot(x='gdp_cap', y='broad_pen', data=broad,               # the data
            ax = ax,                                              # an axis object
            color = 'blue',                                       # make it blue
            ci = 95)                                              # confidence interval: pass it the percent

sns.despine(ax = ax) 

ax.set_title('Broadband penetration and income')
ax.set_ylabel('broadband subscribers per 100 people')
ax.set_xlabel('GDP per capita')

plt.show()

4. The relationship doesn't look very linear to me. Replot your solution from 3. but try adding the `logx=True` option to regplot to regress y = log(x). As always, consult the [documentation](https://seaborn.pydata.org/generated/seaborn.regplot.html) if you need help.

In [None]:
fig, ax = plt.subplots(figsize=(12,7)) 

# your code here

### Bubble plot (and passing keywords)
A bubble plot is a scatter plot in which the size of the data markers (usually a circle) varies with a third variable. 

We can actually make these plots in matplotlib. The syntax is 
```python
ax.plot(x, y, s) 
```
where `s` is the variable corresponding to marker size. Since seaborn is built on top of matplotlib, we can pass *scatter keyword arguments* to `.regplot( )` and these get passed through to the underlying scatter. 

If we pass a single number to `s` it changes the size of all the bubbles. If we pass it a Series of data, then each bubble gets scaled according to its value in the series. 

The syntax for the option is `scatter_kws={'s': data_var}`. This sets the `s` argument of scatter to `data_var`. 

In [None]:
fig, ax = plt.subplots(figsize=(10,5)) 

sns.regplot(x='gdp_cap', y='broad_pen', data=broad,    # the data
            ax = ax,                                   # an axis object
            scatter_kws={'s': broad['pop']/1000},      # make the marker proportional to population            
            #scatter_kws={'s': 25},
            color = 'blue',                            # make it blue
            ci = 95,                                   # confidence interval: pass it the percent
            logx = True)                               # semi-log regression
                      
# We need to let the reader know what the bubble sizes represent.
ax.text(50000, 20, 'Marker size proportional to population size')

sns.despine(ax = ax)  
                                   

ax.set_title('Broadband penetration and income')
ax.set_ylabel('broadband subscribers per 100 people')
ax.set_xlabel('GDP per capita')
plt.show()

Notice that I have scaled population by 1000. The issue is that `s` is interpreted as points^2 (points squared) [[docs](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html)]. The idea is that the area of the marker increases proportional to the square of the width.  There is a good discussion of it at [stack overflow](https://stackoverflow.com/questions/14827650/pyplot-scatter-plot-marker-size).

If you try to use `s` and your whole figure turns the color of your marker, you probably need to scale your measure for `s`. 

Another example of the scatter_kws useage is to customize the scatter colors and alpha.

In [None]:
fig, ax = plt.subplots(figsize=(10,5)) 

# To keep the call to regplot from getting out of control, I define the scatter keywords dict here.
my_kws={'s': broad['pop']/1000, 'alpha':0.25, 'color':'black'}

sns.regplot(x='gdp_cap', y='broad_pen', data=broad,    # the data
            ax = ax,                                   # an axis object
            scatter_kws = my_kws,                      # pass parameters to scatter
            color = 'blue',                            # make it blue
            ci = 95,                                   # confidence interval: pass it the percent
            logx = True)                               # semi-log regression

# We need to let the reader know what the bubble sizes represent.
ax.text(50000, 20, 'Marker size proportional to population size')                                                         

sns.despine(ax = ax)
                                   

ax.set_title('Broadband penetration and income')
ax.set_ylabel('broadband subscribers per 100 people')
ax.set_xlabel('GDP per capita')
plt.show()

## Facet plots

Facet plots are grids of plots with the same x- and y-axes. Each plot in the grid is a different subset of the sample. Seaborn gives us simple way to make these plots. 

We often use facet plots in initial exploratory analysis. If we do not know what we are looking for, a facet plot is a good way to start "eye-balling" relationships. Once we have some ideas, we can narrow down our focus and use more precise tools. In general, **we do not include large grids of figures in our finished analysis.** They contain too much unnecessary information.

Load the file 'auto_data.dta' which contains data on automobile characteristics in the European market. These data are from [Miravete, Moral, and Thurk](https://www3.nd.edu/~jthurk/Papers/MMT_RAND.pdf).

In [None]:
df = pd.read_stata('auto_data.dta')
df.head(3)

In [None]:
# Recode the FUEL variable so I can easily understand it.
df['FUEL'] = df['FUEL'].replace({0:'gasoline', 1:'diesel'})
df.sample(5)

Looking at the data, we see that a unit of observation is a model at a point in time. We see prices and quantities sold and characteristics about the model. Let's cut the data down to VW and try some plots. 

In [None]:
vw = df[df['FIRM']=='Volkswagen']
vw.sample(8)

**Q: How are vehicle weight and fuel efficiency related? Does it vary by fuel type? Does it vary by brand?**

* Volkswagen has four brands during this period: Audi, Seat, Skoda, and Volkswagen. 
* There are two fuel types: gasoline and diesel. 

Let's make a grid of plots where the rows are the brands and columns are the fuel types. This is a 4x2 grid. 

In each plot, we will scatter weight vs. mpg.

In [None]:
g = sns.FacetGrid(vw, row='BRAND', col='FUEL')
g.map(plt.scatter, "WEIGHT", "MPG", color='blue')

plt.show()

What is a point in a plot? It is a model-year. 

What do we see? 

* Diesel vehicles tend to get higher mpg
* Within a fuel type, the range of mpg are similar
* Weight and mpg are usually negatively correlated (Skoda diesel?)
* There is heterogeneity in the number of models per brand

### Facet plot syntax
We first create the grid using `FacetGrid()`. We specify which DataFrame we are plotting and which variables we want for the rows and columns. These variables should be *categorical* and should have relatively few potential values. Otherwise, the grid would get very large and it would be hard to interpret. 

```python
g = sns.FacetGrid(vw, row='BRAND', col='FUEL')
```

Next, we map a plot type to the grid using `map()`. We can make many types of plots. In this case we have used the `scatter()` plot from matplotlib. Notice the `plt` that precedes the `scatter`. We can also pass any keyword arguments that `plt.scatter` accepts.    

```python
g.map(plt.scatter, "WEIGHT", "MPG", color='blue')
```

Now lets try a different plot type, `regplot()` from Seaborn. It's easier to see the relationship between weight and mpg. 


In [None]:
g = sns.FacetGrid(vw, row='BRAND', col='FUEL', height=4)
g.map(sns.regplot, "WEIGHT", "MPG", color='red', ci=90)

plt.show()

**Q: Are more powerful cars more expensive? Does it depend on fuel type? Brand?**

We are not limited to just one type of data in each plot. We can use color to differentiate further. In the next figure we add Ford, PSA, and Fiat to the firms in our DataFrame. Each firm has several brands and each brand has several models.

* Columns are still fuel type
* Rows are now firms (VW, Ford, PSA, Fiat)
* Hue (color) is brand (Ford's Volvo; Fiat's Alfa Romeo, etc.)
```python
g = sns.FacetGrid(to_plot, hue='BRAND', col='FUEL', row='FIRM')
```

In each plot we have 
* Price vs HP
* I'm using a scatter plot

```python
g.map(plt.scatter, "PRICE", "HP")
```

In [None]:
# Create a dataframe with just these firms
firms = ['Ford', 'PSA', 'Volkswagen', 'Fiat']
to_plot = df[df['FIRM'].isin(firms)]

g = sns.FacetGrid(to_plot, hue='BRAND', col='FUEL', row='FIRM')
g.map(plt.scatter, "PRICE", "HP")
g.add_legend()
plt.show()

## Practice: Facet Plots



1. **Q: How is size related to price? Does it differ by firm? By brand? By fuel type?**

Use a facet plot to explore these questions. Restrict the DataFrame to include only Ford, PSA, Volkswagen, and Fiat. 

In [None]:
# The .isin() saves us from syntax like 
# to_plot = df[(df['FIRM']=='Ford') | (df['FIRM'] == 'PSA') | (df['FIRM']==Volkswagen) | (df['FIRM']=='Fiat')]

firms = ['Ford', 'PSA', 'Volkswagen', 'Fiat']
to_plot = df[df['FIRM'].isin(firms)]

In [None]:
# Create your plot here

2. Let's explore a related concept, the `pairplot` [[docs](https://seaborn.pydata.org/generated/seaborn.pairplot.html)]. Try

```python
g=sns.pairplot(df, vars=['PRICE', 'HP', 'WEIGHT'])
```

What does `pairplot` do? Why do we only need to look at the upper or lower triangle of the figure? What is on the diagonal?

In [None]:
g=sns.pairplot(df, vars=['PRICE', 'HP', 'WEIGHT', 'MPG'])

# pairplot plots scatter plots for each possible pair of variables.
# The figure is symmetric, so any plot in the top triangle is also in 
# bottom triangle, but with the axes reversed. 
# The diagonal is the histogram of the variable. 

3. How do these relationships differ by fuel type? Use the 'hue' option (and the documentation).

In [None]:
g=sns.pairplot(df, vars=['PRICE', 'HP', 'WEIGHT',  'MPG'], hue='FUEL')

# It is clear that gasoline has a right-shifted HP distribution and a left-shifted MPG distribution. 
# For diesel, HP doesn't look correlated with weight or mpg. For gas HP and MPG look negatively correlated. 
# The weight-mpg correlation looks to have a similar slope and a different intercept by fuel type.