# Plotting Exercise

### Exercise 1

Create a pandas dataframe from the "Datasaurus.txt" file using the code: 

In [6]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/Datasaurus.txt', delimiter='\t')

Note that the file being downloaded is *not* actually a CSV file. It is tab-delimited, meaning that within each row, columns are separated by tabs rather than commas. We communicate this to pandas with the `delimiter="\t"` option (`"\t"` is how we write a tab, as we will discuss in future lessons). 

### Exercise 2

This dataset actually contains 13 separate example datasets, each with two variables named `example[number]_x` and `example[number]_y`. 

In order to get a better sense of what these datasets look like, write a loop that iterates over each example dataset (numbered 1 to 13) and print out the mean and standard deviation for `example[number]_x` and `example[number]_y` for each dataset. 

For example, the first iteration of this loop might return something like:

```
Example Dataset 1: Mean x: 54.26609978429576, Mean y: 47.83472062494366, Std Dev x: 16.769824954043756, Std Dev y: 26.939743419267103
```

### Exercise 3

Based only on these results, discuss what might you conclude about these example datasets with your partner. Write down your thoughts.


### Execise 4

Write a loop that iterates over these example datasets, and using the `plotnine` library, plot a simple scatter plot of each dataset with the `x` variable on the x-axis and the `y` variable on the y-axis. Save these plots as PDFs somewhere you can find them. 

Hint: When writing this type of code, it is often best to start by writing code to do what you want for the first iteration of the loop. Once you have code that works for the first example dataset, then write the full loop around it. 

### Exercise 5

Review you plots. How does your impression of how these datasets differ from what you wrote down in Exercise 3?

## Wealth and Democracy

Let's now pivot from working with example data to real data. Load the World Development Indicator data you worked with over the summer. This is country-level data that includes information on both countries' GDP per capita (a measure of wealth) and the Polity IV scores (a measure of how democratic a country is -- countries with higher scores are liberal democracies, countries with low scores are autocratic.). Use the code below to download the data. 

In [77]:
wdi = pd.read_csv('https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/world-small.csv')

Your data should look like this: 

In [78]:
wdi.head()

Unnamed: 0,country,region,gdppcap08,polityIV
0,Albania,C&E Europe,7715,17.8
1,Algeria,Africa,8033,10.0
2,Angola,Africa,5899,8.0
3,Argentina,S. America,14333,18.0
4,Armenia,C&E Europe,6070,15.0


### Exercise 6

Let's being analyzing this data by estimating a simple linear model ("ordinary least squares") of the relationship between GDP per capita (`gdppcap08`) and democracy scores (`polityIV`). We will do so using the `statsmodel` package, which we'll discuss in detail later is this course. For the momement, just use this code:

```python
import statsmodels.formula.api as smf
results = smf.ols('polityIV ~ gdppcap08',
                   data=wdi).fit()
print(results.summary())
```


In [83]:
import statsmodels.formula.api as smf
results = smf.ols('polityIV ~ gdppcap08',
                   data=wdi).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               polityIV   R-squared:                       0.047
Model:                            OLS   Adj. R-squared:                  0.040
Method:                 Least Squares   F-statistic:                     6.981
Date:                Wed, 17 Jul 2019   Prob (F-statistic):            0.00915
Time:                        10:54:50   Log-Likelihood:                -475.14
No. Observations:                 145   AIC:                             954.3
Df Residuals:                     143   BIC:                             960.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     12.1354      0.721     16.841      0.0

### Exercise 7

Based on the results of this analysis, what would you conclude about about the relationship between `gdppcap08` and `polityIV`? 

(If you aren't familiar with Linear Models and aren't sure how to interprete this, you can also just look at the simple correlation between these two variables using `wdi[['polityIV', 'gdppcap08']].corr()`.)

Write down your conclusions. 

### Exercise 8

Now let's plot the relationship you just estimated statistically. First, use `plotnine` to create a scatter plot of `polityIV` and `gdppcap08`. 

### Exercise 9

Now overlay the linear model you just estimated. You can do this by adding a `geom_smooth()` layer, where the `method` argument is set to `'lm'` (for linear model). 

### Exercise 10

Does it seem like the linear model you estimated fits the data well?

### Exercise 11

Linear models impose a very strict *functional form* on the model they use: they try to draw a straight line through the data, no matter what. Let's consider a more flexible functional form. Change the `method` in your `geom_smooth` to `"lowess"`. This is a form of local polynomial regression that is designed to be flexible in how it fits the data. 

### Exercise 12

This does seem to fit the data better, but there's clearly this HUGE outlier in the bottom right. Who is that? Using `geom_text()`, label the points on your graph with country names. 

### Exercise 13

Interesting. It seems that there's are a lot of rich, undemocratic countries that all have something in common: they're oil-rich, small, Middle Eastern countries.

Let's see what happens if we exclude the ten countries with the highest per-capita oil production from our data: Qatar, Kuwait, Equatorial Guinea, United Arab Emirates (UAE), Norway, Brunei, Saudi Arabia, Libya, Oman, and Gabon. (Note this was in 2007!)

To do this, I would recomment creating a new variable called `bigproducer` that is `True` if `country` matches a name in that list, and `False` otherwise. You may find [the `isin` method useful.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html) 

### Exercise 13

Let's make sure that you accurately identified all 10 of the oil producers. Is the value of  `wdi['bigproducer'].sum()` 10? If not, can you figure out what you did wroon

### Exercise 14

How does the relationship between GDP per capita and Polity look without those oil producers? Does it look the same as it did without the oil producers?

### Exercise 15

Now that we've gotten a good sense of the relationship between wealth and democracy for non-oil producers, can we draw any conclusions about the relationship between polity scores and wealth for the oil producers? Plot the polity / GDP per capita relationship *just* for the oil producers.

### Exercise 16

Look back to your answer for Exercise 7. Do you still believe the result of your linear model? What did you learn from plotting. Write down your answers with your partner. 

## Take-aways

One of our main jobs as data scientists is to *summarize* data. In fact, its such an obvious part of our jobs we often don't think about it very much. In reality, however, this is one of the most difficult things we do. 

Summarization means taking rich, complex data and trying to tell readers about what is going on in that data using simple statistics. In the process of summarization, therefore, we must necessarily throw away much of the richness of the original data. When done well, this simplification makes data easier to understand, but only if we throw away the *right* data. You can *always* calulate the average value of a variable, or fit a linear model, but whether doing so generates a summary statistic that properly represents the essence of the data being studied depends on the data itself. 

Plotting is one fo the best tools we have as data scientists for evaluating whether we are throwing away the *right* data. As we learned from Part 1 of this exercise, just looking at means and standard deviations can mask tremendous variation. Each of our example datasets looked the same when we examined our summary statistics, but they were all radically different when plotted. 

Similarly, a simple linear model would "tell" us that if GDP per capita increases by \$10,000, we would expect Polity scores to increase by about 1 (i.e. the coefficent on the linear model was 9.602e-05). But when we plot the data, not only can we that the data is definitely *not* linear (and so that slope doesn't really mean anything), but we can also see that oil producing countries seem to defy the overall trend, and so should maybe be studied separately. 

Moreover, we can see that if we just look at oil producers, there is no clear story: some are rich and democratic, while others are rich and autocratic (indeed, [this observation is the foundation of some great research on the political consequences of resource wealth](https://www.jstor.org/stable/41480824)!)

So remember this: tools for summarizing data will always give you an answer, but it's up to you as a data scientist to make sure that the summaries you pass on to other people properly represent the data you're using. And there is perhaps no better way to do this than with plotting!

