In [None]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline

# Class 5: Managing Data with Pandas 

Pandas is a Python library for managing datasets. Documentation and examples are available on the website for Pandas: http://pandas.pydata.org/. 

In this Notebook, we'll make use of a dataset containing long-run averages of inflation, money growth, and real GDP. The dataset is available here: https://raw.githubusercontent.com/letsgoexploring/economic-data/master/quantity-theory/csv/quantity_theory_data.csv (Python code to generate the dataset: https://github.com/letsgoexploring/economic-data). Recall that the quantity theory of money implies the following linear relationship between the long-run rate of money growth, the long-run rate of inflation, and the long-run rate of real GDP growth in a country:

\begin{align}
\text{inflation} & = \text{money growth} - \text{real GDP growth},
\end{align}

Generally, we treat real GDP growth and money supply growth as exogenous so this is a theory about the determination of inflation.

### Import Pandas

In [None]:
# Import the Pandas module as pd


### Import data from a csv file

Pandas has a function called `read_csv()` for reading data from a csv file into a Pandas `DataFrame` object.

In [None]:
# Import quantity theory data into a Pandas DataFrame called 'df' with country names as the index.

# Directly from internet


# From current working directory
# df = pd.read_csv('quantity_theort_data.csv')

In [None]:
# Print the first 5 rows


In [None]:
# Print the last 10 rows


In [None]:
# Print the type of variable 'df'


### Properties of `DataFrame` objects

Like entries in a spreadsheet file, elements in a `DataFrame` object have row (or *index*) and column coordinates. Column names are always strings. Index elements can be integers, strings, or dates.

In [None]:
# Print the columns of df


In [None]:
# Create a new variable called 'money' equal to the 'money growth' column and print



In [None]:
# Print the type of the variable money


A Pandas `Series` stores one column of data. Like a `DataFrame`, a `Series` object has an index. Note that `money` has the same index as `df`. Instead of having a column, the `Series` has a `name` attribute.

In [None]:
# Print the name of the 'money' variable


Select multiple columns of a `DataFrame` by puting the desired column names in a set a of square brackets (i.e., in a `list`).

In [None]:
# Print the first 5 rows of just the inflation, money growth, and gdp growth columns


As mentioned, the set of row coordinates is the index. Unless specified otherwise, Pandas automatically assigns an integer index starting at 0 to rows of the `DataFrame`.

In [None]:
# Print the index of 'df'


Note that in the index of the `df` is the numbers 0 through 177. We could have specified a different index when we imported the data using `read_csv()`. For example, suppose we want to the country names to be the index of `df`. Since country names are in the first column of the data file, we can pass the argument `index_col=0` to `read_csv()`

In [None]:
# Import quantity theory data into a Pandas DataFrame called 'df' with country names as the index.


# Print first 5 rows of df


Use the `loc` attribute to select rows of the `DataFrame` by index *values*.

In [None]:
# Create a new variable called 'usa_row' equal to the 'United States' row and print



Use `iloc` attribute to select row based on integer location (starting from 0).

In [None]:
# Create a new variable called 'third_row' equal to the third row in the DataFrame and print



There are several ways to return a single element of a Pandas `DataFrame`. For example, here are three that we want to return the value of inflation for the United States from the DataFrame `df`:

1. `df.loc['United States','inflation']`
2. `df.loc['United States']['inflation']`
3. `df['inflation']['United States']`

The first method points directly to the element in the `df` while the second and third methods return *copies* of the element. That means that you can modify the value of inflation for the United States by running:

    df.loc['United States','inflation'] = new_value
    
But running either:

    df.loc['United States']['inflation'] = new_value
    
or:

    df['inflation']['United States'] = new_value

will return a warning from Pandas.

In [None]:
# Print the inflation rate of the United States  (By index and column together)


In [None]:
# Print the inflation rate of the United States (first by index, then by column)


In [None]:
# Print the inflation rate of the United States  (first by column, then by index)


New columns are easily created as functions of existing columns.

In [None]:
# Create a new column called 'difference' equal to the money growth column minus 
# the inflation column and print the modified DataFrame



In [None]:
# Print the average difference between money growth and inflation


In [None]:
# Remove the following columns from the DataFrame: 'iso code','observations','difference'


# Print the modified DataFrame


### Methods

A Pandas `DataFrame` has a bunch of useful methods defined for it. `describe()` returns some summary statistics.

In [None]:
# Print the summary statistics for 'df'


The `corr()` method returns a `DataFrame` containing the correlation coefficients of the specified `DataFrame`.

In [None]:
# Create a variable called 'correlations' containg the correlation coefficients for columns in 'df'


# Print the correlation coefficients


In [None]:
# Print the correlation coefficient for inflation and money growth


# Print the correlation coefficient for inflation and real GDP growth


# Print the correlation coefficient for money growth and real GDP growth


`sort_values()` returns a copy of the original `DataFrame` sorted along the given column. The optional argument `ascending` is set to `True` by default, but can be changed to `False` if you want to print the lowest first.

In [None]:
# Print rows for the countries with the 10 lowest inflation rates


In [None]:
# Print rows for the countries with the 10 highest inflation rates


Note that `sort_values` and `sort_index` return *copies* of the original `DataFrame`. If, in the previous example, we had wanted to actually modify `df`, we would have need to explicitly overwrite it:

    df = df.sort_index(ascending=False)

In [None]:
# Print first 10 rows with the index sorted in descending alphabetical order


### Quick plotting example

Construct a graph that visually confirms the quantity theory of money by making a scatter plot with average money growth on the horizontal axis and average inflation on the vertical axis. Set the marker size `s` to 50 and opacity (`alpha`) 0.25. Add a 45 degree line, axis labels, and a title. Lower and upper limits for the horizontal and vertical axes should be -0.2 and 1.2.

In [None]:
# Create data for 45 degree line


# Create figure and axis


# Plot 45 degree line and create legend in lower right corner


# Scatter plot of data inflation against money growth




### Exporting a `DataFrame` to csv

Use the DataFrame method `to_csv()` to export DataFrame to a csv file.

In [None]:
# Export the DataFrame 'df' to a csv file called 'modified_data.csv'.
