# Introduction

files needed = `industrial_production.xlsx`

This week we are working on

1. Working with DataFrames
2. Practice working with messy data
3. Looking through documentation


In [None]:
import pandas as pd

# Checking Documentation

- How to find solutions to your problems when writing code. 

- First place to start is by looking at the documentation for the function or package itself. Most well-developed packages will have some sort of explanation of the function or object, and proper ways to use it. 

- Check the [pandas documentation here](https://pandas.pydata.org/docs/). 

- Try searching for something like "dataframe" or ["melt"](https://pandas.pydata.org/docs/reference/api/pandas.melt.html?highlight=melt#pandas.melt).

- Python has [its own documentation](https://docs.python.org/3.7/library/).


- You can also check the documentation in Jupyter notebooks by using the `?` operator.

- For example, if you wanted to check the documentation for the `melt` function, you could type `?pd.melt` in a code cell and run it.

In [None]:
?pd.melt

What if we get stuck? Here's where some "Google skills" come in handy. If you're not familiar with Google's hidden search operators, [check them out here](https://support.google.com/websearch/answer/2466433?hl=en), and some others [here](http://www.googleguide.com/advanced_operators_reference.html).

Try searching for terms related to the concepts you're struggling with- pandas functions, if/else statements, etc, rather than something specific to the problem you're trying to solve. A more general search will give you a higher probability of finding what you want. StackExchange has a number of answers that are well-organized for problems often encountered by everyone from beginners to experts.

When you find your answer, be sure to cite it, preferably with a link in your code, or alternatively, in a text file for your safekeeping.

## 1. Inspecting DataFrames

Describing and learning about a DataFrame is typically the first thing you do after defining it. We might want to know:

1. How big is this DataFrame?
2. What are the column or row names?
3. What are the data types? Are they the types we expected? 
4. How do we peek at small parts of the DataFrame?

Once we think our DataFrame is set up correctly, we can move on to the analysis.

### Loading and setting up your DataFrame

1. Load the file `industrial_production.xlsx` into a DataFrame. We want the sheet 'Quarterly'. 
2. Print out the first 4 rows and the last 4 rows of the DataFrame.
2. Set the index to 'DATE'.
3. Those variable names are terrible. Check out the 'README' tab in the excel workbook for the definitions. Rename the columns with sensible names.  

In [None]:
# 1. Load the dataframe.
# 2. First four rows.
ind = pd.read_excel('industrial_production.xlsx', sheet_name='Quarterly')
ind.head(4)

In [None]:
# 3. Set the index to 'DATE'.

ind = ind.set_index('DATE')

ind.head(6)

### Changing column names

We need to change nine column names. This is a bit tedious, but at least we only have to do this once. Once written, we can reuse the code anytime we deal with this file. Below are three different ways to do it. 

In [None]:
# 4. Change the column names (method 1). 

# This is the slick way. 
# Grab the column names from the index. 
# Make a list of the new column names.
# Zip them together (check the documentation for zip) and then create a dict. 
# The columns have to be in the correct order with respect to new_names.

old_names = ind.columns.to_list()
new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
names = dict(zip(old_names, new_names))
ind = ind.rename(columns=names)

ind.head(2)


Change the column names (method 2). 

If you know for sure that that the columns are in the correct order, you can also do this. In method 1, you can look at `names` and make sure the new and old names line up before changing the columns names. 

```python
new_names = ['consumer', 'consumer durables', 'crude oil', 'mining', 'elec and gas', 'cars', 'manuf', 'ice cream', 'steel']
ind.columns = new_names
```

Change the column names (method 3).

The most tedious, but robust way. No matter what order the columns are in, this will replace the names properly. 
```python
names = {'IPB51000SQ':'consumer',             'IPB51100SQ':'consumer durables', 'IPG211111CSQ':'crude oil','IPG21SQ':'mining', 'IPG2211A2SQ':'elec and gas', 'IPG3361T3SQ':'cars', 'IPGMFSQ':'manuf', 'IPN31152NQ':'ice cream', 'IPN3311A2RNQ':'steel'}
ind = ind.rename(columns=names)
```

### DataFrame attributes
Any object, including DataFrame, has attributes. We access attributes using the `.` operator. 


5. Try `dtypes`. What does it tell you?
6. Try `shape`. What is the return type? What does it tell you?
7. Try `columns`. What is the return type? What does it tell you?
8. Try `index`. What is the return type? What does it tell you?


In [None]:
# 5. dtypes

# Each column is made up of floats. (float64 is the same as float)
ind.dtypes

In [None]:
# 6. shape

# This gives us the number of rows and columns in a tuple.
ind.shape

In [None]:
#7. (column) index 

# An index object. The column names. 
ind.columns

In [None]:
# 8. (row) index 

# Another index object. A different type (a DatetimeIndex, to hold dates). 
ind.index

### DataFrame methods
Objects also have methods. The following are methods of DataFrame. They also use the `.` to access, but are like functions. 

9. Try `sample(5)`. What does it tell you?
10. Try `describe()`. What does it tell you?

In [None]:
# 9. sample()

# A random sample of rows.
ind.sample(5)

In [None]:
# 10. describe()

# Summary statistics. If a column is not made up of numbers, it will not appear here. 
ind.describe()

## 2. Working with messy data: Pisa Scores

The [pisa](https://www.oecd.org/pisa/) test is a test given to 15-year olds around the world. It evaluates reading, math, and science skills. 

1. In a web browser, go to [dx.doi.org/10.1787/888932937035](http://dx.doi.org/10.1787/888932937035) This should initiate a download of an excel file with pisa scores across countries. Open the workbook up and take a look. This is a bit of a mess.

The issue here is that the workbook was formatted for humans to read. Since it is not a neat rectangular block of data, we will need to 'wrangle' it into shape. This is a common task in the real world, so let's practice some more.  

In [None]:
# 2. Reading from the internet.

url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,             # skip the first 18 rows
                     skipfooter=7,            # skip the last 7
                     usecols=[0,1,9,13],      # select columns of interest
                     )

# Rather than use the 'usecols' argument, you could have loaded all the columns and dropped the ones you do not want.
# Notice that the first row is a bunch of text. That will cause some problems later...

pisa.head()

3. Look up `dropna()` in the pandas documentation.  
    1. Clean up your DataFrame. Drop any rows that have *at least one* `NaN`. Save the result into a new DataFrame named `pisa2`.

   How many rows are in `pisa2`? 

In [None]:
# 3. Drop NaNs.


4. Using `pisa2`, make the country names the index.


In [None]:
# 4. Set the index.


5. Using `pisa2`, print out the ratios of the United States pisa scores (math, reading, science) relative to the OECD average.

In [None]:
# 5. US relative to the average. 
# The US is pretty average...

6. **Challenging.** Use `pisa2`. How correlated are the math, reading, and science scores with each other?  Write the correlation matrix to a file called 'pisa_corrs.xlsx'.

    This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' (strings, basically) and `.corr()` won't work. Take a look at the first row of `pisa` to see why the data are stored as strings. Google around and see if you can convert the three columns to numbers. Then find the correlations. 

In [None]:
# 6. Convert types/compute correlations
# Strings! 


In [None]:
# New column names
pisa2.columns = ['math', 'read', 'sci']  

# There are several ways to convert strings to numeric values. This is one of them.


In [None]:
# Now we are in good shape. What does .corr() do?
# What is the return type of .corr()?

pisa_corrs = 
print(pisa_corrs)
print(type(pisa_corrs))

# Now save the DataFrame of results to a file. 
