# Importing (and Exporting) Data
* Contact: Lachlan Deer, [econgit] @ldeer, [github/twitter]: @lachlandeer

In our early adventures we have focussed on attributes of pandas objects, and created small data sets on the fly. This is rarely how empirical work happens - usually we have some data stored on our computer / in the cloud that we want to load in, work with and then potentially save the results.

Pandas has functionality to read and write from/to many different formats. A complete list is available here: http://pandas.pydata.org/pandas-docs/version/0.20/io.html. For economists, we are typically interested in working with:

* Csv files: read_csv(), write_csv()
* Stata Files: read_stata(), write_stata()
* Excel files: read_excel(), write_excel()

and possibly working with `sql` or `sas` data too.

Hopefully this course has emphasized flat files, that are not encrypted to a particular language are a preferred go-to - so we focus on the csv file functions. Exceptions to this rule are OK if your data is 'big' when you might have to look at sql, or HDF5 or Google Big Query structures, but we are getting way ahead of ourselves.

Most functions work similar enough that focussing on csv based read/write operations should be almost without loss of generality.



In [None]:
import pandas as pd
import glob

## Importing data with `read_csv`

Our github repository comes with a lot of data in it that we will want to use throughout the session. Let's look at one subdirectory that stores a bunch of data from the BLS:

In [None]:
print(glob.glob("data/bls-employment-state/*.csv")[:10])

and what does a particular file look like?

In [None]:
!head data/bls-employment-state/LAUST010000000000005.csv

Let's read in that file to our session using `read_csv`...

### The Simplest Import

In [None]:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv')
df

it worked. What is the type of object we just imported?

In [None]:
type(df)

For the future, we may not want to look at such a big excerpt of data - useful functions are then:

In [None]:
df.head()

In [None]:
df.tail()

### Selectively importing columns

Suppose we didn't want to import the `footnotes` column. Pandas allows us to specify the columns we want imported:

In [None]:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'])
df.head()

## Selectively Importing Rows

The default behaviour is to import any row from the file that contains text. This is not always what we want.

For example:
* sometimes a file contains a header or footer full of unneccesary text
* the whole data set may not fit into the computer's memory
* we only need certain data for our analysis

we can use some of `read_csv`'s functionaliy to help out here

we can read in the first 20 rows:

In [None]:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'],
                nrows=20)
df

we can skip the earliest year:

In [None]:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols=['state', 'year', 'period', 'value'],
                skipfooter=12)
df.tail()

we can skip the most recent year:

In [None]:
df = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                names=['state', 'year', 'period', 'value'],
                index_col=False,
                skiprows=13)
df.head()

#### Challenge

1. Read in the file `data/bls-employment-state/LAUST010000000000006.csv`, as `df2`
2. Read in the file `data/bls-employment-state/LAUST010000000000003.csv`, as `df3`, but only the years around the financial crisis, 2007-2010.

## Setting the Index

Notice how in `df`, pandas created a new index for us? We may want to set an index ourselves.

Let's do this, first by importing only the 2016 data for Alabama, and using the month as our index:

In [None]:
df_2016 = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols =['period', 'value'],
                index_col = ['period'],
                nrows=12)

In [None]:
df_2016

we can also have a 'multi-index' structure, where mutliple rows define the index. For example we can import the years 2015 and 2016 data as

In [None]:
df_recent = pd.read_csv('data/bls-employment-state/LAUST010000000000005.csv',
                usecols =['year', 'period', 'value'],
                index_col = ['period', 'year'],
                nrows=24)
df_recent

### Challenge:
Read in the file data/bls-employment-state/LAUST010000000000003.csv, as df3, but only the years around the financial crisis, 2007-2010. Set the index to consist of the state-month-year.

We have now covered the basics for importing data. If you want to know (much!) more, consult the documentation at http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_csv.html#pandas.read_csv. Warning: it's hard to get the hang of reading the pandas documentation, but over time you become accustomed to it.

## Exporting Data

Now let's turn to exporting data. We created the DataFrame `df_2016`

In [None]:
df_2016

Suppose we want to export that data and save it to our project. The function `to_csv()` will do this for us:

In [None]:
df_2016.to_csv('out_data/alabama_2016.csv')

Let's verify that pandas did as expected:

In [None]:
!head out_data/alabama_2016.csv

There are many options for the `to_csv()` function - many of which I don't find that useful. One I sometimes use is to change the column separator. Let's change to '&' separators when we write `df_recent` to file:

In [None]:
df_recent.to_csv('out_data/alabama_recent.csv', sep='&')
!head out_data/alabama_recent.csv

## Challenge

1. Take your financial crisis data, and write it to file. Use tab separators ('t').
2. Reload your financial crisis data under the name 'df_financialCrisis' using `read_csv()`. What did you have to change to make the data import successful?  