## Reading and writing data with Pandas

files needed = (gdp_components.csv, debt.xlsx)

We have seen some of the basic things we can do with Pandas. In doing so, we created some simple DataFrames from dicts. That was simple, but it is almost never how we create DataFrames in the wild. 

Most data live in files, often as comma-separated values or as MS Excel workbooks, either on our computers or in the cloud. In this notebook, we will review was to get data into (and out of) Pandas. 

## Reading from your computer

Let's start by getting files from our own computers. We start by loading Pandas. We are also loading the os package. `os` means 'operating system' and it contains functions that help us navigate the file structure of our computers.   

In [1]:
import pandas as pd     # load the pandas package and call it pd
import os               # The package name is already short enough. No need to rename it. 

If you have not already, move the `gdp_components.csv` file to your U:\ drive and put it in the same folder that holds this notebook. We expect this file to contain U.S. GDP and its major components. Let's see.  

In [2]:
gdp = pd.read_csv('gdp_components.csv')       # read_csv is a part of Pandas, so we need the pd. 
print(type(gdp))                              # What have we got here?

<class 'pandas.core.frame.DataFrame'>


This looks successful. `read_csv()` takes a string with the file name and creates a DataFrame. Let's take a look at the data. 

In [3]:
print(gdp)

          DATE       GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
0   1929-01-01    104.556    17.170     9.622     5.939     5.556
1   1930-01-01     92.160    11.428    10.273     4.444     4.121
2   1931-01-01     77.391     6.549    10.169     2.906     2.905
3   1932-01-01     59.522     1.819     8.946     1.975     1.932
4   1933-01-01     57.154     2.276     8.875     1.987     1.929
5   1934-01-01     66.800     4.296    10.721     2.561     2.239
6   1935-01-01     74.241     7.370    11.151     2.769     2.982
7   1936-01-01     84.830     9.391    13.398     3.007     3.154
8   1937-01-01     93.003    12.967    13.119     4.039     3.961
9   1938-01-01     87.352     7.944    14.170     3.811     2.845
10  1939-01-01     93.437    10.229    15.165     3.969     3.136
11  1940-01-01    102.899    14.579    15.562     4.897     3.426
12  1941-01-01    129.309    19.369    27.836     5.482     4.449
13  1942-01-01    165.952    11.762    65.440     4.375     4.627
14  1943-0

Even though jupyter notebook hid rows 30-58, this is still a bit obnoxious. We can use the `head()` and `tail()` methods of DataFrame to peek at just the first or last few rows. 

In [4]:
print( gdp.head(4) )            # Show the first 4 rows.

         DATE     GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
0  1929-01-01  104.556  17.170   9.622   5.939   5.556
1  1930-01-01   92.160  11.428  10.273   4.444   4.121
2  1931-01-01   77.391   6.549  10.169   2.906   2.905
3  1932-01-01   59.522   1.819   8.946   1.975   1.932


If you do not pass `head()` or `tail()` an argument, it defaults to 5 rows. 

In [5]:
print( gdp.tail() )

          DATE       GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
84  2013-01-01  16784.851  2826.013  3132.409  2273.428  2764.210
85  2014-01-01  17521.747  3038.931  3167.041  2371.027  2879.284
86  2015-01-01  18219.297  3211.971  3234.210  2265.047  2786.461
87  2016-01-01  18707.189  3169.887  3290.979  2217.576  2738.146
88  2017-01-01  19485.394  3367.965  3374.444  2350.175  2928.596


The index isn't very sensible. This is time series data (the unit of observation is a year), so the date seems like a good index. How do we set the index?

In [6]:
gdp_new_index = gdp.set_index('DATE')   # We could use 'inplace = True' if we didn't need a copy.

print(gdp_new_index.head())

               GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929


We can also set the index as we read in the file. Let's take a look at the read_csv() function.

In [7]:
pd.read_csv?

I'm seeing a lot of good stuff here. `index_col`, `usecols`, `header`, `sep`,...some stuff I don't know about, too. When reading in messy files, these extra arguments may come in handy. 

Let's give `index_col` a try. 

In [8]:
gdp_2 = pd.read_csv('gdp_components.csv', index_col = 0)    # Treat the CSV like a DataFrame. Count cols staring with 0

In [9]:
gdp_2.head()

Unnamed: 0_level_0,GDPA,GPDIA,GCEA,EXPGSA,IMPGSA
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1929-01-01,104.556,17.17,9.622,5.939,5.556
1930-01-01,92.16,11.428,10.273,4.444,4.121
1931-01-01,77.391,6.549,10.169,2.906,2.905
1932-01-01,59.522,1.819,8.946,1.975,1.932
1933-01-01,57.154,2.276,8.875,1.987,1.929


### Navigating your file structure
We dumped our file into our **current working directory** so we could just ask for the file name `gdp_components.csv` in `read_csv()`. What is our current working directory (cwd)?

In [10]:
path_to_cwd = os.getcwd()           # getcwd() is part of the os package we imported earlier
print(path_to_cwd)

U:\2019F_Econ_690\2_Pandas


When we gave read_csv() gpd_components.csv, it looked in our cwd for the file. Let's try something more complicated. Go into your Data_Class folder and create a new folder called Data_Files. Make a copy of the gdp_components file and paste it into the Data_Files folder. Rename the file `gdp_components_moved.csv`.

In [11]:
gdp_moved = pd.read_csv('gdp_components_moved.csv')

FileNotFoundError: [Errno 2] File b'gdp_components_moved.csv' does not exist: b'gdp_components_moved.csv'

Of course this doesn't work. The file is not in our cwd. It's good see what that kind of error message looks like. We need to pass csv_read() the *path* to the file. The path is the hierarchy of folders that contains the file. In my case, the path is 

U:\Data_Class\Data_Files

Note that there is a  `\` each time we list a new folder. 

When we specify a file path, we need to [escape](https://en.wikipedia.org/wiki/Escape_character) the `\` by using a second backslash in front of it.  If you are using a Mac, you need to use the forward slash `/`.

In [12]:
gdp_moved = pd.read_csv('U:\\2019F_Econ_690\\2_Pandas\\Data_Files\\gdp_components_moved.csv')
gdp_moved.head()

Unnamed: 0,DATE,GDPA,GPDIA,GCEA,EXPGSA,IMPGSA
0,1929-01-01,104.556,17.17,9.622,5.939,5.556
1,1930-01-01,92.16,11.428,10.273,4.444,4.121
2,1931-01-01,77.391,6.549,10.169,2.906,2.905
3,1932-01-01,59.522,1.819,8.946,1.975,1.932
4,1933-01-01,57.154,2.276,8.875,1.987,1.929


We could have manipulated some strings to get to this, too. This approach might be useful if you needed to read in many files from the same place. (Maybe using a for loop and a list of file names?) 

In [13]:
path_to_cwd = os.getcwd()
file_name = 'gdp_components_moved.csv'
path_to_data_file = path_to_cwd + '\\Data_Files\\' +  file_name  #Note the double \ characters
print(path_to_data_file)

U:\2019F_Econ_690\2_Pandas\Data_Files\gdp_components_moved.csv


In [14]:
gdp_moved = pd.read_csv(path_to_data_file, index_col=0)
gdp_moved.head()

Unnamed: 0_level_0,GDPA,GPDIA,GCEA,EXPGSA,IMPGSA
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1929-01-01,104.556,17.17,9.622,5.939,5.556
1930-01-01,92.16,11.428,10.273,4.444,4.121
1931-01-01,77.391,6.549,10.169,2.906,2.905
1932-01-01,59.522,1.819,8.946,1.975,1.932
1933-01-01,57.154,2.276,8.875,1.987,1.929


### Practice: Reading CSVs
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

1. Try out the `to_csv()` method of DataFrame. Save `gdp_moved` as 'gdp_moved_2.csv' in your cwd. \[You can use `?` if you need help.\]


In [15]:
gdp_moved.to_csv('gpd_moved_2.csv')

2. Use `to_csv()` again to save `gdp_moved` to the Data_Files folder. Name it 'gdp_moved_3.csv'

In [18]:
gdp_moved.to_csv('U:\\2019F_Econ_690\\2_Pandas\\Data_Files\\gdp_moved_3.csv')

Are your files in the correct places? 

Isn't this supposed to be practice reading in CSV files? Right. Let's do some of that. 

3. Use gdp_moved_3.csv to create a DataFrame named gdp_growth. Set the index to the dates. Print out the first 10 years of data.

In [19]:
gdp_growth = pd.read_csv('U:\\2019F_Econ_690\\2_Pandas\\Data_Files\\gdp_moved_3.csv', index_col=0)
print( gdp_growth.head(10) )


               GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929
1934-01-01   66.800   4.296  10.721   2.561   2.239
1935-01-01   74.241   7.370  11.151   2.769   2.982
1936-01-01   84.830   9.391  13.398   3.007   3.154
1937-01-01   93.003  12.967  13.119   4.039   3.961
1938-01-01   87.352   7.944  14.170   3.811   2.845


4. Rename 'GDPA' to 'gdp' and rename 'GCEA' to 'gov'

In [20]:
gdp_growth.rename(columns={'GDPA':'gdp', 'GCEA':'gov'}, inplace=True)
print(gdp_growth.head())

                gdp   GPDIA     gov  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929


## Reading Excel spreadsheets
Reading spreadsheets isn't much different than reading csv files. But, since workbooks are more complicated than csv files, we have a few more options to consider. 

If you haven't already, copy over 'debt.xlsx' to your cwd. Let's open it in Excel and have a look at it...

There's a lot going on here: missing data, some #N/A stuff, and several header rows. Let's get to work.

In [21]:
debt = pd.read_excel('debt.xlsx')
debt

Unnamed: 0,FRED Graph Observations,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,Federal Reserve Economic Data,,,
1,Link: https://fred.stlouisfed.org,,,
2,Help: https://fred.stlouisfed.org/help-faq,,,
3,Economic Research Division,,,
4,Federal Reserve Bank of St. Louis,,,
5,,,,
6,GDPA,"Gross Domestic Product, Billions of Dollars, A...",,
7,GFDEBTN,"Federal Debt: Total Public Debt, Millions of D...",,
8,DGS10,"10-Year Treasury Constant Maturity Rate, Perce...",,
9,,,,


In [22]:
# Use the header to specify the row to use as the column names. (zero based, as usual)

debt = pd.read_excel('debt.xlsx', header = 12)

print(debt.head())
print('\n')
print(debt.tail())

  observation_date     GDPA  GFDEBTN  DGS10
0       1929-01-01  104.556      NaN    NaN
1       1930-01-01   92.160      NaN    NaN
2       1931-01-01   77.391      NaN    NaN
3       1932-01-01   59.522      NaN    NaN
4       1933-01-01   57.154      NaN    NaN


   observation_date       GDPA      GFDEBTN     DGS10
85       2014-01-01  17521.747  17799837.00  2.539560
86       2015-01-01  18219.297  18344212.75  2.138287
87       2016-01-01  18707.189  19549200.50  1.837440
88       2017-01-01  19485.394  20107155.25  2.329480
89       2018-01-01        NaN          NaN       NaN


That's looking good. Notice that Pandas added NaN for the missing data and for those #N\A entries. We will have to deal with those at some point. The header parameter is part of `read_csv()`, too.

We didn't specify which sheet in the workbook to load, so Pandas took the first one. We can ask for sheets by name. 

In [23]:
debt_q = pd.read_excel('debt.xlsx', header=12, sheet_name='quarterly')
print(debt_q.head())
print('\n')
print(debt_q.tail())

  observation_date  GFDEBTN  DGS10      GDP
0       1947-01-01      NaN    NaN  243.164
1       1947-04-01      NaN    NaN  245.968
2       1947-07-01      NaN    NaN  249.585
3       1947-10-01      NaN    NaN  259.745
4       1948-01-01      NaN    NaN  265.742


    observation_date     GFDEBTN     DGS10        GDP
281       2017-04-01  19844554.0  2.260952  19359.123
282       2017-07-01  20244900.0  2.241429  19588.074
283       2017-10-01  20492747.0  2.371452  19831.829
284       2018-01-01  21089643.0  2.758525  20041.047
285       2018-04-01  21195070.0  2.920625  20411.924


We can ask for just a subset of the columns when reading in a file (csv or xlsx). Use the `usecols` argument. This takes either integers or Excel column letters. 

In [24]:
# Take the first and third columns of sheet 'quarterly'

interest_rates = pd.read_excel('debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,2])  
interest_rates.head()

Unnamed: 0,observation_date,DGS10
0,1947-01-01,
1,1947-04-01,
2,1947-07-01,
3,1947-10-01,
4,1948-01-01,


### Practice: Reading Excel
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.

1. Read in the quarterly data from 'debt.xlsx' and keep only the columns with the date, gdp, and GFDEBTN. Name your new DataFrame `fed_debt`.

In [26]:
fed_debt = pd.read_excel('debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,1,3], index_col=0)
fed_debt.head()

Unnamed: 0_level_0,GFDEBTN,GDP
observation_date,Unnamed: 1_level_1,Unnamed: 2_level_1
1947-01-01,,243.164
1947-04-01,,245.968
1947-07-01,,249.585
1947-10-01,,259.745
1948-01-01,,265.742


2. Oops, I wanted to set the observation_date to the index. Go back and add that to your solution to 1. 
3. What is 'GFDEBTN'? It is the federal debt, in millions. Rename this variable to 'DEBT'

In [27]:
fed_debt.rename(columns={'GFDEBTN':'DEBT'}, inplace=True)
fed_debt.head()

Unnamed: 0_level_0,DEBT,GDP
observation_date,Unnamed: 1_level_1,Unnamed: 2_level_1
1947-01-01,,243.164
1947-04-01,,245.968
1947-07-01,,249.585
1947-10-01,,259.745
1948-01-01,,265.742


4. Create a variable name debt_ratio that is the debt-to-GDP ratio. Debt is in millions and gdp is in billions. Adjust accordingly.

In [29]:
fed_debt['debt_ratio'] = (fed_debt['DEBT']/1000)/fed_debt['GDP']
print(fed_debt.head())

                  DEBT      GDP  debt_ratio
observation_date                           
1947-01-01         NaN  243.164         NaN
1947-04-01         NaN  245.968         NaN
1947-07-01         NaN  249.585         NaN
1947-10-01         NaN  259.745         NaN
1948-01-01         NaN  265.742         NaN


There are a lot of missing debt values. Did Pandas throw an error? No. Pandas knows (in some cases) how to work around missing data. 
5. Summarize the debt_ratio variable. What is its max level? Its min?

In [30]:
print(fed_debt['debt_ratio'].describe())

count    210.000000
mean       0.564994
std        0.227520
min        0.306033
25%        0.355102
50%        0.555767
75%        0.641648
max        1.052562
Name: debt_ratio, dtype: float64


## Reading from the internet
We can pass `read` functions urls, too. 

In [31]:
# Read in the penn world table data
url = "http://www.rug.nl/ggdc/docs/pwt81.xlsx"
pwt = pd.read_excel(url, sheet_name= "Data")
pwt.head()

Unnamed: 0,countrycode,country,currency_unit,year,rgdpe,rgdpo,pop,emp,avh,hc,...,csh_g,csh_x,csh_m,csh_r,pl_c,pl_i,pl_g,pl_x,pl_m,pl_k
0,AGO,Angola,Kwanza,1950,,,,,,,...,,,,,,,,,,
1,AGO,Angola,Kwanza,1951,,,,,,,...,,,,,,,,,,
2,AGO,Angola,Kwanza,1952,,,,,,,...,,,,,,,,,,
3,AGO,Angola,Kwanza,1953,,,,,,,...,,,,,,,,,,
4,AGO,Angola,Kwanza,1954,,,,,,,...,,,,,,,,,,


That took a few seconds --- this is a pretty big file. 


In [33]:
# Data from McKinney's book. Each file contains baby name counts for a year. 
baby_url = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames'

# What was trendy in 1880?
old_names = pd.read_csv(baby_url + '//yob1880.txt')
old_names.head()
    

Unnamed: 0,Mary,F,7065
0,Anna,F,2604
1,Emma,F,2003
2,Elizabeth,F,1939
3,Minnie,F,1746
4,Margaret,F,1578


We've lost Mary, which looks pretty popular. What happened? 

We can specify no header with the `None` keyword

In [38]:
old_names = pd.read_csv(baby_url + '//yob1880.txt', header=None)
old_names.head()

Unnamed: 0,0,1,2
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


### Practice
Take a few minutes and try the following. Feel free to chat with those around if you get stuck. The TA and I are here, too.

### Baby names

1. Get the baby name data for 2009. Call the DataFrame `new_names`. Give the columns some reasonable names.



In [39]:

new_names = pd.read_csv(baby_url + '//yob2009.txt', header=None)
print(old_names.head())

new_names.rename(columns={0:'name', 1:'sex', 2:'count'}, inplace=True)
print('\n', old_names.head())

           0  1     2
0       Mary  F  7065
1       Anna  F  2604
2       Emma  F  2003
3  Elizabeth  F  1939
4     Minnie  F  1746

            0  1     2
0       Mary  F  7065
1       Anna  F  2604
2       Emma  F  2003
3  Elizabeth  F  1939
4     Minnie  F  1746


2. What are the two most popular female names in 2009? You might try the `sort_values()` method of DataFrame.

In [41]:

print('Shape of new_names:', new_names.shape)

new_names_female=new_names[new_names['sex']=='F']
print('Shape of new_names_female:', new_names_female.shape)

new_names_female_sorted = new_names_female.sort_values(by=['count'], ascending=False)
print(new_names_female_sorted.head())

Shape of new_names: (34602, 3)
Shape of new_names_female: (20123, 3)
       name sex  count
0  Isabella   F  22222
1      Emma   F  17830
2    Olivia   F  17374
3    Sophia   F  16869
4       Ava   F  15826


2. What are the two least popular female names in 2009?

In [43]:
print(new_names_female_sorted.tail())

           name sex  count
18190   Giannie   F      5
18191  Giavonni   F      5
18192    Gibson   F      5
18193     Gilda   F      5
20122    Zyriel   F      5


### Pisa Scores

1. In a web browser, go to [dx.doi.org/10.1787/888932937035](http://dx.doi.org/10.1787/888932937035) This should initiate a download of an excel file with pisa scores across countries. This is a bit of a mess.

2. Use the `read_excel()` function to create a DataFrame with mean scores in math, reading, and science. Do not set an index yet. There is some junk at the bottom of the spreadsheet. Try the `skipfooter` argument.  


In [44]:

url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,             # skip the first 18 rows
                     skipfooter=7,            # skip the last 7
                     usecols=[0,1,9,13],   # select columns of interest
                     #index_col=0,             # set the index as the first column
                     #header=[0,1]             # set the variable names
                     )

pisa


Unnamed: 0.1,Unnamed: 0,Mathematics,Reading,Science
0,,Mean score in PISA 2012,Mean score in PISA 2012,Mean score in PISA 2012
1,,,,
2,OECD average,494.046,496.463,501.16
3,,,,
4,Shanghai-China,612.676,569.588,580.118
5,Singapore,573.468,542.216,551.493
6,Hong Kong-China,561.241,544.6,554.937
7,Chinese Taipei,559.825,523.119,523.315
8,Korea,553.767,535.79,537.788
9,Macao-China,538.134,508.949,520.571


3. Clean up your DataFrame. Drop rows that have NaNs. 

In [45]:
pisa2 = pisa.dropna()
pisa2

Unnamed: 0.1,Unnamed: 0,Mathematics,Reading,Science
2,OECD average,494.046,496.463,501.16
4,Shanghai-China,612.676,569.588,580.118
5,Singapore,573.468,542.216,551.493
6,Hong Kong-China,561.241,544.6,554.937
7,Chinese Taipei,559.825,523.119,523.315
8,Korea,553.767,535.79,537.788
9,Macao-China,538.134,508.949,520.571
10,Japan,536.407,538.051,546.736
11,Liechtenstein,534.965,515.522,524.695
12,Switzerland,530.931,509.04,515.298


4. Make the country names the index.

In [46]:
pisa2.set_index('Unnamed: 0', inplace=True)
pisa2

Unnamed: 0_level_0,Mathematics,Reading,Science
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OECD average,494.046,496.463,501.16
Shanghai-China,612.676,569.588,580.118
Singapore,573.468,542.216,551.493
Hong Kong-China,561.241,544.6,554.937
Chinese Taipei,559.825,523.119,523.315
Korea,553.767,535.79,537.788
Macao-China,538.134,508.949,520.571
Japan,536.407,538.051,546.736
Liechtenstein,534.965,515.522,524.695
Switzerland,530.931,509.04,515.298


In [47]:
print(pisa2.loc['United States']/pisa2.loc['OECD average'])

Mathematics       0.974335
Reading            1.00225
Science           0.992517
dtype: object


6. Challenging. How correlated are pisa math, reading, and science scores with each other? Write the correlation matrix to a file called 'pisa_corrs.xlsx'

This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' and corr() won't work. Google around and see if you can convert the three columns to numbers. Then find the correlations. 

In [49]:
pisa2.columns = ['math', 'read', 'sci']  # rename the columns to something short
pisa2=pisa2[['math', 'read', 'sci']].apply(pd.to_numeric)  # convert coumns to numeric
pisa2.corr() # correlate


Unnamed: 0,math,read,sci
math,1.0,0.959806,0.972131
read,0.959806,1.0,0.978559
sci,0.972131,0.978559,1.0
