# 1. Lecture Overview



- Downloading data using packages
    - yfinance
    - datareader
    
    
- Saving and loading data to/from various file formats
    - delimited (txt, csv)
        - as applications, we open the "compa" and "crspm" files that we will be using extensively over the entire course
    - MS Excel (xlsx)
        - additional package required: openpyxl (or xlsxwriter)
    - proprietary: Python (pkl), SAS (sas7bdat), Stata (dta), Matlab (mat), etc
        - we only showcase Python and Stata 
        
        
- Big data solutions
    - reading (and processing) datasets in chunks
    - using HFD5 files for better performance


# 2. Downloading data using python packages

## 2.1. Preliminaries

We first need to install the "pandas_datareader" package by typing "pip install pandas_datareader" in the Anaconda Prompt.

In [None]:
# Import libraries
import pandas as pd
import yfinance as yf
import pandas_datareader as pdr

## 2.2. The DataReader package

The **pandas-datareader** package can be used to obtain data from many different sources. See the link in the Resources section below for a documentation of all its capabilities.

In this lecture we will only use it to get data on the CPI from the St. Louis FRED database, and data on the 1-month Tbill from Ken French's database.

### 2.2.1. Downloading macro data from St Louis Fred

- From the St. Louis FRED, download monthly data on the CPI for all urban consumers (CPIAUCSL). 
    - Plot the CPI
    - Calculate the percentage change in the CPI each month, to obtain the inflation rate
    - Plot the inflation rate

In [None]:
# Get CPI data from FRED
cpi = pdr.DataReader('CPIAUCSL','fred','1970-01-01','2020-12-31')
print(cpi)
cpi.plot();

In [None]:
# Calculate inflation rate
infl = cpi.pct_change()
print(infl)
infl.plot();

### 2.2.2. Downloading data from the Ken French database

- From the Ken French database, download monthly data on the risk-free rate (1-month tbill rate).
    - Use the "pandas-datareader" package
    - This is the "RF" column in the "F-F_Research_Data_Factors" database
    - The rate is expressed in percentage points so you will have to divide it by 100
    - Plot the resulting risk-free rate


In [None]:
# Load the names of all the available datasets from Ken French database
pdr.famafrench.get_available_datasets()[0:9] #print just the top 10

In [None]:
# Download the monthly Fama French factors (first item in the list)
ff3f = pdr.DataReader('F-F_Research_Data_Factors', 'famafrench','1970-01-01')
ff3f
print(ff3f['DESCR'])

In [None]:
# Extract only the monthly table
monthly_dat = ff3f[0]
monthly_dat

In [None]:
# Extract only the monthly risk-free rate
rfdat = monthly_dat['RF']
rfdat = rfdat/100
print(rfdat)
rfdat.plot();
print(rfdat.index)

## 2.3. The yfinance package

We already used this package in the previous lecture, and, going forward, we will not use much more of its functionality other than downloading monthly data on a few tickers (as below). Please see the link in the Resources section below to explore more of its capabilities.

One important thing to keep in mind about the Yahoo Finance data, is that, for individual stocks, you should use "Adj Close" prices to calculate returns (like in the previous lecture), but for indexes, you should use the "Close" values to calculate returns (as below).

**Application: Performance of main asset classes**

- From Yahoo Finance, download monthly data on the SPDR S&P 500 ETF, the SPRD Gold Shares ETF, and BlackRock's long-term (20+ years) treasury ETF (tickers: SPY, GLD, TLT respectively). 
    - Download data from 2004 to 2020
    - Convert these data to monthly returns
    - Calculate the returns on an equal-weighted portfolio of the 3 asset classes
    - Plot all the montly returns on the same graph
    - Calculate rolling compounded returns 
    - Plot the rolling compounded returns on the same graph

In [None]:
# Download Yahoo Finance data
yfdat = yf.download(tickers = ['SPY', 'GLD', 'TLT'], 
                    start = '2002-01-01', end = '2020-12-31',
                    interval = '1mo')

# Always inspect the structure of your data when you are not familiar with the dataset
print(yfdat.index,'\n\n')    # These are like the line numbers in an excel sheet but much more general
print(yfdat.columns, '\n\n') # These are column names, but they can have multiple parts
print(yfdat)

In [None]:
# Keep only the 'Close' values and drop missing values
yfdat = yfdat['Close'].dropna()
yfdat

In [None]:
# Plot evolution of asset classes during the sample
yfdat.plot();

In [None]:
# Calculate monthly returns
yfret = yfdat.pct_change()

# Add column with returns of equal-weighted portfolio
yfret['EW_Portf'] = yfret.mean(axis = 1)

# Plot rolling compounded returns
(1+yfret).cumprod().plot(); 
    #note that Python will evaluate the above expression from left to right
    #so we can chain instructions on the same line of code in the order that we want them executed

# 3. Saving and loading data from various file formats

## 3.1. Reading and writing .csv and .txt files with "read_csv" and "to_csv"

The WRDS data files under the Datasets tab in D2L ("crspm" and "compa") were saved as tab-delimited txt files. The "read_csv" pandas function can read basically every type of delimited file, as long as you specify what the delimiter is (comma, space, tab), as below.

See the "read_csv" link in the "Resources" section below for a description of the full functionality of "read_csv".

**Application: Saving and loading Pandas dataframes using .csv files**

In [None]:
# Save the "yfdat" dataset to a csv file
yfdat.to_csv("./L05_yfdat.csv")

In [None]:
# Load the data from the csv file we create above 
yfdat_load1 = pd.read_csv("./L05_yfdat.csv")
yfdat_load1

In [None]:
# Load the save file, this time specifying header and index
yfdat_load2 = pd.read_csv("./L05_yfdat.csv",
                         index_col = [0],
                         header = [0])
yfdat_load2

**Application: Loading the CRSPM (.txt) file**

In [None]:
# Load CRSP data from WRDS (crspm file)
crsp = pd.read_csv("./crspm.zip",   # the "./" means the crspm file is in the same folder as these lecture notes
                   sep = '\t',      # specifies that the crspm file is tab delimited
                   usecols = ['PERMNO', 'date', 'RET'],  #allows us to select only a subset of all the columns
                   low_memory = False) # specifies that we want to read the whole file in one chunk

In [None]:
# Examine the dataset
print(crsp, '\n\n')
print(crsp.dtypes)

**Application: Loading the COMPA (.zip) file**

In [None]:
# Load the Compustat annual files (compa) specifying that it is zipped
comp = pd.read_csv("./compa.zip",   # the ".gz" means this file was archived with gzip 
                   sep = '\t',         
                   usecols = ['LPERMNO', 'datadate', 'at'],  
                   low_memory = False) 

In [None]:
# Examine the dataset
print(comp, '\n\n')
print(comp.dtypes)

## 3.2. Reading and writing Excel files

To read and write Excel files, we need two more packages: xlrd (to read Excel files) and openpyxl (to write excel files). 

To install these packages, open the Anaconda Prompt (or a terminal) and type:

conda install -y openpyxl xlrd

In [None]:
# Write the yfdat data to excel
yfdat.to_excel("./L05_yfdat.xlsx",
              sheet_name = "Asset_Classes",
              startrow = 2,
              startcol = 5)

In [None]:
# Read data from excel file we created above (to showcase what may go wrong)
yfdat_excel1 = pd.read_excel("./L05_yfdat.xlsx") #, index_col = [0])
yfdat_excel1

In [None]:
# Read it again, this time specifying where the index and header are (and more functionality)
yfdat_excel2 = pd.read_excel("./L05_yfdat.xlsx",
                            sheet_name = "Asset_Classes",
                            usecols = "F:I", 
                            skiprows = [0,1],
                            nrows = 4, 
                            index_col = [0],
                            header = [0])
yfdat_excel2

## 3.3. Reading and writing from/to common proprietary file formats

Pandas has several functions that allow us to read and write data from many different types of files. Several of these are files that are created using expensive software like SAS, Stata, and Matlab.

Below, we only showcase reading a Stata file (the "comp.dta" file found in the Datasets tab in D2L). This is done with the "read_stata" function. Writing Stata data files can be done with the "to_stata" function. We do not show this here because it's likely none of you have Stata. 

The point of this section is to make you aware that, if, for some reason you come across datasets that are in proprietary file formats, Python (and Pandas in particular) will likely have a way to allow you to read that dataset, even if you don't have the software that was used to create it in the first place. 

Please visit the "I/O tools" link in the Resources section below for additional information on how to read or write these additional file formats. 

In [None]:
# Read the Compustat annual file saved in Stata format (compa.dta)
#comp_stata = pd.read_stata("./compa.dta",
#                          columns = ['LPERMNO', 'datadate', 'at'])

In [None]:
#Examine the dataset
#print(comp_stata.head(5), '\n\n')
#print(comp_stata.dtypes)

More importantly for this course, Python has a proprietary data format called "pickle". Saving and loading data from pickle (.pkl extension) files is significantly faster than from/to csv, so we will be using it quite a bit later on in the course. 

In [None]:
# Save a dataframe to a pkl file
comp.to_pickle('./comp.pkl')

# Read a dataframe from a pkl file
comp_pkl = pd.read_pickle('./comp.pkl')
comp_pkl

# 4. "Big Data" solutions: working with very large datasets

## 4.1. Reading and processing large files in "chunks"

Show how to use a for loop to (1) read a file in chunks and (2) apply some simple processing to each chunk

In [None]:
# Read the COMPA file, 10,000 rows at a time and retain the firm with most total assets (AT)
oldmax = 0;
for chunk in pd.read_csv('./compa.zip', sep='\t', 
                         chunksize=10000, 
                         usecols=['LPERMNO','datadate','at', 'tic']):
    newmax = chunk['at'].max()
    if newmax > oldmax:
        comp_info = chunk.loc[chunk['at']==newmax,:].copy()
        oldmax = newmax
        print(comp_info, end='\n\n')  

## 4.2. Using HDF5 files for better performance

To use some of the functionality of HDF files, we need the "tables" package. Install it by typing the following in the Anaconda Prompt (or a terminal):

pip install tables

In [None]:
# Read COMPA file and time it
import time
tic = time.perf_counter()
comp = pd.read_csv("./compa.zip",  sep = '\t', 
                   usecols=['LPERMNO','datadate','at'])
toc = time.perf_counter()
print(toc-tic)

In [None]:
# Save to HDF format
comp.to_hdf('./comp.hdf', key='comp',
            data_columns = ['LPERMNO','datadate','at'])

In [None]:
#Read HDF file and time it
tic = time.perf_counter()
comp_hdf = pd.read_hdf('./comp.hdf', key='comp')
toc = time.perf_counter()
print(toc-tic)

**Indexing HDF files**

In [None]:
#Create an HDF file with the "table" format so we can index it later on
comp.to_hdf('./comp_at.h5', key='comp_at', 
                  format='table',
                  data_columns = ['LPERMNO','datadate','at'])

In [None]:
#Retrieve only the data that satisfies some condition (keep only data from december 2018)
comp_large = pd.read_hdf('./comp_at.h5', key='comp_at', 
                         where = ["datadate == 20181231"])
comp_large

# 5. Resources

- I/O tools for Pandas data:
    - https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html