# 1. Lecture Overview

The data we need for our projects is rarely all in one place (dataset) or organized the way we need it. This means that we very often have to combine two or more datasets into a single dataset or change the organization of the dataset to better suit our needs.

To keep things easy to visualize, we will use small, fictitious datasets to cover the following topics:

- Merging datasets
    - By columns (using "merge")
    - By index (using "merge" or "join")
      
- Appending datasets

- Concatenating datasets
        
- Reshaping datasets
    - From long to wide
    - From wide to long
       
- Sorting datasets
    - By index
    - By values
    
    
We finish with an application where we use the above techniques to merge the CRSP and Compustat datasets.

# 2. Preliminaries

In [None]:
# Import libraries
import pandas as pd
import datetime as dt

# Pretty print all cell's output and not just the last one
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 3. Merging datasets

When we say we want to "merge" two datasets, we generally mean that we want the columns of the two datasets to appear side by side in one final dataset. The important question is: How should the ROWS of the two datasets be matched? To perform this match, we need to have one or more columns that contain the same information in each of the two datasets. These common columns are ususally referred to as the "keys" on which the rows are matched. 

The second thing we have to decide is what to do with the rows that do NOT match after the merge. This is where we have to decide if we want an "inner", "outer", "left", or "right" join (aka merge), as specified below.

Finally, and this is specific to Pandas: if the keys are indexes in the dataframes we want to merge, then we need to be carefull to specify that when we use the "merge" function. Alternatively, we can use the "join" function. This case is covered in section 3.2. below.

In [None]:
# Example datasets
df1 = pd.DataFrame({'year': [2001, 2002, 2003], 
                    'tic': ['MSFT','TSLA','AAPL'], 
                    'fy':[2002,2003,2004]})
df1
df2 = pd.DataFrame({'years': [2001, 2002, 2004], 
                    'ticker': ['MSFT','NFLX','AAPL'], 
                    'fy':[12,12,12]})
df2

## 3.1. Merging by columns

In this section, we assume the keys on which you want to merge are columns (not indexes) in the two dataframes. The keys are specifed with the "left_on" and "right_on" arguments to the merge function.

### 3.1.1. Inner join

The inner join combines the datasets based on the INTERSECTION of the keys.

In [None]:
# Merge on year data
inner1 = df1.merge(df2, how='inner', 
                   left_on='year', right_on='years') 
inner1
    # Specify how to change names of common variables (fy)
inner2 = df1.merge(df2, how='inner', 
                   left_on='year', right_on='years', 
                   suffixes=('_df1','_df2'))
inner2

# Merge on ticker data
inner3 = df1.merge(df2, how='inner', 
                   left_on='tic', right_on='ticker')
inner3

# Merge on year and ticker
inner4 = df1.merge(df2, how = "inner", 
                   left_on = ['year','tic'], 
                   right_on = ['years','ticker'], 
                   suffixes=('_df1','_df2'))
inner4

### 3.1.2. Outer join

The outer join combines the datasets based on the UNION of the keys.

In [None]:
# Merge on ticker data
outer = df1.merge(df2, how='outer', 
                  left_on='tic', right_on='ticker')
outer

#Note how year data has been converted to float because of the introduction of NaN values (which are float)
outer.dtypes

### 3.1.3. Left join

In a left join, the unmatched keys from the left dataset are kept, but the unmatched keys from the right dataset are discarded.

In [None]:
# Merge on ticker data
left = df1.merge(df2, how='left', 
                 left_on='tic', right_on='ticker')
left

### 3.1.4. Right join

In a right join, the unmatched keys from the right dataset are kept, but the unmatched keys from the left dataset are discarded.

In [None]:
# Merge on ticker data
right = df1.merge(df2, how='right', 
                  left_on='tic', right_on='ticker')
right

## 3.2. Merging on index

As mentioned above, this covers the situation when the keys on which we want to perform the merge are indexes in the dataframes we want to merge. In this case, we can either use the "merge" function and specify "left_index=True, right_index=True", or we can use the "join" function.

In [None]:
# Add indices to the example dataframes
df3 = df1.set_index(['year','tic'])
df3
df4 = df2.set_index(['years','ticker'])
df4

In [None]:
# Using merge function (perform outer merge)
#outerm = df3.merge(df4, how='outer', left_index=True, right_index=True) #gives error

#Make index names match
df4.index.names = ['year','tic']
df3
df4

# Try the merge again
outerm = df3.merge(df4, how='outer', 
                   left_index=True, right_index=True, 
                   suffixes=('_df3','_df4')) 
outerm

In [None]:
# Using join function
#outerj = df3.join(df4, how = 'outer') #gives error
outerj = df3.join(df4, how = 'outer', 
                  lsuffix='_d3', rsuffix='_d4')
outerj

# 5. Appending datasets

When we say that we want to append a dataset to another, we generally mean that we want the columns of the two datasets to be stacked on top of each other (as opposed to side by side, for the merge). This usually happens when we obtain more data on a given set of variables, and we just want to add it at the bottom of a dataset containing those variables (columns).

In [None]:
# Append df2 to df1
df1
df2
df1.append(df2)
df1 
#this shows that df1 has not actually been re-written 
#(for that we would have to say df1 = df1.append(df2) above)

# Rename columns so Python knows which columns contain the same type of information in the two datasets
# "ignore_index" makes sure the column numbering does not start from 0 again
df5 = df1.append(df2.rename(columns={'years':'year',
                                     'ticker':'tic', 
                                     'fy':'fym'}), 
                 ignore_index=True) 
df5

# 6. Concatenating datasets

Usually when we say we want to concatenate datasets, we mean that we want to combine them side by side (just like merge), but "as they are", without concern for matching rows in any meaningful way (like merge). The "concat" function does this for us (using axis = 1), though it can can also append datasets (using axis = 0, or just leaving out axis altogether).

In [None]:
# Concatenate df2 to df1
df1
df2
df6 = pd.concat([df1, df2], axis=1) #without axis=1, this is just "append"
df6

# 7. Reshaping datasets

By reshaping a dataset we generally mean that we want to change the structure of the dataset so that either

1. Some data stored in one column is converted to multiple columns (but the same row)
    - In pandas, this is called "unstacking"
    - Informally, we say that we are converting the dataset from long to wide
    

1. Some data stored in multiple columns (but the same row) is converted to a single column
    - In pandas, this is called "stacking"
    - Informally, we say that we are converting the dataset from wide to long
      

## 7.1 From long to wide (unstacking)

In [None]:
# Create example dataset
long1 = pd.DataFrame({'p':[1,1,2,2], 
                      'time': ['2005','2006','2005','2006'], 
                      'ret': [0.1,0.15,0.05,0.01],
                      'n':[5,3,4,10]})
long1

In [None]:
# Convert to wide, where each "p" gets its own column and each "time" gets its own row

    # First we have to set the index
long2 = long1.set_index(['time','p'])
long2

    # Now reshape to wide
wide = long2.unstack(level='p')
wide
wide.columns
wide.index

## 7.2. From wide to long (stacking)

In [None]:
# Stack "wide" back to "long"
long3 = wide.stack(level='p')
long3

# 8. Sorting datasets

Sorting is a straightforward concept. In Pandas, we use the "sort_index" function to sort the dataframe by the index (assumes an index is set), and "sort_values" to sort the dataframe by a variable (or a set of variables). 

In [None]:
# Sort by the index
long3.sort_index(level='p')
long3.sort_index() #this sorts by the first dimension of the multi-level index (time in our case)

In [None]:
# Sort by values (columns)
long3.sort_values('n')
long3.sort_values('n', ascending=False)

# 9. Application

Perform an inner join on CRSP and Commpustat datasets based on date and permno. 

You will have to:
- Create a monthly frequency date variable (mdate) in each dataset
- Set the index in each dataset (permno mdate)
- Make sure the index names match in both datasets
- Merge the datasets by index

In [None]:
# Load crspm (just date, permno and ret)
crsp = pd.read_csv('./crspm.zip', sep = '\t', 
                   usecols = ['date', 'PERMNO', 'RET'], 
                   low_memory = False)

# Convert column names to lowercase
crsp.columns = crsp.columns.str.lower()

# Create dtdate (datetime variable) and mdate (month end frequency date)
crsp['dtdate'] = pd.to_datetime(crsp['date'], 
                                format='%Y%m%d') 
crsp['mdate'] = crsp['dtdate'].dt.to_period('M')

# Set index
crsp.set_index(['permno', 'mdate'], inplace = True)
crsp

In [None]:
# Load compa dataset (just lpermno, datadate, at)
comp = pd.read_csv('./compa.zip', sep ='\t', 
                   usecols = ['LPERMNO', 'datadate','at'], low_memory = True)

# Rename LPERMNO to permno
comp.rename(columns = {'LPERMNO':'permno'}, inplace = True)

# Create dtdate (datetime variable) and mdate (month end frequency date)
comp['dtdate'] = pd.to_datetime(comp['datadate'], 
                                format='%Y%m%d') 
comp['mdate']  = comp['dtdate'].dt.to_period('M')

# Set index
comp.set_index(['permno', 'mdate'], 
               inplace = True, drop = False)
comp

In [None]:
# Merge crsp on comp on index (inner)
inner = comp.join(crsp, how='inner', rsuffix='_crsp')
inner

In [None]:
# Check if there is a difference between the daily dates in comp and crsp
inner['dif'] = inner['dtdate'] - inner['dtdate_crsp']
inner['dif'].describe()

# 10. Resources

- general discussion on combining datasets
    - https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html


- merge function
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html


- join function
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
    
    
- append function
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html


- concat function
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat


- general discussion on reshaping and pivot tables
    - https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html


- reshaping functions
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html
    - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.stack
