# Planning

In this notebook, we will work with the following skills that assist in planning:

1. Identifier transformation 
1. Identifier coding
1. Data mockups

In [None]:
import pandas as pd

In [None]:
pd.set_option("mode.copy_on_write", True)

# Identifier transformation

As we discussed before, a key planning issue is ensuring that we have a strategy for having identifiers across our component datasets that allow for merging and other transformations that we need.
For many of our archival datasets, we typically have one or more identifiers, and we often need to transform or cross reference those.
We can think of this as the easiest case: identifiers are built in, even if they have some nuances for us to navigate.

**Note:** We're using macro research identifiers as examples here, but identifier planning matters generally across levels and research designs.

## CUSIP

CUSIP is an identifier used in the securities industry, and it's used in multiple databases, including [Compustat](http://office.banker.thomsonib.com/ta/help/glossaryhelp/CompustatGlossaryC.htm#CUSIP). It has multiple forms and components, detailed below:

component | format | notes
-----|-----------|---------------
Issuer code|six character string|Issuer (firm) level identifier.
Issue number|two character string|Issue (e.g., stock class, bond issue) level identifier.
Check digit|one character string|Basically a checksum digit, an old data integrity technique.

In archival databases, a six-digit CUSIP is the issuer code, an eight-digit CUSIP contains the issuer code and issue number, and the nine-digit CUSIP includes all three components.
Because these individual parts are in the same positions, we can convert any CUSIP variant down to a shorter one.
In addition, because the common stock of a firm typically gets the issue code `10`, we can upconvert six-digit CUSIPs to eight-digit CUSIPs, if we are willing to accept some missing identifiers (an important qualification).

In [None]:
# CUSIP Examples
cusips = pd.DataFrame(
    [
        {"name": "Apple", "cusip": "037833100"},
        {"name": "Microsoft", "cusip": "594918104"},
        {"name": "Berkshire Hathaway", "cusip": "084670702"},
    ]
)

cusips.head()

In [None]:
for v in [6, 8, 9]:
    cusips["id_cusip" + str(v)] = cusips["cusip"].str.slice(0, v)

In [None]:
cusips.head()

Notice that Berkshire Hathaway has an issue number of `70`, not `10`.
This is a case where a firm has multiple share classes, and one other than `10` is the most numerous (the Compustat criterion for choosing an issue to use as a firm identifier).

## GVKEY

GVKEY is Compustat's primary identifier.
It's a six character identifier, which is largely an integer, but the leading zeros are included in some datasets.
Filling with leading zeros is a transformation we sometimes need to do to facilitate merging later.

In [None]:
gvkeys = pd.DataFrame(
    [
        {"name": "Apple", "gvkey": "1690"},
        {"name": "Microsoft", "gvkey": "12141"},
        {"name": "Berkshire Hathaway", "gvkey": "2176"},
    ]
)

gvkeys.head()

In [None]:
gvkeys["gvkey"] = gvkeys["gvkey"].str.zfill(6)
gvkeys.head()

## Converting identifiers

For these identifiers and others (e.g., PERMNO, PERMCO, CIK), WRDS has [linked data](https://wrds-www.wharton.upenn.edu/pages/support/data-overview/wrds-overview-crspcompustat-merged-ccm/) for converting identifiers.

# Identifier coding

At times, we encounter datasets or entire projects with no coordinated identifers across the data.
With no identifier throughout the project, the solution is relatively simple: take your base-level data (i.e. the data at the intended final data level), assign identifiers to each row (I suggest integers starting with `1`), and use those identifiers throughout.

At other times, we need to code identifiers in order to match up with our other data.
This is the case we focus on below.

In the example data below, we would like to code a firm-level identifier.

In [None]:
coding = pd.DataFrame(
    [
        {"name": "Apple", "year": "2018"},
        {"name": "Apple", "year": "2019"},
        {"name": "Apple", "year": "2020"},
        {"name": "Microsoft", "year": "2018"},
        {"name": "Microsoft", "year": "2019"},
        {"name": "Microsoft", "year": "2020"},
        {"name": "Berkshire Hathaway", "year": "2018"},
        {"name": "Berkshire Hathaway", "year": "2019"},
        {"name": "Berkshire Hathaway", "year": "2020"},
    ]
)

coding.head(10)

The first thing to notice about this data is that each firm appears multiple times.
In practical data parlance, this column has relatively low **cardinality** (i.e. the uniqueness of the items in the column compared to its length).
When coding something durable like identifiers, we like to see this, because we will only need to look up each value once.

In [None]:
# We can get unique values of a column with the unique method.
coding["name"].unique()

In [None]:
# We can make a new dataframe with those unique values.
code_table = pd.DataFrame(coding["name"].unique())
code_table = code_table.rename(columns={0: "name"})
code_table.head()

In [None]:
# Then, we can populate new columns for use in coding.
for new_col in ["gvkey", "source", "coder", "flag", "notes"]:
    code_table[new_col] = ""
code_table.head()

We can then write this as a CSV, code it, read it back in, and use it to add identifiers to this dataframe.
We'll talk more about the mechanics of this kind of cosing in human coding data retrieval segment.

# Data mockups

At the planning and pilot study stage, we may have a complex and labor-intensive data collection yet to do.
As a result, we will not have some of the data that we need in order to make sure that we can fit everything together.

A data mockup is a form of data that we create—often manually—to simulate the form of the data that we will retrieve in a subsequent collection.
This is common for data obtained by web scraping, human coding, or other time-intensive processes.
Before starting such a collection, we need to know that it will produce the data that we need.
If we are designing the collection ourselves, it may serve as a target for the form of data produced.

My favorite tool for producing data mockups is a manually-created CSV file.
Unlike Excel spreadsheets (with a lot of internal complexity and sometimes well-intended but harmful automatic behavior), a CSV file is what the name describes: comma-separated values.
To make one manually, we simply type (or, more likely, copy and paste) into a file in a text editor.

## CSV example

The contents of a CSV file look like this:

```csv
price,tic,yr
86.13,msft,2018
62.79,msft,2017
54.32,msft,2016

```

Also, though it's beyond the scope of the course, using regular expressions in VS Code find/replace can often format raw data quickly into a CSV format. 

# Q&A: Planning

At the end, we will chat as one big group about issues that you foresee in your own planning.