# Human coding

In this notebook, we will work with the following aspects of human coding:

1. Revisiting our identifier example 
1. Human coding design
1. Heuristics
1. Cartesian joins

In [None]:
import pandas as pd

In [None]:
pd.set_option("mode.copy_on_write", True)

# Identifier example

Returning to our identifier example from the planning segment, we can look through some elements of the design of a coding spreadsheet.

Note: I removed one year from each firm for brevity.

In [None]:
coding = pd.DataFrame(
    [
        {"name": "Apple", "year": "2018"},
        {"name": "Apple", "year": "2019"},
        {"name": "Microsoft", "year": "2018"},
        {"name": "Microsoft", "year": "2019"},
        {"name": "Berkshire Hathaway", "year": "2018"},
        {"name": "Berkshire Hathaway", "year": "2019"},
    ]
)

coding.head(10)

In [None]:
# We can get unique values of a column with the unique method.
coding["name"].unique()

In [None]:
# We can make a new dataframe with those unique values.
code_table = pd.DataFrame(coding["name"].unique())
code_table = code_table.rename(columns={0: "name"})
code_table.head()

In [None]:
# Then, we can populate new columns for use in coding.
for new_col in ["gvkey", "source", "coder", "flag", "notes"]:
    code_table[new_col] = ""
code_table.head()

# Human coding design

When we design coding processes for humans, there are a few important things that we need to manage and balance.

1. Cost/time spent
1. Future use and versatility
1. Consistency across time and coders
1. Cleanness of data
1. Identification of errors


There are a few practices that I suggest, which we can see embedded in the example above.

1. **Minimize what you present to coders.** This table has a lot less than a typical dataset, because we only want to provide what is needed to do the work. While we don't have to be extremely strict about this (i.e. including a column for meaningful context but not direct use is sometimes helpful), everything you add has a cost in time.
1. **Communicate a benchmark for completing work.** Coders are sometimes less than particularly diligent (and it's fair to say that this is often tedious work). I have found that communicating a specific benchmark, even a generous one (I tend to multiply my average over a couple dozen observations by 2.0), improves output. You can use things like file creation times or Dropbox revisions (depending on the design) to compute the time as a way of measuring compliance.
1. **Capture information that helps improve the process.** It may not be strictly necessary to capture the source here, but it may be useful to have. For example, if we had a cascading design, we might find that some sources are not helpful enough to include going forward.
1. **Provide explicit instructions.** Instructions help the same person perform similarly over time, and they help us keep multiple coders in sync. They can also be helpful to show to reviewers to give face validity to the rigor of the management of the coding protocol.
1. **Give coders a mechanism to raise issues and capture the unexpected.** Here, I use a flag column and a notes column. The idea is that a flag tells me to look at it, and notes capture something unexpected. I advise coders that using this is rare, but it's there to surface things that they notice. Without this, you sometimes get responses like (for example, for gvkey) "001234 or 056789." This will read in as the wrong type, and it won't merge properly (even if one of them is correct). By giving them a mechanism (and instructions) to communicate, they can choose the better of the two, flag it, and make a note.
1. **Establish a review process to detect errors.** I typically have two forms of review. First, when onboarding a new coder, we have them do a short list of coding tasks (from the real data) that we use for every new coder. We use this to detect errors in process or conscientiousness early and correct them, and it's common to have issues (around 50 percent of new coders have at least one issue, in my experience). Using the same list means that we know those particular issues. Second, for more nuanced/complex projects, we write instructions for reviewing the work that is submitted.

# Heuristics

Sometimes, we have an imperfect way of coding a variable that works a reasonable amount of the time.
The idea here is that we can pre-populate those heuristic values, and we ask coders to check them, leave them if they are correct, and correct them if not.

It's worth noting that this necessarily increases the nuance and complexity of a coding task.
On the other hand, if the coders are not unduly anchored by the heuristic, it can create a substantial time savings by reducing data entry.
These issues make me more likely to use this technique with coders who we know to do good work, PhD students, and other co-authors.

In [None]:
msft_nyt = pd.read_csv("../data/msft_nyt.csv")
msft_nyt.head()

In [None]:
msft_coding = msft_nyt[["_id", "headline.main"]].copy()
msft_coding.head()

In [None]:
for new_col in ["is_significant", "coder", "flag", "notes"]:
    msft_coding[new_col] = ""
msft_coding.head()

In [None]:
# Perhaps we think the word billion in the headline makes something
# likely to be significant.
def code_significant(text):
    if "billion" in text.lower():
        return "1"
    else:
        return ""

In [None]:
msft_coding["is_significant"] = msft_coding["headline.main"].apply(code_significant)

In [None]:
msft_coding.head()

In this example, I write a simple function that is designed to operate on one string, and I use the `apply` method on the `headline.main` column to apply that function to each row.
Another common pattern is to merge in the values that may work, and use the coding process to check them.

# Cartesian joins

A [Cartesian product](https://en.wikipedia.org/wiki/Join_\(SQL\)#Cross_join) (also called a "cross join") is a combination of each row in one set of data with each row in another set of data.
As you can imagine, the length of the new data is $n \times m$, which gets very large as $n$ and $m$ increase.
However, when we can use small groups, these products become manageable.

## Conference call example

Imagine that we have conference call data consisting of a list of participants and then a transcript where we can isolate the names, but they don't quite match the participant list.
These are reasonably short-length lists, so we can use a Cartesian product for coding.
The idea is that we will have all of the combinations, and we expect there to be a proper match on both sides (not strictly required, but it helps), so we can produce all of the combinations and have a coder mark the proper one.

In [None]:
participants = pd.DataFrame(
    [
        {"name": "Abbie Executive (CEO)"},
        {"name": "Bruce Executive (CFO)"},
        {"name": "Charles Analyst (Firm1)"},
        {"name": "Bella Analyst (Firm2)"},
    ]
)
participants.head()

In [None]:
speaker = pd.DataFrame(
    [
        {"name": "Operator"},
        {"name": "Abbie E."},
        {"name": "Bruce E."},
        {"name": "Charles A."},
        {"name": "Bella A."},
    ]
)
speaker.head()

In [None]:
call_coding = participants.merge(speaker, how="cross")
call_coding["correct"] = ""
call_coding

As you can see, we now have 20 rows from lists of five and four (i.e. `5 * 4`).
That is a big expansion in length, but these are fast to code.
If you look through the rows, you'll notice that you can ascertain matches very quickly, and most rows do not match.
Using this technique, we've seen coders reliably exceed 40 rows per minute.
It also has the benefit of preventing much entry, except the for `1`s entered for matches.

In [None]:
mockup = [1, 7, 13, 19]
call_coding.loc[mockup, "correct"] = 1
call_coding

In [None]:
call_coding[call_coding["correct"] == 1]

This last form with the correct rows filtered becomes our lookup table to match these two forms of names.

# Q&A: Human coding

At the end, we will chat as one big group about human coding experiences and issues.