# Data Biography
## *History Colorado's KKK Ledgers*

In [170]:
    #This analyzes a History Colorado dataset that needs to be imported for analysis
import pandas as pd
kkk_df = pd.read_csv('kkk-ledgers-index.csv', delimiter=",", low_memory=False)

### Introduction
In 2021, History Colorado (formerly the Colorado Historical Society) digitized and advertised what it described as “membership books” for the Ku Klux Klan (KKK). The KKK is a white supremacist group in the United States that gained particular popularity in the 1920s in Colorado, even being represented by people in prominent elected government positions. While there is no confirmed date connected to ledgers, they are estimated to have been in use between 1920-1930, more specifically between 1924-1926. They seem to mainly focus on the Denver area, though there is also no defined geographic scope.

### Creating the Dataset
As we will see repeatedly, these two ledgers were created for vastly different purposes. The first ledger documents membership applications, including information such as names, dues payments, and when people became naturalized as Klan members. **Therefore, from now on, this ledger will be referred to as the “application ledger.”**

<div align="center">
  <img
    src="../assets/img/application_ledger.png"
    width="40%"
    style="border: 2px solid"
    alt="application ledger page"
  />
  <figcaption style="font-style: italic">
    Page from the "Application Ledger"
  </figcaption>
</div>


The other ledger focuses more on documenting basic information about members like their names and addresses. **Therefore, from now on, this ledger will be referred to as the “directory ledger.”**

<div align="center">
  <img
    src="../assets/img/directory_ledger.png"
    width="40%"
    style="border: 2px solid"
    alt="directory ledger page"
  />
  <figcaption style="font-style: italic">
    Page from the "Directory Ledger"
  </figcaption>
</div>


Both ledgers include various handwritten annotations and symbols, many of which with undefined meanings. Such annotations mean that there is a wide variety in the data collected within the ledgers despite the attempt at standardization through the premade columns of the ledger book. Also, not all columns were always filled in, creating gaps in the data.

These ledgers were donated to History Colorado in Denver in 1946 by an anonymous donor from the Rocky Mountain News, though to protect privacy records, they were restricted to the public until 1990. Later, History Colorado underwent a digitization process in 2021 to make these ledgers easily accessible to the public. Now, through [History Colorado’s website](https://www.historycolorado.org/kkkledgers), one can access this data in many forms, including PDF images of the original ledgers, an interactive map using the address data from the ledgers, and a spreadsheet index that compiled information from the ledgers. In creating the spreadsheet index, History Colorado used Optical Character Recognition (OCR) technology, a process they acknowledge has some imperfections despite their extensive checks on the results. The variety of mediums not only helps people access the data through a variety of lenses but also gives those analyzing the data many ways to check data and come to conclusions.


It is also important to note the context with which History Colorado digitized and advertised this information. Coinciding with racial reckoning in the United States throughout 2020, History Colorado’s digitization of this data was part of an effort to “actively name and confront systems of inequality.” The emphasis throughout the dataset and the website’s tools and instructions on names and addresses reflects these goals as, rather than erasing Colorado’s racist history, they show the prominence of the KKK in the Denver area. This is also a way to make confronting Colorado’s racist past a personal experience for people, who may potentially use the tools to search for relatives or see the past prominence of the KKK in their area. 

### What does the data show?
That said, as with all datasets, this reduces KKK membership to names and addresses without the accompanying stories. What was it like for people to be a member of the KKK at this time? How did this large membership impact marginalized groups? As all knowledge is situated, the answers provide essential context to truly analyze the ledgers, yet these questions cannot be answered through the dataset itself.

In [180]:
    #Instead, this is the data that we see
kkk_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29635 entries, 0 to 29634
Data columns (total 34 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   itemID                  29635 non-null  object 
 1   Number                  29633 non-null  float64
 2   fullName                29635 non-null  object 
 3   Prefix                  174 non-null    object 
 4   Last Name               29634 non-null  object 
 5   First Name              29631 non-null  object 
 6   Middle Name             26267 non-null  object 
 7   Suffix                  357 non-null    object 
 8   residenceAddress1       14921 non-null  object 
 9   residenceAddress2       433 non-null    object 
 10  Residence Address       15334 non-null  object 
 11  Residence Other         868 non-null    object 
 12  residenceCity           15649 non-null  object 
 13  residenceState          15649 non-null  object 
 14  Residence City & State  15651 non-null


Note that this list of data included in this dataset shows that the directory ledger was the template for creating the dataset, as much of the data shown, like addresses, was not collected in the application ledger. Also, categories like "sNumber" refer specifically to the directory ledger. That said, it is valuable that the dataset worked to include some of the written annotations like symbols, as this lets them be analyzed. However, this also means that potentially valuable information from the first dataset, like dues payments, cannot be analyzed through the dataset.

<div align="center">
  <img
    src="../assets/img/snumber.png"
    width="10%"
    style="border: 2px solid"
    alt="The "S" annotation in the number column representing the "sNumber" category"
  />
  <figcaption style="font-style: italic">
    The "S" annotation in the number column representing the "sNumber" category
  </figcaption>
</div>


This reveals a key concern about the dataset: that it was created combining two sets of data that were created for very different reasons. History Colorado lumps the two ledgers into the category of “membership books” to justify their combination into one dataset, but assuming sameness is a problematic way to address different datasets. ***Combining the datasets aligned with their goals to represent the prominence of the KKK in Denver, but it leaves opportunities for misrepresentation of the data.***

### Double Counting
The main issue that this causes is the existence of double counting people’s KKK membership by including both their application data and directory data. 

In [185]:
    #How many duplicate names exist in the dataset?
kkk_df['fullName'].duplicated(keep=False).value_counts()

fullName
False    19715
True      9920
Name: count, dtype: int64

In [186]:
    #What is the percentage of duplicated names in the dataset?
kkk_df['fullName'].duplicated(keep=False).value_counts(normalize=True)

fullName
False    0.665261
True     0.334739
Name: proportion, dtype: float64

This shows that of the 29635 entries, about 33% were duplicate names. Of course, there could have been some people with the same name, but this is likely due to some people being in both the application ledger *and* the directory ledger.

In [188]:
    #However, I will also check individiually within the ledgers how many duplicate entries there were.
application_filter = kkk_df['Ledger Link'] == "MSS.366.4"
directory_filter = kkk_df['Ledger Link'] == "MSS.366.5"
application_df = kkk_df[application_filter]
directory_df = kkk_df[directory_filter]

In [189]:
    #How many duplicated in the application dataset?
application_df['fullName'].duplicated(keep=False).value_counts()

fullName
False    12636
True       262
Name: count, dtype: int64

In [190]:
    #What is this percentage?
application_df['fullName'].duplicated(keep=False).value_counts(normalize=True)

fullName
False    0.979687
True     0.020313
Name: proportion, dtype: float64

In [191]:
    #How many duplicated in the directory dataset?
directory_df['fullName'].duplicated(keep=False).value_counts()

fullName
False    16591
True       146
Name: count, dtype: int64

In [192]:
    #What is this percentage?
directory_df['fullName'].duplicated(keep=False).value_counts(normalize=True)

fullName
False    0.991277
True     0.008723
Name: proportion, dtype: float64

Having only 2% and 0.8% of entries being duplicates within the individual ledgers shows that the dataset double counts people who were part of the application ledger and then also part of the directory ledger. This would therefore inflate membership numbers if people used this dataset to simply show KKK membership numbers at this time. 

This also begs the question of why not all of the names in the application ledger were included in the directory, because if everybody was transferred from their application to the directory, there would be nearly 13,000 duplicates rather than the around 10,000 that exist. 

<div align="center">
  <img
    src="../assets/img/rejected.png"
    width="40%"
    style="border: 2px solid"
    alt="annotations noting a rejected applicant"
  />
  <figcaption style="font-style: italic">
    Annotations noting rejected applicants
  </figcaption>
</div>

In [196]:
    #When looking at the PDF of the application ledger, there are instances where it is noted that an applicant was rejected. 
    #check to see how many of those listed in application ledger were rejected
application_df['Notes & Remarks'].str.contains('Reject').value_counts()   

Notes & Remarks
False    423
True     388
Name: count, dtype: int64

Since only 388 were noted as being rejected, this does not account for the 3000 people that do not seem to be in both the application ledger and the directory ledger. Regardless, this emphasizes the difficulties of treating two different data sets as equal.

### Member vs. Nonmember Binary
The concept of rejection also addresses another problematic assumption about the dataset - the binary of somebody being either a “member” or “not a member” of the KKK through existence in the dataset. However, this does not take into account the nuances of membership, not only including rejections but also resignations and statuses like “banished” that appear in the ledger annotations. 

<div align="center">
  <img
    src="../assets/img/banished.png"
    width="40%"
    style="border: 2px solid"
    alt="Description of image"
</div>


<div align="center">
  <img
    src="../assets/img/suspended.png"
    width="50%"
    style="border: 2px solid"
    alt="suspended annotation"
  />
</div>

<div align="center">
  <img
    src="../assets/img/resigned.png"
    width="50%"
    style="border: 2px solid"
    alt="resigned annotation"
  />
  <figcaption style="font-style: italic">
    Examples of "Banished," "Suspended," and "Resigned" annotations
  </figcaption>
</div>

In [202]:
    #check to see how many people were noted as resigned
directory_df['Notes & Remarks'].str.contains('Resigned').value_counts()

Notes & Remarks
False    577
True      30
Name: count, dtype: int64

In [203]:
    #However, I found 30 resigned by page 65 out of 741. A brief scroll showed there were definitely more.
    #Look up Carl S Homsher to see his notes and remarks to double check the accuracy
directory_df[directory_df['fullName'] == 'Carl S Homsher']

Unnamed: 0,itemID,Number,fullName,Prefix,Last Name,First Name,Middle Name,Suffix,residenceAddress1,residenceAddress2,...,ledger,link,Ledger Link,Page,pdfFileName,symbolExists,sNumber,sErased,Notes & Remarks,Column 29
1482,K01483,2481.0,Carl S Homsher,,Homsher,Carl,S,,1540 Monroe,,...,MSS.366.5,h-co.org/ledger2,MSS.366.5,70,mss366-5_FC-p74_QC,True,False,False,"Name struck; ""RESIGNED-OCT-1925""",


In [204]:
    #Realized that it will be case sensitive, which makes more sense as to why I could not see more
directory_df['Notes & Remarks'].str.contains('RESIGNED').value_counts()

Notes & Remarks
False    423
True     184
Name: count, dtype: int64

In [205]:
    #I also wanted to check the notation of somebody who had a strike through their name but no annotations
directory_df[directory_df['fullName'] == 'L M Jones']

Unnamed: 0,itemID,Number,fullName,Prefix,Last Name,First Name,Middle Name,Suffix,residenceAddress1,residenceAddress2,...,ledger,link,Ledger Link,Page,pdfFileName,symbolExists,sNumber,sErased,Notes & Remarks,Column 29
1551,K01552,2550.0,L M Jones,,Jones,L,M,,3456 West 32nd Ave,,...,MSS.366.5,h-co.org/ledger2,MSS.366.5,73,mss366-5_FC-p74_QC,False,True,False,,


While around 200 resignations is not a significant portion of the membership, it is important to note that there are nuances in membership. Not only that, but the previous code also reveals that there is no note in the dataset about people whose names were struck out with no explanation, something else that implies nuance in the data that cannot be analyzed simply through the dataset.

In general, a person’s presence in the data assigns a certain equal meaning to being a “member” of the KKK. This data therefore misses the nuances and the stories associated with Colorado’s KKK membership in the 1920s. This is exacerbated by the fact that there is very little year data associated with the dataset. While we can see some members resigning, there is also no way to tell for how long people were members of the organization or track patterns in membership.

The lack of demographic data, including one’s occupation, class, or age, eliminates more of the nuance of this story. (Of course, it is already notable that the ledger only includes men, therefore implying a male-only organization despite the involvement of women - another instance of data erasing stories.) While demographic data comes with its own complications and “cooking” of data, this provides a hole in the stories that can be told from the data.

### Geographic Limitations
Another area for misinterpretation of this data is the geographic scope, particularly with respect to how History Colorado presented the data in their map. Annotations indicating a person had been “demitted” to places like Pueblo and Greeley imply that the ledgers were more geographically centered in the Denver area. However, History Colorado’s map has data in all parts of the state, prompting users to “explore how they [the KKK] were spread across the state.” This could lead to people drawing conclusions on the particular prominence of the KKK in Denver, while in actuality it seems that the ledgers themselves were focused there.

<div align="center">
  <img
    src="../assets/img/map.png"
    width="40%"
    style="border: 2px solid"
    alt="history colorado's map"
  />
  <figcaption style="font-style: italic">
    History Colorado's map showing statewide membership locations
  </figcaption>
</div>

Additionally, History Colorado does not note that many of the ledger entries did not include address information.

In [212]:
    #How many entries did not include addresses?
kkk_df['residenceAddress1'].isna().value_counts()

residenceAddress1
False    14921
True     14714
Name: count, dtype: int64

In [213]:
    #Was there any address info included in the application ledger?
application_df['residenceAddress1'].isna().value_counts()

residenceAddress1
True    12898
Name: count, dtype: int64

The answer? No.

In [215]:
    #How much address info was missing in the directory?
directory_df['residenceAddress1'].isna().value_counts()

residenceAddress1
False    14921
True      1816
Name: count, dtype: int64

In [216]:
    #What was the percentage of missing addresses in the directory?
directory_df['residenceAddress1'].isna().value_counts(normalize=True)

residenceAddress1
False    0.891498
True     0.108502
Name: proportion, dtype: float64

Therefore, even 11% of the entries in the directory ledger did not have address data, meaning using the map to show distribution even within Denver itself is a problematic exercise, as many people did not provide address information.

### Conclusion
While it is valuable to confront Colorado’s racist past and history with the KKK, one should approach the data included in History Colorado’s digitized KKK ledgers with caution. While the KKK was certainly a prominent organization in Colorado in the 1920s, the ledgers do not tell the story of this experience, only showing membership without context as to what this membership meant. Not only that, but combining the two ledgers into a single dataset misrepresents the data, inflating numbers and assuming commonalities between data collected for vastly different purposes.