The dataset can be downloaded [here](https://github.com/zygmuntz/goodbooks-10k). 


In [212]:
import pandas as pd
book_tags = pd.read_csv('book_tags.csv')
books = pd.read_csv('books.csv')
tags = pd.read_csv('tags.csv')

We first want to identify any authors of color based on book tags. Note that the authors column in the books dataframe needs cleaning, since audio books contain multiple authors. We just need the main author, which is the first.

In [213]:
books['authors'] = books.authors.apply(lambda x: x.split(',')[0])

Now we search tag names that indicate we have an author of color.

In [214]:
key_words = ['poc', 'black', 'african', 'asian', 'color', 'latino', 'trans', 'hispanic']
poc_tags = []
# here we look for tags with any of the key words above together with
for i in range(tags.shape[0]):
    tag = tags.tag_name.loc[i].split('-')
    for item in key_words:
        if item in tag and ('author' in tag or 'authors' in tag or 'writer' in tag or 'writers' in tag):
            poc_tags.append([tags.tag_id.loc[i], tags.tag_name.loc[i]])
            

In [215]:
poc_tag_ids = [tag[0] for tag in poc_tags]
poc_tag_ids[:5]

[1754, 1767, 1777, 3220, 3223]

We want to use these book tags to label the authors as POCs. We will merge the book_tags dataframe with the books dataframe to match them up.

In [216]:
author_tags = book_tags.merge(books[['goodreads_book_id', 'book_id', 'authors']], left_on = 'goodreads_book_id', right_on = 'goodreads_book_id')
author_tags.head(3)

Unnamed: 0,goodreads_book_id,tag_id,count,book_id,authors
0,1,30574,167697,27,J.K. Rowling
1,1,11305,37174,27,J.K. Rowling
2,1,11557,34173,27,J.K. Rowling


Let's make a new column called 'poc' to label authors that are POC.

In [217]:
import numpy as np
author_tags['poc'] = ""
author_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count,book_id,authors,poc
0,1,30574,167697,27,J.K. Rowling,
1,1,11305,37174,27,J.K. Rowling,
2,1,11557,34173,27,J.K. Rowling,
3,1,8717,12986,27,J.K. Rowling,
4,1,33114,12716,27,J.K. Rowling,


In [218]:
for i in range(author_tags.shape[0]):
    if author_tags.tag_id.loc[i] in poc_tag_ids:
        author_tags.poc.loc[i] = 1

In [219]:
print('There are ', author_tags[author_tags.poc == 1].groupby('authors').poc.first().sum(), ' authors tagged as POC.')

There are  88  authors tagged as POC.


Let's try to identify more authors of color by scraping a list of POC authors from Goodreads.com. We'll need to clean it too.

In [220]:
urls = [f'https://www.goodreads.com/list/show/96119._ReadPOC_List_of_Books_by_Authors_of_Color?page={i}' for i in range(1,12)]

In [221]:
from bs4 import BeautifulSoup
import requests

In [222]:
goodreads_poc = []
for url in urls:
    soup = BeautifulSoup(requests.get(url).text, "lxml")
    for i in soup.find_all('span', itemprop = 'name'):
        goodreads_poc.append(i.text)

In [223]:
goodreads_poc_authors = [item for item in goodreads_poc if goodreads_poc.index(item) % 2 == 1]

In [224]:
for i in range(author_tags.shape[0]):
    if author_tags.authors.loc[i] in goodreads_poc_authors:
        author_tags.poc.loc[i] = 1

In [225]:
print('There are ', author_tags[author_tags.poc == 1].groupby('authors', as_index = False).poc.first().shape[0], ' authors tagged as POC.')

There are  186  authors tagged as POC.


In [226]:
authors_df = pd.DataFrame(author_tags[['authors', 'poc']])
authors_df = authors_df.drop_duplicates(subset = ['authors', 'poc'])
authors_df = authors_df.set_index('authors')

Finally, let's write a function that will allow us to hand-label authors that are POC, in case we want to do it manually. We pickle author_tags each time we call the function.

We first need the following script to be able to print out a link. (I did not write this script.)

In [232]:
"""URL Wrapper."""

from dataclasses import dataclass


@dataclass(frozen=True)
class Url:
    """Wrapper around a URL string to provide nice display in IPython environments."""

    __url: str

    def _repr_html_(self):
        """HTML link to this URL."""
        return f'<a href="{self.__url}">{self.__url}</a>'

    def __str__(self):
        """Return the underlying string."""
        return self.__url

The following script can be used to hand-label authors. This code will also print out a link to the author's Wikipidia page. This function can be used on any dataframe where the index are the author names, and there is a column labelled 'poc', where the entries are either "", 1 or 0.

In [233]:
import pickle

In [234]:
def hand_label(authors_dataframe):
    '''Input a dataframe where index are authors and dataframe has a 'poc' column. 
    Enter 'poc' if author is a POC, exit if you want to end the program,
    and w to indicate not POC'''
    author_label = None
    for author in authors_dataframe.index:
        if authors_dataframe.loc[author].poc  == "":
            if author_label == 'exit':
                break

            while True:
                print(Url('https://en.wikipedia.org/wiki/'+author.replace(' ','_')))
                author_label = input(author + ' ') 
                if author_label == 'poc':
                    authors_dataframe.loc[author]['poc'] = 1
                    with open('authors_df.pkl', 'wb') as picklefile:
                        pickle.dump(authors_dataframe, picklefile)
                    break
                elif author_label == 'w':
                    authors_dataframe.loc[author]['poc'] = 0
                    with open('authors_df.pkl', 'wb') as picklefile:
                        pickle.dump(authors_dataframe, picklefile)
                    break
                else:
                    break


Try labelling some! Make sure to do your research.

In [235]:
hand_label(authors_df)

https://en.wikipedia.org/wiki/J.K._Rowling
J.K. Rowling w
https://en.wikipedia.org/wiki/Douglas_Adams
Douglas Adams w
https://en.wikipedia.org/wiki/Bill_Bryson
Bill Bryson w
https://en.wikipedia.org/wiki/J.R.R._Tolkien
J.R.R. Tolkien w
https://en.wikipedia.org/wiki/Chris___Smith
Chris   Smith exit


In [236]:
authors_df.poc.value_counts()

     3736
1     186
0       4
Name: poc, dtype: int64