## Data acquisition file for [Peña and Choi (2021, *Economics Letters*)](https://doi.org/10.1016/j.econlet.2021.109968)
- Corresponding author for this file: Jun Ho Choi (junhoc@uchicago.edu)

This `.ipynb` file contains Python code to acquire data that has been used for the article **"Female representation among notable people born in 1700-2000" (Peña and Choi, 2021, *Economics Letters*)**. Note that, due to the changing nature of the Wikipedia and Wikidata databases, the results of running the below process are not likely to be completely the same as the data that we collected in November 2020, which were used for our article.

## 0. Importing necessary modules and packages

We assume that all of the Python modules and packages mentioned in this file are installed when running the below codes.

In [39]:
import pandas as pd
import numpy as np
import dask.distributed as dd
import requests as req
import time
import warnings

from joblib import Parallel, delayed
from json.decoder import JSONDecodeError
from requests.exceptions import ChunkedEncodingError
from qwikidata.sparql import return_sparql_query_results as sparql_res
from bs4 import BeautifulSoup as bsoup
from dask.diagnostics import ProgressBar as daskPBar
from tqdm.auto import tqdm

We will also set up the `dask` cluster here for parallelization. Note that `N_WORKERS` can be set to any other number depending on the machine's capabilities.

In [None]:
N_WORKERS = 4
cluster = dd.LocalCluster(
    n_workers=N_WORKERS, threads_per_worker=1, memory_limit=(2.5 * 1024 ** 3)
)
client = dd.Client(cluster)
client

## 1. Query block (in SPARQL) and the initial fetch of relevant Wikidata information

The below codes are necessary parts for the initial fetching of relevant Wikidata information. In the last sub-section (1.5), we will demonstrate how the codes come together to acquire relevant information.

### 1.1. Query block

The query block below is written in SparkQL and is the standard for fetching data from Wikidata. It will be used in conjunction with the `qwikidata` module to fetch query results to Python environment. For a detailed introduction to using SparkQL to fetch relevant information from Wikidata, please refer to [this video by the Wikimedia Foundation](https://www.youtube.com/watch?v=kJph4q0Im98).

In [3]:
## structure of the query
query_block = """
SELECT ?person ?personLabel ?occupation ?occupationLabel ?birthplace ?birthplaceLabel ?dob ?dobLabel ?region ?regionLabel ?gregion ?gregionLabel ?ggregion ?ggregionLabel ?sex ?sexLabel
WHERE {
  ?person wdt:P31 wd:Q5 .
  ?person wdt:P19 ?birthplace .
  OPTIONAL {
    ?birthplace wdt:P131 ?region.
    OPTIONAL {
      ?region wdt:P131 ?gregion.
      OPTIONAL {
        ?gregion wdt:P131 ?ggregion.
      }
    }
  }
  OPTIONAL {
    ?person wdt:P106 ?occupation 
  }
  ?birthplace wdt:P17 wd:%s .
  
  ?person wdt:P21 ?sex .
  
  ?person wdt:P569 ?dob .
  FILTER (YEAR(?dob) = %s).

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""

### 1.2. Dictionary (or `pandas.Series`) of country name to country code

For each country, there exists a corresponding Wikidata-specific country code (which can be found by searching on Wikidata). We will need to map country name to country code to retrieve country-specific information from Wikidata. We may also use the `pandas.Series` format over the dictionary format. We also note that these countries are not exhaustive of the countries in Wikidata, but simply contain the 30 countries we use in our study (roughly accounting for 75 percent of the world population).

In [4]:
## note that there are multiple names for some of the countries
wiki_dict = {
    "United States of America": "Q30",
    "United States": "Q30",
    "USA": "Q30",
    "Germany": "Q183",
    "France": "Q142",
    "United Kingdom": "Q145",
    "UK": "Q145",
    "Italy": "Q38",
    "Russia": "Q159",
    "Russian Federation": "Q159",
    "Japan": "Q17",
    "Spain": "Q29",
    "Brazil": "Q155",
    "India": "Q668",
    "Turkey": "Q43",
    "China": "Q148",
    "Mexico": "Q96",
    "South Korea": "Q884",
    "Iran": "Q794",
    "Indonesia": "Q252",
    "South Africa": "Q258",
    "Colombia": "Q739",
    "Philippines": "Q928",
    "Pakistan": "Q843",
    "Nigeria": "Q1033",
    "Egypt": "Q79",
    "Thailand": "Q869",
    "Vietnam": "Q881",
    "Viet Nam": "Q881",
    "Bangladesh": "Q902",
    "Kenya": "Q114",
    "Democratic Republic of the Congo": "Q974",
    "DR Congo": "Q974",
    "Myanmar": "Q836",
    "Ethiopia": "Q115",
    "Tanzania": "Q924",
}

wiki_name_series = pd.Series(wiki_dict)

### 1.3. Functions for fetching relevant data from Wikidata

#### 1.3.1. Data retrieval for a single country-year

In [5]:
def querydata_organizer(qdat, val="value"):
    """
    Organizing the raw JSON or dictionary version of query data, then
    putting it into a list which can be used to create dataframes

    Inputs:
    - qdat (dict): returned query data (after making the SparQL query)
    - val (str): the string designation for finding the necessary value

    Output:
    - result_df (list of lists): information from query, in an organized
        fashion (easily transform-able to pandas dataframe or csv)
    """
    res = qdat["results"]["bindings"]

    result_lst = []
    for i in res:
        ## name and wikidata URL
        name = i["personLabel"][val]
        wikidata_url = i["person"][val]

        ## birthplace and where the said location(s) belong to
        bplace = i["birthplace"][val]
        bplaceLabel = i["birthplaceLabel"][val]
        if i.get("regionLabel") is None:
            region, regionLabel = np.nan, np.nan
        else:
            regionLabel = i["regionLabel"][val]
            region = i["region"][val]

        if i.get("gregion") is None:
            gregion, gregionLabel = np.nan, np.nan
        else:
            gregionLabel = i["gregionLabel"][val]
            gregion = i["gregion"][val]

        if i.get("ggregion") is None:
            ggregion, ggregionLabel = np.nan, np.nan
        else:
            ggregionLabel = i["ggregionLabel"][val]
            ggregion = i["ggregion"][val]

        ## cleaning occupation
        if i.get("occupation") is None:
            occupation, occupationLabel = np.nan, np.nan
        else:
            occupationLabel = i["occupationLabel"][val]
            occupation = i["occupation"][val]

        dob = i["dobLabel"][val]
        yr = dob[0:4]
        month = dob[5:7]
        day = dob[8:10]
        sex = i["sexLabel"][val]
        result_lst.append(
            [
                name,
                wikidata_url,
                bplaceLabel,
                bplace,
                regionLabel,
                region,
                gregionLabel,
                gregion,
                ggregionLabel,
                ggregion,
                occupation,
                occupationLabel,
                yr,
                month,
                day,
                sex,
            ]
        )

    return result_lst

In [40]:
def query_organize_pipeline(
    year, country, qblock=query_block, wiki_name_to_code=wiki_name_series, val="value"
):
    """
    Specify a specific year (of birth), make the query, and organize the returned
    information into a list of lists

    inputs:
    - year (int / str): year of birth for the people we are interested in
    - country (str): country that we want to search
    - qblock (str): query block to be used in fetching the data; use default
        unless there is a need to fetch other data
    - wiki_name_to_code (dict or pandas.Series): dictionary or pandas.Series that
        maps country name to relevant country code used in Wikidata
    - val (str): name of the value that we one to acquire; default "value"

    outputs:
    - if successful, then organized values from the query in list of lists
        will be returned; if not, the unsuccessful year will be returned

    """

    yr = str(year)
    cntry = str(wiki_name_to_code[country])
    query_itself = query_block % tuple([cntry, yr])

    try:
        res = sparql_res(query_itself)
        res_organized = querydata_organizer(res, val)
    except (JSONDecodeError, ChunkedEncodingError) as error_tuple:
        try:
            res = sparql_res(query_itself)
            res_organized = querydata_organizer(res, val)
        except (JSONDecodeError, ChunkedEncodingError) as error_tuple:
            res_organized = year

    return res_organized

#### 1.3.2. Parallelization (for multiple years) and making sure that data have been retrieved properly

Due to issues such as unstable connectivity to Wikidata query services or general Internet connectivity, there may be cases in which information is not retrieved properly across all years. The following process is to ensure that all desired data is acquired while parallelizing the data retrieval process across all years.

In [67]:
def multiyear_querydata(
    years,
    country,
    qblock=query_block,
    wiki_name_to_code=wiki_name_series,
    val="value",
    cl=client,
):
    """
    Specify multiple years for retrieving multiple-birth-year data from Wikidata.

    Inputs:
    - years (array-like of int): containing the relevant birth years that we want
        to acquire information about, pertaining to the people of a specified country
    - country (str): name of the country from which to acquire information
    - qblock (str): query block to be used in fetching the data; use default
        unless there is a need to fetch other data
    - wiki_name_to_code (dict or pandas.Series): dictionary or pandas.Series that
        maps country name to relevant country code used in Wikidata
    - val (str): name of the value that we one to acquire; default "value"
    - cl (dask.distributed.client.Client): dask client for parallelization

    Outputs:
    - gather_all (list of lists): containing all retrieved information, one row per line
        from the returned Wikidata query
    """

    gather_fn = lambda x: query_organize_pipeline(
        x, country, qblock, wiki_name_to_code, val
    )

    target_years = list(years)
    incomplete = True
    gather_completed = []
    while incomplete:
        cl.restart()
        gather = cl.map(gather_fn, target_years)
        gather = cl.gather(gather)

        check_incomplete = [x for x in gather if type(x) == int]
        gather_completed += [x for x in gather if type(x) != int]

        if len(check_incomplete) == 0:
            incomplete = False
        else:
            target_years = check_incomplete
            print(
                "re-collecting data, has {} incomplete years.".format(len(target_years))
            )

    gather_all = []
    for i in gather_completed:
        gather_all += i

    return gather_all

### 1.4. Cleaning data, while separating out the Jan. 1st-birthdays and non-Jan. 1st-birthdays

One particular issue that we detected while trying to use the retrieved information from Wikidata was there were those with unclear birth years. For instance, see the case of [Ira Washington Rubel](https://www.wikidata.org/wiki/Q1650128) whose date of birth is noted as `19. century` instead of having a specific birth year. This would not pose a great problem if Wikidata query simply returns birth year as "unknown" or "not available"; however, in these cases, Wikidata query would still return some numbers for the unavailable birth years (and birth months and days as well). Again in the example with Ira Washington Rubel, the birth year, month, and day were returned as `1900`, `01`, and `01`. Similarly, those that have their date of birth information as `20. century` on the Wikidata website would have the birth year, month, and day information returned as `2000`, `01`, and `01`.

This issue can be problematic because, if a researcher were to use birth year data without much precaution, then one may find a lot of those with misspecified birth years. At the same time, one may also have problem distinguishing from, say, those that were actually born in the year 1900 and those whose birth year information is simply noted as `1900` due to how the Wikidata queries are returned. One pattern that we are able to observe is that, whenever birth years are unknown exactly, their Wikidata webpage would also lack birth month and day information, and Wikidata query would return **January 1st** (or `01` and `01` as birth month and day, respectively).

Acknowledging this rather minor but potentially important issue, we follow the below procedure:
1. Knowing that those with incomplete birth date information would have January 1st (`01` and `01`) as their birth month and day information, separate out those having birth month-day as January 1st versus those not having the said birth month-day. Retain the information of the latter group of people.
2. Reference back to the Wikidata website (as opposed to the query) for the people with January 1st as their query-specified birth month-days to see if their birth month-days are the said day. If so, retain those whose birth month-days are verified to be January 1st.
3. For those whose Wikidata webpages indicate their birth month-days are not January 1st, retain if their birth **years** are well-specified (e.g., 1900 instead of 19. century). If not, drop their information.

For this sub-section, we introduce a function to prepare for the above process by (i) separating out those born on January 1st and those not and (ii) organize information in a `pandas.DataFrame` format.

In [64]:
def organize_j1_vs_nj1(df_of_lst):
    """
    Organizing the data retrieved from Wikidata queries (which is in list format)
    into pandas.DataFrame, and separating January 1st birthdays and non-January 1st
    birthdays.

    Inputs:
    - df_of_lst (array-like of array-likes): resulting information from running
        the function `multiyear_querydata`, which has retrieved information from
        relevant Wikidata queries

    Outputs:
    - j1, nj1 (pandas DataFrames): organized Wikidata query information, where the
        former contains those people's information with birth month-days as
        January 1st, and the latter contains those with birth month-days not as
        January 1st
    """

    df_col = [
        "person",
        "personCode",
        "birthplace",
        "birthplaceCode",
        "birthregion",
        "birthregionCode",
        "birthgregion",
        "birthgregionCode",
        "birthggregion",
        "birthggregionCode",
        "occupation",
        "occupationCode",
        "dob_year",
        "dob_month",
        "dob_day",
        "sex",
    ]

    stacked = np.vstack(df_of_lst)
    df = pd.DataFrame(stacked, columns=df_col).astype(
        {"dob_year": "int64", "dob_month": "int64", "dob_day": "int64"}
    )

    j1 = df.loc[(df["dob_month"] == 1) & (df["dob_day"] == 1), :].copy()
    nj1 = df.loc[(df["dob_month"] != 1) | (df["dob_day"] != 1), :].copy()

    return j1, nj1

### 1.5. Demonstration for this section's codes

Here, we demonstrate extracting the information for those who were born in France between the years 1900 to 1950 (inclusive).

In [68]:
## fetching data, then dividing between January 1st cases
## and non-January 1st cases
EXAMPLE_J1, EXAMPLE_NJ1 = organize_j1_vs_nj1(
    multiyear_querydata(list(range(1900, 1951)), "France")
)

re-collecting data, has 4 incomplete years.




## 2. Checking for those with birth month-days as January 1st

For the reasons elaborated in sub-section 1.4 above, we need to make sure that those with birth month-days as January 1st have correct information (regarding their birth years). This section goes checks for these cases by going back to the original Wikidata *websites* (as opposed to the queries) to check for January 1st-cases.

We also provide brief descriptions of the variables that are included in the final output:
- `person`: Name of the person of interest
- `personCode`: Wikidata URL of the said person
- `birthplace`: Birth place of the said person, usually at the town or city level
- `birthplaceCode`: Wikidata URL of `birthplace`
- `birthregion`: Region to which `birthplace` belongs to
- `birthregionCode`: Wikidata URL of `birthregion`
- `birthgregion`: Region to which `birthregion` belongs to (i.e., "greater" region)
- `birthgregionCode`: Wikidata URL of `birthgregion`
- `birthggregion`: Region to which `birthgregion` belongs to (i.e., "greater-greater" region)
- `birthggregionCode`: Wikidata URL of `birthggregion`
- `occupation`: The said person's occupation
- `occupationCode`: Wikidata URL of `occupation`
- `dob_year`: Birth year of the person, returned by the Wikidata query
- `dob_year_actual`: Birth year of the person, cross-checked with the Wikidata webpage
- `dob_month`: Birth month of the person, returned by the Wikidata query
- `dob_day`: Birth day of the person, returned by the Wikidata query
- `sex`: Sex of the person

Note that if a greater region to which `birthplace`, `birthregion`, or `birthgregion` belongs to does not exist (e.g., at the country level), greater region variables (e.g., `birthgregion` and `birthggregion` for `birthregion`) will be recorded as not available.


### 2.1. Functions for checking the January 1st-birthdays and finalizing the cleanup

In [117]:
def jan1st_checker(wikidata_url, yr, verified):
    """
    Checking, for a Wikidata entry, with the said entry's Wikidata website
    to see if the year of birth is actually the specified year (i.e., `yr`)
    from the Wikidata query

    Inputs:
    - wikidata_url (str): URL for the Wikidata entry
    - yr (int): year of birth retrieved from the Wikidata query

    Outputs:
    - Either True (actual birth year is the year specified) or False (or not) in
        non-error cases, or the str "Error" if there is some error while retrieving
        information

    """
    if type(verified) == bool:
        return verified

    ## following are the potential cases in which date of birth may appear
    ## with VALID year of birth
    monthday_filled = "1 January {}".format(yr)
    monthday_filled2 = monthday_filled + " Gregorian"
    monthday_filled3 = monthday_filled + "Gregorian"
    simply_year = str(yr)
    valid_cases = [monthday_filled, monthday_filled2, monthday_filled3, simply_year]

    ## fetching the URL request and html information
    url = wikidata_url.replace("/entity/", "/wiki/")
    url_req = req.get(url)
    soup_data = bsoup(url_req.text, "html.parser")
    first_dict = {"data-property-id": "P569"}  ## pertains to the data of birth info
    second_dict = {
        "class": "wikibase-snakview-value wikibase-snakview-variation-valuesnak"
    }
    try:
        check_to = soup_data.find(attrs=first_dict).find_all(attrs=second_dict)
        check_to_lst = [i.text for i in check_to]
        intersection_check = np.intersect1d(valid_cases, check_to_lst)
        if len(intersection_check) > 0:
            return True
        else:
            return False

    except (AttributeError, ChunkedEncodingError) as error_tuple:
        return "Error"

In [164]:
def clean_and_merge_all_data(j1, nj1, cl=client):

    j1_cases = j1[["personCode", "dob_year"]].copy()
    j1_cases["truth"] = "Error"
    j1_cases.drop_duplicates(inplace=True)
    j1_cases.reset_index(inplace=True, drop=True)

    first = True
    while j1_cases.dtypes["truth"] == "O":
        cl.restart()
        if first:
            first = False
        else:
            error_n = j1_cases.loc[j1_cases.truth == "Error", :].shape[0]
            print("Further detected {} unresolved cases, retrying..".format(error_n))

        wikidata_urls = list(j1_cases["personCode"])
        reported_years = list(j1_cases["dob_year"])
        retrieved_info = list(j1_cases["truth"])
        all_cases_j1 = tuple(zip(wikidata_urls, reported_years, retrieved_info))

        check_fn = lambda x: jan1st_checker(x[0], x[1])
        j1_mapped = cl.map(check_fn, all_cases_j1)
        j1_cases["truth"] = cl.gather(j1_mapped)

    j1_cases["dob_year_actual"] = j1_cases["dob_year"].values
    j1_cases.loc[~j1_cases.truth, "dob_year_actual"] = np.nan
    j1_cases = j1_cases.drop(["truth"], axis=1).set_index(["personCode", "dob_year"])
    j1_merged = j1.set_index(["personCode", "dob_year"])
    j1_merged = j1_merged.merge(
        j1_cases, left_index=True, right_index=True, how="left"
    ).reset_index()

    total_cases = nj1.copy()
    total_cases["dob_year_actual"] = total_cases["dob_year"].values
    total_cases = pd.concat([total_cases, j1_merged], axis=0)
    total_cases_columns = [
        "person",
        "personCode",
        "birthplace",
        "birthplaceCode",
        "birthregion",
        "birthregionCode",
        "birthgregion",
        "birthgregionCode",
        "birthggregion",
        "birthggregionCode",
        "occupation",
        "occupationCode",
        "dob_year",
        "dob_year_actual",
        "dob_month",
        "dob_day",
        "sex",
    ]
    total_cases = (
        total_cases[total_cases_columns]
        .sort_values(["dob_year", "dob_month", "dob_day", "person"])
        .reset_index(drop=True)
    )

    return total_cases

### 2.2. Demonstration for this section's codes

We use `EXAMPLE_J1` and `EXAMPLE_NJ1` acquired from sub-section 1.5.

In [165]:
EXAMPLE_MERGED_CASE = clean_and_merge_all_data(EXAMPLE_J1, EXAMPLE_NJ1)



To export (in `.csv` format, say), one can simply follow the below code. Note that `DESIRED_LOC` and `FILE_NAME` need to be changed accordingly.

In [None]:
DESIRED_LOC = "example_directory"
FILE_NAME = "example_file_name.csv"
EXAMPLE_MERGED_CASE.to_csv("/".join([DESIRED_LOC, FILE_NAME]), index=False)

## 3. Final remarks for this file

In our application for **Peña and Choi (2021, *Economics Letters*)**, we ran a similar code with each of the 30 countries specified in `wiki_dict` above (also shown in the paper as well) and for the birth years 1700 to 2000 (inclusive).

Also, it should be noted that the output from this file may have multiple rows for a same person of interest, due to reasons such as the said person's birthplace belonging to multiple districts or jurisdictions. These redundancies will be resolved in our Stata `.do` files.