# Merge NSF Survey of Earned Doctorates Spreadsheets
J. Nathan Matias

Last updated Feb 5, 2022

* Asian Includes Native Hawaiians or Other Pacific Islanders through 2000, but excludes them since 2001.
* Before 2001, "Other race or race not reported" included respondents who selected more than one race. Since 2001, this category has included Native Hawaiians or Other Pacific Islanders, who previously had been included in the category Asian.
* Life Sciences Includes agricultural sciences and natural resources; biological and biomedical sciences; and health sciences.
* Field (Other) Includes other non-science and engineering fields not shown separately
* All groups other than "Hispanic or Latino" are classified as "Not Hispanic or Latino"

In [120]:
from openpyxl import load_workbook
import pandas as pd
from collections import defaultdict, Counter
import datetime

In [121]:
data_path = "/Users/nathan/Box/Projects/"
nsed_path = "2021-NSF-Survey-Earned-Doctorates/"

year_filenames = {
    "2016": "tab23-2016.xlsx",
    "2017": "sed17-sr-tab023-2017.xlsx",
    "2018": "nsf20301-tab023-2018.xlsx",
    "2019": "nsf21308-tab023-2019.xlsx",
    "2020": "nsf22300-tab023-2020.xlsx"  
}

fields = ["All fields",
          "Life sciencesc", 
          "Physical sciences and earth sciences", 
          "Mathematics and computer sciences",
          "Psychology and social sciences",
          "Engineering",
          "Education",
          "Humanities and arts",
          "Otherd"]

### Utility method for generating year rows from the NSF Earned Doctorates Workbook

In [123]:
def generate_year_rows(wb):

    counter = 0

    # dict with the index as key
    # and the column name as the value
    colnames = {}

    ## data structure for storing records:
    ## one row per field + group + year
    ## field, group, year, count
    records = []

    field = None
    group = None


    for row in wb.worksheets[0].values:
        # the first three rows are labels
        if(counter<3): 
            counter += 1
            continue

        ## the end of the sequence
        if(row[0] is None):
            break

        ## header row
        if(counter==3):
            i = 0
            for colname in row:
                if colname is None:
                    continue
                colnames[i] = colname
                i += 1
            counter += 1
            continue
            
        ## skip rows labeled "Not Hispanic or Latino"
        if row[0].strip() == "Not Hispanic or Latino":
            counter += 1
            continue

        if row[0] in fields:
            group = "Total"
            field = row[0].strip()
        else:
            group = row[0].strip()

        ## add record
        for i in range(1, len(colnames)):
            year = colnames[i]
            records.append({
                "field": field,
                "group": group,
                "year" : int(year),
                "count": row[i]
            })

        counter += 1
    return records

In [124]:
all_records = []

for year_filename in year_filenames.values():
    print(year_filename)
    wb = load_workbook(data_path + nsed_path + year_filename)
    all_records += generate_year_rows(wb)

tab23-2016.xlsx
sed17-sr-tab023-2017.xlsx
nsf20301-tab023-2018.xlsx
nsf21308-tab023-2019.xlsx


  warn("Workbook contains no default style, apply openpyxl's default")


nsf22300-tab023-2020.xlsx


# Write Output to file

In [125]:
min_year = str(min([x['year'] for x in all_records]))
max_year = str(max([x['year'] for x in all_records]))
timestamp_str = datetime.datetime.now().strftime("%Y%m%d")
pd.DataFrame(all_records).to_csv(data_path + nsed_path + "nsf_earned_doctorate_race_ethnicity_years_" +
                                 min_year + "-" + max_year + "_" + timestamp_str + ".csv")