### Jupyter notebook to distill GRFP Awardee results

I made this notebook because I was curious about the distribution of GRFP awards by institution and discipline and I'm currently learning Python and git. If you have suggestions or changes, please feel free to provide feedback.

The output is all below, but if you wish to rerun it yourself (perhaps with a different year), then make your changes and press the "Run" button above to step through each cell. Press the "fast forward" button to rerun the whole notebook from start to finish.

The fields in the data file should be "Name", "Baccalaureate Institution", "Field of Study", and "Current Institution". 

Note that NSF used commas as the separator AND inside the fields, which makes importing the data very annoying. As such, I've only labeled the first three columns here. The other columns are labeled "Extra*" and will be used when we search for key words like "engineering" or "ocean".

I noticed that commas inside the fields always have a space after them, and the commas used as separators don't have a space. There should be a way to use regular expressions to handle this.

In [1]:
import pandas as pd
import csv

In [2]:
year = 2020 #Change the year here to 2021, 2020, or 2019 and rerun the code 
GRFP = pd.read_csv(str(year) + 'GRFPAwardeeList.csv', sep=',',
                   names=['Last', 'First', 'BaccalaureateInstitution', 
                          'Extra', 'Extra2', 'Extra3', 'Extra4']
                  )
GRFP.head() # preview the head of the GRFP dataframe just created

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
0,Abbott,Kathleen Emma,UNIVERSITY OF CALIFORNIA,BERKELEY,Geosciences - Physical Oceanography,,
1,Abele,Taylor Jane,University of North Carolina at Chapel Hill,Life Sciences - Cell Biology,University of North Carolina at Chapel Hill,,
2,Abrams,Samantha Rose,Skidmore College,Psychology - Social Psychology,,,
3,Abramson,Haley Gilbert,Massachusetts Institute of Technology,Engineering - Biomedical Engineering,JOHNS HOPKINS UNIVERSITY,,
4,Abreha,Biruk Gezahegn,Northeastern University,Chemistry - Chemical Theory,Models and Computational Methods,Northeastern University,


### Which baccalaureate instututions had the most awardees?
As I said above, they used commas as both their separator and in the fields. This creates issues for everything downstream of baccalaureate institution. Creative solutions welcome...

This call includes engineering, which seems to swamp the other disciplines.

In [3]:
GRFP_byBaccInst = GRFP.groupby('BaccalaureateInstitution').size().sort_values(ascending=False)
GRFP_byBaccInst.head(20)

BaccalaureateInstitution
Massachusetts Institute of Technology          68
UNIVERSITY OF CALIFORNIA                       65
University of Michigan Ann Arbor               34
Stanford University                            33
Columbia University                            32
University of Texas at Austin                  32
University of Chicago                          28
Brown University                               26
Princeton University                           26
Cornell University                             26
Georgia Institute of Technology                25
Northeastern University                        24
University of California-Los Angeles           23
Harvard University                             22
William Marsh Rice University                  22
Yale University                                22
University of Illinois at Urbana-Champaign     21
University of Wisconsin-Madison                21
University of North Carolina at Chapel Hill    20
Texas A&M University Main

### Did anyone who got a Bachelor's at USF get an award?

In [4]:
GRFP[GRFP['BaccalaureateInstitution'].str.contains("University of South Florida", case=False)]

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
159,Blackwell,Keller Lloyd,University of South Florida,Comp/IS/Eng - Algorithms and Theoretical Found...,University of South Florida,,
1157,McClinton,Willie B,University of South Florida,Comp/IS/Eng - Machine Learning,University of South Florida,,


### Did anyone who is currently at USF get an award?
(This will also include those who got bachelor's from USF)

In [5]:
GRFP[GRFP.apply(lambda row: row.astype(str).str.contains('University of South Florida', case=False).any(), axis=1)]

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
159,Blackwell,Keller Lloyd,University of South Florida,Comp/IS/Eng - Algorithms and Theoretical Found...,University of South Florida,,
1157,McClinton,Willie B,University of South Florida,Comp/IS/Eng - Machine Learning,University of South Florida,,
1275,Navarro-Estrada,Delfina Paola,Humboldt State University Foundation,Geosciences - Chemical Oceanography,University of South Florida,,


### Let's look at just Florida institutions

In [6]:
GRFP_byBaccFLInst = GRFP[GRFP['BaccalaureateInstitution'].str.contains("Florida")].groupby('BaccalaureateInstitution').size().sort_values(ascending=False)
GRFP_byBaccFLInst

BaccalaureateInstitution
University of Florida                             19
University of Central Florida                     14
Florida State University                           7
New College of Florida                             4
Florida International University                   3
University of South Florida                        2
Florida Agricultural and Mechanical University     1
Florida Gulf Coast University                      1
Florida Memorial University                        1
University of North Florida                        1
University of West Florida                         1
dtype: int64

### Which programs at UF are producing lots of grads who receive awards?

In [7]:
GRFP[GRFP['BaccalaureateInstitution'].str.contains("University of Florida", case=False)]

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
181,BOTELLO,JORDY FELIPE,University of Florida,Life Sciences - Cell Biology,University of Florida,,
229,Burke,Kristen Lagasse,University of Florida,Social Sciences - Sociology,University of Texas at Austin,,
387,Davis,Zo� Indiana,University of Florida,Engineering - Bioengineering,University of Florida,,
478,El Basha,Mohammad Daniel,University of Florida,Engineering - Biomedical Engineering,MD Anderson Cancer Center,,
504,Fares,Wisam,University of Florida,Engineering - Biomedical Engineering,UNIVERSITY OF VIRGINIA,,
734,Hester,Holley Grace,University of Florida,Chemistry - Macromolecular,Supramolecular,and Nanochemistry,Cornell University
889,Kempfert,Katherine Candice,University of Florida,Mathematical Sciences - Statistics,UNIVERSITY OF CALIFORNIA,BERKELEY,
1071,Loring,Kaden Jay,University of Florida,Physics and Astronomy - Astronomy and Astrophy...,University of Florida,,
1160,McCourt,Kelli Marie,University of Florida,Engineering - Environmental Engineering,University of Florida,,
1221,Molinaro,Dean Devine,University of Florida,Engineering - Mechanical Engineering,Georgia Institute of Technology,,


## Remember, for 2021 (due in Oct 2020) they had priority areas which were "Artificial Intelligence, Quantum Information Science, and Computationally Intensive Research"
https://www.nsf.gov/pubs/2020/nsf20587/nsf20587.htm

https://www.nature.com/articles/d41586-020-02272-x

We might expect that this would lead to more awardees in certain types of departments. How many applicants have "engineering" in their discipline, baccalaureate, or current institution name?

You can rerun this notebook again to get results for 2019 and 2020, but here are the three years of results:

For 2021, it's 658 out of 2074 awardees.

For 2020, it's 655 out of 2076 awardees.

For 2019, it's 605 out of 2052 awardees.

In [8]:
GRFP[GRFP.apply(lambda row: row.astype(str).str.contains('Engineering', case=False).any(), axis=1)].count()

Last                        655
First                       655
BaccalaureateInstitution    655
Extra                       655
Extra2                      627
Extra3                       43
Extra4                       17
dtype: int64

In [9]:
GRFP.count() #total applicants

Last                        2076
First                       2076
BaccalaureateInstitution    2076
Extra                       2076
Extra2                      1872
Extra3                       263
Extra4                       122
dtype: int64

##### What about for "Computer"?
You can rerun this notebook again to get results for 2019 and 2020, but here are the three years of results:

For 2021, it's 66 out of 2074 awardees.

For 2020, it's 67 out of 2076 awardees.

For 2019, it's 65 out of 2052 awardees.

In [10]:
GRFP[GRFP.apply(lambda row: row.astype(str).str.contains('Computer', case=False).any(), axis=1)]

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
21,Aikat,Vikram,University of North Carolina at Chapel Hill,Comp/IS/Eng - Computer Networks,University of North Carolina at Chapel Hill,,
25,Al-Shaer,Rawan,University of North Carolina at Charlotte,Comp/IS/Eng - Computer Security and Privacy,University of North Carolina at Charlotte,,
38,Alkayyali,Amani Abdulkader,Wayne State University,Comp/IS/Eng - Human Computer Interaction,University of Michigan Ann Arbor,,
82,Austin,Jacob,Columbia University,Comp/IS/Eng - Robotics and Computer Vision,Columbia University,,
85,Aziz,Samantha Dale,Texas State University - San Marcos,Comp/IS/Eng - Computer Security and Privacy,Texas State University - San Marcos,,
...,...,...,...,...,...,...,...
2023,Yin,Jessica,Carnegie-Mellon University,Comp/IS/Eng - Robotics and Computer Vision,Carnegie-Mellon University,,
2046,Zhang,Jason,University of California-Berkeley,Comp/IS/Eng - Robotics and Computer Vision,Carnegie-Mellon University,,
2047,Zhang,Connie Mabel,Columbia University,Comp/IS/Eng - Robotics and Computer Vision,University of Southern California,,
2058,Zhou,Allan Yang,UNIVERSITY OF CALIFORNIA,BERKELEY,Comp/IS/Eng - Robotics and Computer Vision,Stanford University,


In [11]:
GRFP[GRFP.apply(lambda row: row.astype(str).str.contains('Computer', case=False).any(), axis=1)].count()

Last                        67
First                       67
BaccalaureateInstitution    67
Extra                       67
Extra2                      63
Extra3                       7
Extra4                       5
dtype: int64

### I see no significant difference in the number awardees in engineering and computer science in 2021 relative to 2020 and 2019.  

### Now let's look only at geosciences:
There were 99 awards total in Geosciences in 2021.

In [12]:
GRFP_Geo = GRFP[GRFP.apply(lambda row: row.astype(str).str.contains('Geosciences -', case=False).any(), axis=1)]
GRFP_Geo.count()

Last                        98
First                       98
BaccalaureateInstitution    98
Extra                       98
Extra2                      82
Extra3                       2
Extra4                       0
dtype: int64

In [13]:
GRFP_Geo_byBaccInst = GRFP_Geo.groupby('BaccalaureateInstitution').size().sort_values(ascending=False)
GRFP_Geo_byBaccInst.head(20)

BaccalaureateInstitution
University of California-Los Angeles        3
University of California-Santa Cruz         3
California Institute of Technology          3
University of South Carolina at Columbia    3
Georgia State University                    2
UNIVERSITY OF CALIFORNIA                    2
University of Colorado at Boulder           2
University of Oklahoma Norman Campus        2
University of Puerto Rico Mayaguez          2
Oregon State University                     2
Northwestern University                     2
Northeastern University                     2
Massachusetts Institute of Technology       2
Indiana University                          2
Barnard College                             2
University of Texas at Austin               2
Utah State University                       2
Brown University                            2
Colorado College                            2
Michigan State University                   2
dtype: int64

### How many awards in Oceanography?
In 2021, there were 15, and only 2 in chemical oceanography.

In 2020, there were 17, and 6 in chemical oceanography.

In 2019, there were 13, and 4 in chemical oceanography.

In [14]:
GRFP_Geo[GRFP_Geo.apply(lambda row: row.astype(str).str.contains('Oceanography', case=False).any(), axis=1)]

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
0,Abbott,Kathleen Emma,UNIVERSITY OF CALIFORNIA,BERKELEY,Geosciences - Physical Oceanography,,
197,Brayton,Casey Escher,University of South Carolina at Columbia,Geosciences - Physical Oceanography,University of South Carolina at Columbia,,
206,Brokaw,Richard James,University of South Carolina at Columbia,Geosciences - Physical Oceanography,University of South Carolina at Columbia,,
383,Davis,David M,Georgia State University,Geosciences - Chemical Oceanography,Georgia State University,,
497,Fachon,Evangeline,Northeastern University,Geosciences - Biological Oceanography,,,
706,Hauer,Michelle,DePaul University,Geosciences - Biological Oceanography,University of Rhode Island,,
729,Herrera,Erica Lauren,University of Texas at El Paso,Geosciences - Chemical Oceanography,University of Texas at El Paso,,
810,Jahns,Max Andrew,Vassar College,Geosciences - Chemical Oceanography,Woods Hole Oceanographic Institution,,
872,Kaplan,Rachel Linnea,Brown University,Geosciences - Biological Oceanography,,,
1172,McKie,Taylor Sherisse,Georgia Institute of Technology,Geosciences - Physical Oceanography,University of California-San Diego Scripps Ins...,,


In [15]:
GRFP_Geo[GRFP_Geo.apply(lambda row: row.astype(str).str.contains('Oceanography', case=False).any(), axis=1)].count()

Last                        17
First                       17
BaccalaureateInstitution    17
Extra                       17
Extra2                      14
Extra3                       0
Extra4                       0
dtype: int64

## How many in Marine Biology?
In 2021, there were 12.

In 2020, there were 10.

In 2019, there were 12.

In [16]:
GRFP_Geo[GRFP_Geo.apply(lambda row: row.astype(str).str.contains('Marine Biology', case=False).any(), axis=1)]

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
14,Adler,Alyssa,Oregon State University,Geosciences - Marine Biology,,,
75,Ashey,Jill,College of William and Mary,Geosciences - Marine Biology,,,
894,Khalsa,Noah Sat Sarbat Singh,University of Alaska Fairbanks Campus,Geosciences - Marine Biology,University of Alaska Fairbanks Campus,,
1049,Linsky,Jacob Morrie Jack,University of California-Santa Cruz,Geosciences - Marine Biology,,,
1263,Myers,Hannah,Middlebury College,Geosciences - Marine Biology,University of Alaska Fairbanks Campus,,
1336,Osborne,Chen Cornelia Elspeth,Pennsylvania State Univ University Park,Geosciences - Marine Biology,Pennsylvania State Univ University Park,,
1619,Searles,Adam Ross,University of Central Florida,Geosciences - Marine Biology,,,
1772,Tatom-Naecker,Theresa-Anne,University of Chicago,Geosciences - Marine Biology,University of California-Santa Cruz,,
1823,Tran,Leon Luat,University of South Carolina at Columbia,Geosciences - Marine Biology,University of Hawaii,,
2000,Wuest,Anna Renee,Florida State University,Geosciences - Marine Biology,Florida State University,,


In [17]:
GRFP_Geo[GRFP_Geo.apply(lambda row: row.astype(str).str.contains('Marine Biology', case=False).any(), axis=1)].count()

Last                        10
First                       10
BaccalaureateInstitution    10
Extra                       10
Extra2                       6
Extra3                       0
Extra4                       0
dtype: int64

### Maybe people in chemical oceanography applied under other disciplines? What if we widen the search back out to the entire awardee list and search for just "chem"?
315 in 2021

316 in 2020

312 in 2019


In [18]:
GRFP_chem = GRFP[GRFP.apply(lambda row: row.astype(str).str.contains('Chem', case=False).any(), axis=1)]
GRFP_chem

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
4,Abreha,Biruk Gezahegn,Northeastern University,Chemistry - Chemical Theory,Models and Computational Methods,Northeastern University,
5,Abularrage,Nile,Tufts University,Chemistry - Chemical Structure,Dynamics,and Mechanism,Massachusetts Institute of Technology
7,Ackenhusen,Sarah Elizabeth,UNIVERSITY OF ILLINOIS,Chemistry - Chemistry of Life Processes,University of Michigan Ann Arbor,,
12,Adams,Jaquesta Alexia Maxine,Howard University,Chemistry - Chemical Theory,Models and Computational Methods,Howard University,
16,Agnello,Emily Rose,UNIVERSITY OF MASSACHUSETTS AMHERST,Life Sciences - Biochemistry,University of Massachusetts Medical School,,
...,...,...,...,...,...,...,...
2024,Yokokura,Takashi,Purdue University,Engineering - Chemical Engineering,Purdue University,,
2032,Younkin,Gordon,Northwestern University,Life Sciences - Biochemistry,Cornell University,,
2068,Zito,Alessandra Mary,Johns Hopkins University Krieger School of Art...,Chemistry - Sustainable Chemistry,University of California-Irvine,,
2071,Zoll,Adam Jacob,Tufts University,Chemistry - Chemical Synthesis,Yale University,,


###### What about "Geochemistry"? (includes Biogeochemistry)
7 in 2021

6 in 2020

14 in 2019


In [19]:
GRFP_chem = GRFP[GRFP.apply(lambda row: row.astype(str).str.contains(' Geochemistry', case=False).any(), axis=1)]
GRFP_chem

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
47,Alvarez Caraveo,Blanca Jessica,Barnard College,Geosciences - Geochemistry,University of California-Los Angeles,,
1159,McComb,Samantha,SUNY College at Potsdam,Geosciences - Geochemistry,SUNY College at Potsdam,,
1497,Riche,Alexis,George Washington University,Geosciences - Geochemistry,Northern Arizona University,,
1804,Todes,Jordan Philip,Northwestern University,Geosciences - Geochemistry,,,
1809,Tompkins,Hannah Gail Dooley,University of Rochester,Geosciences - Geochemistry,University of Rochester,,
1961,Wiita,Elizabeth Grace,Barnard College,Geosciences - Geochemistry,Barnard College,,


In [20]:
GRFP_chem.count()

Last                        6
First                       6
BaccalaureateInstitution    6
Extra                       6
Extra2                      5
Extra3                      0
Extra4                      0
dtype: int64

### What about Paleoclimate (excluding paleontology and paleobiology)?
8 in 2021

4 in 2020

6 in 2019

In [21]:
GRFP_paleo = GRFP[GRFP.apply(lambda row: row.astype(str).str.contains('Paleoclimate', case=False).any(), axis=1)]
GRFP_paleo

Unnamed: 0,Last,First,BaccalaureateInstitution,Extra,Extra2,Extra3,Extra4
1011,Lehrmann,Asmara Anne,Trinity University,Geosciences - Paleoclimate,University of Alabama Tuscaloosa,,
1195,Mentzer,Carlie Maize,University of Mount Union,Geosciences - Paleoclimate,University of Rochester,,
1717,Standring,Patricia,University of Texas at Austin,Geosciences - Paleoclimate,University of Texas at Austin,,
1792,Thomas,Trent Brian,University of California-Los Angeles,Geosciences - Paleoclimate,University of California-Los Angeles,,


In [22]:
GRFP_paleo.count()

Last                        4
First                       4
BaccalaureateInstitution    4
Extra                       4
Extra2                      4
Extra3                      0
Extra4                      0
dtype: int64