# Evolution of Scientific Grants

In this project I will extract data from various grant foundations and explore the following questions:

    1. Which institutions get the grants? How much?
    2. What are the research topics over time, and do they relate to research developments?
    3. How do the National Science Foundation grants compare to grants from private foundations?

### 1. Scraping Moore Foundation Grant Information

In [52]:
# import necessary packages
import lxml.html as lx
import pandas as pd
import requests
import requests_cache

requests_cache.install_cache("cache")

# download web page
url = 'https://www.moore.org/grants/?showAll=true'
response = requests.get(url)
response.raise_for_status()
doc = response.text

# parse web page
html = lx.fromstring(doc, base_url = url)
html.make_links_absolute()

titles = html.xpath("//p[@class = 'first-paragraph']")
titles = [x.text_content() for x in titles]

organizations = html.xpath("//a[@class = 'tile-close']/p[@class='']")
organizations = [x.text_content() for x in organizations]

dates = html.xpath("//div[@class = 'date']/p")
dates = [x.text_content() for x in dates]

terms = html.xpath("//a[@class = 'tile-close']//ul//li[last()]/p")
terms = [x.text_content() for x in terms]

amounts = html.xpath("//a[@class = 'tile-close']//ul/li[last()-1]/p")
amounts = [x.text_content() for x in amounts]

types = html.xpath("//a[@class = 'tile-close']/img[last()]/@alt")
links = [x.text_content() for x in links]

# make data frame
moore_df = pd.DataFrame({'titles': titles, 'organizations': organizations, 'dates': dates, 'terms': terms, 
                         'amounts': amounts, 'types': types})

moore_df.head()

Unnamed: 0,titles,organizations,dates,terms,amounts,types
0,Exploring High-temperature Topological Superco...,"University of British Columbia, Department of ...",May 2022,48 months,"$1,479,160",Science
1,Youth Community Science Module Development,Smithsonian Institution (Office of the Comptro...,May 2022,30 months,"$1,261,732",Science
2,SETI Research Experience for Undergraduates,SETI Institute,May 2022,24 months,"$202,835",Science
3,Activating Science Invention and Entrepreneurs...,Activation Energy,Apr 2022,24 months,"$1,915,724",Science
4,Novel Double Angle-Resolved Photoemission Spec...,"University of Illinois at Urbana-Champaign, De...",Apr 2022,48 months,"$1,599,461",Science


In [54]:
# write to csv file
moore_df.to_csv('moore_grants.csv')

### 2. Scraping Sloan Foundation Grant Information

In [115]:
requests_cache.install_cache('cache')

sloan_url = 'https://sloan.org/grants-database?page='

organizations = []
years = []
cities = []
amounts = []
descriptions = []

# iterate through every page on the grans database website in order to extract data for all grants
for i in range(1,260):
    i = str(i)
    url = sloan_url + i
    
    # download web page
    response = requests.get(url)
    response.raise_for_status()
    doc = response.text

    # parse web page
    html = lx.fromstring(doc, base_url = url)
    html.make_links_absolute()
    
    descriptions_i = html.xpath("//div[@class = 'brief-description']/p")
    descriptions = descriptions + descriptions_i
    
    organizations_i = html.xpath("//div[@class = 'grantee']/span[1]/following-sibling::text()[1]")
    organizations = organizations + organizations_i
    
    years_i = html.xpath("//div[@class = 'year']/span[1]/following-sibling::text()[1]")
    years = years + years_i
    
    amounts_i = html.xpath("//div[@class = 'amount']/span[1]/following-sibling::text()[1]")
    amounts = amounts + amounts_i
   
    cities_i = html.xpath("//div[@class = 'city']/span[1]/following-sibling::text()[1]")
    cities = cities + cities_i
    
# make data frame
sloan_df = pd.DataFrame({'descriptions': descriptions, 'organizations': organizations, 'years': years, 'amounts': amounts, 
                         'cities': cities})
sloan_df.head()    

Unnamed: 0,descriptions,organizations,years,amounts,cities
0,To support graduate student presentations at t...,"Industrial Organizational Society, Inc.\n\t",2022\n\t,"$33,000\n\t","Boston, MA\n\t"
1,To explore environmental topics including wate...,Geochemical Society\n\t,2022\n\t,"$10,000\n\t","Washington, DC\n\t"
2,To gather a broad cross-section of the geoscie...,American Geophysical Union\n\t,2022\n\t,"$50,000\n\t","Washington, DC\n\t"
3,"To develop, test, and help implement ways of m...",U.S. Chamber of Commerce Foundation\n\t,2022\n\t,"$250,000\n\t","Washington, DC\n\t"
4,To support the research and writing of “Collid...,Richard Rhodes\n\t,2022\n\t,"$117,500\n\t","Seattle, WA\n\t"


In [118]:
# remove '\n' and '\t' characters in data frame 
sloan_df = sloan_df.replace(r'\n',' ', regex=True)
sloan_df = sloan_df.replace(r'\t',' ', regex=True)

# write to csv file
sloan_df.to_csv('sloan_grants.csv')

sloan_df.head(15)

Unnamed: 0,descriptions,organizations,years,amounts,cities
0,To support graduate student presentations at t...,"Industrial Organizational Society, Inc.",2022,"$33,000","Boston, MA"
1,To explore environmental topics including wate...,Geochemical Society,2022,"$10,000","Washington, DC"
2,To gather a broad cross-section of the geoscie...,American Geophysical Union,2022,"$50,000","Washington, DC"
3,"To develop, test, and help implement ways of m...",U.S. Chamber of Commerce Foundation,2022,"$250,000","Washington, DC"
4,To support the research and writing of “Collid...,Richard Rhodes,2022,"$117,500","Seattle, WA"
5,To support the research and writing of “The Va...,Marina Gerner,2022,"$49,906","London, United Kingdom"
6,To produce and release a 90-minute documentary...,"Metropole Film Board, Inc.",2022,"$250,000","New York, NY"
7,"To create the Eckerd Science, Technology, Engi...",Eckerd College,2022,"$249,723","Saint Petersburg, FL"
8,To organize opportunities to run experiments a...,National Academy of Sciences,2022,"$235,000","Washington, DC"
9,To study historical and contemporary decision-...,Wichita State University Foundation,2022,"$45,907","Wichita, KS"


### 3. Extracting NSF Grant Data

The data for every National Science Foundation Grant is stored in XML files. 

https://nsf.gov/awardsearch/advancedSearchResult?ProgEleCode=7217,9150,9168&BooleanElement=Any&BooleanRef=Any&ActiveAwards=true&#results

In [None]:
path = "C:\\Users\\Natalie\\NSF_grants\\"
filenames = glob.glob(path + "\*.xml")

In [137]:
cols = ['AwardTitle', 'AwardEffectiveDate', 'AwardExpirationDate', 'AwardAmount', 'Institution']
df = pd.read_xml(r'C:\Users\Natalie\NSF_grants\0457453.xml')[cols]
df

Unnamed: 0,AwardTitle,AwardEffectiveDate,AwardExpirationDate,AwardAmount,Institution
0,Liquid-Core Capsules via Interfacial Free Radi...,04/01/2005,03/31/2009,245043,
