# 1.0 Data Cleaning

## 1.1 Importing Stuff

Import date of raw data file `CA.US-Social.csv`:  2024-11-17

Source:  https://webapps.cihr-irsc.gc.ca/decisions/p/main.html?lang=en#fq={!tag=theme2}theme2%3A%22Social%20%2F%20Cultural%20%2F%20Environmental%20%2F%20Population%20Health%22&fq={!tag=country}country%3ACanada%20%20%20OR%20%20%20country%3A%22United%20States%20of%20America%22&sort=namesort%20asc&start=0&rows=20

In [1]:
# installations
## section 1.1
import pandas as pd

## section 1.2
from bs4 import BeautifulSoup
import csv
import numpy as np
import random
import re
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time


In [33]:
data = pd.read_csv("raw data/CA.US-Social.csv", encoding='latin1')
grant_count = data.shape[0] # will use this in section 1.2

In [27]:
# clean numeric columns
numeric_cols = ['CIHR_Contribution', 'CIHR_Equipment']
for column in numeric_cols:
    data[column] = data[column].replace({'\$': '', ',': ''}, regex=True)

# remove 0's
data['CIHR_Contribution'] = pd.to_numeric(data['CIHR_Contribution'], errors='coerce') 
data = data.query("CIHR_Contribution != 0")

## 1.2 Scraping Data

So the search results from which I imported the data contains links to the information on the papers.  Within these links, the abstracts can be found, so once we can scrape that we're golden.

Issues:
- search results are split into 352 pages, with 20 results per page
- the links to the papers are independent of their positions on the search results

Link structure of search results:

`https://webapps.cihr-irsc.gc.ca/decisions/p/main.html?lang=en#fq={!tag=theme2}theme2%3A%22Social%20%2F%20Cultural%20%2F%20Environmental%20%2F%20Population%20Health%22&fq={!tag=country}country%3ACanada%20%20%20OR%20%20%20country%3A%22United%20States%20of%20America%22&sort=namesort%20asc&start=`**0**`&rows=20` where `row` is a multiple of 20.  The maximum n should be the  max possible integer that satisfies 20n < [count of grants] 



Link structure of papers:

`https://webapps.cihr-irsc.gc.ca/decisions/p/project_details.html?applId=`**462354**`&lang=en` where the `applID` is unique to each paper.

Steps:
1. download chromedriver here https://storage.googleapis.com/chrome-for-testing-public/131.0.6778.69/win64/chromedriver-win64.zip and extract.  The extract location will differ based on operating system (Windows or Mac)
2. initialize driver for selenium.  This step is required since the site is dynamically loaded
3. load and scrape the `applID`s of all n pages of the search result
    
    a. save to `data/applids.csv`
4. use the `applID` to obtain the soup for each grant
5. obtain the abstract, project name, researcher(s) names, contribution count from the soup

### 1.2.1 Scraping applIDs

This shit took 27 mins to run.  Don't run this section again unless necessary LUL

In [None]:
# Calculate number of pages
if grant_count % 20 == 0:
    page_count = grant_count // 20  # No extra page needed if divisible by 20
else:
    page_count = grant_count // 20 + 1  # Add one more page if not divisible by 20

print(f"Total pages to scrape: {page_count}")

In [None]:
# Set up Selenium WebDriver with the correct path
options = Options()
options.add_argument('--headless')
service = Service("C:/chromedriver-win64/chromedriver.exe")
driver = webdriver.Chrome(service=service, options=options)

In [None]:
applids = []

for page_num in range(page_count):
    try:
        start = str(page_num * 20)
        link = 'https://webapps.cihr-irsc.gc.ca/decisions/p/main.html?lang=en#fq={!tag=theme2}theme2%3A%22Social%20%2F%20Cultural%20%2F%20Environmental%20%2F%20Population%20Health%22&fq={!tag=country}country%3ACanada%20%20%20OR%20%20%20country%3A%22United%20States%20of%20America%22&sort=namesort%20asc&start=' + start + '&rows=20'
        driver.get(link)
        driver.implicitly_wait(15)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        links = soup.find_all('a', class_='pull-left')
        for link in links:
            href = link.get('href', '')
            match = re.search(r'applId=(\d+)', href)
            # lang = re.search(r'lang=(\d+)', href)
            if match:
                applids.append(match.group(1))
        # Add a delay between requests to avoid overloading the server and getting my ass banned
        delay = random.uniform(3, 6)
        time.sleep(delay)
    except Exception as e:
        print(f"Error occurred on page {page_num}. Error details: {str(e)}")
        break

In [None]:
driver.quit()

with open('data/applids.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['applId'])  # Write header
    for applid in applids:
        writer.writerow([applid])

### 1.2.2 Scraping everything else

In [2]:
# initialize and prepare stuff
grants_info = pd.read_csv("data/applids.csv")
grants_applIDs = grants_info['applId'].to_list()

# Set up Selenium WebDriver with the correct path
options = Options()
options.add_argument('--headless')
service = Service("C:/chromedriver-win64/chromedriver.exe")
driver = webdriver.Chrome(service=service, options=options)

In [3]:
# define functions
def extract_info(dl):
    data = {}
    items = dl.find_all(["dt", "dd"])
    for i in range(0, len(items), 2):  # Iterate in pairs of dt and dd
        label = items[i].get_text(strip=True).replace(":", "")
        value = items[i + 1].get_text(strip=True)
        data[label] = value
    return data

def clean_contribution(data):
    contribution_str = data['CIHR contribution']
    if isinstance(contribution_str, str):
        match = re.search(r'Amount:\$(\d{1,3}(?:,\d{3})*)', contribution_str)
        if match:
            data['CIHR contribution'] = float(match.group(1).replace(',', ''))
        else:
            data['CIHR contribution'] = None
    else:
        pass

def convert_to_list_of_strings(data, key):
    # Check if the value for the given key is a string
    if isinstance(data.get(key), str):
        # Split the string by semicolon and remove extra spaces
        data[key] = [name.strip() for name in data[key].split(';')]

In [4]:
# define other shit
grants_scraped = pd.DataFrame(columns=["Project title", "Principal investigator(s)", "Co-investigator(s)", "Keywords", "CIHR contribution", "ApplID"])

keys_to_remove = ['Supervisors', 'Institution paid', 'Research institution', 'Department', 'External applicant partner(s)', 'Program', 'Competition (year/month)', 
                  'Assigned peer review committee', 'Primary institute', 'Primary theme', 'Term (yrs/mths)', 'Contributors', 'Amount', 'Equipment', 'External funding partner(s)', 
                  'Partner Name', 'External applicant partner(s)', 'External in-kind partner(s)']

chunks = np.array_split(grants_applIDs, len(grants_applIDs) // 662 + 1)
chunk_size = 662
chunks = [grants_applIDs[i:i + chunk_size] for i in range(0, len(grants_applIDs), chunk_size)]

# Now, `chunks` is a list of lists, where each sublist contains 662 entries (or fewer for the last one)

In [None]:
grants_scraped = pd.DataFrame(columns=["Project title", "Principal investigator(s)", "Co-investigator(s)", "Keywords", "CIHR contribution", "ApplID"])  # Initialize DataFrame
failed = []

# round 1
for id in chunks[6]:
    try:
        link = 'https://webapps.cihr-irsc.gc.ca/decisions/p/project_details.html?applId=' + str(id) + '&lang=en'
        driver.get(link)
        driver.implicitly_wait(30)
        html = driver.page_source
        if not html:
            failed.append(id)  # Add to the failed list
            continue  # Skip this iteration if the content is missing
        soup = BeautifulSoup(html, 'html.parser')
        dl = soup.find("dl", id="results")
        if not dl:
            failed.append(id)  # Add to the failed list
            continue
        data = extract_info(dl)
        if not data:  # If the dictionary is empty, skip this iteration
            failed.append(id)  # Add to the failed list
            continue
        for key in keys_to_remove:
            data.pop(key, None)

        # Clean and convert as needed
        if 'CIHR contribution' in data:
            clean_contribution(data)
        if 'Principal investigator(s)' in data:
            convert_to_list_of_strings(data, 'Principal investigator(s)')
        if 'Co-investigator(s)' in data:
            convert_to_list_of_strings(data, 'Co-investigator(s)')
        if 'Keywords' in data:
            convert_to_list_of_strings(data, 'Keywords')

        data['ApplID'] = id  # Add ApplID to the data dictionary

        # Convert dictionary to DataFrame and concatenate
        grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)

        # Add delay between requests
        delay = random.uniform(3, 6)
        time.sleep(delay)

    except Exception as e:
        print(f"Error details: {str(e)}")
        break  # Break if an exception occurs

  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
  grants_scraped = pd.concat([gr

                                         Project title  \
0    Long-term mental health outcomes of childhood ...   
1    Facilitating knowledge translation and exchang...   
2    Disseminating and contextualizing the results ...   
3    Mobilizing knowledge-to-action for responsive ...   
4    Transitions to a new normal: The health of you...   
..                                                 ...   
604  Planification d'une recherche évaluative parti...   
605  Consolider le partenariat communauté-universit...   
606  Recherche-action évaluative du processus d'app...   
607  Analyse de l'implantation d'une intervention c...   
608  Évaluation de l'acceptabilité et de la faisabi...   

                        Principal investigator(s)  \
0                                [Mcintyre, Lynn]   
1                         [McIsaac, Jessie-Lee D]   
2           [McIsaac, Jessie-Lee D, Kirk, Sara F]   
3    [McIsaac, Jessie-Lee D, Rossiter, Melissa D]   
4         [McIsaac, Jessie-Lee D, Turn

In [14]:
grants_scraped.to_csv('data/grants_scraped7.csv', index=False)
failed_df = pd.DataFrame(failed, columns=["Failed_Ids"])
failed_df.to_csv('data/failed_ids_chunk7.csv', index=False)

In [None]:
driver.quit()

# testing hell afsjdfjlsj;gjieasojreosa;tha;osh

In [98]:
test = str(476159)
link = 'https://webapps.cihr-irsc.gc.ca/decisions/p/project_details.html?applId=' + test + '&lang=en'

In [123]:
driver.get(link)
driver.implicitly_wait(60)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
soup

<html class="js backgroundsize borderimage csstransitions fontface svg details progressbar meter mathml cors smallview" dir="ltr" lang="en"><!--<![endif]--><head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<!-- Web Experience Toolkit (WET) / Boîte à outils de l'expérience Web (BOEW) wet-boew.github.com/wet-boew/License-eng.txt / wet-boew.github.com/wet-boew/Licence-fra.txt -->
<title>Funding Decisions Database - CIHR</title>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Meta data -->
<meta content="Resource providing access to different units and initiatives of the Canadian Institutes of Health Research." name="description"/>
<meta content="Research institutes ; Governance ; Medical research ; Ethics ; Initiatives ; Partnerships ; Assessment ; Public involvement" name="keywords"/>
<meta content="Funding Decisions Database" name="dcterms.title"/>
<meta content="Government of Canada, Canadian Institutes of Health Research, 

In [124]:
dl = soup.find("dl", id="results")
dl

<dl class="col-md-12" id="results"></dl>

In [125]:
data = extract_info(dl)
data

{}

In [53]:
keys_to_remove = ['Supervisors', 'Institution paid', 'Research institution', 'Department', 'External applicant partner(s)', 'Program', 'Competition (year/month)', 'Assigned peer review committee', 'Primary institute', 'Primary theme', 'Term (yrs/mths)', 'Contributors', 'Amount', 'Equipment', 'External funding partner(s)', 'Partner Name', 'External applicant partner(s)', 'External in-kind partner(s)']
for key in keys_to_remove:
    data.pop(key, None)

data

{'Project title': 'Embarking on a journey to explore the realities of pediatric solid organ transplantation for Indigenous patients, families and communities across Canada',
 'Principal investigator(s)': 'Anthony, Samantha J; Beaucage, Mary B',
 'Co-investigator(s)': 'Dart, Allison B; Goldberg, Aviva M; Matsuda-Abedini, Mina; Urschel, Simon; Weiss, Matthew',
 'CIHR contribution': 'Contributors:Amount:$382,499Equipment:$0',
 'Keywords': 'Health Equity; Indigenous Health; Patient Engagement; Patient-Oriented Research; Pediatrics; Solid Organ Transplantation',
 'Abstract/Summary': "In Canada, Indigenous peoples (including First Nations, Métis and Inuit) experience persistent health and social disparities and report higher rates of end-stage organ failure. The reasons for these disparities are complex and have been linked to Canada's history of colonialism and racism. Our ultimate objective is to improve access rates and health outcomes of solid organ transplantation among Indigenous child

In [None]:


clean_contribution(data)

In [60]:
def convert_to_list_of_strings(data, key):
    # Check if the value for the given key is a string
    if isinstance(data.get(key), str):
        # Split the string by semicolon and remove extra spaces
        data[key] = [name.strip() for name in data[key].split(';')]

convert_to_list_of_strings(data, 'Principal investigator(s)')
convert_to_list_of_strings(data, 'Co-investigator(s)')
convert_to_list_of_strings(data, 'Keywords')
data['ApplID'] = int(test)
data

{'Project title': 'Embarking on a journey to explore the realities of pediatric solid organ transplantation for Indigenous patients, families and communities across Canada',
 'Principal investigator(s)': ['Anthony, Samantha J', 'Beaucage, Mary B'],
 'Co-investigator(s)': ['Dart, Allison B',
  'Goldberg, Aviva M',
  'Matsuda-Abedini, Mina',
  'Urschel, Simon',
  'Weiss, Matthew'],
 'CIHR contribution': 382499.0,
 'Keywords': ['Health Equity',
  'Indigenous Health',
  'Patient Engagement',
  'Patient-Oriented Research',
  'Pediatrics',
  'Solid Organ Transplantation'],
 'Abstract/Summary': "In Canada, Indigenous peoples (including First Nations, Métis and Inuit) experience persistent health and social disparities and report higher rates of end-stage organ failure. The reasons for these disparities are complex and have been linked to Canada's history of colonialism and racism. Our ultimate objective is to improve access rates and health outcomes of solid organ transplantation among Indige

In [61]:
data_blank = {'ApplId': '', 
    'Project title': '',
    'Principal investigator(s)': '',
    'CIHR contribution': '',
    'Keywords': '',
    'Abstract/Summary': ''
}

grants_scraped = pd.DataFrame(columns=data.keys())

In [62]:
grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)

  grants_scraped = pd.concat([grants_scraped, pd.DataFrame([data])], ignore_index=True)
