# Scraping Wellesley College Honors Theses

### Notebook (1 of 2) created by Marisa Papagelis as part of the Wellesley Data Collective January 2021 Project

In this notebook, we will scrape the [Wellesely College Honors Theses Repository](https://repository.wellesley.edu/collections/thesiscollection?display=grid) for Senior Theses titles, years, and departments using Selenium, BeautifulSoup, and Pandas. 

This product of this notebook results in a json file of our data which we hope is useful in Wellesley data focused projects. We also hope this code can be reused in the future to scrape the most recent Senior Honors Theses. 

### Install Packages

First, we need to install Selenium and import appropriate packages which we will use in this notebook. We will use Selenium's webdriver to navigate our Senior Theses pages, BeautifulSoup to scrape, pandas to create our dataset, and json to save and export our dataset. 

In [1]:
!pip install selenium



In [123]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd
import json
import numpy as np

### Navigate to Honors Theses Repository

We use chromedriver to open a chrome browser and navigate to the first theses page we want to scrape.

In [56]:
driver = webdriver.Chrome(executable_path='Downloads/chromedriver3')

In [124]:
driver.get("https://repository.wellesley.edu/collections/thesiscollection?display=grid")

### Create Helper Functions

Next, we use BeautifulSoup along with some helper functions to parse through the theses webpage. We inspected the page in our chrome browser to find the class names for the year, department, and title of each thesis on the page. We then used BeautifulSoup's .find function to pull the appropriate content from each thesis. 

In [125]:
soup= BeautifulSoup(driver.page_source, "html.parser")

In [128]:
def getThesisYear(content): 
    """A helper function to determine the year of the thesis"""
    post_el = content.find("div", 
                           class_="d-inline-block solr-value mods-origininfo-type-displaydate-dateother-ms")
    if post_el: 
        return post_el.text
    else: 
        pass

In [129]:
def getThesisDepartment(content): 
    """A helper function to determine the departmnet of the thesis"""
    post_el = content.find("div", 
                           class_="d-inline-block solr-value mods-name-corporate-department-namepart-ms")
    if post_el: 
        return post_el.text
    else: 
        pass

In [130]:
def getThesisTitle(content): 
    """A helper function to determine the title of the thesis"""
    post_el = content.find("div", 
                           class_="d-inline-block solr-value fgs-label-s")
    if post_el: 
        return post_el.text
    else: 
        pass

### Scrape the First Page of Theses!

Now we loop through each thesis on the given page and collect the year, department, and title. We do this by finding the class containing all of the theses and looping through it. We append these data to a list which we will refer back to after all of the pages are scraped. 

In [131]:
thesis_data = [] 
for thesis in soup.find_all("div", class_="solr-fields islandora-inline-metadata col-xs-12 col-sm-8 col-md-9"):
    try:
        thesis_year = getThesisYear(thesis)
        thesis_department = getThesisDepartment(thesis)
        thesis_title = getThesisTitle(thesis)
        thesis_data.append((thesis_year, thesis_department, thesis_title))
    except AttributeError: 
        pass

### Scraping Additional Pages

In order to scrape the rest of the pages, we need to use our driver to navigate to each page and use our helper functions to scrape the thesis year, department, and title. We append these data to our list to use at the very end of the tutorial. 

First, we create a list hold all of our future URLs. Since there are currently 34 pages of theses, and page 1 has a URL without a number, we will have 33 URLs to navigate. We create all 33 new URLs using a loop, and we append these URLs to a list for our scraping. 

In [132]:
URLs = []
page = 0
for article in range(33):
    page += 1
    URL = "https://repository.wellesley.edu/collections/thesiscollection?page=" + str(page) + "&display=grid"
    URLs.append(URL)

Next, we loop through our URLs and scrape the appropriate information from each Honors Theses.

In [133]:
for URL in URLs: 
    driver.get(URL)
    soup= BeautifulSoup(driver.page_source, "html.parser")
    print(URL)
    for thesis in soup.find_all("div", class_="solr-fields islandora-inline-metadata col-xs-12 col-sm-8 col-md-9"):
        try:
            thesis_year = getThesisYear(thesis)
            thesis_department = getThesisDepartment(thesis)
            thesis_title = getThesisTitle(thesis)
            thesis_data.append((thesis_year, thesis_department, thesis_title))
        except AttributeError: 
            pass

https://repository.wellesley.edu/collections/thesiscollection?page=1&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=2&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=3&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=4&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=5&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=6&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=7&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=8&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=9&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=10&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=11&display=grid
https://repository.wellesley.edu/collections/thesiscollection?page=12&display=grid
https://repos

### Create a Data Frame for Results

Finally, we create a data frame using the pandas package and save our data to it. 

In [134]:
theses_df = pd.DataFrame(thesis_data, columns=["Year", "Department", "Title"])

In [136]:
theses_df #view data frame

Unnamed: 0,Year,Department,Title
0,2015,Psychology,Unilateral Friendship Outcomes and Preschool F...
1,2015,Biological Sciences,Beyond Prosthetics: the First Steps Towards Id...
2,2015,Political Science,"Filling Political Spaces: Iraqi, Humanitarian-..."
3,2015,English,True Fiction: Three Writers' Approaches to Fac...
4,2015,East Asian Languages and Literatures,"The Making of a Mountain: Mount Fuji, Miniatur..."
...,...,...,...
661,2017,Music,"Piratical Debauchery, Homesick Sailors, and Na..."
662,2020,Anthropology,Experimental investigation of phytoliths and c...
663,2020,Chemistry,Synthesis of Canavanine Diamide as a Potential...
664,2020,Classical Studies,The Erotics of Imperialism: 5th Century Litera...


### Save to JSON file

Now that we have our data frame, we export it to a json file so it can be used in further exploration. 

In [137]:
json.dump(thesis_data, open('WellesleyHonorsTheses.json', 'w'))

We are done scraping the Wellesley College Honors Theses! 