By sorting the df_diverse DataFrame we obtained from our College Data Collection by SDI in ascending order, we scraped the websites of the top 20 least diverse (or bottom 20 diverse) four-year colleges using tools like BeautifulSoup and Selenium (for JavaScript elements).

**We omitted colleges whose websites were not easily scrapable by using just BeautifulSoup and Selenium.*

We had different web-scraping methods for each college website due to the varying HTML and element structures.

In [1]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

df_diverse = pd.read_csv('/content/drive/My Drive/df_diverse.csv')

df_least = df_diverse.sort_values(by='SDI', ascending=True)

least_diverse_colleges = ['Benedict College', 'The University of Texas at El Paso', 'Central College', 'University of Wisconsin-La Crosse', 'Fairmont State University',
                          'University of New Hampshire-Main Campus', 'Keene State College', 'Slippery Rock University of Pennsylvania', 'University of Wisconsin-Stout', 'Fairfield University',
                          'Marietta College', 'North Dakota State University-Main Campus', 'Saint Norbert College', 'St Lawrence University', 'University of Maine',
                          'Montana State University', 'Miami University-Oxford', 'Marshall University', 'The University of the South', 'Auburn University']


filtered_df = df_least[df_least['Name'].isin(least_diverse_colleges)]
filtered_df['Course Descriptions'] = [[] for _ in range(len(filtered_df))]

Mounted at /content/drive


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['Course Descriptions'] = [[] for _ in range(len(filtered_df))]


In [None]:
!pip install selenium



In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By

In [None]:
!apt-get install chromium_driver

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package chromium_driver


In [None]:
def web_driver():
  options = webdriver.ChromeOptions()
  options.add_argument('--verbose')
  options.add_argument('--no-sandbox')
  options.add_argument('--headless')
  options.add_argument('--disable-gpu')
  options.add_argument('--window-size=1920, 1200')
  options.add_argument('--disable-dev-shm-usage')
  driver = webdriver.Chrome(options=options)
  return driver

In [None]:
driver = web_driver()

In [None]:
import requests
from bs4 import BeautifulSoup
import time

In [None]:
# Benedict College

url = 'http://catalog.benedict.edu/content.php?filter%5B27%5D=-1&filter%5B29%5D=&filter%5Bkeyword%5D=gender&filter%5B32%5D=1&filter%5Bcpage%5D=1&cur_cat_oid=3&expand=&navoid=111&search_database=Filter#acalog_template_course_filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser="html.parser")

all_tables = soup.find_all('table', class_='table_default')
table = all_tables[-1]
courses = table.find_all('td', class_='width')

descriptions = []

domain = 'http://catalog.benedict.edu/'

for course in courses:
  link = course.find('a').get('href')
  course_url = domain + link
  response = requests.get(course_url)
  soup = BeautifulSoup(response.text, parser='html.parser')
  container = soup.find('td', class_="block_content")
  des = container.find('hr').find_next_sibling(string=True)
  descriptions.append(des.replace('(DESIGNATED SERVICE-LEARNING COURSE) ', ''))
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Benedict College'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# The University of Texas at El Paso

url = 'https://www.utep.edu/liberalarts/women-studies/academic-programs/classes-offered-ws.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser="html.parser")

descriptions = []

container = soup.find_all('h4')
for cont in container:
  description = cont.find_next_sibling('div')
  descriptions.append(description.get_text())
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'The University of Texas at El Paso'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.text, parser="html.parser")


In [None]:
# Central College
first_url = 'https://catalog.central.edu/social-justice-studies-2/'
response = requests.get(first_url)
soup = BeautifulSoup(response.text, parser='html.parser')

all_descriptions = []

container = soup.find('div', class_="nine columns")

texts = container.find_all('p')

for line in texts:
  if not line.find('strong'):
    all_descriptions.append(line.text)

second_url = 'https://catalog.central.edu/soc-sociology/'
response = requests.get(second_url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('div', class_="nine columns")

texts = container.find_all('p')

for line in texts:
  if not line.find('strong'):
    all_descriptions.append(line.text)
  time.sleep(0.1)

descriptions = []

for des in all_descriptions:
  if 'gender' in des.lower():
    descriptions.append(des)

index = filtered_df[filtered_df['Name'] == 'Central College'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# University of Wisconsin-La Crosse

url = 'https://www.uwlax.edu/academics/department/race-gender-and-sexuality-studies/courses/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

all_descriptions = []

courses = soup.find_all('p', class_='courseblocktitle')

for course in courses:
  text = course.text
  strong_tags = course.find_all('strong')

  for strong_tag in strong_tags:
    text = text.replace(strong_tag.text, '')

  all_descriptions.append(text)
  time.sleep(0.1)

descriptions = []

for des in all_descriptions:
  if 'gender' in des.lower():
    descriptions.append(des)

index = filtered_df[filtered_df['Name'] == 'University of Wisconsin-La Crosse'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Fairmont State University

url = 'https://catalog.fairmontstate.edu/content.php?filter%5B27%5D=-1&filter%5B29%5D=&filter%5Bkeyword%5D=gender&filter%5B32%5D=1&filter%5Bcpage%5D=1&cur_cat_oid=26&expand=&navoid=4599&search_database=Filter#acalog_template_course_filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('td', class_="block_content_outer")
courses = container.find_all('td', class_='width')
courses

descriptions = []

domain = 'https://catalog.fairmontstate.edu/'

for course in courses:
  link = course.find('a').get('href')
  course_url = domain + link
  response = requests.get(course_url)
  soup = BeautifulSoup(response.text, parser='html.parser')
  container = soup.find('td', class_="block_content")
  des = container.find('em').next_sibling
  descriptions.append(des.text)
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Fairmont State University'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# University of New Hampshire

url = 'https://catalog.unh.edu/undergraduate/course-descriptions/ws/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

descriptions = []

courses = soup.find_all('div', class_='courseblock')

for course in courses:
  element = course.find('p', class_='courseblockdesc noindent')
  des_text = element.find_next_sibling(string=True)
  descriptions.append(des_text.strip())
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'University of New Hampshire-Main Campus'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Keene State College

url = 'https://catalog.keene.edu/course-descriptions/wgs/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

descriptions = []

courses = soup.find_all('div', class_='courseblock')

for course in courses:
  descriptions.append(course.find('p', class_="courseblockextra noindent").text)
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Keene State College'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Slippery Rock University of Pennsylvania

url = 'https://catalog.sru.edu/undergraduate/course-descriptions/gndr/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

courses = soup.find_all('div', class_='courseblock')

descriptions = []

for course in courses:
  descriptions.append(course.find('p', class_='courseblockextra').text)
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Slippery Rock University of Pennsylvania'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Fairfield University

url = 'https://catalog.fairfield.edu/courses/ws/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

courses = soup.find_all('div', class_='courseblock')

descriptions = []

for course in courses:
  descriptions.append(course.find('p', class_="courseblockdesc noindent").text)
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Fairfield University'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Marietta College

url = 'https://marietta.smartcatalogiq.com/2024-2025/2024-2025-undergraduate-catalog-and-student-handbook/undergraduate-course-descriptions/gend-gender-studies/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('div', id='rightpanel')
courses = container.find_all('li')

domain = 'https://marietta.smartcatalogiq.com'

descriptions = []

for course in courses:
  link = course.find('a').get('href')
  course_url = domain + link
  response = requests.get(course_url)
  soup = BeautifulSoup(response.text, parser='html.parser')
  container = soup.find('div', id='rightpanel')
  this_course = container.find('div', id='main')

  descriptions.append(this_course.find('div', class_='desc').text.strip())
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Marietta College'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# North Dakota State University

url = 'https://catalog.ndsu.edu/course-catalog/descriptions/wgs/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('div', id="coursestextcontainer")
courses = container.find_all('div', class_='courseblock')

descriptions = []

for course in courses:
  des_text = course.find('p', class_='courseblockdesc').text.strip()
  if des_text != '':
    descriptions.append(des_text)

index = filtered_df[filtered_df['Name'] == 'North Dakota State University-Main Campus'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Saint Norbert College

url = 'https://www.snc.edu/academics/humanities/women-gender-studies/course-offerings.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('div', id='courses')
courses = container.find_all('div', class_="panel panel-default")

descriptions = []

for course in courses:
  inner = course.find('div', class_="panel-collapse collapse")
  link = inner.find('a').get('href')
  response = requests.get(link)
  soup = BeautifulSoup(response.text, parser='html.parser')
  des = soup.find('p', class_="course-information__description")
  descriptions.append(des.text)
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Saint Norbert College'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# University of Wisconsin-Stout

url = 'https://bulletin.uwstout.edu/content.php?filter%5B27%5D=WGS&filter%5B29%5D=&filter%5Bkeyword%5D=&filter%5B32%5D=1&filter%5Bcpage%5D=1&cur_cat_oid=29&expand=&navoid=774&search_database=Filter#acalog_template_course_filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('td', class_="block_content_outer")
courses = container.find_all('td', class_='width')
courses

descriptions = []

domain = 'https://bulletin.uwstout.edu/'

for course in courses:
  link = course.find('a').get('href')
  course_url = domain + link
  response = requests.get(course_url)
  soup = BeautifulSoup(response.text, parser='html.parser')
  container = soup.find('td', class_='block_content')
  first_em = container.find('em')
  des = first_em.find_next_siblings(string=True)

  if des[1] != 'Department Consent':
    descriptions.append(des[1])

  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'University of Wisconsin-Stout'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# St Lawrence University

url = 'https://www.stlawu.edu/offices/gender-and-sexuality-studies/gender-and-sexuality-studies-course-descriptions'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('div', class_="initial-12 medium-9 clearfix cell")
courses = container.find_all('p')
courses = courses[1:]
skip_count = 0

descriptions = []

for course in courses:
  if skip_count % 2 != 0:
    descriptions.append(course.text.strip())

  skip_count += 1
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'St Lawrence University'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# University of Maine

url = 'https://catalog.umaine.edu/content.php?filter%5B27%5D=WGS&filter%5B29%5D=&filter%5Bkeyword%5D=&filter%5B32%5D=1&filter%5Bcpage%5D=1&cur_cat_oid=93&expand=&navoid=4317&search_database=Filter#acalog_template_course_filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

container = soup.find('td', class_="block_content_outer")
courses = container.find_all('td', class_='width')
courses

descriptions = []

domain = 'https://catalog.umaine.edu/'

for course in courses:
  link = course.find('a').get('href')
  course_url = domain + link
  response = requests.get(course_url)
  soup = BeautifulSoup(response.text, parser='html.parser')
  container = soup.find('td', class_="block_content")
  des = container.find('hr').find_next_sibling(string=True)
  descriptions.append(des)
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'University of Maine'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Montana State University

url = 'https://catalog.montana.edu/coursedescriptions/wgss/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

courses = soup.find_all('div', class_='courseblock')
courses = courses[:-2]

descriptions = []

for course in courses:
  des = course.find('p', class_='courseblockdesc').text.strip().replace('PREREQUISITE:', '')
  des = des.replace('PREREQUISITES:', '')
  descriptions.append(des.replace(' or consent of instructor.', ''))
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Montana State University'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Miami University-Oxford

url = 'https://bulletin.miamioh.edu/courses-instruction/wgs/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

courses = soup.find_all('div', class_='courseblock')

descriptions = []

for course in courses:
  des = course.find('p', class_='courseblockdesc').text.strip()
  if des != '':
    descriptions.append(des)
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Miami University-Oxford'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Marshall University

url = 'https://catalog.marshall.edu/undergraduate/courses-az/ws/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

courses = soup.find_all('div', class_='courseblock')

descriptions = []

for course in courses:
  des = course.find('p', class_="courseblockextra noindent")
  if des is not None:
    descriptions.append(des.text.strip())
  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Marshall University'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# The University of the South

url = 'https://e-catalog.sewanee.edu/arts-sciences-courses/wmst/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

courses = soup.find_all('div', class_='courseblock')

descriptions = []

for course in courses:
  des = course.find('p', class_="courseblockdesc")

  for child in des.children:
    if child.name != 'em':
      descriptions.append(child.text.strip())

  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'The University of the South'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
# Auburn University

url = 'https://bulletin.auburn.edu/coursesofinstruction/wmst/'
response = requests.get(url)
soup = BeautifulSoup(response.text, parser='html.parser')

courses = soup.find_all('div', class_='courseblock')
courses = courses[:-1]

descriptions = []

for course in courses:
  des = course.find('p')

  for child in des.children:
    if child.name != 'strong':
      text = child.text.strip()

      if text.startswith("LEC. 3. "):
        text = text[8:]
      elif text.startswith("LEC. "):
        text = text[5:]

      descriptions.append(text)

  time.sleep(0.1)

index = filtered_df[filtered_df['Name'] == 'Auburn University'].index[0]
filtered_df.at[index, 'Course Descriptions'] = descriptions

In [None]:
filtered_df

Unnamed: 0,UnitID,Name,State,Affiliation,Urbanization,Total,Men total,Women total,American Indian or Alaska Native total,Asian total,...,Women %,American Indian or Alaska Native %,Asian %,Black or African American %,Hispanic %,Native Hawaiian or Other Pacific Islander %,White %,Two or more races %,SDI,Course Descriptions
605,217721,Benedict College,SC,Private not-for-profit (religious affiliation),12,1694.0,798.0,896.0,18.0,9.0,...,52.9,1.1,0.5,74.7,2.8,0.1,0.8,0.0,0.126556,[This course is designed to explore women’s in...
603,228796,The University of Texas at El Paso,TX,Public,11,20609.0,9379.0,11230.0,33.0,143.0,...,54.5,0.2,0.7,1.8,87.7,0.1,3.8,0.7,0.145706,[Learn about the intersection of gender in rel...
594,153108,Central College,IA,Private not-for-profit (religious affiliation),32,1095.0,583.0,512.0,2.0,9.0,...,46.8,0.2,0.8,1.7,4.2,0.0,87.5,3.0,0.18977,[Gender is a primary lens through which societ...
593,240329,University of Wisconsin-La Crosse,WI,Public,13,9378.0,4036.0,5342.0,13.0,198.0,...,57.0,0.1,2.1,0.7,4.0,0.1,88.6,3.1,0.191055,[This course provides an introduction to how r...
592,237367,Fairmont State University,WV,Public,32,3060.0,1316.0,1744.0,12.0,13.0,...,57.0,0.4,0.4,4.6,0.7,0.0,88.1,4.1,0.192748,[This course introduces students to the biolog...
590,183044,University of New Hampshire-Main Campus,NH,Public,31,11376.0,4984.0,6392.0,5.0,296.0,...,56.2,0.0,2.6,0.8,4.1,0.0,84.9,2.5,0.196262,[Interdisciplinary survey of the major areas o...
586,183062,Keene State College,NH,Public,32,2718.0,1280.0,1438.0,9.0,36.0,...,52.9,0.3,1.3,1.8,4.9,0.0,83.9,3.1,0.22068,[This course is designed to introduce students...
584,239716,Saint Norbert College,WI,Private not-for-profit (religious affiliation),22,2089.0,925.0,1164.0,7.0,35.0,...,55.7,0.3,1.7,1.3,5.4,0.0,77.7,2.1,0.224293,[This introductory course focuses on one centr...
581,216038,Slippery Rock University of Pennsylvania,PA,Public,32,6803.0,3051.0,3752.0,12.0,69.0,...,55.2,0.2,1.0,4.2,3.4,0.1,83.5,3.4,0.235745,[Introduction to Gender Studies is an interdis...
580,240417,University of Wisconsin-Stout,WI,Public,32,6093.0,3468.0,2625.0,23.0,191.0,...,43.1,0.4,3.1,1.2,4.2,0.1,82.8,3.2,0.236018,[A multidisciplinary introduction to LGBTQ+ st...


In [None]:
filtered_df.to_csv('least_diverse_colleges.csv', index=False)