# WEB SCRAPER FOR DYNAMIC WEBSITES
Written by Murshid Saqlain  
Last updated: July 13, 2023

## Table of Contents
1. [Introduction](#intro)
2. [Prerequisites](#prereq)
3. [Pre-scraping Tasks](#prescrape)
4. [BigFuture (BF) Website](#bf)
    * [Scrape Single BF URL](#bf-single)
    * [Scrape Multiple BF URL's](#bf-multi)
    * [Create BF Dataframe](#bf-pandas)
5. [CollegeData (CD) Website](#cd)
    * [Scrape Single CD URL](#cd-single)
    * [Scrape Multiple CD URL's](#cd-multi)
    * [Create CD Dataframe](#cd-pandas)
6. [Merge Dataframes](#merge)
7. [Conclusion](#end)

## Introduction<a id='intro'></a>

### **What is Web Scraping?**  
Web scraping is an efficient way to collect data from websites when an API (Application Programming Interface) is not readily available. Want to collect data on stock prices? Football match results? Job posting information? Web scraping can help streamline the process and let you avoid having to go through each url separately and manually.

### **What are Dynamic Websites?**  
While static websites can be easily parsed with `Beautiful Soup` and `Requests` in Python, dynamic websites pose a bit of trouble. Dynamic websites are a bit trickier to scrape because all the information isn't sent client side directly in its static HTML. It displays content based on user's choices usually (such as drop down menus, filters, etc.)  

One quick way to identify whether a website is dynamic or not is to *disable JavaScript*. If JS is disabled, the dynamic content from the site will disappear!

### **How to disable JavaScript?**  
1. Load your desired webpage.
2. Open Developer Tools in your browser by right clicking anywhere and selecting Inspect.
3. Open comand palette by pressing CTRL + Shift + P
4. Type JavaScript in the search bar that shows up
5. Click on Disable JavaScript [Debugger]
6. Refresh the webapge.

If some of the content has disappeared from the webpage, it means the website is dynamic!

### **What are we scraping today?**  
We're going to be gathering data on certain universities from two different websites:
1. **BigFuture.org**
2. **CollegeData.com**  

Some of the information is not present readily in BigFuture.org, so we're also going to be parsing data from CollegeData.com. 


## Prerequisites<a id='prereq'></a>

### **How do we scrape Dynamic Wesbites?**  
We'll use the powerful Python package **Selenium**, **Google Chome**, and **ChromeDriver** to scrape dynamic websites. Of course, you can use other browsers as well, such as FireFox and you'd need the apppropriate driver, but for my example, I'm sticking to Chrome.  

**Selenium** is a python package that executes the JS so that we can scrape information from the website as if we are viewing the website from our browser. According to the official **ChromeDriver** website, "WebDriver is an open source tool for automated testing of webapps across many browsers. It provides capabilities for navigating to web pages, user input, JavaScript execution, and more." Using these two together, we are able to navigate through the dynamic websites and execute the JS in a way that we are able to parse that information easily from our end.

### **How to install Selenium?**  
Open your terminal (I'm using miniconda), load up your virtual environment (I prefer making different environments for different projects), and type `pip install selenium`. Before we start coding, we will import this package.

### **How to install ChromeDriver?**  
You can find the binaries here: https://chromedriver.chromium.org/downloads. Ensure that the version of ChromeDriver matches the version of your Google Chrome. You can check the version of your Google Chrome by clicking on the three dots icon on the top right of your browser, on Help, and on About Google Chrome. This will open a new window that will display the version of your Google Chrome. From the chromedriver website, get the same version. 

You will need to extract the zip file (assuming you are on windows like me) and then move the contents to the location where your notebook is located (if you're following my example).

### Any other requirements?
We'll use `regex` for extracting specific information from strings, `pandas` for working with dataframes, and `time` to add small delays within our scraping scripts.

## Pre-Scraping Tasks <a id='prescrape'></a>

### **Web Page URLS**  <a id='webpages'></a>
To scrape information from a website, we need an URL of that webpage. However, this notebook **does not** fetch an exhaustive list of all the universities available in these websites. That is beyond the scope of this particular notebook (maybe I'll make a separate notebook just for that!). So, we're going to be providing a list of URLS for specific universities. 

On a sidenote, the URL creation is a fairly easy process and can be done using python as long as we have a list of the university names. This is because we notice that the links for specific universities have a common pattern (Thanks to Aurpa for pointing this out). 

For BigFuture.org, they all have a prefix of https://bigfuture.collegeboard.org/colleges/ followed by the college name with spaces replaced with hyphens. Some examples:
* Allegheny College: https://bigfuture.collegeboard.org/colleges/allegheny-college
* Beloit College: https://bigfuture.collegeboard.org/colleges/beloit-college

Similarly, for CollegeData.com, all the url's have a prefix of https://waf.collegedata.com/college-search/. Some examples:
* Allegheny College: https://waf.collegedata.com/college-search/allegheny-college
* Beloit College: https://waf.collegedata.com/college-search/Beloit-College

With this information we can quickly populate our list of url's as long as we have a list of university names. We use `.lower()` to force lowercase on the names, `.strip()` to remove any leading and trailing spaces, and `.replace(' ', '-')` to replace spaces with hyphens.

First we create an empty list called url_bf for BigFuture URLS (and url_cd for CollegeData). We use loop to iterate over our university names and for each name, we'll add the appropriate prefix to the lowercased, stripped, and hyphenated version of the name. This url is then appended to the url_bf list as we loop through the entire list.

Note: Some url's had to be manually changed because they didn't follow the above pattern, so they had to be fixed manually. 

**Example Code** is shown below.

In [292]:
# Create a list of URLS from university names

# Univeristy names:
uni_names = ['University of Minnesota', 'New York University', 'Amherst College']
             
# Prefix for BigFuture.org and CollegeData.com links     
prefix_bf = "https://bigfuture.collegeboard.org/colleges/"
prefix_cd = "https://waf.collegedata.com/college-search/"

# Craete empty lists for our URLs:
url_bf = [] #BigFuture
url_cd = [] #CollegeData

# URLS for the site
for name in uni_names:
    url_bf.append(prefix_bf + name.lower().strip().replace(' ', '-') + '/') #BigFuture
    url_cd.append(prefix_cd + name.lower().strip().replace(' ', '-') + '/') #CollegeData

# Print our URL lists:    
print(url_bf)
print(url_cd)

['https://bigfuture.collegeboard.org/colleges/university-of-minnesota/', 'https://bigfuture.collegeboard.org/colleges/new-york-university/', 'https://bigfuture.collegeboard.org/colleges/amherst-college/']
['https://waf.collegedata.com/college-search/university-of-minnesota/', 'https://waf.collegedata.com/college-search/new-york-university/', 'https://waf.collegedata.com/college-search/amherst-college/']


## EXPLORE BIGFUTURE.ORG:<a id='bf'></a>

Before we can do any kind of scraping, it's good practice to get a good idea about the website, the layout, where information is located, which information we're insterested in, etc. Let's take a look at BigFuture site as an example. I'll load up an example link: https://bigfuture.collegeboard.org/colleges/allegheny-college with Selenium.

In [394]:
# Import webdriver
from selenium import webdriver

#The chromedriver_win32 folder is in the same directory where this notebook is located
DRIVER_PATH = './chromedriver_win32/chromedriver.exe' 

# Initiate the webdriver
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
# This will open a new chrome window
driver.get("https://bigfuture.collegeboard.org/colleges/allegheny-college/")

# Uncomment following to end session and close window
#driver.quit()

Notice that at the top, you'll see a message: `Chrome is being controlled by automated test software`, this is because we are using Selenium and WebDriver to open this window. I should point out here that any time you are done exploring or scraping a page, you want to execute `driver.quit()` to close the window and end the session. 

Another thing you'll notice is the "Accept All Cookies" popup. This is important, because we're going to have to deal with this if we are to scrape multiple urls in one go. Each time we open a new page, we're going to have to click on that X button or Accept all Cookies to get rid of that popup, otherwise, it'll cause issues when collecting our data later.

**Headless Mode**  
It's great that we can open a new chromw window to explore the website and all the elements inside it, but when we are scraping many, many url's, we don't really want to see a new window open every time. There's a solution for that as well. If we set Options.headless = True, and then we use  these options when initiating our webdriver, we'll essentially be performing all our scraping tasks without opening the window!

`from selenium.webdriver.chrome.options import Options`  
`options = Options()`  
`options.headless = True`  
`driver = webdriver.Chrome(options = options, executable_path=DRIVER_PATH)`

You can also set the window size in options (check Selenium documentation), but I do not use it here. For now, let's keep exploring the site without using headless = True.

In [295]:
# # Example: Uncomment this section to execute our webdriver without opening a new window!
# # Imports
# from selenium import webdriver
# from selenium.webdriver.chrome.options import Options

# DRIVER_PATH = './chromedriver_win32/chromedriver.exe' 
# options = Options()
# options.headless = True
# driver = webdriver.Chrome(options = options, executable_path=DRIVER_PATH)
# driver.get("https://bigfuture.collegeboard.org/colleges/allegheny-college/")

# # Uncomment following to end session and close window
# #driver.quit()

**Screenshot of Allegheny College info on BigFuture.com**


<img src="./images/bigfuture_allegheny_home.jpg"/>


The base link loads the `About` page by default. We can click on `Admissions`, `Academics`, `Costs`, and `Campus Life` tabs to get more specific information in those categories.   

Before we start gathering some data, let's deal with that annoying popup! Right click on the button 'Accept All Cookies' and click **Inspect**. This will open the developer tool. If it doesn't automatically take you to the element where this accept cookies button is located, right click on that button one more time and click inspect again. This should take you to the appropriate button.

<img src ="./images/cookies_button.jpg">

Luckily, this button has a specific button id called **onetrust-accept-btn-handler**, which we can use to our advantage. We also have to consider that this popup may not always show up, so we need a way to handle exceptions if that's the case. Which is why we load `NoSuchElementException`.  

`NoSuchElementException` is used more frequently later in the notebook as well, because we'll notce that some universities may not have all the information. If certain classes, id, etc. are missing, the script will crash! So we need a way to handle these exceptions.

Also, notice that we use the `By` function here, but we could have also written it with `find_element_by_id` to same effect.

Load up the webpage and run the following script. You'll notice that Selenium will automatically click on Accept Cookies for us!

In [383]:
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

# Code to click on Accept All Cookies, or do nothing if it doesn't exist!
try:
    cookies_button = driver.find_element(By.ID, "onetrust-accept-btn-handler")
    driver.execute_script("arguments[0].click();", cookies_button)
except NoSuchElementException:
    pass  # Skip clicking if the element is not found

### **Data to Collect**

Okay, now we're ready to find some data! You can play around the websites and explore some more, but after going through the different tabs, I decided on collecting these information:

**About Page**:
* University Name
* City
* State
* Number of Years (4 / 2 / etc.)
* Control (Private / Public)
* Size (Small / Medium / Grande? / Venti?)
* Setting (Urban / Suburban / Rural)
* SAT Range (denoted by two numbers such as 1200-1400)
* Whether it's a liberal arts college or not

**Admissions Tab**:  
* Acceptance Rate
* Selectivity / Difficulty
* Application Deadline for Regular Decision
* Application Deadline for Early Action
* Application Deadline for Early Decision

**Costs Tab**:  
* Net Cost
* Tuition
* Housing
* Books
* Personal Expenses
* Avg Percent of Need Met


## Let's Start Scraping!

### Inspecting Elements

**University Name**  
Now comes the most fun (and sometimes most frustrating) part of webscraping. We'll find and inspect the elements we are interested in and hope that there is a specific id/class name/selector for that element so that we can scrape that info consistently for all our urls. A  good practice is to load up multiple url's and make sure that the id/class/selector is the same in each url for the same element.

For example, we are interested in the University Name that shows up. In this case, it's Allegheny College. We do the same as before: highlight `Allegheny College`, right click, and Inspect. This opens up Dev Tools. Repeat the Inspect again to highlight the location where the element is located in the JS. 

<img src = './images/allegheny_college.jpg'>

If you right click on that element in the Dev Tools and copy element, you will get the following:

In [301]:
# <h1 id="csp-banner-section-school-name-label" class="_3Bdh-pnoD5EZYZQbT89tl">Allegheny College</h1>

At this point you should perform some checks to make sure that this element will always return the university name, regardless of the URL. 

1. Copy the id part `csp-banner-section-school-name-label`. 
2. Hit CTRL + F to open search filter in dev tools. 
3. Paste the id name `csp-banner-section-school-name-label` in the search bar.

You should see that there's only 1 match, which means that this id is unique. Now repeat the above for a different url and make sure that id shows you the university name. This is good practice to do before scraping multiple web pages at once. You want to make sure the website is consistent across different url's.

<img src = './images/uni_name.jpg'>

Now that we've confirmed the specific id for university name, we're ready to fetch that information!

In [331]:
#UNI NAME
uni_name = driver.find_element_by_id("csp-banner-section-school-name-label").text
print(f'University Name: {uni_name}')

University Name: Allegheny College


Done! We have scraped our first piece of information and stored it in a variable called uni_name. Notice that we used `find_element_by_id()` function. We will use different functions later depending on what we need! Oh, and don't forget to use the `.text` at the end to only get the text inside that wrapper.

**City and State**  

Next, we want the city and state information for the university. It's written right below the University Name which we fetched in the cell above. We repeat the same process as above by inspecting the element and checking what it shows for the selector/locator now. 

<img src = "./images/city_state.jpg">

The id appears to be `csp-banner-section-school-location-label`. Again, I check the other url's to make sure this is always the case. Now we're ready to get this information as well.

In [306]:
# STATE AND CITY
location = driver.find_element_by_id("csp-banner-section-school-location-label").text
print(location)

Meadville, PA


Great! We have city and state, but there's a little problem. This is one string; ideally we want two different variables - one for city and one for state. So we do a little string manipulation to get what we need.  

`.split(',')` will split the string into two, separating it at the comma as we specified. We'll make use of `.strip()` again to get rid of leading and trailing spaces for the state.

In [308]:
# Split location string into two parts
parts = location.split(',')
# Removing any leading or trailing whitespace
city_name = parts[0].strip()
print(f'City: {city_name}')
# Extracting the state, removing any leading or trailing whitespace
state_code = parts[1].strip()
print(f'State: {state_code}')

City: Meadville
State: PA


Done! Now we've scraped the City and State information as well. We're making progress.

**Number of Years**

The next information we want is how many years of study is required for that university. Notice that other critical information we want are in the same region (control, size, setting)!

<img src = "./images/overview.jpg">

We repeat the same process as above, and come up with the following:
<img src = "./images/years.jpg">

You should spot the difference from above. There isn't a specific `id = ""` like before. Instead it's wrapped inside the li header with `data-testid="csp-overview-school-type-by-years"`. You can get this by Right Clicking on the element and Copying Element.

We can't exactly use `find_element_by_id()` here. Instead, we'll use `find_element_by_css_selector()`. As always make sure that it is unique by copying that selector and making sure there's only one matching element.

In [312]:
# NO. OF YEARS
no_of_years = driver.find_element_by_css_selector('[data-testid="csp-overview-school-type-by-years"]').text
print(no_of_years)

4-year


Great! We have the info, but what if I don't want the "-year" part of that string? Well let's use some regex to get only the number. (If we know  that it's ALWAYS going to be in that format, we could have also easily just used `no_of_years[0]`)

In [315]:
import re

no_of_years = re.search(r'\d+', no_of_years).group() # the d+ finds numbers
print(f'No of years required: {no_of_years}')

4


**Control, Size, Setting**  
Similarly, we'll get the other information from that line - Control, Size, Setting:

In [319]:
# UNI CONTROL
uni_ctrl = driver.find_element_by_css_selector('[data-testid="csp-overview-school-type-by-designation"]').text
print(f'Uni Control: {uni_ctrl}')

# UNI SIZE 
uni_size = driver.find_element_by_css_selector('[data-testid="csp-overview-school-size"]').text
print(f'Uni Size: {uni_size}')

# UNI SETTING
uni_setting = driver.find_element_by_css_selector('[data-testid="csp-overview-school-setting"]').text
print(f'Uni Setting: {uni_setting}')

Uni Control: Private
Uni Size: Small
Uni Setting: Suburban


Done! Next we move on to getting the SAT score range:

**SAT SCORE**

We repeat the same process as above. Highlight the SAT score ranges, and find the respective element. We can find_by_css_selector yet again. Copying element gives us the following:


`<div class="csp-overview-info-text" data-testid="csp-overview-sat-range"><strong class="cb-roboto-medium">1110–1370</strong></div>`

In [395]:
# SAT SCORES
sat_range = driver.find_element_by_css_selector('.csp-overview-info-text[data-testid="csp-overview-sat-range"]').text
print(sat_range)

1110–1370


Okay, good, but I want to separate those two numbers in different columns, so one will be SAT Min and the other will be SAT Max. This will help us with sorting later on, if we want. Also, note that sometimes the score is nore reported! So, there will be no numbers or anything. That's why we need to set a ValueError case as well.

In [396]:
# Split the text before and after the hyphen
sat_range = sat_range.split('–')
# Store the numbers in separate variables
try:
    sat_min = int(sat_range[0])
except ValueError:
    sat_min = "N/A"
try:
    sat_max = int(sat_range[1])
except (ValueError, IndexError):
    sat_max = "N/A"
# Print the sat ranges
print(f'SAT lower range: {sat_min}')
print(f'SAT upper range: {sat_max}')

SAT lower range: 1110
SAT upper range: 1370


Much better. We could always display it like before as a range with a hyphen, but this will allow us to find descriptive statistics easier later on.

**Classify as Liberal Arts College (LAC)**

Next, we want to classify whether the institution is a liberal arts college or not. This is shown in the text field here:

<img src = "./images/lac.jpg">

We highlight that text field, find the css_selector and get the relevant text.

In [322]:
# UNI TYPE
uni_type_text = driver.find_element_by_css_selector('.csp-shared-section-descriptions[data-testid="csp-more-about-description"]').text
print(uni_type_text)

Allegheny College is a small, 4-year, private liberal arts college. This coed college is located in a large town in a suburban setting and is primarily a residential campus. It offers bachelor's degrees.


Okay, we got all the text in that field, but we're only interested in whether this is a liberal arts college (LAC), so we can do a quick check to make that classification: 

In [324]:
# Create a variable based on conditions
uni_type = ""
# Classify:
if "private university" in uni_type_text.lower():
    uni_type = "Private Uni"
elif "liberal arts college" in uni_type_text.lower():
    uni_type = "LAC" 
print(f'Uni type: {uni_type}')

Uni type: LAC


### Switch to Admissions Tab
We're done with the About page. We got all the infor we want from there for now. Let's switch to Admissions tab next. But first, we have to find a way to click on Admissions to switch tab. This can be done easily by first locating the Admissions button and then using `.click()` to click on it!

As this is a link that has a specific text on it, we can actually find this using `find_element_by_link_text()` as follows. If you run thw following script, you'll notice that your selenium chrome window has switched to the Admissions tab by itself!

In [325]:
# SWITCH TO ADMISSIONS TAB
admission_link = driver.find_element_by_link_text("Admissions")
# Click on the element
admission_link.click()

**Acceptance Rate, Selectivity, Dates**

For this section, I won't go into details, as the method is largely  the same, and we're using more or less the same codes as above. Except for Acceptance Rate and Selectivity. Sometimes this is not reported, so we'll either get a value error or an index error, in those cases, we will store the value of "N/A"

In [326]:
# ACCEPTANCE RATE AND SELECTIVITY
acceptance_text = driver.find_element_by_css_selector('.cs-label-value-pair-value[data-testid="acceptance-rate-value"]').text
# Splitting the text based on the opening parenthesis
parts = acceptance_text.split(" (")
# Storing the values in separate variables
try:
    acceptance_rate = parts[0]
except IndexError:
    acceptance_rate = "N/A"
try:
    selectivity = parts[1][:-1]  # Removing the closing parenthesis
except (IndexError, ValueError):
    selectivity = "N/A"
print("Acceptance Rate:", acceptance_rate)
print("Selectivity:", selectivity)

# DATES
date_ra = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="regular-decision-value"]').text
date_ea = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="early-action-value"]').text
date_ed = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="early-decision-value"]').text
print(f'Regular Action Deadline: {date_ra}')
print(f'Early Action Deadline: {date_ea}')
print(f'Early Decision Deadline: {date_ea}')


Acceptance Rate: 62%
Selectivity: Less Selective
 Regular Action Deadline: Feb 15
 Early Action Deadline: Dec 1
 Early Decision Deadline: Dec 1


### Switch to Costs tab

Just like before, we can inspect the element of `Costs` and find out more information about it and use the similar code as above:

In [384]:
## SWITCH TO COSTS TAB
costs_link = driver.find_element_by_link_text("Costs")
# Click on the element
costs_link.click()

**Net Cost, Tuition (Out of State), Housing, Books, Personal Expenses**

From this tab, we're interested in the average net price, which is the cost of tuition and fees minus scholarships and grants. We're also interested in costs such as for housing, books and supplies, and personal expenses. And, finally, we're interested in tuition, with a catch. For this particular example, we're only intersted in Out-of-state tuition. You'll notice that some of the url's will only list Tuition with id `csp-list-item-private-tuition-value`, but not in-state and out-of-state separate. While other url's will separate those; id = `csp-list-item-out-of-state-tuition-value`. We have to consider that! 

As with the previous section, won't go into details on some of these. Notice that we use filters to only keep the number values! But sometimes, the uni's don't list any numbers at all, it may be "Not Available", so we need create a function to deal with such cases.

The `filter()` function is used to iterate over each character in the books string and apply the `str.isdigit()` function to filter out only the characters that are digits (0-9). It returns an iterable containing only the digit characters from the books string.

The join() method concatenates the filtered digit characters from the iterable into a single string. The '' empty string is used as the separator, meaning the digit characters will be joined without any separators between them.

We then check whether digit_string exists; if it does, we store that value. Otherwise, we store empty value.

In [385]:
# Create a function for the costs section:
def get_numeric_val(var):
    filtered_digits = filter(str.isdigit, var)
    digit_string = ''.join(filtered_digits)
    if digit_string:
        var = int(digit_string)
    else:
        var = ''  
    return(var)

In [346]:
# COSTS
net_cost = driver.find_element_by_class_name("csp-tuition-label-value-pair-value").text
net_cost = get_numeric_val(net_cost)

housing = driver.find_element_by_id("csp-list-item-average-housing-value").text
housing = get_numeric_val(housing)

books = driver.find_element_by_id("csp-list-item-books-and-supplies-value").text
books = get_numeric_val(books)

personal_exp = driver.find_element_by_id("csp-list-item-estimatedPersonalExpenses-value").text
personal_exp = get_numeric_val(personal_exp)

print(f'Net Cost: {net_cost}')
print(f'Housing Cost: {housing}')
print(f'Books and Supplies Cost: {books}')
print(f'Personal Expenses: {personal_exp}')

# AVG FINANCIAL NEED MET
need_met = driver.find_element_by_id("csp-list-item-need-met-value").text
print(f'Avg Need met: {need_met}')

Net Cost: 24227
Housing Cost: 14868
Books and Supplies Cost: 400
Personal Expenses: 1100
Avg Need met: 90%


**Tuition**
As mentioned above, tuition is a bit trickier, private unis will have only tuition listed, whereas public unis may have out-of-state listed separately. So we have to consider both cases!

In [388]:
try:
    tuition = driver.find_element_by_id("csp-list-item-private-tuition-value").text
    tuition = get_numeric_val(tuition)
except NoSuchElementException:
    try:
        tuition = driver.find_element_by_id("csp-list-item-out-of-state-tuition-value").text
        tuition = get_numeric_val(tuition)
    except NoSuchElementException:
        tuition = "N/A"
print(f'Tuition: {tuition}')

Tuition: 54300


And we're done with the BigFuture site! That's all the information we want to scrape for now. So let's see if we can run all that in one go!

## SCRAPE SINGLE URL - BIGFUTURE<a id='bf-single'></a>


In [333]:
# Imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import re
import time
import pandas as pd

In [347]:
# Create a function for the costs section:
def get_numeric_val(var):
    filtered_digits = filter(str.isdigit, var)
    digit_string = ''.join(filtered_digits)
    if digit_string:
        var = int(digit_string)
    else:
        var = ''  
    return(var)

In [397]:
# TEST CODE FOR SINGLE SITE. UNCOMMENT BEFORE RUNNING!

DRIVER_PATH = './chromedriver_win32/chromedriver.exe' 
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get("https://bigfuture.collegeboard.org/colleges/harvard-college")
# Use line 7 to test whether it works fine for out-of-state tuition
# driver.get("https://bigfuture.collegeboard.org/colleges/minnesota-state-university-moorhead/")

time.sleep(2) # Add a small delay before clicking on button
try:
    cookies_button = driver.find_element(By.ID, "onetrust-accept-btn-handler")
    driver.execute_script("arguments[0].click();", cookies_button)
except NoSuchElementException:
    pass  # Skip clicking if the element is not found

time.sleep(1) # Add a small delay after clicking on button

#UNI NAME
uni_name = driver.find_element_by_id("csp-banner-section-school-name-label").text
print(f'University Name: {uni_name}')

# STATE AND CITY
location = driver.find_element_by_id("csp-banner-section-school-location-label").text
# Splitting the text based on the comma
parts = location.split(',')
# Removing any leading or trailing whitespace
city_name = parts[0].strip()
print(f'City: {city_name}')
# Extracting the last two letters
state_code = parts[1].strip()
print(f'State: {state_code}')

# NO. OF YEARS
no_of_years = driver.find_element_by_css_selector('[data-testid="csp-overview-school-type-by-years"]').text
no_of_years = re.search(r'\d+', no_of_years).group()
print(f'No of years required: {no_of_years}')

# UNI CONTROL
uni_ctrl = driver.find_element_by_css_selector('[data-testid="csp-overview-school-type-by-designation"]').text
print(f'Uni Control: {uni_ctrl}')

# UNI SIZE AND SETTING
uni_size = driver.find_element_by_css_selector('[data-testid="csp-overview-school-size"]').text
print(f'Uni Size: {uni_size}')
uni_setting = driver.find_element_by_css_selector('[data-testid="csp-overview-school-setting"]').text
print(f'Uni Setting: {uni_setting}')

# SAT SCORES
sat_range = driver.find_element_by_css_selector('.csp-overview-info-text[data-testid="csp-overview-sat-range"]').text
#Get the text of the element
sat_range = sat_range.split('–')
# Storing the numbers in separate variables
try:
    sat_min = int(sat_range[0])
except ValueError:
    sat_min = "N/A"
try:
    sat_max = int(sat_range[1])
except (ValueError, IndexError):
    sat_max = "N/A"
# Print the sat ranges
print(f'SAT Lower Range: {sat_min}')
print(f'SAT Upper Range: {sat_max}')

# UNI TYPE
uni_type_text = driver.find_element_by_css_selector('.csp-shared-section-descriptions[data-testid="csp-more-about-description"]').text
# Create a variable based on conditions
uni_type = ""
if "private university" in uni_type_text:
    uni_type = "Private Uni"
elif "liberal arts college" in uni_type_text:
    uni_type = "LAC" 
print(f'Uni Type: {uni_type}')


## SWITCH TO ADMISSIONS TAB
print('Switching to Admissions tab.')
admission_link = driver.find_element_by_link_text("Admissions")
# Click on the element
admission_link.click()

# ACCEPTANCE RATE AND SELECTIVITY
acceptance_text = driver.find_element_by_css_selector('.cs-label-value-pair-value[data-testid="acceptance-rate-value"]').text
# Splitting the text based on the opening parenthesis
parts = acceptance_text.split(" (")
# Storing the values in separate variables
try:
    acceptance_rate = parts[0]
except IndexError:
    acceptance_rate = "N/A"
try:
    selectivity = parts[1][:-1]  # Removing the closing parenthesis
except (IndexError, ValueError):
    selectivity = "N/A"
print("Acceptance Rate:", acceptance_rate)
print("Selectivity:", selectivity)

# DATES
date_ra = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="regular-decision-value"]').text
date_ea = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="early-action-value"]').text
date_ed = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="early-decision-value"]').text
print(f'Regular Action Deadline: {date_ra}')
print(f'Early Action Deadline: {date_ea}')
print(f'Early Decision Deadline: {date_ed}')

## SWITCH TO COSTS TAB
print('Switching to Costs tab')
costs_link = driver.find_element_by_link_text("Costs")
# Click on the element
costs_link.click()

# COSTS
net_cost = driver.find_element_by_class_name("csp-tuition-label-value-pair-value").text
net_cost = get_numeric_val(net_cost)
housing = driver.find_element_by_id("csp-list-item-average-housing-value").text
housing = get_numeric_val(housing)
books = driver.find_element_by_id("csp-list-item-books-and-supplies-value").text
books = get_numeric_val(books)
personal_exp = driver.find_element_by_id("csp-list-item-estimatedPersonalExpenses-value").text
personal_exp = get_numeric_val(personal_exp)
try:
    tuition = driver.find_element_by_id("csp-list-item-private-tuition-value").text
    tuition = get_numeric_val(tuition)
except NoSuchElementException:
    try:
        tuition = driver.find_element_by_id("csp-list-item-out-of-state-tuition-value").text
        tuition = get_numeric_val(tuition)
    except NoSuchElementException:
        tuition = None  # or any default value you prefer
print(f'Net Cost: {net_cost}')
print(f'Tuition: {tuition}')
print(f'Books: {books}')
print(f'Personal Expenses: {personal_exp}')

# AVG FINANCIAL NEED MET
need_met = driver.find_element_by_id("csp-list-item-need-met-value").text
print(f'Average Financial Need Met: {need_met}')

# End session
driver.quit()

University Name: Harvard College
City: Cambridge
State: MA
No of years required: 4
Uni Control: Private
Uni Size: Medium
Uni Setting: Urban
SAT Lower Range: N/A
SAT Upper Range: N/A
Uni Type: Private Uni
Switching to Admissions tab.
Acceptance Rate: 4%
Selectivity: Extremely Selective
Regular Action Deadline: Jan 1
Early Action Deadline: Nov 1
Early Decision Deadline: Not available
Switching to Costs tab
Net Cost: 
Tuition: 54269
Books: 1000
Personal Expenses: 2500
Average Financial Need Met: 100%


The above code works just fine for a single url, and we were able to get the information we need! Now it's time to test it on multiple webpages.

## SCRAPE MULTIPLE URLS - BIGFUTURE<a id='bf-multi'></a>


First, we need to create a list of the urls. I have a .txt file with all the urls which I'll parse through and create a list.

In [130]:
# Imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import re
import time
import pandas as pd

In [403]:
# Empty list
site_lists = []

# Open the text file and read the links
with open('uni_list_2.txt', 'r') as file:
    # Read each line and append the link to the list
    for line in file:
        link = line.strip()  # Remove leading/trailing whitespaces and newline characters
        site_lists.append(link)

# Print the list of links
# print(site_lists)

We define `get_numeric_val` function here again, in case we don't want to re-run previous cells in the notebook.

In [131]:
# Create a function for the costs section:
def get_numeric_val(var):
    filtered_digits = filter(str.isdigit, var)
    digit_string = ''.join(filtered_digits)
    if digit_string:
        var = int(digit_string)
    else:
        var = ''  
    return(var)

We wrap our code from the single url case inside a for loop where we iterate through each of the url's in the list. Another notable thing is that for each snippet of code, we wrapped it in a **try** and **except** block; this is so that in case we don't find that particular locator, we'll note "N/A" for that value and move on. Also, for each snippet of code, we **append** to our ever growing lists, which we will use later to create a pandas dataframe and export as a spreadsheet!

There's a lot of commented out print commands, which are crucial for debugging in case we run into errors; they were used heavily during my initial testing phases.

In [132]:
# Placeholder cell for testing a small sample of urls or problem urls.
#site_lists = ['https://bigfuture.collegeboard.org/colleges/columbia-university']

In [133]:
# Create empty lists to store values in
uni_name_list = []
city_name_list = []
state_code_list = []
no_of_years_list = []
uni_ctrl_list = []
uni_size_list = []
uni_setting_list = []
sat_min_list = []
sat_max_list = []
uni_type_list = []
acceptance_list = []
selectivity_list = []
date_ra_list = []
date_ea_list = []
date_ed_list = []
net_cost_list = []
tuition_list = []
housing_list = []
books_list = []
personal_exp_list = []
need_met_list = []

# Create a loop to go through each url in the list and scrape the sites
for link in site_lists:

    print(f'Accessing URL: {link}')
    DRIVER_PATH = './chromedriver_win32/chromedriver.exe' 
    options = Options()
    options.headless = True
    driver = webdriver.Chrome(options = options, executable_path=DRIVER_PATH)
    driver.get(link)
    
    time.sleep(2)  # Add a delay of 2 seconds (adjust the value as needed)
    
    # CLICK ACCEPT COOKIES
    print('Trying to click on cookies button')
    try:
        cookies_button = driver.find_element(By.ID, "onetrust-accept-btn-handler")
        driver.execute_script("arguments[0].click();", cookies_button)
    except NoSuchElementException:
        pass  # Skip clicking if the element is not found
    
    time.sleep(1) # Add a delay of 1 second after clicking on button
    
    print("Getting Data.")
    
    # UNI NAME
    try:
        uni_name = driver.find_element_by_id("csp-banner-section-school-name-label").text
        #print(f'Uni Name: {uni_name}')
    except NoSuchElementException:
        uni_name = "N/A"
    uni_name_list.append(uni_name)
    
    # STATE AND CITY
    try:
        location = driver.find_element_by_id("csp-banner-section-school-location-label").text
        # Splitting the text based on the comma
        parts = location.split(',')
        # Removing any leading or trailing whitespace
        city_name = parts[0].strip()
        # Extracting the last two letters
        state_code = parts[1].strip()
    except NoSuchElementException:
        city_name = "N/A"
        state_code = "N/A"
    city_name_list.append(city_name)
    #print(f'City: {city_name}')
    state_code_list.append(state_code)
    #print(f'State: {state_code}')    

    # NO. OF YEARS
    try:
        no_of_years = driver.find_element_by_css_selector('[data-testid="csp-overview-school-type-by-years"]').text
        no_of_years = re.search(r'\d+', no_of_years).group()
    except NoSuchElementException:
        no_of_years = "N/A"
    no_of_years_list.append(no_of_years)
    # print(f'No of years required: {no_of_years}')

    # UNI CONTROL
    try:
        uni_ctrl = driver.find_element_by_css_selector('[data-testid="csp-overview-school-type-by-designation"]').text
    except NoSuchElementException:
        uni_ctrl = "N/A"
    uni_ctrl_list.append(uni_ctrl)
    # print(f'Uni Control: {uni_ctrl}')

    # UNI SIZE 
    try:
        uni_size = driver.find_element_by_css_selector('[data-testid="csp-overview-school-size"]').text
    except NoSuchElementException:
        uni_size = "N/A"
    uni_size_list.append(uni_size)
    # print(f'Uni Size: {uni_size}')
    
    # UNI SETTING
    try:
        uni_setting = driver.find_element_by_css_selector('[data-testid="csp-overview-school-setting"]').text
    except NoSuchElementException:
        uni_setting = "N/A"
    uni_setting_list.append(uni_setting)
    # print(f'Uni Setting: {uni_setting}')

    # SAT SCORES    
    try:
        sat_range = driver.find_element_by_css_selector('.csp-overview-info-text[data-testid="csp-overview-sat-range"]').text
        #Get the text of the element
        sat_range = sat_range.split('–')
        # Storing the numbers in separate variables
        try:
            sat_min = int(sat_range[0])
        except ValueError:
            sat_min = "N/A"
        try:
            sat_max = int(sat_range[1])
        except (ValueError, IndexError):
            sat_max = "N/A"
    except NoSuchElementException:
        sat_min = "N/A"
        sat_max = "N/A"
    sat_min_list.append(sat_min)
    sat_max_list.append(sat_max)
    # Print the sat ranges
    # print(f'SAT Lower Range: {sat_min}')
    # print(f'SAT Upper Range: {sat_max}')

    # UNI TYPE
    try:
        uni_type_text = driver.find_element_by_css_selector('.csp-shared-section-descriptions[data-testid="csp-more-about-description"]').text
        # Create a variable based on conditions
        uni_type = ""
        if "private university" in uni_type_text:
            uni_type = "Private Uni"
        elif "liberal arts college" in uni_type_text:
            uni_type = "LAC" 
    except NoSuchElementException:
        uni_type = ""
    #print(uni_type)
    uni_type_list.append(uni_type)


    ## SWITCH TO ADMISSIONS TAB
    admission_link = driver.find_element_by_link_text("Admissions")
    # Click on the element
    admission_link.click()
    time.sleep(1) # small delay

    # ACCEPTANCE RATE AND SELECTIVITY
    try:
        acceptance_text = driver.find_element_by_css_selector('.cs-label-value-pair-value[data-testid="acceptance-rate-value"]').text
        # Splitting the text based on the opening parenthesis
        parts = acceptance_text.split(" (")
        # Storing the values in separate variables
        try:
            acceptance_rate = parts[0]
        except IndexError:
            acceptance_rate = "N/A"
        try:
            selectivity = parts[1][:-1]  # Removing the closing parenthesis
        except (IndexError, ValueError):
            selectivity = "N/A"
    except NoSuchElementException:
        acceptance_rate = "N/A"
        selectivity = "N/A"
    acceptance_list.append(acceptance_rate)
    selectivity_list.append(selectivity)
    # print("Acceptance Rate:", acceptance_rate)
    # print("Selectivity:", selectivity)

    # DATES
    try:
        date_ra = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="regular-decision-value"]').text
    except NoSuchElementException:
        date_ra = "Not available"
    date_ra_list.append(date_ra)
    # print(f'Regular Action Deadline: {date_ra}')
    
    try:
        date_ea = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="early-action-value"]').text
    except NoSuchElementException:
        date_ea = "Not available"
    date_ea_list.append(date_ea)
    # print(f'Early Action Deadline: {date_ea}')
    
    try:
        date_ed = driver.find_element_by_css_selector('.csp-application-deadline-item-value[data-testid="early-decision-value"]').text
    except NoSuchElementException:
        date_ed = "Not available"
    date_ed_list.append(date_ed)
    # print(f'Early Decision Deadline: {date_ed}')


    ## SWITCH TO COSTS TAB
    costs_link = driver.find_element_by_link_text("Costs")
    # Click on the element
    costs_link.click()
    time.sleep(1) # small delay

    # COSTS
    try:
        net_cost = driver.find_element_by_class_name("csp-tuition-label-value-pair-value").text
        net_cost = get_numeric_val(net_cost)
    except NoSuchElementException:
        net_cost = "N/A"
    net_cost_list.append(net_cost)
    #print(f'Net Cost: {net_cost}')   
    
    # TUITION
    try:
        tuition = driver.find_element_by_id("csp-list-item-private-tuition-value").text
        tuition = get_numeric_val(tuition)
    except NoSuchElementException:
        try:
            tuition = driver.find_element_by_id("csp-list-item-out-of-state-tuition-value").text
            tuition = get_numeric_val(tuition)
        except NoSuchElementException:
            tuition = "N/A" 
    tuition_list.append(tuition)
    #print(f'Tuition: {tuition}')
    
    try:
        housing = driver.find_element_by_id("csp-list-item-average-housing-value").text
        housing = get_numeric_val(housing)
    except NoSuchElementException:
        housing = "N/A"
    housing_list.append(housing)
    #print(f'Housing: {housing}')
    
    try:
        books = driver.find_element_by_id("csp-list-item-books-and-supplies-value").text
        books = get_numeric_val(books)
    except NoSuchElementException:
        books = "N/A"
    books_list.append(books)
    #print(f'Books: {books}')
    
    try:
        personal_exp = driver.find_element_by_id("csp-list-item-estimatedPersonalExpenses-value").text
        personal_exp = get_numeric_val(personal_exp)
    except NoSuchElementException:
        personal_exp = "N/A"
    personal_exp_list.append(personal_exp)
    #print(f'Personal Expenses: {personal_exp}')

    # AVG FINANCIAL NEED MET
    try:
        need_met = driver.find_element_by_id("csp-list-item-need-met-value").text
        need_met_list.append(need_met)
    except NoSuchElementException:
        need_met = "N/A"
    #print(f'Average Financial Need Met: {need_met}')
    
    # END SESSION
    driver.quit()
    
    print(f"Data Collected for {uni_name}")
    
    # Small delay before getting new data
    time.sleep(3)

Current URL: https://bigfuture.collegeboard.org/colleges/columbia-university
Trying to click on cookies button
Getting Data
Data Collected for Columbia University


**OH NO! The script crashed, what do I do?**

In the version it is in right now, the script should run fine, but in case you still do run into an error because some information was missing from a site or any other reason, and your loop crashes somewhere in the middle, fear not! You don't have to rerun everything from scratch. (In fact, this has happened to me a few times already)

What you would be tempted to do is go to the next section and create a dataframe with the data that you have already collected so far. And that is fine, but do make sure that the lists are all the same length. You can check that by running `len(list_name)`.

In [316]:
len(personal_exp_list), len(uni_name_list) # and so on

(1, 35)

What will happen most likely is some of the first few lists will most likely have a an extra value (usually a N/A), so their length will be 1 more than the remainder of the lists. This means that these first few lists have the values from the url where it crashed. We simply have to delete an element from those lists.

In [317]:
# Deleting last element from a list
a = ['1', '2', '3', 'n/a']
print(a)
del a[-1]
print(a)

['1', '2', '3', 'n/a']
['1', '2', '3']


Once you've ensured that all the lists are of the same length, you can proceed to the next section and create a datarame and export to csv. When we rerun the above loop to scrape data, you'll need to modify the url list. You don't want to rerun the url's you already have data for, so make sure those are deleted (or index properly when you run your loop).

This way, you'll probably end up with multiple csv's (assuming 1 or more times it crashed). All you have to do is concat them together at the end!

In [None]:
# # Sample Code to concat dataframes
# import pandas as pd
# # assume your dataframes are called df1 and df2
# df_master = pd.concat([df1, df2], axis = 0)

## CREATE DATAFRAME FOR BIGFUTURE<a id='bf-pandas'></a>


Here we use `pandas` to create a dataframe using the information stored in all our lists.

In [148]:
import pandas as pd

data = {
    "University Name": uni_name_list,
    "City": city_name_list,
    "State": state_code_list,
    "Years": no_of_years_list,
    "Control": uni_ctrl_list,
    "Type": uni_type_list,
    "Size": uni_size_list,
    "Setting": uni_setting_list,
    "SAT Min": sat_min_list,
    "SAT Max": sat_max_list,
    "Acceptance Rate": acceptance_list,
    "Difficulty Level": selectivity_list,
    "RA": date_ra_list,
    "EA": date_ea_list,
    "ED": date_ed_list,
    "Net Cost": net_cost_list,
    "Tuition": tuition_list,
    "Housing": housing_list,
    "Books": books_list,
    "Personal Expenses": personal_exp_list,
    "Avg Need Met": need_met_list,
    "URL" : site_lists
}

# Create the DataFrame
bigfuture_info = pd.DataFrame(data)

bigfuture_info.sample(5) # display 5 random rows from the dataframe

Unnamed: 0,University Name,City,State,Years,Control,Type,Size,Setting,SAT Min,SAT Max,...,RA,EA,ED,Net Cost,Tuition,Housing,Books,Personal Expenses,Avg Need Met,URL
0,Columbia University,New York,NY,4,Private,Private Uni,Medium,Urban,1500,1560,...,Jan 1,Not available,Nov 1,21828,65340,16800,1392,2350,99%,https://bigfuture.collegeboard.org/colleges/co...


**SAVE BIGFUTURE DATAFRAME TO CSV**

If we are interested only in the BigFuture data, we can now save a csv file.

In [142]:
# Save as csv
bigfuture_info.to_csv('bigfuture_uni_info_1.csv', index = False)

And we're done!! But, are we? We still have to scrape the CollegeData website, so that's what we do in the next section.

## EXPLORE COLLEGEDATA.COM<a id='cd'></a>


Just like with the BigFuture website, it's important to explore CollegeData.com and see where the information is laid out, how it's presented, etc. 

In [318]:
# Import webdriver
from selenium import webdriver

#The chromedriver_win32 folder is in the same directory where this notebook is located
DRIVER_PATH = './chromedriver_win32/chromedriver.exe' 

# Initiate the webdriver
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
# This will open a new chrome window
driver.get("https://waf.collegedata.com/college-search/allegheny-college/")

# Uncomment following to end session and close window
#driver.quit()

We notice that there are 6 tabs. The default link opens the Overview tab. There are also Admissions, Financials, Academic, Campus Life, and Students tabs. 

<img src = "./images/cd_allegheny.jpg">

After some playing around, we decide to collect the following information.

**Overview Tab**:  
* University Name
* Women %

**Admissions Tab**:  
* Regular Action Deadline
* Early Action Deadline
* Early Decision Deadline
* Waiting List (accepted or not)
* Letters of Recommendation required

**Campus Life Tab**:  
* Campus Size (in acres)

**Students Tab**:
* International Students % and how many countries are represented.

One thing you'll notice is the annoying popup to subscribe to their newsletter every time you switch to a new tab! We'll need to deal with this first.

Another thing to note is that since we're not specifying the window size, Selenium loads the chrome page in default size. This actually makes it so that some of the sections are not automatically expanded. We'll need to script clicking on the section headers to expandion the sections!

### **FUNCTION TO CLOSE NEWSLETTER**

This section writes a function to deal with the pesky newsletter subscription popup that shows up every time we switch tabs. First, we inspect the close button (X on the top right) and find the class name `NewsletterModal_close__2xauP`. This is the same regardless of which tab the popup shows up on, so we can write a single function to close this. Also, note that the popup takes a few seconds to load, so we add a small time delay before we attempt to find and close this popup window. We end the function with a small time delay as well, so that we have a little leeway before we move to next tasks.

<img src = "./images/news_button.jpg">

We will call on the `close_newsletter()` function every time we switch tabs.

In [319]:
# Function to Close newsletter
import time

def close_newsletter():
    time.sleep(5) # small delay as the hidden page takes a few seconds to load
    print('Closing Newsletter.')
    try:
        hidden_button = driver.find_element_by_class_name("NewsletterModal_close__2xauP")
        driver.execute_script("arguments[0].click();", hidden_button)
    except NoSuchElementException:
        pass  # Skip clicking if the element is not found
    time.sleep(1)

In [320]:
# Run close_newsletter() to close the popup
close_newsletter()

Closing Newsletter.


### Let's start scraping!

**University Name**  
First, let's get the University name from the Overview page. As before, we highlight the text and Inspect element. There are several ways to go about it. We can use the class `ActionHeader_titles__1bP13`, or we can also just use the xpath. Be careful about using the xpath though, as it does not always work in every url (the path to the information may change). However, I've checked several different url's from CollegeData and the xpath seems to work every time for the University Name, so we'll go with this for now.

Copying the element gives this:
`<div class="ActionHeader_titles__1bP13"><h1 class="cd-web-h1 ActionHeader_title__104Gs">Allegheny College<span class="ActionHeader_subtitle__1i5Tt">Facts &amp; Information</span></h1></div>`

<img src = "./images/allegheny_class.jpg">

Notice in the code below that we use the `find_element_by_xpath()` function. We get the xpath by highlighting the element and then copying the full xpath and pasting in this cell. And as before, we have to make sure to use the `.text` at the end.

A small problem here is that the information we get by using this xpath is the University Name in the first line, followed by "Facts and Information" in the second line. We don't care about the second line. So we use `.splitlines()` to split the lines and then only take the first line. Also, in case this xpath doesn't exist, we make sure to handle exceptions `NoSuchElementException` as well.

In [321]:
# UNI NAME
try:
    uni_name = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[1]/div[1]/div/div/div/div/div[1]/h1').text
    uni_name = uni_name.splitlines()[0]
except NoSuchElementException:
    uni_name = "N/A"
print(f'University Name: {uni_name}')

University Name: Allegheny College


**Women %**  
Next, we want the Women % data as shown in the image below. We inspect the element as usual.
<img src = "./images/women_pct.jpg">

Some problems here! Namely, notice that the class name is not unique. You can do a quick check by copying the class id and searching for it by pressing CTRL + F and pasting the idea; you'll see that there are multiple matches. Now if we were absolutely sure that every single url will have the exact same number of classes and the index of Women % would always be the same, we could just use indexing instead. However, as I found out with some trial and error, that is not always the case for this website. So, we go for xpath again, and I did test this for several sites, and it worked.

The result we get is in the form, e.g., `Women - 54.9%%`, we don't want the `Women - ` part of the string, so we'll need to use some manipulation to get rid of that. `.split()` should work fine here, assuming this information is presented the same way for every url. And then we take the second element after the split to get the % we need. We could have also done this with regex, of course, but this seemed simpler. This string manipulation could lead to IndexError or AttributeError in case we don't have the data in the right format or it's not listed, so we handle those exceptions too.

In [322]:
# WOMEN %
try:
    women_pct = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[1]/div[3]/div[2]/div/div/div[5]/div[2]").text
    women_pct = women_pct.split(' - ')[1]
except NoSuchElementException:
    women_pct = "N/A"
except (IndexError, AttributeError):
    women_pct = "N/A"
print(f'Women%: {women_pct}')

Women%: 53.9%


**Expanding Students Section**  

Because of the default window size, the Students section at the bottom is not automatically expanded. We'll need to click on it to expand the section.

<img src = "./images/students_button.jpg">

Again, xpath works for several different url, so this should be fine to do. We use the `.click()` function to click on the button and then we run the `close_newsletter()` function to close the newsletter popup.

In [323]:
# CLICK ON STUDENTS
students_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[5]/div[2]/div[2]")
students_button.click()

**International Students and Countries**  
The information we want from this section are the % of international students and how many countries are represented in this instituion.

<img src = "./images/intl_students.jpg">
Again, there is a problem here. The classes are not unique, and using xpath doesn't always work either, because some of the url's could have different xpaths. We have to use some tricks here!

First, we find all elements under the class `CollegeProfileContent_content__1hJCl`. This has two subclasses: `TitleValue_title__2-afK` and `TitleValue_value__1JT0d`. Now, if ALL the parent class had both these subclasses, we could use indexing tricks to solve the problem. However, as I noticed with some trial and error, it does not. 

So, what we'll do instead is iterate through all the title and value elements and find where we have the text "International Students" in the title; we'll then get the corresponding value.

There are a few issues here:  
* Sometimes the data is `Not reported`, in which case we have to assign "N/A" to our data.
* Sometimes only the number of countries is reported, e.g. "50 countries."
* Sometimes only the % is reported, e.g. "4.2%."

When we do match the title element to "International Students", the value we get is usually in the form, e.g., `4.1% from 56 countries`. We want to separate this data so that we capture only the % value and the raw number for the countries. Regex comes in handy here! Side note: you can always test your regex here: https://regex101.com/

For more information on how these regex codes work, please check the **regex demos**:  
* Getting %: https://regex101.com/r/RIuS6K/1
* Getting the number only while ignoring %: https://regex101.com/r/NAmr0p/1

As before, in case `NoSuchElementException` occurs, i.e. we can't find information from this class, we do nothing.

In [324]:
# Quick example of regex:

import re

# Test string
test_string_1 = "4.2% from 50 countries" # Case 1
test_string_2 = "Not reported" # Case 2
test_string_3 = "50 countries" # Case 3
test_string_4 = "4.1%" # Case 4

# Find the numbers, capture dots and numbers that follow and end with %.
# If we have Index Error, it means we didn't have the % value.
# Case 1
try:
    test_rate = re.findall(r"(\d+(\.\d+)?%)", test_string_1)[0][0]
except IndexError:
    test_rate = "N/A"
print(f'Case 1: {test_rate}')

# Case 2
try:
    test_rate = re.findall(r"(\d+(\.\d+)?%)", test_string_2)[0][0]
except IndexError:
    test_rate = "N/A"
print(f'Case 2: {test_rate}')

# Case 3
try:
    test_rate = re.findall(r"(\d+(\.\d+)?%)", test_string_3)[0][0]
except IndexError:
    test_rate = "N/A"
print(f'Case 3: {test_rate}')

# Case 4
try:
    test_rate = re.findall(r"(\d+(\.\d+)?%)", test_string_4)[0][0]
except IndexError:
    test_rate = "N/A"
print(f'Case 4: {test_rate}')


# Getting countries represented
# Case 1
try:
    test_countries = re.findall(r'[-+]?\b(?!\d+(?:[,.]\d+)?%)\d+(?:[.,]\d+)?', test_string_1)
    if test_countries:
        test_countries = test_countries[0]
    else:
        test_countries = "N/A"
except IndexError:
    test_countries = "N/A"
print(f'Case 1: {test_countries}')

# Case 2
try:
    test_countries = re.findall(r'[-+]?\b(?!\d+(?:[,.]\d+)?%)\d+(?:[.,]\d+)?', test_string_2)
    if test_countries:
        test_countries = test_countries[0]
    else:
        test_countries = "N/A"
except IndexError:
    test_countries = "N/A"
print(f'Case 2: {test_countries}')

# Case 3
try:
    test_countries = re.findall(r'[-+]?\b(?!\d+(?:[,.]\d+)?%)\d+(?:[.,]\d+)?', test_string_3)
    if test_countries:
        test_countries = test_countries[0]
    else:
        test_countries = "N/A"
except IndexError:
    test_countries = "N/A"
print(f'Case 3: {test_countries}')

# Case 4
try:
    test_countries = re.findall(r'[-+]?\b(?!\d+(?:[,.]\d+)?%)\d+(?:[.,]\d+)?', test_string_4)
    if test_countries:
        test_countries = test_countries[0]
    else:
        test_countries = "N/A"
except IndexError:
    test_countries = "N/A"
print(f'Case 4: {test_countries}')


Case 1: 4.2%
Case 2: N/A
Case 3: N/A
Case 4: 4.1%
Case 1: 50
Case 2: N/A
Case 3: 50
Case 4: N/A


Our reg ex works, and we use the method described above fetch information on international students % and number of countries represented.

In [325]:
from selenium.common.exceptions import NoSuchElementException

# INTERNATIONAL STUDENTS % AND NUMBER OF COUNTRIES
print('Getting International students % and number of countries represented.')
# Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
# Iterate over the parent elements and extract the desired information
for parent_element in parent_elements:
    try:
        title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
        value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")

        if title_element.text == "International Students":
            intl = value_element.text
            
            try:
                intl_rate = re.findall(r"(\d+(\.\d+)?%)", intl)[0][0]
            except IndexError:
                intl_rate = "N/A"
            print(f'International students %: {intl_rate}')
            
            try:
                intl_ctrs = re.findall(r'[-+]?\b(?!\d+(?:[,.]\d+)?%)\d+(?:[.,]\d+)?', intl)
                if intl_ctrs:
                    intl_ctrs = intl_ctrs[0]
                else:
                    intl_ctrs = "N/A"
            except IndexError:
                intl_ctrs = "N/A"
            print(f'Number of countries: {intl_ctrs}')

            break
            
    except NoSuchElementException:
        continue  # Skip this iteration if one of the elements is not found

Getting International students % and number of countries represented.
International students %: 4.1%
Number of countries: 56


It works - at least for the link we provided. But I've tested it on other urls' where the information on International students was either 'not reported', only countries, or only %.

### **Swith to Admissions Tab**

Now we switch to the Admissions tab: First, we right click and inspect.

<img src = "./images/adm_button.jpg">

Again, we can use the xpath for the different tabs, as they seem to be fixed, but we could have easily just used selector as well. Remember to run `close_newsletter()` after to get rid of that pesky popup. As a reminder, Right click on element, and Copy > full Xpath to get the xpath details.

In [326]:
# CLICK ON ADMISSIONS
admissions_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[1]/div[2]/div/div/div/div/a[2]/div/img")
admissions_button.click()
time.sleep(1)
close_newsletter()

Closing Newsletter.


**Click on Applying for Admissions**  
The information we need is under the section Applying for Admissions, which we need to click on to expand the section.

<img src = "./images/apply_adm.jpg">

In [327]:
# CLICK ON APPLYING FOR ADMISSIONS
applying_adms_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[2]/div[2]/div[2]/div")
applying_adms_button.click()
time.sleep(1) 

**DEADLINES, LETTERS REQUIRED, WAIT LIST**

From the applying for admission section, we're going to fetch multiple information. 
* Letters of Recommendation
* Waiting List Used
* Regular Action Deadline
* Early Action Deadline
* Early Decision Deadline

We inspect each element and notice that they are all under the class `CollegeProfileContent_content__1hJCl`. The subclass `TitleValue_title__2-afK` has the titles such as 'Letters of Recommendation', while the subclass `TitleValue_value__1JT0d` has the values, such as '2 required for all freshmen'. Unlike in the BigFuture website, we can't just refer to a specific class id, so we're going to have to use the same method we used earlier for getting the % of international students.

First, we'll fetch all the elements under the class `CollegeProfileContent_content__1hJCl`, then we'll loop through each and check the ``TitleValue_title__2-afK`` to see if it matches with any of our required information. If it does, then we'll store the value in the corresponding variable.

On a sidenote, I did notice that that regular admission deadline sometimes had two dates separated by a comma. e.g. It could be written as 'February 15, December 1'. After checking through them it seemed like the second date was usually the early action date, so I decided to remove the part after the comma and keep only the first value. So instead of 'February 15, December 1', I stored only 'February 15`.

In [329]:
# DEADLINES, LETTERS REQUIRED, WAIT LIST
print('Getting letters required, wait list, and deadlines.')

# Initialize variables
lor = "N/A" # letter of recommendation
wait = "N/A" # wait list used
ra = "N/A" # regular action deadline
ea = "N/A" # early action deadline
ed = "N/A" # early decision deadline

# Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")

# Iterate over the parent elements and extract the desired information
for parent_element in parent_elements:
    try:
        title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
        value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")

        # Extract information based on the title element's text
        title_text = title_element.text.lower()

        if "letters of recommendation" in title_text:
            lor = value_element.text.split()[0]
        elif "waiting list used" in title_text:
            wait = value_element.text
        elif "regular admission deadline" in title_text:
            ra = value_element.text
            # As mentioned earlier, some dates have two components, we only want the component before the comma
            ra_parts = ra.split(",")  # Split the string at the comma
            # Check if the date did indeed have two components or not:
            if len(ra_parts) > 1:
                ra = ra_parts[0].strip()  # Extract the text before the comma and remove leading/trailing spaces
            else:
                ra = ra.strip()  # No comma present, use the whole string and remove leading/trailing spaces
        elif "early action deadline" in title_text:
            ea = value_element.text
        elif "early decision deadline" in title_text:
            ed = value_element.text
    except NoSuchElementException:
        continue  # Skip this iteration if one of the elements is not found

# Print the extracted information using f-strings
print(f"Letters of Recommendation: {lor}")
print(f"Waiting List Used: {wait}")
print(f"Regular Admission Deadline: {ra}")
print(f"Early Action Deadline: {ea}")
print(f"Early Decision Deadline: {ed}")

Getting letters required, wait list, and deadlines.
Letters of Recommendation: 2
Waiting List Used: Yes
Regular Admission Deadline: February 15
Early Action Deadline: December 1
Early Decision Deadline: November 15


### **Switch to Campus Tab**

Next, we'll switch to Campus tab. We'll use the xpath method to click on this tab, and as always, we need to remember to run the `close_newsletter()` function afterwards.

<img src = "./images/campus_button.jpg">

In [330]:
# CLICK ON CAMPUS LIFE
campus_button = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[1]/div[2]/div/div/div/div/a[5]')
campus_button.click()
time.sleep(1)
close_newsletter()     

Closing Newsletter.


**University Area/Size**

We want the university size in acres, and this information is stored un the Location and Setting section which needs to be expanded, so we inspect the element, copy the xpath, and click on it as below.

In [488]:
# CLICK ON LOCATION AND SETTING
loc_set_button = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div[2]/div')
loc_set_button.click()
time.sleep(1)

Again, we inspect the Campus Size element and write a code very similar to the one we used for getting information on the international students %.

<img src = "./images/campus_size.jpg">

In [331]:
# UNI AREA SIZE           
print('Getting Uni Area Size.')
# Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
uni_area = 'N/A'
parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
# Iterate over the parent elements and extract the desired information
for parent_element in parent_elements:
    try:
        title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
        value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")

        if title_element.text.lower() == "campus size":
            uni_area = value_element.text
            break
    except NoSuchElementException:
        continue  # Skip this iteration if one of the elements is not foundprint(f'Uni Area Size: {uni_area}')
print(f'Uni Area: {uni_area}')

Getting Uni Area Size.
Uni Area: 566 acres


And we're  done! That's all the information we want from the CollegeData website. 

In [332]:
# End session
driver.quit()

## SCRAPE SINGLE URL - COLLEGEDATA<a id='cd-single'></a>


It's time to put everything together and see if we can get all that information in one go.

In [54]:
# Imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import re
import time
import pandas as pd

In [333]:
# TEST CODE FOR SINGLE SITE

DRIVER_PATH = './chromedriver_win32/chromedriver.exe' 
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get("https://www.collegedata.com/college-search/california-institute-of-technology")
close_newsletter()

# UNI NAME
try:
    uni_name = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[1]/div[1]/div/div/div/div/div[1]/h1').text
    uni_name = uni_name.splitlines()[0]
except NoSuchElementException:
    uni_name = "N/A"
print(f'University Name: {uni_name}')

# WOMEN %
try:
    women_pct = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[1]/div[3]/div[2]/div/div/div[5]/div[2]").text
    women_pct = women_pct.split(' - ')[1]
except NoSuchElementException:
    women_pct = "N/A"
except (IndexError, AttributeError):
    women_pct = "N/A"
print(f'Women %: {women_pct}')

# CLICK ON STUDENTS
students_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[5]/div[2]/div[2]")
students_button.click()
time.sleep(1)

# INTERNATIONAL STUDENTS % AND NUMBER OF COUNTRIES
print('Getting International students % and number of countries represented.')
# Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
# Iterate over the parent elements and extract the desired information
for parent_element in parent_elements:
    try:
        title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
        value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")

        if title_element.text == "International Students":
            intl = value_element.text
            
            try:
                intl_rate = re.findall(r"(\d+(\.\d+)?%)", intl)[0][0]
            except IndexError:
                intl_rate = "N/A"
            print(f'International students %: {intl_rate}')
            
            try:
                intl_ctrs = re.findall(r'[-+]?\b(?!\d+(?:[,.]\d+)?%)\d+(?:[.,]\d+)?', intl)
                if intl_ctrs:
                    intl_ctrs = intl_ctrs[0]
                else:
                    intl_ctrs = "N/A"
            except IndexError:
                intl_ctrs = "N/A"
            print(f'Number of countries: {intl_ctrs}')
            
            break
    except NoSuchElementException:
        continue  # Skip this iteration if one of the elements is not found

# CLICK ON ADMISSIONS
print('Switching to Admissions tab.')
admissions_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[1]/div[2]/div/div/div/div/a[2]/div/img")
admissions_button.click()
time.sleep(1)
close_newsletter()

# CLICK ON APPLYING FOR ADMISSIONS
applying_adms_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[2]/div[2]/div[2]/div")
applying_adms_button.click()
time.sleep(1)

# GET LETTERS REQUIURED, WAIT LIST, AND DEADLINES
print('Getting Letters required, wait list, and deadlines.')
# Initialize variables
lor = "N/A"
wait = "N/A"
ra = "N/A"
ea = "N/A"
ed = "N/A"
# Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
# Iterate over the parent elements and extract the desired information
for parent_element in parent_elements:
    try:
        title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
        value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")

        # Extract information based on the title element's text
        title_text = title_element.text.lower()

        if "letters of recommendation" in title_text:
            lor = value_element.text.split()[0]
        elif "waiting list used" in title_text:
            wait = value_element.text
        elif "regular admission deadline" in title_text:
            ra = value_element.text
            ra_parts = ra.split(",")  # Split the string at the comma
            if len(ra_parts) > 1:
                ra = ra_parts[0].strip()  # Extract the text before the comma and remove leading/trailing spaces
            else:
                ra = ra.strip()  # No comma present, use the whole string and remove leading/trailing spaces
        elif "early action deadline" in title_text:
            ea = value_element.text
        elif "early decision deadline" in title_text:
            ed = value_element.text
    except NoSuchElementException:
        continue  # Skip this iteration if one of the elements is not found
# Print the extracted information using f-strings
print(f"Letters Required: {lor}")
print(f"Waiting List Used: {wait}")
print(f"Regular Admission Deadline: {ra}")
print(f"Early Action Deadline: {ea}")
print(f"Early Decision Deadline: {ed}")

# CLICK ON CAMPUS LIFE
print('Switching to Campus tab.')
campus_button = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[1]/div[2]/div/div/div/div/a[5]')
campus_button.click()
time.sleep(1)
close_newsletter()     
           
# CLICK ON LOCATION AND SETTING
loc_set_button = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div[2]/div')
loc_set_button.click()
time.sleep(1)
           
# UNI AREA SIZE           
print('Getting Uni Area Size.')
# Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
uni_area = 'N/A'
parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
# Iterate over the parent elements and extract the desired information
for parent_element in parent_elements:
    try:
        title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
        value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")

        if title_element.text.lower() == "campus size":
            uni_area = value_element.text
            break
    except NoSuchElementException:
        continue  # Skip this iteration if one of the elements is not foundprint(f'Uni Area Size: {uni_area}')
print(f'Uni Area: {uni_area}')

# END SESSION
driver.quit()
print(f'Completed collecting data for {uni_name}.')

Closing Newsletter.
University Name: California Institute of Technology
Women %: 44.7%
Getting International students % and number of countries represented.
International students %: 7.9%
Number of countries: N/A
Switching to Admissions tab.
Closing Newsletter.
Getting Letters required, wait list, and deadlines.
Letters Required: 2
Waiting List Used: Yes
Regular Admission Deadline: January 3
Early Action Deadline: November 1
Early Decision Deadline: N/A
Switching to Campus tab.
Closing Newsletter.
Getting Uni Area Size.
Uni Area: 124 acres
Completed collecting data for California Institute of Technology.


## SCRAPE MULTIPLE URLS - COLLEGEDATA<a id='cd-multi'></a>


We're now ready to wrap our code into a loop where we iterate through a list of url's and scrape each url for the information we need. The information is appended to lists for each of the variable, which we use later for dataframe creation.

The text files containing the lists are in the same directory as the notebook, so make sure to write the correct path for your files.

Just like with the BigFuture webt scraping script, we wrap our code in a loop, we initialize empty lists which we append to with each iteration, and make sure to handle exceptions. Most of the print commands have been commented out, but they can be used for debugging purposes.

In [256]:
# Imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import re
import time
import pandas as pd

In [257]:
# Empty list
site_lists = []

# Open the text file and read the links
with open('cd_list_2.txt', 'r') as file:
    # Read each line and append the link to the list
    for line in file:
        link = line.strip()  # Remove leading/trailing whitespaces and newline characters
        site_lists.append(link)

# Print the list of links
# print(site_lists)

In [334]:
# Function to Close newsletter
def close_newsletter():
    time.sleep(5) # small delay as the hidden page takes a few seconds to load
    #print('Closing Newsletter.')
    try:
        hidden_button = driver.find_element_by_class_name("NewsletterModal_close__2xauP")
        driver.execute_script("arguments[0].click();", hidden_button)
    except NoSuchElementException:
        pass  # Skip clicking if the element is not found
    time.sleep(1)

In [335]:
# # List of URLS to fetch data from
# site_lists = ["https://waf.collegedata.com/college-search/allegheny-college/",
#              "https://waf.collegedata.com/college-search/beloit-college"]

In [337]:
# Create Empty Lists to store values in:
uni_name_list = [] # university name list
women_pct_list = [] # women % list
intl_rate_list = [] # international students rate list
intl_ctrs_list = [] # countries represented list
ra_list = [] # regular action deadline list
ea_list = [] # early action deadline list
ed_list = [] # early decision deadline list
wait_list = [] # wait list accepted list
lor_list = [] # letters required list
uni_area_list = [] # university campus size list

for link in site_lists:

    print(f'Accessing URL: {link}')
    DRIVER_PATH = './chromedriver_win32/chromedriver.exe' 
    options = Options()
    options.headless = True
    driver = webdriver.Chrome(options = options, executable_path=DRIVER_PATH)
    driver.get(link)

    close_newsletter()
    
    # UNI NAME
    print('Getting university name.')
    try:
        uni_name = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[1]/div[1]/div/div/div/div/div[1]/h1').text
        uni_name = uni_name.splitlines()[0]
        
    except NoSuchElementException:
        uni_name = "N/A"
    #print(f'University Name: {uni_name}')
    uni_name_list.append(uni_name)
    
    # WOMEN %
    print('Getting % of women.')
    try:
        women_pct = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[1]/div[3]/div[2]/div/div/div[5]/div[2]").text
        women_pct = women_pct.split(' - ')[1]
    except NoSuchElementException:
        women_pct = "N/A"
    except (IndexError, AttributeError):
        women_pct = "N/A"
    women_pct_list.append(women_pct)
#     print(f'Women%: {women_pct}')

    # CLICK ON STUDENTS
    #print('Clicking on Students button.')
    students_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[5]/div[2]/div[2]")
    students_button.click()
    time.sleep(1)

    # INTERNATIONAL STUDENTS % AND NUMBER OF COUNTRIES
    print('Getting International students % and number of countries represented.')
    # Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
    parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
    # Iterate over the parent elements and extract the desired information
    for parent_element in parent_elements:
        try:
            title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
            value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")

            if title_element.text == "International Students":
                intl = value_element.text
                
                try:
                    intl_rate = re.findall(r"(\d+(\.\d+)?%)", intl)[0][0]
                except IndexError:
                    intl_rate = "N/A"
                intl_rate_list.append(intl_rate)
#                 print(f'International students %: {intl_rate}')

                try:
                    intl_ctrs = re.findall(r'[-+]?\b(?!\d+(?:[,.]\d+)?%)\d+(?:[.,]\d+)?', intl)
                    if intl_ctrs:
                        intl_ctrs = intl_ctrs[0]
                    else:
                        intl_ctrs = "N/A"
                except IndexError:
                    intl_ctrs = "N/A"
                intl_ctrs_list.append(intl_ctrs)
#                 print(f'Number of countries: {intl_ctrs}')                
                
                break
        except NoSuchElementException:
            continue  # Skip this iteration if one of the elements is not found
      
    # CLICK ON ADMISSIONS
    print('Clicking on Admissions.')
    admissions_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[1]/div[2]/div/div/div/div/a[2]/div/img")
    admissions_button.click()
    time.sleep(1)
    close_newsletter()

    # CLICK ON APPLYING FOR ADMISSIONS
    print('Clicking on Applying for Admissions.')
    applying_adms_button = driver.find_element_by_xpath("/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[2]/div[2]/div[2]/div")
    applying_adms_button.click()
    time.sleep(1)
    
    # GET LETTERS REQUIURED, WAIT LIST, AND DEADLINES
    print('Getting Letters Required, Wait List, and Deadlines.')
    # Initialize variables
    lor = "N/A"
    wait = "N/A"
    ra = "N/A"
    ea = "N/A"
    ed = "N/A"
    # Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
    parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
    # Iterate over the parent elements and extract the desired information
    for parent_element in parent_elements:
        try:
            title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
            value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")
            # Extract information based on the title element's text
            title_text = title_element.text.lower()
            if "letters of recommendation" in title_text:
                lor = value_element.text.split()[0]
            elif "waiting list used" in title_text:
                wait = value_element.text
            elif "regular admission deadline" in title_text:
                ra = value_element.text
                ra_parts = ra.split(",")  # Split the string at the comma
                if len(ra_parts) > 1:
                    ra = ra_parts[0].strip()  # Extract the text before the comma and remove leading/trailing spaces
                else:
                    ra = ra.strip()  # No comma present, use the whole string and remove leading/trailing spaces
            elif "early action deadline" in title_text:
                ea = value_element.text
            elif "early decision deadline" in title_text:
                ed = value_element.text
        except NoSuchElementException:
            continue  # Skip this iteration if one of the elements is not found
    # Print the extracted information using f-strings
    lor_list.append(lor)
    wait_list.append(wait)
    ra_list.append(ra)
    ea_list.append(ea)
    ed_list.append(ed)
#     print(f"Letters Required: {lor}")
#     print(f"Waiting List Used: {wait}")
#     print(f"Regular Admission Deadline: {ra}")
#     print(f"Early Action Deadline: {ea}")
#     print(f"Early Decision Deadline: {ed}")

    # CLICK ON CAMPUS LIFE
    print('Clicking on Campus Button.')
    campus_button = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[1]/div[2]/div/div/div/div/a[5]')
    campus_button.click()
    time.sleep(1)
    close_newsletter()     

    # CLICK ON LOCATION AND SETTING
    print('Clicking on Location and Setting button.')
    loc_set_button = driver.find_element_by_xpath('/html/body/div[4]/div/div[1]/div/div[2]/div[1]/div/div[1]/div/div[2]/div')
    loc_set_button.click()
    time.sleep(1)

    # UNI AREA SIZE           
    print('Getting Uni Area Size.')
    # Find all parent elements with the class name "CollegeProfileContent_content__1hJCl"
    uni_area = 'N/A' #Initialize variable
    parent_elements = driver.find_elements_by_class_name("CollegeProfileContent_content__1hJCl")
    # Iterate over the parent elements and extract the desired information
    for parent_element in parent_elements:
        try:
            title_element = parent_element.find_element_by_class_name("TitleValue_title__2-afK")
            value_element = parent_element.find_element_by_class_name("TitleValue_value__1JT0d")
            if title_element.text.lower() == "campus size":
                uni_area = value_element.text
                break
        except NoSuchElementException:
            continue  # Skip this iteration if one of the elements is not foundprint(f'Uni Area Size: {uni_area}')
#     print(f'Uni Area: {uni_area}')
    uni_area_list.append(uni_area)

    driver.quit()
    
    if len(uni_name_list) == len(site_lists):
        print(f'Completed getting info for all the universities.')
    else:
        print(f'Completed getting info for {uni_name}. Waiting for next link.')
    
    time.sleep(2)

Accessing URL: https://waf.collegedata.com/college-search/allegheny-college/
Getting university name.
Getting % of women.
Getting International students % and number of countries represented.
Clicking on Admissions.
Clicking on Applying for Admissions.
Getting Letters Required, Wait List, and Deadlines.
Clicking on Campus Button.
Clicking on Location and Setting button.
Getting Uni Area Size.
Completed getting info for Allegheny College. Waiting for next link.
Accessing URL: https://waf.collegedata.com/college-search/beloit-college
Getting university name.
Getting % of women.
Getting International students % and number of countries represented.
Clicking on Admissions.
Clicking on Applying for Admissions.
Getting Letters Required, Wait List, and Deadlines.
Clicking on Campus Button.
Clicking on Location and Setting button.
Getting Uni Area Size.
Completed getting info for all the universities.


And we're done!!! That's it. Now we're ready to gather these lists into a dataframe and combine all the information we collected from both websites.

### CREATE DATAFRAME FROM COLLEGEDATA DATA<a id='cd-pandas'></a>


In [338]:
data = {
    "University Name": uni_name_list,
    "Women%": women_pct_list,
    "Intl Rate": intl_rate_list,
    "Countries": intl_ctrs_list,
    "RA CD" : ra_list,
    "EA CD" : ea_list,
    "ED CD" : ed_list,
    "Wait List": wait_list,
    "LOR": lor_list,
    "Campus Area (acres)": uni_area_list
    #"URL": site_lists
}

# Create the DataFrame
collegedata_info = pd.DataFrame(data)

collegedata_info

Unnamed: 0,University Name,Women%,Intl Rate,Countries,RA CD,EA CD,ED CD,Wait List,LOR,Campus Area (acres)
0,Allegheny College,53.9%,4.1%,56.0,February 15,December 1,November 15,Yes,2,566 acres
1,Beloit College,52.5%,,,January 15,Not reported,November 1,No,1,84 acres


**Function to fix dates**

As mentioned before, some dates have two components, separated by a comma. We want to keep only the first component of these dates, and we want to apply this function only if a comma is present.

In [339]:
import re

def extract_single_date(date):
    if ',' in date:
        # Split the string at the comma and keep only the first part
        date = date.split(',')[0]
    
    # Remove any leading/trailing spaces
    date = date.strip()
    
    return date

In [341]:
# Test the extract_single_date function on different date formats to make sure it works.

# Sample date values
date1 = "January 30"
date2 = "January 30, December 15"
date3 = "N/A"

# Extract single dates
date1 = extract_single_date(date1)
date2 = extract_single_date(date2)
date3 = extract_single_date(date3)

print(date1)  # Output: January 30
print(date2)  # Output: January 30
print(date3)  # Output: N/A

January 30
January 30
N/A


We apply the above function on our 'ED CD' column (early decision column). 

In [342]:
collegedata_info['ED CD'] = collegedata_info['ED CD'].apply(extract_single_date)

In [343]:
collegedata_info

Unnamed: 0,University Name,Women%,Intl Rate,Countries,RA CD,EA CD,ED CD,Wait List,LOR,Campus Area (acres)
0,Allegheny College,53.9%,4.1%,56.0,February 15,December 1,November 15,Yes,2,566 acres
1,Beloit College,52.5%,,,January 15,Not reported,November 1,No,1,84 acres


**SAVE COLLEGEDATA DATA TO CSV**

If we are interested only in the CollegeData.com data, we can now save a csv file.

In [265]:
# Save file
collegedata_info.to_csv('collegedata_uni_info_2.csv', index = False)

## MERGE BIGFUTURE AND COLLEGEDATA DF'S<a id='merge'></a>


The hard part is done! Now we use `pandas` to merge our dataframes and export a final master spreadsheet that combines information from both the websites.

First we make sure to import `pandas`, and load our datasets.

In [344]:
import pandas as pd

# List 1 Unis
# cd_1 = pd.read_csv('./collegedata_uni_info_1_1.csv')
# cd_2 = pd.read_csv('./collegedata_uni_info_1_2.csv')
# cd_df = pd.concat([cd_1, cd_2], axis = 0)
# bf_df = pd.read_csv('./bigfuture_uni_info_1.csv')


# List 2 Unis
bf_df = pd.read_csv('./bigfuture_uni_info_2.csv') # csv we had saved earlier after scraping bigfuture site
cd_df = pd.read_csv('./collegedata_uni_info_2.csv') # csv we had saved earlier scraping collegedata site

**DOUBLE CHECK UNIVERSITY NAMES / KEY INDEX**

We're going to be merging the dataframes based on 'University Name' column as our key index, so we need to be absolutely sure that the University names all match up. It is very likely (and we notice in our example) that some universities may be listed with slight variations in the name, e.g. '&' instead of 'and', 'St' instead of 'Saint', and so on.

In [345]:
# Check which rows in the collegedata df is missing in bigfuture df.
missing_rows = cd_df[~cd_df['University Name'].isin(bf_df['University Name'])]
missing_rows 

Unnamed: 0,University Name,Women%,Intl Rate,Countries,RA CD,EA CD,ED CD,Wait List,LOR,Campus Area (acres)
2,Brigham Young University - Idaho,55.3%,17.4%,116.0,February 15,,,No,,255 acres
7,College of Saint Benedict,,2.7%,5.0,Rolling,December 15,,No,,300 acres
18,Lewis & Clark College,63.6%,4.4%,54.0,January 15,November 1,November 1,Yes,,137 acres


We notice that **3 university names** are missing. These are:
* Brigham Young University - Idaho
* College of Saint Benedict
* Lewis & Clark College

In [346]:
# Check which rows in the bigfuture df is missing in collegedata df.
missing_rows = bf_df[~bf_df['University Name'].isin(cd_df['University Name'])]
missing_rows

Unnamed: 0,University Name,City,State,Years,Control,Type,Size,Setting,SAT Min,SAT Max,Acceptance Rate,Difficulty Level,RA,EA,ED,Net Cost,Tuition,Housing,Books,Personal Expenses,Avg Need Met,URL
2,Brigham Young University-Idaho,Rexburg,ID,4,Private,,,Rural,990.0,1200.0,96%,Not Selective,Sep 10,Not available,Not available,7038,,,,,Not available,https://bigfuture.collegeboard.org/colleges/br...
7,College of St. Benedict,Saint Joseph,MN,4,Private,LAC,Small,Suburban,1045.0,1235.0,80%,Less Selective,Jan 15,Dec 15,Not available,28177,52700.0,12160.0,1000.0,1500.0,91%,https://bigfuture.collegeboard.org/colleges/co...
18,Lewis and Clark Community College,Godfrey,IL,2,Public,,,Suburban,,,Not available,,Not available,Not available,Not available,3939,7500.0,,,,Not available,https://bigfuture.collegeboard.org/colleges/le...


We notice that **3 university names** are missing. These are:
* Brigham Young University-Idaho
* College of St. Benedict
* Lewis and Clark Community College

Well, in this case it's obvious that they are referring to the same universities, they're just written slightly differently. One approach to fix these issues is to replace the names in one of the dataframes with the names from the other.

In [293]:
# Replace the university names in the cd_df dataframe
cd_df['University Name'].replace({
    'Brigham Young University - Idaho': 'Brigham Young University-Idaho',
    'College of Saint Benedict': 'College of St. Benedict',
    'Lewis & Clark College': 'Lewis and Clark Community College'
}, inplace=True)

If you now rerun the earlier cells where we checked which uni names are missing in either df, you should get back empty df's, which means that there are no cases where the names are mismatching anymore!

**MERGE THE DATAFRAMES**

In [347]:
# Merge the df's based on the keyindex = 'University Name'
merged_df = pd.merge(bf_df, cd_df, on = 'University Name')

**ADDING A TOTAL COST COLUMN**  
Notice that we have separate costs such as Tuition, Housing, Books, and Personal Expenses. What if we want to have a column that shows the total cost as well? Well we can do that too. But also realize that some of the values for those fields are missing, which means that if a value in any of the costs is missing, a sum would return NaN.

So, this is what we'll do. First we'll sum up Housing costs, Books and Supplies, and Personal Expenses cost, and in case of N/A, we'll simply skip those.

Next we'll add at that partial sum to Tuition to get Total Cost. **However**, if Tution value is missing (N/A), our Total Cost will also return N/A, so in this case, we do not skip na.(This made sense because if Tuition (which is the majority portion of the costs) info is missing, we don't want to get a misleading value for Total Costs.

In [348]:
# First, calculate the partial sum with skipna=True
partial_sum = merged_df[['Housing', 'Books', 'Personal Expenses']].sum(axis = 1, skipna = True)

# Next, Calculate the final sum (if Tuition is na, our result will also be na)
merged_df['Total Cost'] = partial_sum + merged_df['Tuition']

**FILLING DEADLINE DATES**

Remember that we collected deadlines from BOTH the websites? This was because BigFuture didn't report deadlines for some of the universities, but that information was available in CollegeData. So what we're going to do here is if a deadline ('RA', 'EA', 'ED' columns) has 'Not available' in its field, we'll look at the corresponding deadline columns ('RA CD', 'EA CD', 'ED CD') and get the value from there instead.

In [349]:
# Create a boolean mask for the 'RA' column where the value is 'Not available'
mask = merged_df['RA'] == 'Not available'

# Use boolean indexing to assign the corresponding values from the 'RA CD' column to 'RA'
merged_df.loc[mask, 'RA'] = merged_df.loc[mask, 'RA CD'].fillna(merged_df['RA'])

In [350]:
# Create a boolean mask for the 'EA' column where the value is 'Not available'
mask = merged_df['EA'] == 'Not available'

# Use boolean indexing to assign the corresponding values from the 'EA CD' column to 'EA'
merged_df.loc[mask, 'EA'] = merged_df.loc[mask, 'EA CD'].fillna(merged_df['EA'])

In [351]:
# Create a boolean mask for the 'ED' column where the value is 'Not available'
mask = merged_df['ED'] == 'Not available'

# Use boolean indexing to assign the corresponding values from the 'ED CD' column to 'ED'
merged_df.loc[mask, 'ED'] = merged_df.loc[mask, 'ED CD'].fillna(merged_df['ED'])

**REARRANGE AND SELECT COLUMNS**  
In this section we'll rearrange and keep specific columns. For instance, we don't need to keep the 'RA CD', 'EA CD', and 'ED CD' columns anymore.

In [352]:
# Check column names
merged_df.columns

Index(['University Name', 'City', 'State', 'Years', 'Control', 'Type', 'Size',
       'Setting', 'SAT Min', 'SAT Max', 'Acceptance Rate', 'Difficulty Level',
       'RA', 'EA', 'ED', 'Net Cost', 'Tuition', 'Housing', 'Books',
       'Personal Expenses', 'Avg Need Met', 'URL', 'Women%', 'Intl Rate',
       'Countries', 'RA CD', 'EA CD', 'ED CD', 'Wait List', 'LOR',
       'Campus Area (acres)', 'Total Cost'],
      dtype='object')

In [353]:
# Rearrange and keep specific columns
merged_df = merged_df[['University Name', 'Difficulty Level', 'Acceptance Rate',
                      'Intl Rate', 'Countries', 'Women%', 'SAT Min', 'SAT Max', 'Avg Need Met',
                      'City', 'State', 'Campus Area (acres)', 'Control', 'Type', 'Setting',
                       'Total Cost', 'Net Cost', 'Tuition', 'Housing', 'Books', 'Personal Expenses',
                       'LOR', 'Wait List', 'Years', 'RA', 'EA', 'ED', 'URL'
                      ]]

**FORMAT CURRENCY COLUMNS**

The costs column, as it is have int values. We want to format them as currency values with $ sign and a thousands separator. So let's apply that to the different costs columns, while making sure to leave na values as is.

In [354]:
# Define a lambda function to format currency values
format_currency = lambda x: "${:,.0f}".format(x) if pd.notna(x) else ""

# Apply the formatting function to the 'Amount' column
merged_df['Total Cost'] = merged_df['Total Cost'].apply(format_currency)
merged_df['Net Cost'] = merged_df['Net Cost'].apply(format_currency)
merged_df['Tuition'] = merged_df['Tuition'].apply(format_currency)
merged_df['Housing'] = merged_df['Housing'].apply(format_currency)
merged_df['Books'] = merged_df['Books'].apply(format_currency)
merged_df['Personal Expenses'] = merged_df['Personal Expenses'].apply(format_currency)

**FORMAT CAMPUS SIZE COLUMN**  
The Campus Size column shows string values with acres included in the string, e.g. '1,000 acres'. Let's remove the acres part and keep only the numeric value instead!

regex demo: https://regex101.com/r/SnLxzj/1

In [359]:
merged_df['Campus Area (acres)'] = merged_df['Campus Area (acres)'].str.extract(r'(\d{1,3}(?:,\d{3})*|\d+)')

**SORT BY UNIVERSITY NAME**  
Last step, let's sort by University Name before saving our dataframe.

In [360]:
# Sort by 'University Name'
merged_df.sort_values(by = 'University Name', inplace = True)

In [363]:
# Check first 5 rows
merged_df.head()

Unnamed: 0,University Name,Difficulty Level,Acceptance Rate,Intl Rate,Countries,Women%,SAT Min,SAT Max,Avg Need Met,City,State,Campus Area (acres),Control,Type,Setting,Total Cost,Net Cost,Tuition,Housing,Books,Personal Expenses,LOR,Wait List,Years,RA,EA,ED,URL
0,Allegheny College,Less Selective,62%,4.1%,56.0,53.9%,1110.0,1370.0,90%,Meadville,PA,566,Private,LAC,Suburban,"$70,668","$24,227","$54,300","$14,868",$400,"$1,100",2,Yes,4,Feb 15,Dec 1,Nov 15,https://bigfuture.collegeboard.org/colleges/al...
1,Beloit College,Less Selective,62%,,,52.5%,1230.0,1360.0,96%,Beloit,WI,84,Private,LAC,Urban,"$71,446","$24,493","$58,042","$10,740","$1,133","$1,531",1,No,4,January 15,Nov 1,Nov 1,https://bigfuture.collegeboard.org/colleges/be...
2,Bryn Mawr College,Selective,33%,15.4%,37.0,,1300.0,1470.0,Not available,Bryn Mawr,PA,135,Private,LAC,Suburban,,"$33,716",,"$18,690","$1,000","$1,000",3,Yes,4,Jan 15,Not available,Nov 15,https://bigfuture.collegeboard.org/colleges/br...
3,Bucknell University,Selective,34%,4.9%,50.0,52.9%,1180.0,1390.0,0%,Lewisburg,PA,446,Private,Private Uni,Rural,"$81,436","$39,493","$64,418","$16,118",$900,$0,1,Yes,4,Jan 15,Not available,Nov 15,https://bigfuture.collegeboard.org/colleges/bu...
4,Centre College,Less Selective,76%,,,51.9%,1150.0,1380.0,92%,Danville,KY,160,Private,,Suburban,"$65,836","$24,226","$50,250","$13,040","$1,200","$1,346",1,Yes,4,Jan 15,December 1,November 15,https://bigfuture.collegeboard.org/colleges/ce...


**DROP DUPLICATE ROWS**  
After saving the csv, i noticed that one of the universities had been duplicated multiple times! This is becuase the url for that university was in our list multiple times. So let's drop duplicate rows.

In [364]:
merged_df.drop_duplicates(subset=['University Name', 'Acceptance Rate', 'City'], inplace=True)

**SAVE OUR MASTER SPREADSHEET**  
And we're at the end of the road! What a journey. Time to save this csv and peruse it for our benefit. 

In [311]:
# Save File
merged_df.to_csv('uni_list_sheet2.csv', index = False)

## CONCLUSION<a id='end'></a>


That's it, we made it. We've **successfully scraped** specific information from two dynamic websites - **BigFuture.org** and **CollegeData.com**. 

We've used the combined power of `Selenium` and `WebDriver` in `Python` to achieve these results. We've made use of different `find_element` methods depending on the website we were scraping. We've handled exceptions, dealt with pesky popups, and dissected texts/strings to extract only useful information. We've gathered similar data from both sites to ensure that we had the as much of the complete picture as possible. We've used `pandas` to create our dataframes, create and format columns, and finally save our dataframe to a csv.

Are we done? For now, yes, at least as far as this notebook is concerned. We had set out to scrape dynamic websites, and we did. But now that we have the data, what kind of information can we glean from it? 

That is a tale for another day.