## Introduction

During this class section we will be make a dataset (suprise suprise) by folowing this notebook. We will scrape data off of [https://money.com/best-colleges/](https://money.com/best-colleges/). Money is an independent, advertiser-supported website and their editors "research hundreds of sources and contact hundreds of the most respected experts in each industry to get the most relevant information to help others make the right purchasing decision." The data consists of various/useful metrics of the the best colleges in America ranked by value (as determined by the website).

We will follow the steps of the web scraping "pipeline" in this notebook.

Features of the dataset - Demo

First lets explore the website

# Import Packages

In [17]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np

Make an GET request to get all the html from this webpage, print out the status code and the server that is giving you the html

In [24]:
url = "https://money.com/best-colleges/"
req = requests.get(url)
print(req.status_code)

200


In [25]:
req.headers['Server']

'cloudflare'

Pass in the html document into a beautiful soup object

In [26]:
soup = bs(req.content, "html.parser")
# req.content

Run this code cell to define this simple function. What is this function doing? What are we going to use it? What type of object does `element` have to be?

In [27]:
def func(element):
  return element.string

Use the above function and find the proper CSS Selectors to get the respective data, I will demo the first one. Note that getting CSS selectors can be tricky. I usually like to do a combo of the SelectorGadget extension and the Inspect Element tool to determine proper selectors.

In [6]:
college_names = list(map(func, soup.select("hgroup h2")))[0:-1]

In [7]:
college_location = list(map(func, soup.select('hgroup h3')))[:-1]
acceptance_rate = list(map(func, soup.select('.item-row .col-3')))
Est_full_price_23_24 = list(map(func, soup.select('.item-row .col-4')))
Est_price_with_avg_grant = list(map(func, soup.select('.item-row .col-5')))
graduation_rate = list(map(func, soup.select('.item-row .col-6')))

Now that we have all the data from the table stored into their respective lists, we can move onto getting data specific to the college by "clicking" on the link in the table. How do be "click" links on a webpage with `requests` and `BeautifulSoup`?

In [8]:
college_links = soup.select('.mask-link')
new_url = "https://money.com/" + college_links[0]['href']
adelphi_university = requests.get(new_url)
print(new_url)

https://money.com//best-colleges/profile/adelphi-university/


In [9]:
adelphi_soup = bs(adelphi_university.content, "html.parser")

Below I initialized a bunch of lists that will store data specific to a college.

In [10]:
metrics = list(map(func, adelphi_soup.select("dd")))
metrics

['$67,400',
 '90%',
 '$30,300',
 '$24,800',
 '77%',
 '3.69',
 '1200/26',
 'No',
 '5,170',
 '98%',
 '62%',
 '22%',
 '$20,530',
 '74%',
 '4.2 years',
 '$25,000',
 '$70,100',
 '80%']

In [11]:
avg_price_for_low_income_students = metrics[3]
median_sat_act_score = metrics[6]
sat_act_required = metrics[7]
undergrad_enrollment = metrics[8]
percent_of_students_with_need_who_get_grants = metrics[9]
percent_of_need_met = metrics[10]
percent_of_students_who_get_merit_grants = metrics[11]
avg_merit_grant = metrics[12]
avg_time_to_a_degree = metrics[14]
median_student_debt = metrics[15]
percent_earning_more_than_a_high_school_grad = metrics[17]
early_career_earnings = metrics[16]

"Click" on each link and extract the proper data (make sure to find and use the right CSS Selectors) and then add them to the list.

In [12]:
avg_price_for_low_income_students = []
median_sat_act_score = []
sat_act_required = []
undergrad_enrollment = []
percent_of_students_with_need_who_get_grants = []
percent_of_need_met = []
percent_of_students_who_get_merit_grants = []
avg_merit_grant = []
avg_time_to_a_degree = []
median_student_debt = []
percent_earning_more_than_a_high_school_grad = []
early_career_earnings = []

for link in college_links:
  new_url = "https://money.com/" + link['href']
  university = requests.get(new_url)
  university_soup = bs(university.content, "html.parser")
  metrics = list(map(func, university_soup.select("dd")))
  avg_price_for_low_income_students.append(metrics[3])
  median_sat_act_score.append(metrics[6])
  sat_act_required.append(metrics[7])
  undergrad_enrollment.append(metrics[8])
  percent_of_students_with_need_who_get_grants.append(metrics[9])
  percent_of_need_met.append(metrics[10])
  percent_of_students_who_get_merit_grants.append(metrics[11])
  avg_merit_grant.append(metrics[12])
  avg_time_to_a_degree.append(metrics[14])
  median_student_debt.append(metrics[15])
  percent_earning_more_than_a_high_school_grad.append(metrics[17])
  early_career_earnings.append(metrics[16])

The focus on the next few week's material is on the `numpy` and `pandas` package which is what we will use to store the data we extract from a webpage. I have made the code to package all the data into a dataset and save it as a `.csv` file. You will understand how I am doing this after going through this week's material.

In [13]:
data = {
    "college_names": college_names,
    "college_location": college_location,
    "acceptance_rate" : acceptance_rate,
    "Est_full_price_22_23": Est_full_price_23_24,
    "Est_price_with_avg_grant": Est_price_with_avg_grant,
    "graduation_rate": graduation_rate,
    "early_career_earnings": early_career_earnings,
    "avg_price_for_low_income_students": avg_price_for_low_income_students,
    "median_sat_act_score" : median_sat_act_score,
    "sat_act_required" : sat_act_required,
    "undergrad_enrollment": undergrad_enrollment,
    "percent_of_students_with_need_who_get_grants" : percent_of_students_with_need_who_get_grants,
    "percent_of_need_met" : percent_of_need_met,
    "percent_of_students_who_get_merit_grants" : percent_of_students_who_get_merit_grants,
    "avg_merit_grant" : avg_merit_grant,
    "avg_time_to_a_degree" : avg_time_to_a_degree,
    "median_student_debt" : median_student_debt,
    "percent_earning_more_than_a_high_school_grad" : percent_earning_more_than_a_high_school_grad

}

In [14]:
college_df = pd.DataFrame(data=data)

In [15]:
college_df.head()

Unnamed: 0,college_names,college_location,acceptance_rate,Est_full_price_22_23,Est_price_with_avg_grant,graduation_rate,early_career_earnings,avg_price_for_low_income_students,median_sat_act_score,sat_act_required,undergrad_enrollment,percent_of_students_with_need_who_get_grants,percent_of_need_met,percent_of_students_who_get_merit_grants,avg_merit_grant,avg_time_to_a_degree,median_student_debt,percent_earning_more_than_a_high_school_grad
0,Adelphi University,"Garden City, NY",\n 77%\n,"\n $67,400\n","\n $30,300\n",\n 74%\n,"$70,100","$24,800",1200/26,No,5170,98%,62%,22%,"$20,530",4.2 years,"$25,000",80%
1,Agnes Scott College,"Decatur, GA",\n 70%\n,"\n $63,500\n","\n $14,400\n",\n 73%\n,"$49,300","$10,400",,No,1010,100%,85%,25%,"$31,390",4.1 years,"$26,750",60%
2,Albertus Magnus College,"New Haven, CT",\n 82%\n,"\n $62,100\n","\n $30,800\n",\n 65%\n,"$52,300","$26,300",,No,1020,69%,46%,4%,"$21,060",4.3 years,"$30,960",59%
3,Albion College,"Albion, MI",\n 69%\n,"\n $70,200\n","\n $21,200\n",\n 70%\n,"$57,500","$15,300",1110/24,No,1510,100%,91%,16%,"$37,120",4.2 years,"$27,000",74%
4,Alcorn State University,"Alcorn State, MS",\n 39%\n,"\n $25,000\n","\n $15,600\n",\n 42%\n,"$34,500","$12,600",960/22,Yes,2480,92%,20%,33%,"$8,210",4.3 years,"$27,000",40%


In [16]:
college_df.to_csv("total_college_data.csv")