# US NEWS Ranking Web scraping using Python

## Introduction
Your investment in a college education could profoundly affect your career opportunities, how much money you earn (i.e., your earning power) and your quality of life. Whatever degree you decide to pursue, choosing the right college is a huge first step. To find the right college, you need a source of comprehensive data – information that lets you compare one school with another and find the differences that matter to you. That's what U.S. News' Best College rankings are for. National Universities offer a range of undergraduate majors, plus master's and doctoral programs, and emphasize faculty research or award professional practice doctorates.

This paper covers the rankings of National University in the US with some basic information scraped from US News and the code to scrape it, including utilizing packages Selenium to get the data needed.

### I. Web scraping <a class="anchor" id="sub_section_1_1_1"></a>
Import the necessary libraries

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import numpy as np

Let's make a request and get the code of the whole page.

In [7]:
driver = webdriver.Chrome()
driver.get('https://www.usnews.com/best-colleges/rankings/national-universities?_mode=table')

In [8]:
driver.title

'2024 Best National Universities | US News Rankings'

Let's create the list of all the universities data we scrape from website:

In [18]:
universitylist = []

In [10]:
lists = driver.find_element(By.TAG_NAME, 'tbody')
unilist = lists.find_elements(By.CSS_SELECTOR, '[class="search-table__TableRow-sc-8xxgib-5 jSNVDf"]')

In [11]:
len(unilist)    

400

Create variables of data needed from the website including basic information, ranks and link for details.

In [19]:
for i in unilist:
    info={
        'University': i.find_element(By.TAG_NAME,'h3').text,
        'Details': i.find_element(By.TAG_NAME,'a').get_attribute('href'),
        'Location': i.find_element(By.TAG_NAME,'p').text,
        'Rank': i.find_element(By.CSS_SELECTOR, '[class="RankList__ListItem-sc-2xewen-1 dofuo rank-list-item"]').text.split('\n')[0][1:],
        'Tuition & Fee($)': i.find_element(By.CSS_SELECTOR, '[class="ResultsTableAtlas__TableCell-sc-1wtgwb-6 ResultsTableAtlas__DataCell-sc-1wtgwb-8 erKNyE eceoqp"]').text.split('\n')[0][1:],
        'Undergraduate enrollment (fall 2022)': i.find_elements(By.CSS_SELECTOR, '[class="Span-sc-19wk4id-0 ResultsTableAtlas__StatValue-sc-1wtgwb-13 dDrKMw hNWEBi"]')[1].text.split('\n')[0],
        
    }
    universitylist.append(info)


In [20]:
len(universitylist)

400

Now get some look at the preview of data we have scraped.

In [21]:
df = pd.DataFrame(universitylist)
df

Unnamed: 0,University,Details,Location,Rank,Tuition & Fee($),Undergraduate enrollment (fall 2022)
0,Princeton University,https://www.usnews.com/best-colleges/princeton...,"Princeton, NJ",1,59710,5604
1,Massachusetts Institute of Technology,https://www.usnews.com/best-colleges/massachus...,"Cambridge, MA",2,60156,4657
2,Harvard University,https://www.usnews.com/best-colleges/harvard-u...,"Cambridge, MA",3,59076,7240
3,Stanford University,https://www.usnews.com/best-colleges/stanford-...,"Stanford, CA",3,62484,8049
4,Yale University,https://www.usnews.com/best-colleges/yale-univ...,"New Haven, CT",5,64700,6645
...,...,...,...,...,...,...
395,Barry University,https://www.usnews.com/best-colleges/barry-uni...,"Miami Shores, FL",394-435,32500,3122
396,Belhaven University,https://www.usnews.com/best-colleges/belhaven-...,"Jackson, MS",394-435,29195,1501
397,Brenau University,https://www.usnews.com/best-colleges/brenau-un...,"Gainesville, GA",394-435,33275,1367
398,Briar Cliff University,https://www.usnews.com/best-colleges/briar-cli...,"Sioux City, IA",394-435,34498,709


### II. Data Cleaning 
The dataframe is maybe not in the format we want. To clean it up, we should check columns format, table shape and null values.

In [22]:
df.shape

(400, 6)

In [23]:
df.replace('N/A', np.nan, inplace=True)
df.isnull().sum()

University                               0
Details                                  0
Location                                 0
Rank                                     0
Tuition & Fee($)                         0
Undergraduate enrollment (fall 2022)    14
dtype: int64

As we see, there are some null values in column Undergraduate enrollment (fall 2022).

In [24]:
df.dtypes

University                              object
Details                                 object
Location                                object
Rank                                    object
Tuition & Fee($)                        object
Undergraduate enrollment (fall 2022)    object
dtype: object

In [25]:
df['Tuition & Fee($)'].replace({',': ''}, regex=True, inplace=True)
df['Undergraduate enrollment (fall 2022)'].replace({',': ''}, regex=True, inplace=True)

In [26]:
df.head()

Unnamed: 0,University,Details,Location,Rank,Tuition & Fee($),Undergraduate enrollment (fall 2022)
0,Princeton University,https://www.usnews.com/best-colleges/princeton...,"Princeton, NJ",1,59710,5604
1,Massachusetts Institute of Technology,https://www.usnews.com/best-colleges/massachus...,"Cambridge, MA",2,60156,4657
2,Harvard University,https://www.usnews.com/best-colleges/harvard-u...,"Cambridge, MA",3,59076,7240
3,Stanford University,https://www.usnews.com/best-colleges/stanford-...,"Stanford, CA",3,62484,8049
4,Yale University,https://www.usnews.com/best-colleges/yale-univ...,"New Haven, CT",5,64700,6645


In [27]:
df[['Tuition & Fee($)','Undergraduate enrollment (fall 2022)']]=df[['Tuition & Fee($)','Undergraduate enrollment (fall 2022)']].apply(pd.to_numeric)

In [28]:
df.dtypes

University                               object
Details                                  object
Location                                 object
Rank                                     object
Tuition & Fee($)                          int64
Undergraduate enrollment (fall 2022)    float64
dtype: object

In [29]:
df.describe()

Unnamed: 0,Tuition & Fee($),Undergraduate enrollment (fall 2022)
count,400.0,386.0
mean,36412.1775,12661.11399
std,14891.532946,11130.433472
min,2168.0,699.0
25%,24794.0,3921.75
50%,34546.5,8748.0
75%,46352.0,18357.5
max,68237.0,65492.0


### III. Export and Data summary <a class="anchor" id="sub_section_1_1_1"></a>

This is dataset: 
- Dataset Structure: 400 observations (rows), 6 features (variables)
- Missing Data: only 14 missing data total in all columns
- Data Type: two datatypes in this dataset: objects and integers

| Column | Description | 
| :---: | :--- |
| University | Name of the national univerisity in the US. |
| Details | Link to the details of ratings, rank and reviews of college. |
| Location | Location of the college. |
| Rank | The rank based on varying outcome measures related to schools' success at enrolling, retaining and graduating students from different backgrounds with manageable debt and post-graduate success. |
| Tuition & Fee($) | The amount of money required for instruction and fees that are charged to every single student on campus that vary from college to college. |
| Undergraduate enrollment (fall 2022) | The number of student enrolling at an institution in fall 2022. |

In [30]:
df.to_csv('2024 Top 400 National Universities | US News Rankings.csv')