## Gathering graduate school rankings from USNews

The purpose of this excecise is to get lists of top graduate programs through USNews for personal use. 
I got access denied when I tried to scrape the website, so I decided to download the source codes of the webpages and save in text files. Then I used Python to extract the information I need.  

In [1]:
import bs4
import re
import codecs
import pandas as pd
from functools import reduce

In [2]:
#Create a function to return a dataframe for each program:
def scrap_text_from_txt(list_of_files):
    ranks = []
    names = []
    locations = []
    
    for file in list_of_files:
        html = codecs.open(file, "r", "utf-8")
        f = html.read()
        soup = bs4.BeautifulSoup(f, "lxml")
        for rank in soup.findAll('span', attrs={'class': 'rankscore-bronze'}):
            ranks.append(int(re.findall('\d+', rank.text)[0]))
        for college in soup.findAll('a', attrs={'class': 'school-name'}):
            names.append(college.text)
        for location in soup.findAll('p', attrs={'class': 'location'}):
            locations.append(location.text)

    data = {'Rank': ranks, 'College Name': names, 'Location': locations}
    df = pd.DataFrame(data = data)
    return df

In [3]:
files = ['usnews_page1.txt','usnews_page2.txt','usnews_page3.txt']
stats_df = scrap_text_from_txt(files)

math_files = ['math_page1.txt','math_page2.txt','math_page3.txt','math_page4.txt','math_page5.txt','math_page6.txt']
math_df = scrap_text_from_txt(math_files)

cs_files = ['cs_page1.txt','cs_page2.txt','cs_page3.txt','cs_page4.txt','cs_page5.txt']
cs_df = scrap_text_from_txt(cs_files)

bus_files = ['bus_page1.txt','bus_page2.txt','bus_page3.txt']
bus_df = scrap_text_from_txt(bus_files)

In [4]:
#Clean the column names
math_df=math_df.rename(columns = {'Rank': 'Math Rank'})
stats_df=stats_df.rename(columns = {'Rank': 'Statistics Rank'})
cs_df=cs_df.rename(columns = {'Rank': 'Computer Science Rank'})
bus_df=bus_df.rename(columns = {'Rank': 'Business Rank'})

In [5]:
#Merge all dataframes to one dataframe:
dfs = [stats_df, math_df, cs_df, bus_df]

all_programs = reduce(lambda left,right: pd.merge(left,right,how = 'outer',
                                                  on=['College Name','Location']), dfs)

In [6]:
all_programs.sort_values(by = ['Statistics Rank']).head(10)

Unnamed: 0,College Name,Location,Statistics Rank,Math Rank,Computer Science Rank,Business Rank
0,Stanford University,"Stanford, CA",1.0,5.0,1.0,4.0
1,University of California—​Berkeley,"Berkeley, CA",2.0,3.0,1.0,
3,Harvard University,"Boston, MA",3.0,,,1.0
4,University of Washington,"Seattle, WA",3.0,25.0,6.0,
6,Johns Hopkins University,"Baltimore, MD",5.0,25.0,28.0,
7,University of Chicago,"Chicago, IL",5.0,5.0,34.0,
5,University of Washington,"Seattle, WA",7.0,25.0,6.0,
8,Harvard University,"Cambridge, MA",7.0,3.0,18.0,
9,Carnegie Mellon University,"Pittsburgh, PA",9.0,34.0,1.0,
10,Duke University,"Durham, NC",10.0,17.0,25.0,
