## Crawling topuniversities.com

Running this code takes time, as ~1000 webpages need to be requested and parsed. The code was run only once, and the DataFrame was saved with pickle. Some postprocessing (renaming columns, etc) had to be done afterwards to minimize internet traffic.

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json
import numpy as np
import pandas as pd
from functools import reduce


In [2]:
## Later we will have a DataFrame where each cell has some html code in it from which
## we have to extract the necessary values. This function helps to run Beautifulsoup on
## each column at once instead of running it on each cells separately. The function
## helps to create a single string out of the column
def addS(x,y):
    if type(y) is str:
        return x+y + "\n"
    else:
        return x

We begin by requesting the data in the table of the "Ranking indicators" on https://www.topuniversities.com/university-rankings/world-university-rankings/2018 The table is actually populated using a json object that has the appropriate html code as cells.


In [3]:
URL = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt'
r = requests.get(URL)
page_body = r.text
jdata=json.loads(page_body)

Now we import the json object into pandas and extract the necessary information from each column using Beautfulsoup. For better performance we don't call Beautfulsoup on each cell, but rather transform each column into one string and call Beautfulsoup on that. For most of the cells the info is in the div with class=td-wrap-in, except the univerity name and link which had to be handled separately. 

In [20]:
PDdata=pd.DataFrame(jdata["data"])
PDdata=PDdata.loc[0:199] # we only need the top 200 unis

soup=BeautifulSoup(reduce(addS, PDdata["uni"].values), 'html.parser')
wrappers= soup.find_all('a')
PDdata["uni"]=[p.text for p in wrappers]
PDdata["link"]=[p['href'] for p in wrappers]

codes=["2971069","2971070","2971071","2971072","2971073","2971074"] # rest of the data 
for code in codes:
    soup=BeautifulSoup(reduce(addS, PDdata[code].values), 'html.parser')
    wrappers= soup.find_all('div',class_="td-wrap-in")
    PDdata[code][[not x for x in PDdata[code].isnull().values]]=[float(p.text) for p in wrappers]


## dropping unnecessary stuff, renaming columns 
drop_codes=["","2971069_rank_d","2971070_rank_d","2971071_rank_d","2971072_rank_d","2971073_rank_d","2971074_rank_d","_rank_d","overall","overall_rank_dis"]
PDdata=PDdata.drop(drop_codes,axis=1)
PDdata.columns=["academic reputation score",
                "citations per faculty score",
                "employer reputation score",
                "faculty-student score",
                "international faculty score",
                "international students score",
                "country",
                "overall rank",
                "region",
                "stars",
                "name",
                "link"]


## For some reason the type of these columns had to be set manually
PDdata["academic reputation score"]=PDdata["academic reputation score"].astype(float)
PDdata["citations per faculty score"]=PDdata["citations per faculty score"].astype(float)
PDdata["employer reputation score"]=PDdata["employer reputation score"].astype(float)
PDdata["faculty-student score"]=PDdata["faculty-student score"].astype(float)
PDdata["international students score"]=PDdata["international students score"].astype(float)
PDdata["international faculty score"]=PDdata["international faculty score"].astype(float)
PDdata["overall rank"]=PDdata["overall rank"].astype(int)


PDdata.dtypes

academic reputation score       float64
citations per faculty score     float64
employer reputation score       float64
faculty-student score           float64
international faculty score     float64
international students score    float64
country                          object
overall rank                      int64
region                           object
stars                            object
name                             object
link                             object
dtype: object

Since the info on the (international) students/faculty is not available here we need to visit the page dedicated to each university separately. This is the part of the code that takes a considerable amount of time. Note that the details page of NYU (index 51) doesn't have the necessary info.

In [22]:
url_data=np.zeros((PDdata.shape[0],4))
for i in range(PDdata.shape[0]):
    URL = 'https://www.topuniversities.com/'+ PDdata["link"][i]
    r = requests.get(URL)
    page_body = r.text
    page_body

    soup = BeautifulSoup(page_body, 'html.parser')
    all_links = soup.find_all('div', class_='number')
    temp_data=[int(p.text[1:-1].replace(",","")) for p in all_links][0:4]
    if len(temp_data)==4:
        url_data[i,:]=temp_data
    else:
        print(i)

51


In [23]:
PDdata["total faculty"]=url_data[:,0]
PDdata["international faculty"]=url_data[:,1]
PDdata["total students"]=url_data[:,2]
PDdata["international studets"]=url_data[:,3]

Now we save the dataframe so we don't have to rerun the code.

In [24]:
PDdata.to_pickle("topuniversities.p")

In [14]:
PDdata=pd.read_pickle("topuniversities.p")

Unnamed: 0,academic reputation score,citations per faculty score,employer reputation score,faculty-student score,international faculty score,international students score,country,overall rank,region,stars,name,link,total faculty,international faculty,total students,international studets
0,100.0,99.9,100.0,100.0,100.0,96.1,United States,1,North America,6,Massachusetts Institute of Technology (MIT),/universities/massachusetts-institute-technolo...,2982.0,1679.0,11067.0,3717.0
1,100.0,99.4,100.0,100.0,99.6,72.7,United States,2,North America,5,Stanford University,/universities/stanford-university,4285.0,2042.0,15878.0,3611.0
2,100.0,99.9,100.0,98.3,96.5,75.2,United States,3,North America,5,Harvard University,/universities/harvard-university,4350.0,1311.0,22429.0,5266.0
3,99.5,100.0,85.4,100.0,93.4,89.2,United States,4,North America,5,California Institute of Technology (Caltech),/universities/california-institute-technology-...,953.0,350.0,2255.0,647.0
4,100.0,78.3,100.0,100.0,97.4,97.7,United Kingdom,5,Europe,5,University of Cambridge,/universities/university-cambridge,5490.0,2278.0,18770.0,6699.0
5,100.0,76.3,100.0,100.0,98.6,98.5,United Kingdom,6,Europe,5,University of Oxford,/universities/university-oxford,6750.0,2964.0,19720.0,7353.0
6,99.7,74.7,99.5,99.1,96.6,100.0,United Kingdom,7,Europe,,UCL (University College London),/universities/ucl-university-college-london,6345.0,2554.0,31080.0,14854.0
7,99.4,68.7,100.0,100.0,100.0,100.0,United Kingdom,8,Europe,,Imperial College London,/universities/imperial-college-london,3930.0,2071.0,16090.0,8746.0
8,99.9,85.9,92.9,96.5,71.9,79.8,United States,9,North America,5,University of Chicago,/universities/university-chicago,2449.0,635.0,13557.0,3379.0
9,99.6,98.7,99.4,68.2,100.0,98.8,Switzerland,10,Europe,,ETH Zurich - Swiss Federal Institute of Techno...,/universities/eth-zurich-swiss-federal-institu...,2477.0,1886.0,19815.0,7563.0
