# 02 - Data from the Web


In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

In [1]:
from bs4 import BeautifulSoup
import urllib2
import requests
import json
import string
import pandas as pd
import re
import math

## Data from QS ranking

Multiple strategies have been tested in order to find the data used to construct the table of the QS Rankings. However, only one of them led to the expected result. We opened the Web Inspector on the website. Among the ressources used by the page, a text file named 'XHR/357051.txt'  was containing the expected ranking with the desired informations. The URL_QS is leading to this file.

In [2]:
URL_QS = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508497791260'
r = requests.get(URL_QS)
page_body = r.json()
qs_df = pd.DataFrame(page_body['data'])
qs_df = qs_df.drop(['core_id','logo','guide','cc', 'nid', 'stars','score'], axis=1).head(200)

Additionnal informations needed to be accessed on each university webpage. We used the 'url' column, went to the university page and extracted all the number values we wanted and placed them in additionnal columns. In some cases, there were no informations on the webpage, in this case the columns are filed with NaN.

In [4]:
# Going on each university page for each index of the table

for index_ in range(0,qs_df.shape[0]):
    
    URL_ADD = 'https://www.topuniversities.com' + qs_df.get_value(index_, 'url')
    r = requests.get(URL_ADD)
    page_body = r.text
    soup = BeautifulSoup(page_body, 'html.parser')
    # Request all numbers that are displayed on the page, we know their order
    all_numbers = soup.findAll('div', {'class' : 'number'})
    
    # If the page contains no numbers, we will store NaN
    if not all_numbers:
        total_faculty_number = 'NaN'
        interna_faculty_number = 'NaN'
        total_students_number = 'NaN'
        interna_students_number = 'NaN'
   
    else:    
    # Total Faculty Members
        total_faculty = all_numbers[0]
        total_faculty_number = int(re.sub("[^\\d]",'',str(total_faculty)))
    
    # International Faculty Members
        interna_faculty = all_numbers[1]
        interna_faculty_number = int(re.sub("[^\\d]",'',str(interna_faculty)))
    
    # Total Students
        total_students = all_numbers[2]
        total_students_number = int(re.sub("[^\\d]",'',str(total_students)))  
    
    # International Students
        interna_students = all_numbers[3]
        interna_students_number = int(re.sub("[^\\d]",'',str(interna_students)))
    
    # Adding the found values in the columns
    qs_df.set_value(index_, 'Faculty Members - Total', total_faculty_number)
    qs_df.set_value(index_, 'Faculty Members - International', interna_faculty_number)
    qs_df.set_value(index_, 'No Students - Total', total_students_number)
    qs_df.set_value(index_, 'No Students - International', interna_students_number)
    
qs_df.head(10)

Unnamed: 0,country,rank_display,region,title,url,Faculty Members - Total,Faculty Members - International,No Students - Total,No Students - International
0,United States,1,North America,Massachusetts Institute of Technology (MIT),/universities/massachusetts-institute-technolo...,2982.0,1679.0,11067.0,3717.0
1,United States,2,North America,Stanford University,/universities/stanford-university,4285.0,2042.0,15878.0,3611.0
2,United States,3,North America,Harvard University,/universities/harvard-university,4350.0,1311.0,22429.0,5266.0
3,United States,4,North America,California Institute of Technology (Caltech),/universities/california-institute-technology-...,953.0,350.0,2255.0,647.0
4,United Kingdom,5,Europe,University of Cambridge,/universities/university-cambridge,5490.0,2278.0,18770.0,6699.0
5,United Kingdom,6,Europe,University of Oxford,/universities/university-oxford,6750.0,2964.0,19720.0,7353.0
6,United Kingdom,7,Europe,UCL (University College London),/universities/ucl-university-college-london,6345.0,2554.0,31080.0,14854.0
7,United Kingdom,8,Europe,Imperial College London,/universities/imperial-college-london,3930.0,2071.0,16090.0,8746.0
8,United States,9,North America,University of Chicago,/universities/university-chicago,2449.0,635.0,13557.0,3379.0
9,Switzerland,10,Europe,ETH Zurich - Swiss Federal Institute of Techno...,/universities/eth-zurich-swiss-federal-institu...,2477.0,1886.0,19815.0,7563.0


## Data from Times Higher Education ranking

In [None]:
URL_THE = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'
r = requests.get(URL_THE)
page_body = r.json()
the_df = pd.DataFrame(page_body['data'])

In [None]:
the_df