# 02 - Data from the Web


In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

In [1]:
from bs4 import BeautifulSoup
import urllib2
import requests
import json
import string
import pandas as pd
import re
import math
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output

## Data from QS ranking

Multiple strategies have been tested in order to find the data used to construct the table of the QS Rankings. However, only one of them led to the expected result. We opened the Web Inspector on the website. Among the ressources used by the page, a text file named 'XHR/357051.txt'  was containing the expected ranking with the desired informations. The URL_QS is leading to this file.

In [2]:
URL_QS = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508497791260'
r = requests.get(URL_QS)
page_body = r.json()
qs_df = pd.DataFrame(page_body['data'])
qs_df = qs_df.drop(['core_id','logo','guide','cc', 'nid', 'stars','score'], axis=1).head(200)

Additionnal informations needed to be accessed on each university webpage. We used the 'url' column, went to the university page and extracted all the number values we wanted and placed them in additionnal columns. In some cases, there were no informations on the webpage, in this case the columns are filed with NaN.

In [3]:
# Going on each university page for each index of the table

for index_ in range(0,qs_df.shape[0]):
    
    URL_ADD = 'https://www.topuniversities.com' + qs_df.get_value(index_, 'url')
    r = requests.get(URL_ADD)
    page_body = r.text
    soup = BeautifulSoup(page_body, 'html.parser')
    # Request all numbers that are displayed on the page, we know their order
    all_numbers = soup.findAll('div', {'class' : 'number'})
    
    # If the page contains no numbers, we will store NaN
    if not all_numbers:
        total_faculty_number = 'NaN'
        interna_faculty_number = 'NaN'
        total_students_number = 'NaN'
        interna_students_number = 'NaN'
   
    else:    
    # Total Faculty Members
        total_faculty = all_numbers[0]
        total_faculty_number = int(re.sub("[^\\d]",'',str(total_faculty)))
    
    # International Faculty Members
        interna_faculty = all_numbers[1]
        interna_faculty_number = int(re.sub("[^\\d]",'',str(interna_faculty)))
    
    # Total Students
        total_students = all_numbers[2]
        total_students_number = int(re.sub("[^\\d]",'',str(total_students)))  
    
    # International Students
        interna_students = all_numbers[3]
        interna_students_number = int(re.sub("[^\\d]",'',str(interna_students)))
    
    # Adding the found values in the columns
    qs_df.set_value(index_, 'Faculty Members - Total', total_faculty_number)
    qs_df.set_value(index_, 'Faculty Members - International', interna_faculty_number)
    qs_df.set_value(index_, 'No Students - Total', total_students_number)
    qs_df.set_value(index_, 'No Students - International', interna_students_number)
    
    clear_output()
    print('Progress', index_/2, '%')
qs_df.head(10)

('Progress', 99, '%')


Unnamed: 0,country,rank_display,region,title,url,Faculty Members - Total,Faculty Members - International,No Students - Total,No Students - International
0,United States,1,North America,Massachusetts Institute of Technology (MIT),/universities/massachusetts-institute-technolo...,2982.0,1679.0,11067.0,3717.0
1,United States,2,North America,Stanford University,/universities/stanford-university,4285.0,2042.0,15878.0,3611.0
2,United States,3,North America,Harvard University,/universities/harvard-university,4350.0,1311.0,22429.0,5266.0
3,United States,4,North America,California Institute of Technology (Caltech),/universities/california-institute-technology-...,953.0,350.0,2255.0,647.0
4,United Kingdom,5,Europe,University of Cambridge,/universities/university-cambridge,5490.0,2278.0,18770.0,6699.0
5,United Kingdom,6,Europe,University of Oxford,/universities/university-oxford,6750.0,2964.0,19720.0,7353.0
6,United Kingdom,7,Europe,UCL (University College London),/universities/ucl-university-college-london,6345.0,2554.0,31080.0,14854.0
7,United Kingdom,8,Europe,Imperial College London,/universities/imperial-college-london,3930.0,2071.0,16090.0,8746.0
8,United States,9,North America,University of Chicago,/universities/university-chicago,2449.0,635.0,13557.0,3379.0
9,Switzerland,10,Europe,ETH Zurich - Swiss Federal Institute of Techno...,/universities/eth-zurich-swiss-federal-institu...,2477.0,1886.0,19815.0,7563.0


## Best QS university in terms of ratio between faculty members and students

In [8]:
def ratio(a, b):
    if b == 'NaN':
        return a
    
    a = float(a)
    b = float(b)
    if b == 0:
        return 1
    ratio = 100*b/a
    if ratio > 100:
        ratio = 1
    return ratio

In [9]:
# Create the ratios for every university

for index, row in qs_df.iterrows():
    if np.isfinite(row['No Students - Total']) and np.isfinite(row['No Students - International'])\
    or np.isfinite(row['No Students - Total']) and np.isfinite(row['Faculty Members - Total']):
        
        qs_df.loc[index, 'QS Ratio Int. Students'] = ratio(row['No Students - Total'], \
                                                           row['No Students - International'])
        qs_df.loc[index, 'QS Ratio Students/Staff'] = ratio(row['No Students - Total'], \
                                                            row['Faculty Members - Total'])
        

# Sort Universities with both ratios that are asked

qs_df_IntStud = qs_df.sort_values(['QS Ratio Int. Students'],ascending=[False])
qs_df_StaffStud = qs_df.sort_values(['QS Ratio Students/Staff'],ascending=[False])
qs_df_IntStud = qs_df_IntStud.reset_index(drop=True)
qs_df_StaffStud = qs_df_StaffStud.reset_index(drop=True)

# Give each university some points depending on their ratio
points = qs_df.shape[0]
for index_ in range(0,qs_df.shape[0]):
    qs_df_IntStud.set_value(index_, 'Points', points)
    qs_df_StaffStud.set_value(index_, 'Points', points)
    points -= 1

In [10]:
qs_df_IntStud

Unnamed: 0,country,rank_display,region,title,url,Faculty Members - Total,Faculty Members - International,No Students - Total,No Students - International,QS Ratio Int. Students,QS Ratio Students/Staff,Points
0,United Kingdom,35,Europe,London School of Economics and Political Scien...,/universities/london-school-economics-politica...,1088.0,687.0,9760.0,6748.0,69.139344,11.147541,200.0
1,Switzerland,12,Europe,Ecole Polytechnique Fédérale de Lausanne (EPFL),/universities/ecole-polytechnique-f%C3%A9d%C3%...,1695.0,1300.0,10343.0,5896.0,57.004738,16.387895,199.0
2,United Kingdom,8,Europe,Imperial College London,/universities/imperial-college-london,3930.0,2071.0,16090.0,8746.0,54.356743,24.425109,198.0
3,Netherlands,200,Europe,Maastricht University,/universities/maastricht-university,1277.0,502.0,16385.0,8234.0,50.253280,7.793714,197.0
4,United States,=47,North America,Carnegie Mellon University,/universities/carnegie-mellon-university,1342.0,425.0,13356.0,6385.0,47.806229,10.047919,196.0
5,United Kingdom,7,Europe,UCL (University College London),/universities/ucl-university-college-london,6345.0,2554.0,31080.0,14854.0,47.792793,20.415058,195.0
6,United Kingdom,92,Europe,University of St Andrews,/universities/university-st-andrews,1140.0,485.0,8800.0,4030.0,45.795455,12.954545,194.0
7,Australia,=41,Oceania,The University of Melbourne,/universities/university-melbourne,3311.0,1477.0,42182.0,18030.0,42.743350,7.849320,193.0
8,United Kingdom,127,Europe,Queen Mary University of London,/universities/queen-mary-university-london,1885.0,801.0,16135.0,6806.0,42.181593,11.682677,192.0
9,Hong Kong,26,Asia,The University of Hong Kong,/universities/university-hong-kong,3012.0,2085.0,20214.0,8230.0,40.714356,14.900564,191.0


## Data from Times Higher Education ranking

In [28]:
URL_THE = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'
r = requests.get(URL_THE)
page_body = r.json()
the_df = pd.DataFrame(page_body['data'])

In [29]:
the_df = the_df.drop(['member_level','rank_order','record_type','scores_citations', 'scores_citations_rank',\
                      'scores_overall' ,'scores_research','scores_research_rank','scores_teaching', \
                      'scores_teaching_rank','subjects_offered','scores_industry_income','scores_industry_income_rank',\
                      'scores_international_outlook','scores_international_outlook_rank', 'scores_overall',\
                     'scores_overall_rank', 'nid','stats_female_male_ratio'], axis=1).head(200)

In [30]:
the_df

Unnamed: 0,aliases,location,name,rank,stats_number_students,stats_pc_intl_students,stats_student_staff_ratio,url
0,University of Oxford,United Kingdom,University of Oxford,1,20409,38%,11.2,/world-university-rankings/university-oxford
1,University of Cambridge,United Kingdom,University of Cambridge,2,18389,35%,10.9,/world-university-rankings/university-cambridge
2,California Institute of Technology caltech,United States,California Institute of Technology,=3,2209,27%,6.5,/world-university-rankings/california-institut...
3,Stanford University,United States,Stanford University,=3,15845,22%,7.5,/world-university-rankings/stanford-university
4,Massachusetts Institute of Technology,United States,Massachusetts Institute of Technology,5,11177,34%,8.7,/world-university-rankings/massachusetts-insti...
5,Harvard University,United States,Harvard University,6,20326,26%,8.9,/world-university-rankings/harvard-university
6,Princeton University,United States,Princeton University,7,7955,24%,8.3,/world-university-rankings/princeton-university
7,Imperial College London,United Kingdom,Imperial College London,8,15857,55%,11.4,/world-university-rankings/imperial-college-lo...
8,University of Chicago,United States,University of Chicago,9,13525,25%,6.2,/world-university-rankings/university-chicago
9,ETH Zurich – Swiss Federal Institute of Techno...,Switzerland,ETH Zurich – Swiss Federal Institute of Techno...,=10,19233,38%,14.6,/world-university-rankings/eth-zurich-swiss-fe...
