## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.


In [110]:
# Imports
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import requests
import unicodedata
import re
from bs4 import BeautifulSoup

# www.topuniversities.com

## Methodology

After examing the webpage, we identify the URL of a JSON file that contains the ranking. Working with JSON is much simpler than working with HTML, so we perform an HTTP GET on that URL, providing a suitable timestamp value (`?_=...`). We then parse the JSON and extract the information we are interested in.

As indicated in the problem statement, some stats are not available in the main list. We visit each university URL to extract the missing stats and scrap the HTML using BeautifulSoup.

Finally, we put all this data in a dataframe.

In [111]:
BASE_URL_TOP = 'https://www.topuniversities.com'
RANKING_URL_TOP = '/sites/default/files/qs-rankings-data/357051.txt?_=%d'

In [112]:
# Helper function to extract a metric using a CSS selector
def extract_metric(soup, selector):
    try:
        return int(soup.select(selector)[0].get_text().replace(',', ''))
    except:
        return None

# Function to obtain and format the top n universities in the ranking
def get_list_top(n=200):
    # Request the JSON data
    timestamp = int(time.time() * 1000)
    r = requests.get((BASE_URL_TOP + RANKING_URL_TOP) % timestamp)
    raw_ranking = r.json()['data'][0:n]
    
    # Process and clean each entry in the ranking, fetching the detailed page and extracting the necessary information
    ranking = []
    for entry in raw_ranking:
        rank = int(entry['rank_display'].replace('=', '')) # drop the = for tied ranks
        name = entry['title']
        country = entry['country']
        region = entry['region']
        r = requests.get(BASE_URL_TOP + entry['url'])
        soup = BeautifulSoup(r.text, 'lxml')
        total_students = extract_metric(soup, 'div.total.student .number')
        total_int_students = extract_metric(soup, 'div.total.inter .number')
        total_faculty = extract_metric(soup, 'div.total.faculty .number')
        total_int_faculty = extract_metric(soup, 'div.inter.faculty .number')
        ranking.append({'rank': rank, 'name': name, 'country': country, 'region': region, 'total_students': total_students, 'total_int_students': total_int_students, 'total_faculty': total_faculty, 'total_int_faculty': total_int_faculty})        
    
    # Create the dataframe and index it
    df = pd.DataFrame(ranking)
    df = df.set_index('rank')
    df = df.reindex_axis(['name', 'country', 'region', 'total_students', 'total_int_students', 'total_faculty', 'total_int_faculty'], axis=1)
    return df
    
# Let's see what it looks like...
top_df = get_list_top()
top_df

Unnamed: 0_level_0,name,country,region,total_students,total_int_students,total_faculty,total_int_faculty
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Massachusetts Institute of Technology (MIT),United States,North America,11067.0,3717.0,2982.0,1679.0
2,Stanford University,United States,North America,15878.0,3611.0,4285.0,2042.0
3,Harvard University,United States,North America,22429.0,5266.0,4350.0,1311.0
4,California Institute of Technology (Caltech),United States,North America,2255.0,647.0,953.0,350.0
5,University of Cambridge,United Kingdom,Europe,18770.0,6699.0,5490.0,2278.0
6,University of Oxford,United Kingdom,Europe,19720.0,7353.0,6750.0,2964.0
7,UCL (University College London),United Kingdom,Europe,31080.0,14854.0,6345.0,2554.0
8,Imperial College London,United Kingdom,Europe,16090.0,8746.0,3930.0,2071.0
9,University of Chicago,United States,North America,13557.0,3379.0,2449.0,635.0
10,ETH Zurich - Swiss Federal Institute of Techno...,Switzerland,Europe,19815.0,7563.0,2477.0,1886.0


# www.timeshighereducation.com

## Methodology

We also identify the URL of a JSON file that contains the ranking for this dataset and proceed a similar way as for the previous one. It appears that the JSON file contains everything we are interested in, but the number of international faculty and the region. However, since the detailed pages for each university does not contain this information either, there is no point in using BeautifulSoup to extract data from there.

In [113]:
RANKING_URL_TIMES = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'

In [114]:
# Function to obtain and format the top n universities in the ranking
def get_list_times(n=200):
    # HTTP GET to fetch the data
    r = requests.get(RANKING_URL_TIMES)
    raw_ranking = r.json()['data']
    
    # Process and clean each entry in the ranking
    ranking = []
    for entry in raw_ranking[0:n]:
        rank = int(entry['rank'].replace('=', '')) # drop the = for tied ranks
        name = entry['name']
        country = entry['location']
        total_students = int(entry['stats_number_students'].replace(',', ''))
        total_int_students = int(float(entry['stats_pc_intl_students'].replace('%', '')) / 100.0 * total_students)
        total_faculty = int(total_students / float(entry['stats_student_staff_ratio']))
        ranking.append({'rank': rank, 'name': name, 'country': country, 'total_students': total_students, 'total_int_students': total_int_students, 'total_faculty': total_faculty})   
    
    # Create a dataframe and index it
    df = pd.DataFrame(ranking)
    df = df.set_index('rank')
    df = df.reindex_axis(['name', 'country', 'total_students', 'total_int_students', 'total_faculty'], axis=1)
    return df

# Let's see what this one looks like
times_df = get_list_times()
times_df

Unnamed: 0_level_0,name,country,total_students,total_int_students,total_faculty
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,University of Oxford,United Kingdom,20409,7755,1822
2,University of Cambridge,United Kingdom,18389,6436,1687
3,California Institute of Technology,United States,2209,596,339
3,Stanford University,United States,15845,3485,2112
5,Massachusetts Institute of Technology,United States,11177,3800,1284
6,Harvard University,United States,20326,5284,2283
7,Princeton University,United States,7955,1909,958
8,Imperial College London,United Kingdom,15857,8721,1390
9,University of Chicago,United States,13525,3381,2181
10,ETH Zurich – Swiss Federal Institute of Techno...,Switzerland,19233,7308,1317
