 <span>

## Deadline
Wednesday October 25, 2017 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated (i.e., you don't want your colleagues to generate unnecessary Web traffic during the peer review)
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code.

## Background
In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need!
You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this [brief tutorial](https://www.youtube.com/watch?v=jBjXVrS8nXs&list=PLM-7VG-sgbtD8qBnGeQM5nvlpqB_ktaLZ&autoplay=1) to understand quickly how to use it.

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.

</span>


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

- Interceptor: txt database
- we take till rank 200 and not first 200 rows
- find vs findall
- NYU issue
- Indian Institute of Science (IISc) Bangalore issue
- json['data'][0].keys()

In [5]:
def find_div_and_number(soup, class_name):
    
    div = soup.find('div', class_= class_name)
    if div is not None:
        return div.find('div', class_='number').text.replace('\n', '').replace(',', '')
    else: return np.NaN
        

def parse_uni_details(url):
    page_body = requests.get('https://topuniversities.com' + uni['url']).text

    soup = BeautifulSoup(page_body, "html.parser")
    
    faculty_total = find_div_and_number(soup, 'total faculty')
    faculty_intl  = find_div_and_number(soup, 'inter faculty')
    student_total  = find_div_and_number(soup, 'total student')
    student_intl  = find_div_and_number(soup, 'total inter')
    
    return faculty_total, faculty_intl, student_total, student_intl

In [6]:
#ex1.1 
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508153509501')
json = r.json()

topuniversities = pd.DataFrame()

for (idx, uni) in enumerate(json['data']):
    
    if int(uni['rank_display'].replace('=','')) > 200:
        break
        
    faculty_total, faculty_intl, student_total, student_intl = parse_uni_details(uni['url'])
        
    entry = pd.DataFrame({'name': uni['title'], 'rank': uni['rank_display'].replace('=', ''), 'country': uni['country'],
                          'region': uni['region'], 'faculty_intl': faculty_intl, 'faculty_total': faculty_total,
                          'student_intl': student_intl, 'student_total': student_total}, index=[idx]) 
    topuniversities = topuniversities.append(entry)

# Rearrange columns order
topuniversities = topuniversities[['name', 'rank', 'country', 'region', 'faculty_intl', 'faculty_total', 'student_intl', 'student_total']]

# Change rank, faculty and student cols to numeric value
topuniversities[['rank','faculty_intl', 'faculty_total', 'student_intl', 'student_total']]\
    = topuniversities[['rank','faculty_intl', 'faculty_total', 'student_intl', 'student_total']].apply(pd.to_numeric)
    
# Serialization as a pickle file
topuniversities.to_pickle('pickled/topuniversities.pkl')
    
topuniversities.head()

Unnamed: 0,name,rank,country,region,faculty_intl,faculty_total,student_intl,student_total
0,Massachusetts Institute of Technology (MIT),1,United States,North America,1679.0,2982.0,3717.0,11067.0
1,Stanford University,2,United States,North America,2042.0,4285.0,3611.0,15878.0
2,Harvard University,3,United States,North America,1311.0,4350.0,5266.0,22429.0
3,California Institute of Technology (Caltech),4,United States,North America,350.0,953.0,647.0,2255.0
4,University of Cambridge,5,United Kingdom,Europe,2278.0,5490.0,6699.0,18770.0


In [2]:
topuniversities = pd.read_pickle('pickled/topuniversities.pkl')
topuniversities.head()

Unnamed: 0,name,rank,country,region,faculty_intl,faculty_total,student_intl,student_total
0,Massachusetts Institute of Technology (MIT),1,United States,North America,1679.0,2982.0,3717.0,11067.0
1,Stanford University,2,United States,North America,2042.0,4285.0,3611.0,15878.0
2,Harvard University,3,United States,North America,1311.0,4350.0,5266.0,22429.0
3,California Institute of Technology (Caltech),4,United States,North America,350.0,953.0,647.0,2255.0
4,University of Cambridge,5,United Kingdom,Europe,2278.0,5490.0,6699.0,18770.0


Only NYU and IISc Bangalore have NaN values:

In [9]:
topuniversities[topuniversities.isnull().any(1)]

Unnamed: 0,name,rank,country,region,faculty_intl,faculty_total,student_intl,student_total
51,New York University (NYU),52,United States,North America,,,,
189,Indian Institute of Science (IISc) Bangalore,190,India,Asia,,423.0,47.0,4071.0


- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?

In [7]:
#1.2 a
df_1a = topuniversities.loc[:,('name', 'rank', 'country', 'region', 'faculty_total', 'student_total')]
df_1a['ratio faculty/students'] = topuniversities.faculty_total / topuniversities.student_total
df_1a = df_1a.sort_values(['ratio faculty/students'], ascending= False)
df_1a[0:10]

Unnamed: 0,name,rank,country,region,faculty_total,student_total,ratio faculty/students
3,California Institute of Technology (Caltech),4,United States,North America,953.0,2255.0,0.422616
15,Yale University,16,United States,North America,4940.0,12402.0,0.398323
5,University of Oxford,6,United Kingdom,Europe,6750.0,19720.0,0.342292
4,University of Cambridge,5,United Kingdom,Europe,5490.0,18770.0,0.292488
16,Johns Hopkins University,17,United States,North America,4462.0,16146.0,0.276353
1,Stanford University,2,United States,North America,4285.0,15878.0,0.26987
0,Massachusetts Institute of Technology (MIT),1,United States,North America,2982.0,11067.0,0.26945
185,University of Rochester,186,United States,North America,2569.0,9636.0,0.266604
18,University of Pennsylvania,19,United States,North America,5499.0,20639.0,0.266437
17,Columbia University,18,United States,North America,6189.0,25045.0,0.247115


In [8]:
#1.2 b
df_1b = topuniversities.loc[:,('name', 'rank', 'country', 'region', 'student_intl', 'student_total')]
df_1b['ratio students_intl/students_total'] = topuniversities.student_intl / topuniversities.student_total
df_1b = df_1b.sort_values(['ratio students_intl/students_total'], ascending= False)
df_1b[0:10]


Unnamed: 0,name,rank,country,region,student_intl,student_total,ratio students_intl/students_total
34,London School of Economics and Political Scien...,35,United Kingdom,Europe,6748.0,9760.0,0.691393
11,Ecole Polytechnique Fédérale de Lausanne (EPFL),12,Switzerland,Europe,5896.0,10343.0,0.570047
7,Imperial College London,8,United Kingdom,Europe,8746.0,16090.0,0.543567
198,Maastricht University,200,Netherlands,Europe,8234.0,16385.0,0.502533
47,Carnegie Mellon University,47,United States,North America,6385.0,13356.0,0.478062
6,UCL (University College London),7,United Kingdom,Europe,14854.0,31080.0,0.477928
91,University of St Andrews,92,United Kingdom,Europe,4030.0,8800.0,0.457955
41,The University of Melbourne,41,Australia,Oceania,18030.0,42182.0,0.427434
126,Queen Mary University of London,127,United Kingdom,Europe,6806.0,16135.0,0.421816
25,The University of Hong Kong,26,Hong Kong,Asia,8230.0,20214.0,0.407144


- Answer the previous question aggregating the data by (c) country and (d) region.

In [12]:
def getRatioSFBy (source, by ) :
    res = topuniversities.groupby(by,as_index=False).sum()
    res['ratio faculty/student'] = res['faculty_total']/res['student_total']
    return res[[by,'ratio faculty/student']].sort_values('ratio faculty/student',ascending=False).dropna()

In [13]:
def getRatioInterBy (source, by ) :
    res = topuniversities.groupby(by,as_index=False).sum()
    res['ratio international student'] = res['student_intl']/res['student_total']
    return res[[by,'ratio international student']].sort_values('ratio international student',ascending=False).dropna()

In [14]:
#2.2 c ratio between faculty members and students
dataex2cSF = getRatioSFBy(topuniversities,'country')
dataex2cSF

Unnamed: 0,country,ratio faculty/student
23,Russia,0.22191
8,Denmark,0.177261
24,Saudi Arabia,0.175828
25,Singapore,0.16153
18,Malaysia,0.153893
17,Japan,0.152479
27,South Korea,0.141721
30,Switzerland,0.140434
32,United Kingdom,0.136962
15,Israel,0.136047


In [15]:
#2.2 c ratio of international students
dataex2cInter = getRatioInterBy(topuniversities,'country')
dataex2cInter

Unnamed: 0,country,ratio international student
1,Australia,0.352189
32,United Kingdom,0.341705
12,Hong Kong,0.310751
2,Austria,0.30998
30,Switzerland,0.302396
25,Singapore,0.276537
5,Canada,0.260161
21,New Zealand,0.258215
14,Ireland,0.235299
20,Netherlands,0.23298


In [16]:
#2.2 c ratio between faculty members and students
dataex2dSF = getRatioSFBy(topuniversities,'region')
dataex2dSF

Unnamed: 0,region,ratio faculty/student
1,Asia,0.13226
4,North America,0.117776
2,Europe,0.111564
3,Latin America,0.108657
0,Africa,0.08845
5,Oceania,0.072385


In [17]:
#2.2 d ratio of international students
dataex2dInter = getRatioInterBy(topuniversities,'region')
dataex2dInter

Unnamed: 0,region,ratio international student
5,Oceania,0.339261
2,Europe,0.229589
4,North America,0.188906
0,Africa,0.169703
1,Asia,0.136431
3,Latin America,0.08752


In [6]:
#ex2.2
r = requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
json = r.json()
json['data'][0]

{'aliases': 'University of Oxford',
 'location': 'United Kingdom',
 'member_level': '0',
 'name': 'University of Oxford',
 'nid': 468,
 'rank': '1',
 'rank_order': '10',
 'record_type': 'master_account',
 'scores_citations': '99.1',
 'scores_citations_rank': '15',
 'scores_industry_income': '63.7',
 'scores_industry_income_rank': '169',
 'scores_international_outlook': '95.0',
 'scores_international_outlook_rank': '24',
 'scores_overall': '94.3',
 'scores_overall_rank': '10',
 'scores_research': '99.5',
 'scores_research_rank': '1',
 'scores_teaching': '86.7',
 'scores_teaching_rank': '5',
 'stats_female_male_ratio': '46 : 54',
 'stats_number_students': '20,409',
 'stats_pc_intl_students': '38%',
 'stats_student_staff_ratio': '11.2',
 'subjects_offered': 'Archaeology,Art, Performing Arts & Design,Biological Sciences,Business & Management,Chemical Engineering,Chemistry,Civil Engineering,Computer Science,Economics & Econometrics,Electrical & Electronic Engineering,General Engineering,Geo

In [31]:
the = pd.DataFrame()

for (idx, uni) in enumerate(json['data']):
    
    if '201' in uni['rank']:
        break
                
    entry = pd.DataFrame({'name': uni['name'], 'rank': uni['rank'].replace('=', ''), 'country': uni['location'],\
                          'stats_student_staff_ratio': uni['stats_student_staff_ratio'],\
                          'stats_pc_intl_student': uni['stats_pc_intl_students'].replace('%', ''),\
                          'student_total': uni['stats_number_students'].replace(',', ''),\
                          'scores_international_outlook': uni['scores_international_outlook']}, index=[idx]) 
    the = the.append(entry)

# Rearrange columns order
the = the[['name', 'rank', 'country', 'stats_student_staff_ratio', 'stats_pc_intl_student', 'student_total', 'scores_international_outlook']]

# Change rank, faculty and student cols to numeric value
the[['rank','stats_student_staff_ratio', 'stats_pc_intl_student', 'student_total', 'scores_international_outlook']]\
    = the[['rank','stats_student_staff_ratio', 'stats_pc_intl_student', 'student_total', 'scores_international_outlook']].apply(pd.to_numeric)
    
# Serialization as a pickle file
the.to_pickle('pickled/the.pkl')
    
the.head()

Unnamed: 0,name,rank,country,stats_student_staff_ratio,stats_pc_intl_student,student_total,scores_international_outlook
0,University of Oxford,1,United Kingdom,11.2,38,20409,95.0
1,University of Cambridge,2,United Kingdom,10.9,35,18389,93.0
2,California Institute of Technology,3,United States,6.5,27,2209,59.7
3,Stanford University,3,United States,7.5,22,15845,77.6
4,Massachusetts Institute of Technology,5,United States,8.7,34,11177,87.6


In [32]:
the = pd.read_pickle('pickled/the.pkl')
the.head()

Unnamed: 0,name,rank,country,stats_student_staff_ratio,stats_pc_intl_student,student_total,scores_international_outlook
0,University of Oxford,1,United Kingdom,11.2,38,20409,95.0
1,University of Cambridge,2,United Kingdom,10.9,35,18389,93.0
2,California Institute of Technology,3,United States,6.5,27,2209,59.7
3,Stanford University,3,United States,7.5,22,15845,77.6
4,Massachusetts Institute of Technology,5,United States,8.7,34,11177,87.6


No entries has NaN values:

In [33]:
the.isnull().any(1).sum()

0