# 02 - Data from the Web

## Deadline
Wednesday October 25, 2017 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated (i.e., you don't want your colleagues to generate unnecessary Web traffic during the peer review)
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code.

## Background
In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need!
You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this [brief tutorial](https://www.youtube.com/watch?v=jBjXVrS8nXs&list=PLM-7VG-sgbtD8qBnGeQM5nvlpqB_ktaLZ&autoplay=1) to understand quickly how to use it.

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fÃ©dÃ©rale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
    - Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
    - Answer the previous question aggregating the data by (c) country and (d) region.
    
 Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.

##  Textual description of your thought process, the assumptions you made, and the solution you plan to implement!

### Thought process:

www.topuniversities.com 
1. get https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508228964836
    a. this is a json file
2. obtain name rank country region
3. get https://www.topuniversities.com/universities/ecole-polytechnique-f%C3%A9d%C3%A9rale-de-lausanne-epfl
    a. this is a html file
4. get number of faculty -- international and total, number of students -- international and total

### analysis
1. Faculty : student ratio
2. international student : total student ratio



### Our assumptions:


### Implementation solution:

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
req1 = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508228964836')
# req1.json()
body = req1.json()
body_data = body['data']

# Question 1:

In [37]:
rank_list = []
df = pd.DataFrame(columns = ['Name', 'Rank', 'Country', 'Region',
                             'No. of International Faculty',  'No. of Total Faculty',
                             'No. of International Students', 'No. of Total Students'])

# The helper function that gets information from details pages
def get_numbers(soup, class_name):
    try:
        num = float(soup.find(class_=class_name).find(class_='number').string[1:-1].replace(",", ""))
    except:
        num = float("NaN")
        school_name = soup.find(class_ = "qs-profile-2 content panel-panel").find_next('h1').string
        print("Error in " + school_name + ": Can't find " + class_name)
    return num

dict_fields = {'Name':    'title',
               'Rank':    'rank_display',
               'Country': 'country',
               'Region':  'region'}
dict_class_names = {'No. of International Faculty':  'inter faculty',
                    'No. of Total Faculty':          'total faculty',
                    'No. of International Students': 'total inter',
                    'No. of Total Students':         'total student',}

for i in range(200):
    details = requests.get('https://www.topuniversities.com' + body_data[i]['url'])
    html_body = details.text
    soup = BeautifulSoup(html_body, 'html.parser')
    
    dict_school = {}
    for field, field_name in dict_fields.items():
        dict_school[field] = body_data[i][field_name]
    for field, class_name in dict_class_names.items():
        dict_school[field] = get_numbers(soup, class_name)
    
    rank_list.append(dict_school)
    
top_ranking_universities = pd.DataFrame.from_dict(rank_list)
top_ranking_universities.head()

Error in inter faculty New York University (NYU)
Error in total faculty New York University (NYU)
Error in total inter New York University (NYU)
Error in total student New York University (NYU)
Error in inter faculty Indian Institute of Science (IISc) Bangalore


Unnamed: 0,Country,Name,No. of International Faculty,No. of International Students,No. of Total Faculty,No. of Total Students,Rank,Region
0,United States,Massachusetts Institute of Technology (MIT),1679.0,3717.0,2982.0,11067.0,1,North America
1,United States,Stanford University,2042.0,3611.0,4285.0,15878.0,2,North America
2,United States,Harvard University,1311.0,5266.0,4350.0,22429.0,3,North America
3,United States,California Institute of Technology (Caltech),350.0,647.0,953.0,2255.0,4,North America
4,United Kingdom,University of Cambridge,2278.0,6699.0,5490.0,18770.0,5,Europe


In [34]:
top_ranking_universities = top_ranking_universities[['Name', 'Rank', 'Country', 'Region', 'No. of International Faculty', 'No. of Total Faculty', 'No. of International Students', 'No. of Total Students']]

# Question 2:

In [35]:
top_ranking_universities['Faculty : Student'] = top_ranking_universities['No. of Total Faculty']/top_ranking_universities['No. of Total Students']
top_ranking_universities['International Student : Total Student'] = top_ranking_universities['No. of International Students']/top_ranking_universities['No. of Total Students']


In [36]:
top_ranking_universities

Unnamed: 0,Name,Rank,Country,Region,No. of International Faculty,No. of Total Faculty,No. of International Students,No. of Total Students,Faculty : Student,International Student : Total Student
0,Massachusetts Institute of Technology (MIT),1,United States,North America,1679.0,2982.0,3717.0,11067.0,0.26945,0.335863
1,Stanford University,2,United States,North America,2042.0,4285.0,3611.0,15878.0,0.26987,0.227422
2,Harvard University,3,United States,North America,1311.0,4350.0,5266.0,22429.0,0.193945,0.234785
3,California Institute of Technology (Caltech),4,United States,North America,350.0,953.0,647.0,2255.0,0.422616,0.286918
4,University of Cambridge,5,United Kingdom,Europe,2278.0,5490.0,6699.0,18770.0,0.292488,0.356899
5,University of Oxford,6,United Kingdom,Europe,2964.0,6750.0,7353.0,19720.0,0.342292,0.37287
6,UCL (University College London),7,United Kingdom,Europe,2554.0,6345.0,14854.0,31080.0,0.204151,0.477928
7,Imperial College London,8,United Kingdom,Europe,2071.0,3930.0,8746.0,16090.0,0.244251,0.543567
8,University of Chicago,9,United States,North America,635.0,2449.0,3379.0,13557.0,0.180645,0.249244
9,ETH Zurich - Swiss Federal Institute of Techno...,10,Switzerland,Europe,1886.0,2477.0,7563.0,19815.0,0.125006,0.381681


In [37]:
top_uni_by_facstud = top_ranking_universities.sort_values(by = 'Faculty : Student', axis = 0, ascending = False)
top_uni_by_interstud = top_ranking_universities.sort_values(by = 'International Student : Total Student', axis = 0, ascending = False)

# Question 2a:

In [38]:
top_uni_by_facstud.head()

Unnamed: 0,Name,Rank,Country,Region,No. of International Faculty,No. of Total Faculty,No. of International Students,No. of Total Students,Faculty : Student,International Student : Total Student
3,California Institute of Technology (Caltech),4,United States,North America,350.0,953.0,647.0,2255.0,0.422616,0.286918
15,Yale University,16,United States,North America,1708.0,4940.0,2469.0,12402.0,0.398323,0.199081
5,University of Oxford,6,United Kingdom,Europe,2964.0,6750.0,7353.0,19720.0,0.342292,0.37287
4,University of Cambridge,5,United Kingdom,Europe,2278.0,5490.0,6699.0,18770.0,0.292488,0.356899
16,Johns Hopkins University,17,United States,North America,1061.0,4462.0,4105.0,16146.0,0.276353,0.254243


# Question 2b:

In [39]:
top_uni_by_interstud.head()

Unnamed: 0,Name,Rank,Country,Region,No. of International Faculty,No. of Total Faculty,No. of International Students,No. of Total Students,Faculty : Student,International Student : Total Student
34,London School of Economics and Political Scien...,35,United Kingdom,Europe,687.0,1088.0,6748.0,9760.0,0.111475,0.691393
11,Ecole Polytechnique Fédérale de Lausanne (EPFL),12,Switzerland,Europe,1300.0,1695.0,5896.0,10343.0,0.163879,0.570047
7,Imperial College London,8,United Kingdom,Europe,2071.0,3930.0,8746.0,16090.0,0.244251,0.543567
198,Maastricht University,200,Netherlands,Europe,502.0,1277.0,8234.0,16385.0,0.077937,0.502533
46,Carnegie Mellon University,=47,United States,North America,425.0,1342.0,6385.0,13356.0,0.100479,0.478062


Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
Answer the previous question aggregating the data by (c) country and (d) region.

In [44]:
country_facstud = top_ranking_universities.groupby('Country').apply(lambda x: x.sort_values(by = 'Faculty : Student', axis = 0, ascending = False))
country_interstud = top_ranking_universities.groupby('Country').apply(lambda x: x.sort_values(by = 'International Student : Total Student', axis = 0, ascending = False))

In [45]:
region_facstud = top_ranking_universities.groupby('Region').apply(lambda x: x.sort_values(by = 'Faculty : Student', axis = 0, ascending = False))
region_interstud = top_ranking_universities.groupby('Region').apply(lambda x: x.sort_values(by = 'International Student : Total Student', axis = 0, ascending = False))

# Times higher education

1. https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json

rank
name
aliases
location
stats_number_students
stats_student_staff_ratio
stats_pc_intl_students

In [50]:
req_times = requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
body_times = req_times.json()
body_times_data = body_times['data']

In [61]:
rank_list_times = []

for i in range(200):
    name = body_times_data[i]['name']
    aliases = body_times_data[i]['aliases']
    rank = body_times_data[i]['rank']
    location = body_times_data[i]['location']
#     print(location)
    num_students = float(body_times_data[i]['stats_number_students'].replace(",",""))
    student_staff = float(body_times_data[i]['stats_student_staff_ratio'].replace(",",""))
    perc_international = float(body_times_data[i]['stats_pc_intl_students'].strip("%").replace(",",""))/100.0
    
    rank_list_times.append({'Name': name,
                            'Aliases': aliases,
                            'Rank': rank,
                            'Location': location,
                            'No. of Total Students': num_students,
                            'Student : Faculty': student_staff,
                            'Ratio International Students': perc_international
                      
                     })
    
times_universities = pd.DataFrame.from_dict(rank_list_times)
times_universities.head()

Unnamed: 0,Aliases,Location,Name,No. of Total Students,Rank,Ratio International Students,Student : Faculty
0,University of Oxford,United Kingdom,University of Oxford,20409.0,1,0.38,11.2
1,University of Cambridge,United Kingdom,University of Cambridge,18389.0,2,0.35,10.9
2,California Institute of Technology caltech,United States,California Institute of Technology,2209.0,=3,0.27,6.5
3,Stanford University,United States,Stanford University,15845.0,=3,0.22,7.5
4,Massachusetts Institute of Technology,United States,Massachusetts Institute of Technology,11177.0,5,0.34,8.7
