# Homework 2
### Pierre-Antoine Desplaces, Anaïs Ladoy, Lou Richard

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import json

### Question 1

Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.


In [None]:
URL1 = 'https://www.topuniversities.com/university-rankings/world-university-rankings/2018'
URL2 = 'http://timeshighereducation.com/world-university-rankings/2018/world-ranking'

In [None]:
# Do the request
r = requests.get(URL1)

Inspecting the DOM of the website, we find that the ranking datas are stored in a text file at https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt as we can see here :

In [None]:
# 357051.txt
r.text[28109:28208]

In [None]:
# 357051_indicators.txt
id1 = r.text.find("357051.txt")
r.text[id1-88:id1+11]

In [None]:
id2 = r.text.find("357051_indicators.txt")
r.text[id2-99:id2+22]

We thus request from this URL to get the datas

In [None]:
data_QS_URL = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt'

In [None]:
data_QS = requests.get(data_QS_URL)

In [None]:
data_QS = data_QS.json()['data']

In [None]:
type(data_QS)

In [None]:
df_QS = pd.DataFrame(data_QS)

In [None]:
df_QS

In [None]:
df_ranking = pd.DataFrame({"University" : df_QS.title, "Rank" : df_QS.rank_display, "Score" : df_QS.score, "Country": df_QS.country, "Region": df_QS.region, "URL" : df_QS.url},columns = ['Rank', 'University', 'Score', 'Country', 'Region', 'URL'])

In [None]:
df_ranking

In [None]:
#find index of 200th 
df_ranking[df_ranking['Rank'] == "200"].index.tolist()

In [None]:
# keep the top 200 universities
df_ranking = df_ranking[:199].copy()

In [None]:
df_ranking

In [None]:
# fetch the university info from its webpage
def get_number_of(kind, url):
    # get the data from the university page
    r = requests.get('https://www.topuniversities.com'+url)
    soup = BeautifulSoup(r.text, "html.parser")
    # find the corresponding tag
    try:
        res_string = soup.find('div', class_=kind).find('div', class_="number")
    # convert result into integer
        res = int(str(res_string.string).replace('\n', "").replace(",", ""))
    except:
        res = np.nan
    return res

In [None]:
df_ranking['Total Faculties'] = df_ranking['URL'].map(lambda x : get_number_of("total faculty", x))
df_ranking['International Faculties'] = df_ranking['URL'].map(lambda x : get_number_of("inter faculty", x))
df_ranking['Total Students'] = df_ranking['URL'].map(lambda x : get_number_of("total student", x))
df_ranking['International Students'] = df_ranking['URL'].map(lambda x : get_number_of("total inter", x))

In [None]:
df_ranking = df_ranking.drop('URL', axis=1)
df_ranking

### a) Which are the best universities in term of ratio between faculty members and students ?

In [None]:
df_ranking['% Faculties/Students'] = df_ranking['Total Faculties']/df_ranking['Total Students']

In [None]:
df_ranking.sort_values('% Faculties/Students', ascending=False)

### b) Which are the best universities in term of ratio of international students?

In [None]:
df_ranking['% International Students'] = df_ranking['International Students']/df_ranking['Total Students']

In [None]:
df_ranking.sort_values('% International Students', ascending=False)

### c) Aggregating the data by country

In [None]:
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

def bars(t, data, xlab, ylab):
    sns.set_style('darkgrid')
    fig, ax = plt.subplots(figsize = (15,8))
    ax.set_title(t, fontsize=15, fontweight='bold')
    sns.barplot(x=xlab, y=ylab, data=data, saturation=0.7, errcolor='.7')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
bars("Ratio Faculty Members and Students per Country", df_ranking, 'Country', '% Faculties/Students')

In [None]:
bars("International Students per Country", df_ranking, 'Country', '% International Students')

### d) Aggregating the data by region

In [None]:
bars("Ratio Faculty Members and Students per Region", df_ranking, 'Region', '% Faculties/Students')

In [None]:
bars("International Students per Region", df_ranking, 'Region', '% International Students')

### Question 2

Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

Using Postman, a API request was captured on the ranking, containing a json file with all the data.

In [None]:
times_r=requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
data=times_r.text


We decode the json file by keeping only the first 200 data items (each data item corresponds to an university, sorting according its ranking). Furthemore, we just extract the useful columns (name, location, rank, total number of students, percentage of international students and student/staff ratio).

In [None]:
df_times = pd.DataFrame(json.loads(data)['data'][:200],columns=['name','location','rank','stats_number_students','stats_pc_intl_students','stats_student_staff_ratio'])
df_times['stats_pc_intl_students'] = df_times['stats_pc_intl_students'].astype(str).str.replace('%','').astype(int)
df_times['stats_number_students'] = df_times['stats_number_students'].astype(str).str.replace(',','').astype(float)
df_times['stats_student_staff_ratio'] = 1/(df_times['stats_student_staff_ratio'].astype(float))
df_times.columns = ['University','Country','Rank','Total Students','% International Students','Ratio Faculties/Students']
df_times


We don't have the region information in the second website so we decide to use the information collected in the first one.

In [None]:
conv_to_cont=df_ranking[['Country','Region']].drop_duplicates().set_index('Country')
conv_to_cont

In [None]:
df_times_final=pd.merge(df_times, conv_to_cont,how='left',left_on="Country",right_index=True)
df_times_final['International Students'] = df_fin.apply(lambda row:row["Total Students"] * 0.01 * row["% International Students"], axis=1).round()
df_times_final['Total Faculties'] = df_fin.apply(lambda row: row["Total Students"]/row["Ratio Faculties/Students"], axis=1).round()
df_times_final
                                       

In [None]:
bars("Ratio Faculty Members and Students per Country", df_times_final.sort_values('Country'), 'Country', 'Ratio Faculties/Students')

In [None]:
bars("International Students per Country", df_times_final.sort_values('Country'), 'Country', '% International Students')

In [None]:
bars("Ratio Faculty Members and Students per Region", df_times_final.sort_values('Region'), 'Region', 'Ratio Faculties/Students')

In [None]:
bars("International Students per Region", df_times_final.sort_values('Region'), 'Region', '% International Students')

### Question 3


Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

### Question 4
Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

### Question 5

Can you find the best university taking in consideration both rankings? Explain your approach.