# Homework 2
### Pierre-Antoine Desplaces, Anaïs Ladoy, Lou Richard

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import json

### Question 1

Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.


In [None]:
URL1 = 'https://www.topuniversities.com/university-rankings/world-university-rankings/2018'
URL2 = 'http://timeshighereducation.com/world-university-rankings/2018/world-ranking'

In [None]:
# Do the request
r = requests.get(URL1)

Inspecting the DOM of the website, we find that the ranking datas are stored in a text file at https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt as we can see here :

In [None]:
# 357051.txt
id1 = r.text.find("357051.txt")
r.text[id1-88:id1+11]

We thus request the datas from this URL, keeping only the 200 first elements.

In [None]:
data_QS_URL = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt'
data_QS = requests.get(data_QS_URL)
#create the dataframe
df_QS = pd.DataFrame(data_QS.json()['data'][:200])

We now keep the relevant columns.

In [None]:
# select only the useful columns
df_ranking = pd.DataFrame({"University" : df_QS.title,\
                           "Rank" : df_QS.rank_display,\
                           "Score" : df_QS.score,\
                           "Country": df_QS.country,\
                           "Region": df_QS.region, "URL" : df_QS.url},\
                          columns = ['Rank', 'University', 'Score', 'Country', 'Region', 'URL'])

In [None]:
df_ranking

We observe that the 200th university has a rank = 201, this is because there is no 198th (we can see that the last 195th should be the 197th, and the next rank is 199).

We now have to get the number of faculties and students for each university. For this, we have to go the page of the university on the topuniversities website, and collect these information. Inspecting these webapges, we find that the number of faculty members is in the div number of the div total faculty, the number of international faculty members is in the div number of the div inter faculty, the number of students is in the div number of the div total students, and finally the number of international students is in the div number of the div total inter.

In [None]:
# fetch the university informations from its webpage
def get_numbers_of(url):
    # get the data from the university page
    r = requests.get('https://www.topuniversities.com'+url)
    soup = BeautifulSoup(r.text, "html.parser")
    # find the corresponding tag
    try:
        staff = soup.find('div', class_="total faculty").find('div', class_="number").string
        staff = int(str(staff).replace('\n', "").replace(",", ""))
    except:
        staff = np.nan
    try :
        inter_staff = soup.find('div', class_="inter faculty").find('div', class_="number").string
        inter_staff = int(str(inter_staff).replace('\n', "").replace(",", ""))
    except:
        inter_staff = np.nan
    try : 
        students = soup.find('div', class_="total student").find('div', class_="number").string
        students = int(str(students).replace('\n', "").replace(",", ""))
    except:
        students = np.nan
    try : 
        inter_students = soup.find('div', class_="total inter").find('div', class_="number").string
        inter_students = int(str(inter_students).replace('\n', "").replace(",", ""))
    except:
        inter_students = np.nan
    
    return staff, inter_staff, students, inter_students

We fetch the results into new columns in our dataframe.

In [None]:
df_ranking['Total Faculty Members'], \
df_ranking['International Faculty Members'], \
df_ranking['Total Students'], \
df_ranking['International Students'] = zip(*df_ranking['URL'].map(get_numbers_of))

We can now remove the URL columns, as it is not useful anymore.

In [None]:
df_ranking = df_ranking.drop('URL', axis=1)
df_ranking

#### a) Which are the best universities in term of ratio between faculty members and students ?

In [None]:
df_ranking['% Fac Members'] = df_ranking['Total Faculty Members']/(df_ranking['Total Faculty Members']+df_ranking['Total Students'])*100

In [None]:
df_ranking.sort_values('% Fac Members', ascending=False)

In [None]:
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

def bars(t, data, xlab, ylab):
    sns.set_style('darkgrid')
    fig, ax = plt.subplots(figsize = (15,8))
    ax.set_title(t, fontsize=15, fontweight='bold')
    sns.barplot(x=xlab, y=ylab, data=data, saturation=0.7, errcolor='.7')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
bars("Faculty Members per University", df_ranking.sort_values('% Fac Members', ascending=False)[:20], 'University', '% Fac Members')

We observe that this ranking is different from the original ranking, but the 10 first university are almost all in the top 20 of the initial ranking.

#### b) Which are the best universities in term of ratio of international students?

In [None]:
df_ranking['% International Students'] = df_ranking['International Students']/df_ranking['Total Students']*100

In [None]:
df_ranking.sort_values('% International Students', ascending=False)

In [None]:
bars("International Students per University", df_ranking.sort_values('% International Students', ascending=False)[:20], 'University', '% International Students')

Here, the results are completely different from the original ranking.

#### c) Aggregating the data by country

In [None]:
bars("Faculty Members per Country", df_ranking.sort_values('Country'), 'Country', '% Fac Members')

In [None]:
bars("International Students per Country", df_ranking.sort_values('Country'), 'Country', '% International Students')

#### d) Aggregating the data by region

In [None]:
bars("Faculty Members per Region", df_ranking.sort_values('Region'), 'Region', '% Fac Members')

In [None]:
bars("International Students per Region", df_ranking.sort_values('Region'), 'Region', '% International Students')

### Question 2

Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

Using Postman, a API request was captured on the ranking, containing a json file with all the data.

In [None]:
times_r=requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
data=times_r.text

We decode the json file by keeping only the first 200 data items (each data item corresponds to an university, sorting according its ranking). Furthemore, we just extract the useful columns (name, location, rank, total number of students, percentage of international students and student/staff ratio). 

As we are asked to sort the universities according their ratio between faculty members and students, we're already converted the student/staff ratio to staff/student ratio and convert the ratio in percentage. 

In [None]:
df_times = pd.DataFrame(json.loads(data)['data'][:200],columns=['name','location','rank','stats_number_students','stats_pc_intl_students','stats_student_staff_ratio'])
df_times['stats_pc_intl_students'] = df_times['stats_pc_intl_students'].astype(str).str.replace('%','').astype(int)
df_times['stats_number_students'] = df_times['stats_number_students'].astype(str).str.replace(',','').astype(float)
df_times['stats_student_staff_ratio'] = 1/(df_times['stats_student_staff_ratio'].astype(float))
df_times.columns = ['University','Country','Rank','Total Students','% International Students','Fac Members/Students ratio']
df_times


We don't have the region information in the second website so we decide to use the information collected in the first one.

In [None]:
conv_to_cont = pd.DataFrame(data_QS.json()['data'],columns=['country','region']).drop_duplicates()
conv_to_cont.columns=['Country','Region'];
conv_to_cont.head()

Then, we merged the two dataframes on the country attribute.

In [None]:
df_times_final=pd.merge(df_times, conv_to_cont,how='left',left_on="Country",right_on="Country")


In [None]:
df_times_final[df_times_final.Region.isnull()]

As we can see below, two universities have no defined Region because the information about these countries was missing in the first ranking. We complete manually the missing information as just two universities are concerned and they won't appear on the barplot when we'll aggregate ranking by region otherwise.

In [None]:
df_times_final.at[178,'Region']='Europe'
df_times_final.at[193,'Region']='Europe'

Some informations are not explicitely present in our data as the number of international students and the total number of faculty members but we can infer them with the other columns according the following formulas :

Number of international students : *0.01 x Percentage of International Students x Total number of Students*

Total number of faculty members : *Total number of students x 0.01 x Ration between Faculty members and Students*

Nethertheless, we cannot infer the number of international faculty members but it is not essential for the results.

In [None]:
df_times_final['International Students'] = (df_times_final["Total Students"] * 0.01 * df_times_final["% International Students"]).astype(int)
df_times_final['Total Faculty Members'] = (df_times_final["Total Students"] * df_times_final["Fac Members/Students ratio"]).astype(int)
df_times_final['% Fac Members'] = df_times_final["Total Faculty Members"] / (df_times_final["Total Faculty Members"] + df_times_final["Total Students"]) * 100
df_times_final

#### a) Which are the best universities in term of ratio between faculty members and students ?

In [None]:
df_times_final.sort_values('% Fac Members', ascending=False)

In [None]:
bars("Faculty Members per Country", df_times_final.sort_values('% Fac Members', ascending=False)[:20], 'University', '% Fac Members')

#### b) Which are the best universities in term of ratio of international students?

In [None]:
df_times_final.sort_values('% International Students', ascending=False)

In [None]:
bars("International Students per Country", df_times_final.sort_values('% International Students', ascending=False)[:20], 'University', '% International Students')

#### c) Aggregating the data by country

In [None]:
bars("Faculty Members and Students per Country", df_times_final.sort_values('Country'), 'Country', '% Fac Members')

In [None]:
bars("International Students per Country", df_times_final.sort_values('Country'), 'Country', '% International Students')

#### d) Aggregating the data by region

In [None]:
bars("Faculty Members per Region", df_times_final.sort_values('Region'), 'Region', '% Fac Members')


In [None]:
bars("International Students per Region", df_times_final.sort_values('Region'), 'Region', '% International Students')

We save our current results in a pickle file.

In [None]:
import pickle
pickle.dump( df_ranking , open( "qs.p", "wb" ) )
pickle.dump( df_times_final , open( "times.p", "wb") )

# LOU ET ANAIS
Vous pouvez juste exécuter à partir de cette cellule et elle load les résultats de la question 2, pas besoin de run all. Vous pouvez aussi l'exécuter à n'importe quel moment quand vous bossez pour refresh les df originaux. :)

In [None]:
df_ranking = pickle.load(open( "qs.p", "rb" ))
df_times_final = pickle.load(open( "times.p", "rb" ))

### Question 3


Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

First, we try a naive inner (by default) merge on the "University" columns and we obtain 105 rows, meaning only about half of the university names matched.

In [None]:
df_ranking.merge(df_times_final,on="University")

So we take a closer look at the list of names for each ranking :

In [None]:
pd.DataFrame({"qs":df_ranking["University"],"times":df_times_final["University"]})

We notice a lot of universities have their initials in parenthesis after their name in the QS ranking, so we start by removing those parenthesis.

In [None]:
df_ranking["University"] = df_ranking["University"].str.replace(r"\(.*\)","").str.strip()

In [None]:
df_ranking.merge(df_times_final,on="University").shape[0]

We only obtain a slight improvement of 10 more matches. 
This is definitely not enough and even by further study of the 2 sets of names, there doesn't seem to be any other way to bump this number up.

So we try a new approach : the standard python library *difflib* offers an effective way to measure string similarity with its *SequenceMatcher* class.
We compare the two lists of names two by two, and if the best match is above a certain threshold of similarity, we map the two names together.
The similarity threshold is calibrated after some trials in order to make sure we don't wrongly match too many names (high enough) but still have enough flexibility (low enough) to match all the universities that appear on both rankings.

In [None]:
from difflib import SequenceMatcher

df_ranking = pickle.load(open( "qs.p", "rb" )) #just to reset our previous modification of the names

name_map = {}
threshold = 0.85

for name1 in df_ranking["University"]:
    best_ratio = 0
    match = ""
    for name2 in df_times_final["University"]:
        ratio = SequenceMatcher(None,name1,name2).ratio()
        if ratio > best_ratio:       # we find the maximum similarity
            best_ratio = ratio
            match = name2
    if (best_ratio > threshold):     # if it's superior to our threshold we add that couple to the mapping
        name_map[name1] = match

print("Nb of matches :",len(name_map))
name_map

The result is much more pleasing : we obtain 149 matches. But going through that mapping, we notice some names appear several times on the value side, which means we wrongly mapped some universities. Let's see which ones :

In [None]:
l = list(name_map.values())
duplicates = [x for x in l if l.count(x) > 1]
duplicates

Only 5 universities appear twice, which seems reasonable to correct "by hand".

In [None]:
inv_names = {}
for k, v in name_map.items():
    inv_names[v] = inv_names.get(v, [])
    inv_names[v].append(k)

inv_duplicates = {d:inv_names[d] for d in duplicates}
inv_duplicates

With this reversed mapping of names, we can see that, for example 'Eindhoven University of Technology' has been mapped to both 'Eindhoven University of Technology' and 'Vienna University of Technology', but the Vienna University of Technology does not appear in the Times ranking. A quick check shows that it is the same for the 4 other cases, so we simply correct these 5 entries in our mapping by deleting them.

In [None]:
for k,v in inv_duplicates.items():
    for uni in v:
        if uni != k:
            del(name_map[uni])

print("Nb of matches :",len(name_map))
name_map

Finally we can do an inner merge between the two rankings on the university name column, and we only show here the basic information, mostly respective ranks, for more clarity.

In [None]:
df_ranking["University"] = [name_map[uni] if uni in name_map.keys() else uni for uni in df_ranking["University"]]

df_merged = df_ranking.merge(df_times_final.drop(["Country","Region"],axis=1),on="University",suffixes=("_qs","_times"))
df_merged[["University","Rank_qs","Rank_times","Country","Region"]]

### Question 4
Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

In [None]:
df_merged

### Question 5

Can you find the best university taking in consideration both rankings? Explain your approach.