# Background

In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

In [75]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re
import types
import matplotlib.pyplot as plt
%matplotlib inline

In [76]:
# number of universities to fetch
n = 200
tu_link = "https://www.topuniversities.com"

# 1. Obtain the 200 top-ranking universities in www.topuniversities.com

When loading the page https://www.topuniversities.com/university-rankings/world-university-rankings/2018, one can notice, that several additional requests are sent to the server. <br>
One of these files is a json file, which contains the complete ranking:
https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508954434901. <br>

In [77]:
req_tu_ranking = requests.get(tu_link+'/sites/default/files/qs-rankings-data/357051.txt?_=1508168782318')
tu_ranking = req_tu_ranking.json()["data"]
#tu_ranking json take too much place so not printed

In [78]:
len(tu_ranking)

959

In [79]:
# check entries and equal ranks
print (tu_ranking[20])
print (tu_ranking[21])

{'region': 'North America', 'country': 'United States', 'rank_display': '=21', 'core_id': '403', 'score': '87', 'url': '/universities/university-michigan', 'title': 'University of Michigan', 'guide': '<a href="/where-to-study/north-america/united-states/guide" class="guide-link" target="_blank">United States</a>', 'cc': 'US', 'stars': '5', 'nid': '294857', 'logo': '<img src="https://www.topuniversities.com/sites/default/files/university-of-michigan_403_small_0.jpg" alt="University of Michigan Logo">'}
{'region': 'North America', 'country': 'United States', 'rank_display': '=21', 'core_id': '168', 'score': '87', 'url': '/universities/duke-university', 'title': 'Duke University', 'guide': '<a href="/where-to-study/north-america/united-states/guide" class="guide-link" target="_blank">United States</a>', 'cc': 'US', 'stars': '5', 'nid': '294490', 'logo': '<img src="https://www.topuniversities.com/sites/default/files/duke-university_168_small_0.jpg" alt="Duke University Logo">'}


Filter necessary information

In [80]:
tu_ranking_200 = tu_ranking[:n]

In [None]:
tu_relevant_values = ["title", "country", "region", "rank_display"]

tu_ranking_filtered = []
for i in range(n):
    tmp_dict = {}
    for s in tu_relevant_values:
        tmp_dict[s] = tu_ranking_200[i][s]
    tu_ranking_filtered.append(tmp_dict)

The json file does unforunately not contain the needed information like number of students. Therefore, the detailed pages must be searched. The link of the pages is contained in the fetched ranking (e.g. https://www.topuniversities.com/universities/ecole-polytechnique-f%C3%A9d%C3%A9rale-de-lausanne-epfl). <br>
During the analysis of the webpage, no dynamic queries to fetch the detailed information was found. Therefore, the HTML page needs to be searched for those values. <br>
One can notice that all required numbers are stored in a div with the class _view-academic-data-profile_. This div then contains several divs for the details in the format div class="variable"> div class="number"></div </div
</code></pre>. <br>
We are using this structure to fetch the data from each details page.

In [None]:
def tuDetailsGetNumber(div):
    return div.find('div', class_="number").text.strip()

for i in range(n):
    req_tu_details = requests.get(tu_link+tu_ranking_200[i]["url"])    
    soup = BeautifulSoup(req_tu_details.text, 'html.parser')
    tu_detail_classes = soup.find_all('div', class_='view-academic-data-profile')

    try:
        tu_total_fac_div = tu_detail_classes[0].select(".total.faculty")[0]
        tu_ranking_filtered[i]["total_faculties"] = tuDetailsGetNumber(tu_total_fac_div)
    except IndexError:
        print("Could not fetch total faculty field from uni {}".format(i+1))
    
    try:
        tu_inter_fac_div = tu_detail_classes[0].select(".inter.faculty")[0]
        tu_ranking_filtered[i]["inter_faculties"] = tuDetailsGetNumber(tu_inter_fac_div)
    except IndexError:
        print("Could not fetch inter faculty field from uni {}".format(i+1))
        
    try:
        tu_total_stu_div = tu_detail_classes[0].select(".total.student")[0]
        tu_ranking_filtered[i]["total_students"] = tuDetailsGetNumber(tu_total_stu_div)
    except IndexError:
        print("Could not fetch total students field from uni {}".format(i+1))
    
    try:
        tu_inter_stu_div =  tu_detail_classes[0].select(".total.inter")[0]
        tu_ranking_filtered[i]["inter_students"] = tuDetailsGetNumber(tu_inter_stu_div)    
    except IndexError:
        print("Could not fetch inter students field from uni {}".format(i+1))

In [None]:
tu_df = pd.DataFrame.from_dict(tu_ranking_filtered)
tu_df.head()

## Statistics on second ranking

First we transform all the column object into float

In [None]:
def ToNumeric(x):
    if(type(x) is not float):
        x = x.replace(',','')
    if type(x) is str:
        if "%" in x:
            x = float(x.strip('%'))
            x /= 100 
    return x

In [None]:
tu_df.inter_faculties = tu_df.inter_faculties.apply(lambda x: ToNumeric(x))
tu_df.inter_faculties = tu_df.inter_faculties.apply(pd.to_numeric)

tu_df.inter_students = tu_df.inter_students.apply(lambda x: ToNumeric(x))
tu_df.inter_students = tu_df.inter_students.apply(pd.to_numeric)

tu_df.total_faculties = tu_df.total_faculties.apply(lambda x: ToNumeric(x))
tu_df.total_faculties = tu_df.total_faculties.apply(pd.to_numeric)

tu_df.total_students = tu_df.total_students.apply(lambda x: ToNumeric(x))
tu_df.total_students = tu_df.total_students.apply(pd.to_numeric)


tu_df.head()

We use the total number of students and faculties found to calculate the ratio and then add it as a new column, same is done for the internation student and total students

In [None]:
#Create a DF for Ration, then you get the ID of the uni with best Ratio
tot_stud = pd.DataFrame(tu_df.total_students)
tot_fac = pd.DataFrame(tu_df.total_faculties)
ratio = (tot_stud.total_students/tot_fac.total_faculties)
df_ratio_tot = pd.DataFrame(ratio,columns=["ratio_students_fac"])


inter_stud = pd.DataFrame(tu_df.inter_students)
total_stud = pd.DataFrame(tu_df.total_students)
inter_ratio = (inter_stud.inter_students/total_stud.total_students)
df_ratio_inter = pd.DataFrame(inter_ratio,columns=["ratio_inter"])

tu_df = tu_df.join(df_ratio_tot)
tu_df = tu_df.join(df_ratio_inter)

- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?

In [None]:
def plotBarChart(x, y, title, yText):

    top = 5

    fig, ax = plt.subplots()
    rec = ax.bar(np.arange(top), y, 0.8)

    # add some text for labels, title and axes ticks
    ax.set_ylabel(yText)
    ax.set_title(title)
    ax.set_xticks(np.arange(top))
    ax.set_xticklabels(x, rotation='vertical')

    plt.show()

In [None]:
def plotBarChartGroup(df, groupCol, valueCol, title, ylabel):
    
    # TODO bugfix: Some entries are missing
    
    top = 3
    categories = df[groupCol].unique()

    fig, ax = plt.subplots()
    fig.set_size_inches(10,10)
    for i in range(top):    
        topx = df.sort_values([groupCol,valueCol],ascending=False).groupby(groupCol)[valueCol].nth(i)
        nthcat = 0
        fixed_topx = [0]*len(categories)
        for j in range(len(categories)):
            try:
                if topx.index.get_loc(categories[j]) > -1:
                    fixed_topx[j] = topx[topx.index.get_loc(categories[j])]
            except KeyError:
                ""
                # take default value    
        ax.bar(np.arange(len(categories))+0.25*i, fixed_topx, 0.25)


    # add some text for labels, title and axes ticks
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.set_xticks(np.arange(len(categories)) + 0.25 / 3)
    ax.set_xticklabels(df[groupCol].unique(), rotation="vertical")

    plt.show()

In [None]:
#Merge the ratio to sort ther and get best university by ratio
tu_df.sort_values(by="ratio_students_fac",ascending=False).head()

In [None]:
plotBarChart(tu_df.sort_values(by="ratio_students_fac",ascending=False).head()["title"],\
             tu_df.sort_values(by="ratio_students_fac",ascending=False).head()["ratio_students_fac"],\
             'Top Student faculty ratio', 'Ratio student to faculty mebers')

The `Vienna University of Technology` has the highest student to faculty members ratio. Most of the top 5 universities are in Europe.

In [None]:
tu_df.sort_values(by="ratio_inter",ascending=False).head()

In [None]:
plotBarChart(tu_df.sort_values(by="ratio_inter",ascending=False).head()["title"],\
             tu_df.sort_values(by="ratio_inter",ascending=False).head()["ratio_inter"],\
             'Top universities with international students', 'Ratio of international students')

The `London School of Economics and Political Science` has the highest international student ratio. Most of the top 5 universities are in Europe.

- Answer the previous question aggregating the data by (c) country and (d) region.

In [None]:
tu_df.sort_values(["country","ratio_students_fac"],ascending=False).groupby("country").head(1)

In [None]:
plotBarChartGroup(tu_df, "country", "ratio_students_fac", 'Top Student faculty ratio', 'Ratio student to faculty mebers')

The US, Germany, Austria and Italy have the universities with the highest student to faculty ratio.

In [None]:
tu_df.sort_values(["region","ratio_students_fac"],ascending=False).groupby("region").head(2)

In [None]:
plotBarChartGroup(tu_df, "region", "ratio_students_fac", 'Top Student faculty ratio', 'Ratio student to faculty mebers [%]')

The North America and Europe have the universities with the highest student to faculty ratio.

In [None]:
tu_df.sort_values(["country","ratio_inter"],ascending=False).groupby("country").head(1)

In [None]:
plotBarChartGroup(tu_df, "country", "ratio_inter", 'Top universities with international students', 'Ratio of international students')

The US, UK, Switzerland and Netherlands have the universities with the highest international student ratio.

In [None]:
tu_df.sort_values(["region","ratio_inter"],ascending=False).groupby("region").head(2)

In [None]:
plotBarChartGroup(tu_df, "region", "ratio_inter", 'Top universities with international students', 'Ratio of international students')

Europe has the universities with the highest international student ratio.

# Obtain the 200 top-ranking universities in www.timeshighereducation.com (ranking 2018)

When loading the page http://timeshighereducation.com/world-university-rankings/2018/world-ranking, one can notice, that several additional requests are sent to the server. <br>
One of these files is a json file, which contains the complete ranking:
https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json. <br>
Luckily this ranking already contains all neccessary information.

In [None]:
th_link = "https://www.timeshighereducation.com" 

In [None]:
req_th_ranking = requests.get(th_link+'/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
th_ranking = req_th_ranking.json()["data"]
#th_ranking

In [None]:
len(th_ranking)

In [None]:
# check format and equal ranks
print (th_ranking[2])
print (th_ranking[3])

# equal scores

In [None]:
th_ranking_200 = th_ranking[:n]

In [None]:
th_relevant_values = ["name", "aliases", "location", "rank", "stats_number_students", "stats_student_staff_ratio", "stats_pc_intl_students"]

th_ranking_filtered = []
for i in range(n):
    tmp_dict = {}
    for s in th_relevant_values:
        tmp_dict[s] = th_ranking_200[i][s]
    th_ranking_filtered.append(tmp_dict)

In [None]:
th_df = pd.DataFrame.from_dict(th_ranking_filtered)
th_df.head()

## Statistics on second ranking

In [None]:
# transform to numerical values
th_df.stats_number_students = th_df.stats_number_students.apply(lambda x: ToNumeric(x))
tu_df.stats_number_students = th_df.stats_number_students.apply(pd.to_numeric)

th_df.stats_pc_intl_students = th_df.stats_pc_intl_students.apply(lambda x: ToNumeric(x))
th_df.stats_pc_intl_students = th_df.stats_pc_intl_students.apply(pd.to_numeric)

th_df.stats_student_staff_ratio = th_df.stats_student_staff_ratio.apply(lambda x: ToNumeric(x))
th_df.stats_student_staff_ratio = th_df.stats_student_staff_ratio.apply(pd.to_numeric)

Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?

In [None]:
th_df.sort_values(by="stats_student_staff_ratio",ascending=False).head()

In [None]:
plotBarChart(th_df.sort_values(by="stats_student_staff_ratio",ascending=False).head()["name"],\
             th_df.sort_values(by="stats_student_staff_ratio",ascending=False).head()["stats_student_staff_ratio"],\
             'Top Student faculty ratio', 'Ratio student to faculty mebers')

The `University of Bonn` has the highest student to faculty members ratio. All top 5 universities are in Germany.

In [None]:
th_df.sort_values(by="stats_pc_intl_students",ascending=False).head()

In [None]:
plotBarChart(th_df.sort_values(by="stats_pc_intl_students",ascending=False).head()["name"],\
             th_df.sort_values(by="stats_pc_intl_students",ascending=False).head()["stats_pc_intl_students"],\
             'Top universities in international students', 'Ratio international students [%]')

The `London School of Economics and Political Science` has the highest international student ratio. All of the top 5 universities are in Europe.

Answer the previous question aggregating the data by (c) country and (d) region.

In [None]:
th_df.sort_values(["location","stats_student_staff_ratio"],ascending=False).groupby("location").head(1)

In [None]:
plotBarChartGroup(th_df, "location", "stats_student_staff_ratio", 'Top Student faculty ratio', 'Ratio student to faculty mebers')

Germany has by far the highest student to faculty members ratio.

In [None]:
th_df.sort_values(["location","stats_pc_intl_students"],ascending=False).groupby("location").head(1)

In [None]:
plotBarChartGroup(th_df, "location", "stats_pc_intl_students", 'Top universities in international students', 'Ratio international students [%]')

The UK, Switzerland, Netherlands and Luxembourg have the higest international student ratio.

Unfortunately the second ranking does not contain the region column. We can obtain it from the first ranking

In [None]:
region_map = {}
for country in th_df["location"].values:
    if country in tu_df["country"].values:
        region_map[country] = tu_df[tu_df.country == country]["region"].iloc[0]
    else:        
        if country == 'Russian Federation':
            region_map[country] = tu_df[tu_df.country == "Russia"]["region"].iloc[0]
        elif country == 'Luxembourg':
            region_map[country] = "Europe"
        else:
            print (country)
        
th_df['region'] = th_df.apply(lambda x: region_map[x['location']], axis=1)

In [None]:
th_df.sort_values(["region","stats_student_staff_ratio"],ascending=False).groupby("region").head(2)

In [None]:
plotBarChartGroup(th_df, "region", "stats_student_staff_ratio", 'Top Student faculty ratio', 'Ratio student to faculty mebers')


Europe has by far the highest students to staff ratio

In [None]:
th_df.sort_values(["region","stats_pc_intl_students"],ascending=False).groupby("region").head(2)

In [None]:
plotBarChartGroup(th_df, "region", "stats_pc_intl_students", 'Top universities in international students', 'Ratio international students [%]')


Europe has by far the most international students.

# Merge Rankings

The idea is to modify the name of each university to make them as equal as possible:
 - Lower all letters
 - Translating important words (university, school, technical..)
 - Get rid of special characters
 - Get rid of prepositions
 - Get rid of parenthesis and their content. Example: (UCB)


In [None]:
df_top_tomerge = tu_df.copy()
df_times_tomerge = th_df.copy()

In [None]:
def modify_tomerge(string):
    
    #lower case
    string = string.str.lower()
    
    # University
    string = string.str.replace('universite','university')
    string = string.str.replace('universitat','university')
    string = string.str.replace('universitaet','university')
    string = string.str.replace('universidad','university')
    string = string.str.replace('universiteit','university')
    string = string.str.replace('universidade','university')
    string = string.str.replace('universitari','university')


    # school
    string = string.replace('scuola','school')
    
    #technical
    string = string.replace('technische','technical')
    
    #studies
    string = string.replace('studi','studies')
    string = string.replace('estudios','studies')


    #some translations
    string = string.replace('freie','free')
    string = string.replace('tecnológico','technological')



    #special characters
    string = string.str.replace('é','e')
    string = string.str.replace('-','')
    string = string.str.replace('ä','ae')
    string = string.str.replace('ã','a')
    string = string.str.replace('ó','o')
    string = string.str.replace('ö','o')
    string = string.str.replace('&','')
    string = string.str.replace('/','')


    #prepositions
    string = string.str.replace('of','')
    string = string.str.replace('the','')
    string = string.str.replace('at','')
    string = string.str.replace('de','')
    string = string.str.replace('y','')
    string = string.str.replace('di ','')
    string = string.str.replace(',','')


    #parenthesis and their content
    string = string.str.replace(r"\(.*\)","")
    
    #space
    string = string.str.replace(' ','')
    

    return string
   

In [None]:
df_top_tomerge = df_top_tomerge.rename(columns={'title': 'name', 'rank_display':'rank_top'})
df_times_tomerge = df_times_tomerge.rename(columns={'rank': 'rank_times'})

df_times_tomerge['name'] = modify_tomerge(df_times_tomerge['name'])
df_top_tomerge['name'] = modify_tomerge(df_top_tomerge['name'])

df = df_top_tomerge.merge(df_times_tomerge, how='inner')

df.head()

In [None]:
def RankToNumeric(x):
    if(type(x) is not float and type(x) is not int):
        x = x.replace('=','')
    return x

We put all the ranking in Integer (mostly to have the equality in int as they were simbolized as "=ranking")

In [None]:
df.rank_times = df.rank_times.apply(lambda x: RankToNumeric(x))
df.rank_times = df.rank_times.apply(pd.to_numeric)
df.rank_top = df.rank_top.apply(lambda x: RankToNumeric(x))
df.rank_top = df.rank_top.apply(pd.to_numeric)

# Exploratory Data Analysis

In [None]:
df

In [None]:
#There is one value NaN that we drop
df = df.dropna(axis=0)

In [None]:
# set column country, location and region to categorial
df["country"] = df["country"].astype('category')
df["location"] = df["location"].astype('category')
df["region"] = df["region"].astype('category')

Here we can use the describe function to have some statistics like mean, standart deviation, quartiles and some other informations about all features

In [None]:
df.describe()

In [None]:
df_corr = df.corr()
df_corr

In [None]:
df_corr[["rank_top","rank_times"]]

The ranking of topuniversities.com is mostely good for a high number of international faculties.
The ranking of timeshighereducation.com is mostely good for a combination of high number of international faculties and students.

In [None]:
for i in range(df.columns.shape[0]):
    for j in range(df.columns.shape[0]):
        if i<j and df.dtypes[i] == "float" and df.dtypes[j] == "float":
            if np.abs(df_corr.loc[df.columns[i],df.columns[j]]) > 0.5:
                print ("Columns {} and {} have a high corr coeff of {}".format(df.columns[i], df.columns[j], df_corr.loc[df.columns[i],df.columns[j]]))

It stands out, that only the ratio of international students between the both ranking does correlate strongly.
Therefore, let's have a look at the differences

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize= (16,8))
df.boxplot(column="rank_top",
                 by= "country",
                 rot=90,
                 ax=axes[0])
df.boxplot(column="rank_times",
                 by= "country",
                 rot=90,
                 ax=axes[1])

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize= (16,8))
df.boxplot(column="rank_top",
                 by= "region",
                 rot=90,
                 ax=axes[0])
df.boxplot(column="rank_times",
                 by= "region",
                 rot=90,
                 ax=axes[1])

The European universities get a similiar rating on both sites.
Topuniversities favors Asia over North America, timeshighereducation the other way around

In [None]:
# Find not matching student numbers
tu_stud = pd.DataFrame(tu_df.total_students)
th_stud = pd.DataFrame(th_df.stats_number_students).astype("float")
ratio = (tu_stud.total_students/th_stud.stats_number_students)
stud_deviation = pd.DataFrame(ratio,columns=["students_deviation"])
df_stud_deviation = df.join(stud_deviation)
df_stud_deviation.sort_values(["students_deviation"], ascending=False).head()

In [None]:
df_stud_deviation.sort_values(["students_deviation"], ascending=True).head()

There are several universities for whose the number of students differs by factor ~ 10.

In [None]:
# Find not matching ranking positions
tu_ranks = pd.DataFrame(df.rank_top)
th_ranks = pd.DataFrame(df.rank_times)
ratio = np.abs(tu_ranks.rank_top-th_ranks.rank_times)
rank_deviation = pd.DataFrame(ratio,columns=["rank_deviation"])
df_rank_deviation = df.join(rank_deviation)
df_rank_deviation.sort_values(["rank_deviation"], ascending=False).head()

In [None]:
df_rank_deviation.sort_values(["rank_deviation"], ascending=True).head()

There are several universities for whose the rank differs ~ 100 spots in both rankings.

In [None]:
# Find not matching faculies
tu_facs = pd.DataFrame(tu_df.total_faculties)
th_students_facs = pd.DataFrame(th_df.stats_student_staff_ratio)
ratio = (tu_facs.total_faculties/(th_stud.stats_number_students/th_students_facs.stats_student_staff_ratio))
fac_deviation = pd.DataFrame(ratio,columns=["fac_deviation"])
df_fac_deviation = df.join(fac_deviation)
df_fac_deviation.sort_values(["fac_deviation"], ascending=False).head()

In [None]:
df_fac_deviation.sort_values(["fac_deviation"], ascending=True).head()


There are several universities for whose the number of faculties differs by factor ~ 10.

In [None]:
# Find not matching international ratio
tu_ratio1 = pd.DataFrame(tu_df.ratio_inter)
th_ratio2 = pd.DataFrame(th_df.stats_pc_intl_students)
ratio = (tu_ratio1.ratio_inter/th_ratio2.stats_pc_intl_students)
inter_deviation = pd.DataFrame(ratio,columns=["inter_deviation"])
df_inter_deviation = df.join(inter_deviation)
df_inter_deviation.sort_values(["inter_deviation"], ascending=False).head()


In [None]:
df_inter_deviation.sort_values(["inter_deviation"], ascending=True).head()


There are several universities for whose the international student ratio differs by factor ~ 5.

In [None]:
# Find not matching fac ratio
tu_ratio1 = pd.DataFrame(tu_df.ratio_students_fac)
th_ratio2 = pd.DataFrame(th_df.stats_student_staff_ratio)
ratio = (tu_ratio1.ratio_students_fac/th_ratio2.stats_student_staff_ratio)
fac_r_deviation = pd.DataFrame(ratio,columns=["fac_r_deviation"])
df_fac_r_deviation = df.join(fac_r_deviation)
df_fac_r_deviation.sort_values(["fac_r_deviation"], ascending=False).head()


In [None]:
df_fac_r_deviation.sort_values(["fac_r_deviation"], ascending=True).head()


There are several universities for whose the students per faculy differs by factor ~ 10.

All in all there are large differences between the both rankings. <br>
Are those differences the reason for the rank deviation?

In [None]:
df_deviation = df.join(rank_deviation)
df_deviation = df_deviation.join(stud_deviation)
df_deviation = df_deviation.join(fac_deviation)
df_deviation = df_deviation.join(inter_deviation)
df_deviation = df_deviation.join(fac_r_deviation)
df_deviation[["rank_deviation","students_deviation","fac_deviation", "inter_deviation", "fac_r_deviation"]].corr()

Rank deviation only partially influenced by deviating columns between rankings

# Find combined ranking

The idea for combinating the ratings is the following :
- First, if 2 ranking are the same for bot website, we use this ranking
- Else we get the most correlated features for each ranking ( inter faculties for rank_top , and total_faculties for rank_times) and the feature that is most correlated for both ( ratio_inter)
- Then we normalize this features between 0 and 1, to use them as multiplicative factor
- Inverse normalization is applied to both ranking (highest rankin gives 1, lowest gives 0)
- We then sum the normalized feature, multiplied by the absolute value of the correlation with the ranking, and add 4 times both ranking ( the ranking are more important than the 3 other features, so to let it dominates we use 4, as the sum of the 3 other is maximum 3)
- The resulting sum is ordered in descending order, and we get the index of each of these ranking from the original dataframe to assign these as ranking (meaning the first sum of the list will be ranked 1, second 2, etc)
- We then merge the total_ranking Dataframe with our datafram to get the final Dataframe

In [None]:
def applyRanking(row):
    
    return row.rank_times
    

In [None]:
#corr interfac and rank top = -0.486923
maxR = df.inter_faculties.max()
minR = df.inter_faculties.min()

column = (df.inter_faculties-minR)/(maxR-minR)

#corr rank times total fac -0.348804
maxS = df.total_faculties.max()
minS = df.total_faculties.min()

column2 = (df.total_faculties-minS)/(maxS-minS)

#correlation is about -0.2 for both
maxB = df.ratio_inter.max()
minB = df.ratio_inter.min()

column3 = (df.ratio_inter-minB)/(maxB-minB)


#INVERSE NORMALIZE THE RANKING, meaning the highest ranking are 1 and lowest are close to 0 

maxRtop= df.rank_top.max()
minRtop= df.rank_top.min()

columnTop = (1-(df.rank_top-minRtop)/(maxRtop-minRtop))

maxRtimes = df.rank_times.max()
minRtimes = df.rank_times.min()

columnTimes = (1-(df.rank_times-minRtimes)/(maxRtimes-minRtimes))


test = (column*0.477066+0.350235*column2+0.2*column3+4*columnTimes+4*columnTop).sort_values(ascending=False)
#total_rank = df.loc[test.index].apply(applyRanking, axis=1)
total_ranking = pd.DataFrame(df.loc[test.index].reset_index().apply(applyRanking, axis=1).values, index=df.index , columns=['total_rank'])

final_df = df.join(total_ranking)
final_df.sort_values(["total_rank"])