# Modeling residuals on university rankings

We'll try to apply the principles we saw in the diamonds-example to [this](https://www.kaggle.com/datasets/alitaqi000/world-university-rankings-2023) dataset. And to make it a competition or anything, but let's see who has the most influence on the ranking of a university: the rate of international students or the teaching score!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df=pd.read_csv("files/World University Rankings 2023.csv")
df.head()

For the purpose of this exercise, we'll only be looking at countries with a lot of universities in them. Sort by descending amount of universities and keep only those rows from the top 10 countries. (Hint, the 11th country, Pakistan, has 55 universities.)

In [None]:
# DELETE
sub_df = df[df.groupby('Location').Location.transform('count')>55].copy()
sub_df.head()

sub_df.Location.value_counts()

Now do a boxplot for "Teaching score", our goal metric, split by country.

In [None]:
sns.boxplot(data=sub_df, x='Location', y='OverAll Score')

Doesn't look like much, does it? A number of our metrics are numbers that are stored as objects. Compare which ones by looking at the data and the datatypes.

In [None]:
# DELETE

print(sub_df.head())
print(sub_df.dtypes)

An overview:
- Name of University             , object: Ok!
- Location                       , object: Ok!
- No of student                  , object: Wrong
- No of student per staff       , float64: Ok!
- International Student          , object: Wrong
- Female:Male Ratio              , object: Wrong
- OverAll Score                  , object: Wrong
- Teaching Score                , float64: Ok!
- Research Score                , float64: Ok!
- Citations Score               , float64: Ok!
- Industry Income Score         , float64: Ok!
- International Outlook Score   , float64: Ok!

Fix the wrong ones. Do you also feel the [lambda](https://www.geeksforgeeks.org/applying-lambda-functions-to-pandas-dataframe/)-vibe in this one?

(Challenges: nan-values, OverAll score sometimes contains "10.5-18.3" or something like that and the "-" isn't a normal "-", ratio has to be split and divided female/male, but there is an all female-college.)

In [None]:
# DELETE

def ratio_to_nr(ratio):
    if isinstance(ratio, float) or ratio.find(":") == -1:
        return 0
    data = str(ratio).split(" : ")
    if data[1] == "0":
        return 1
    return int(data[0])/int(data[1])

def overall_score_fix(score):
    if isinstance(score, float):
        return score
    
    if score.find("–") == -1:
        return float(score)
    
    data = score.split("–")
    data = [ float(x) for x in data]
    return (data[0] + data[1])/2

sub_df["No of student"] = sub_df["No of student"].str.replace(",","").astype(int)
sub_df["International Student"] = sub_df["International Student"].apply(lambda x: 0 if x == "%" else float(x.replace("%","")))
sub_df["Female:Male Ratio"] = sub_df["Female:Male Ratio"].apply(ratio_to_nr)
sub_df["OverAll Score"] = sub_df["OverAll Score"].apply(overall_score_fix)

The above code-block can only be run once, since it stores the cleaned values in the fields itself.

But we were doing a boxplot!

In [None]:
sns.boxplot(data=sub_df, x='Location', y='OverAll Score')

There seems to be some influence. What other metrics are important? Try a correlation matrix.

In [None]:
# DELETE

# plot a correlation matrix of sub_df
corr = sub_df.corr(numeric_only=True)
sns.heatmap(corr, annot=True)

We are trying to maximize the OverAll Score, so if we look at that line and do some creative copy-pasting in paint, we get the following image.

![](files/2023-09-14-15-24-10.png)


Now analyze this data. What is the influence of the different columns?

(The following is a code-block, but we expect only text, the sort of which you'll also be expected to be able to produce on the exam.)

In [None]:
# DELETE

# An overview:
# - No of student: 0,069, small positive correlation, can be neglected
# - No of student per staff: -0.24, reasonable negative correlation. Negative so more staff per student means a better university
# - International Student: 0,59, good positive correlation
# - Female:Male Ratio: 0,096, small positive correlation, can be neglected. Note: higher number means more female students, so more female students mean a better university, although not statistically significant
# - OverAll Score: 1, because the same value
# - Teaching Score: 0,85, high positive correlation
# - Research Score: 0,88, high positive correlation
# - Citations Score: 0,85, high positive correlation
# - Industry Income Score: 0,37 reasonable positive correlation
# - International Outlook Score: 0,65, good positive correlation

# concluding: the influence of teaching score, research score and citations score is very big. This should be modelled first.

Also, copy-pasting in paint? Really? Show all values from the correlation matrix for OverAll Score order from high to low.

In [None]:
# DELETE

corr["OverAll Score"].sort_values(ascending=False)


Let's look at the teaching score vs the overall score in a scatter plot. Add the percentage of international students as a color.

In [None]:
# DELETE

# Draw a scatter plot of overall score vs teaching score. Add international student as color.

# plt.scatter(sub_df["OverAll Score"], sub_df["Teaching Score"])
sns.scatterplot(data=sub_df, x="OverAll Score", y="Teaching Score", hue="International Student")


The dark dots (high number of international students) occur more on the right side of the plot, but some also appear on the left. We know there is a positive effect, but how big is it? To understand that we have to remove the influence of the Teaching score.

We'll assume the influence of teaching score on overall score is linear. No need to do a log this time.

Also, to make the linear regression model work we'll need to delete all na-values. Wish we would have done that before converting al the datatypes, it would have made our lives a lot easier.

In [None]:
# print(sub_df.count())
sub_df.dropna(inplace=True)
# print(sub_df.count())

from sklearn import datasets, linear_model

x = sub_df["Teaching Score"].values.reshape(-1, 1)
y = sub_df["OverAll Score"].values.reshape(-1, 1)

regr = linear_model.LinearRegression()
model = regr.fit(x, y)

fig, ax = plt.subplots(figsize=(12,6))

sub_df.plot(kind='scatter', x="Teaching Score", y="OverAll Score", grid=True,fontsize=10, ax=ax,  figsize=(12, 6), alpha=0.1)
plt.plot(x, regr.predict(x), color='red')

Maybe print the parameters for the regression-line?

In [None]:
print(f"a= {model.coef_[0][0]}, b= {model.intercept_[0]}")

And there is our linear model. Now we use this to predict the values for every weight. Once we have this predicted weight, we use it to calculate the residuals (or the error) for every actual value.

In [None]:
sub_df['overall_score_predicted'] = model.predict(sub_df["Teaching Score"].values.reshape(-1, 1))
sub_df['overall_score_residuals'] = sub_df["OverAll Score"] - sub_df['overall_score_predicted']

# df.head()
sub_df.plot(kind='scatter', x="Teaching Score", y="overall_score_residuals", grid=True,fontsize=10, figsize=(12, 6), alpha=0.5)

What we have here is a plot of teaching score vs the residuals of the overall score. These residuals imply that the teaching score has no predictive value anymore and that is actually the case: the data is almost random.

Well, not quite. There are still some lines left because the research score and the citations score are still in there. But if we plot the residuals of the the overall score vs the international students, what do we get?

In [None]:
sub_df.plot(kind='scatter', y="International Student", x="overall_score_residuals", grid=True,fontsize=10, figsize=(12, 6), alpha=0.5)

We see that:
- Most schools have a low percentage of international students (< 20). They have both positive and negative residuals, so both bad and good scores.
- Above 20% international students we don't see many below 0 residuals anymore. This means that there is indeed a trend that more international students mean better schools.

But was all this necessary? Couldn't we have just made the same plot with the original data?

In [None]:
# DELETE

sub_df.plot(kind='scatter', y="International Student", x="OverAll Score", grid=True,fontsize=10, figsize=(12, 6), alpha=0.5)

The trend is indeed there, but less explicit than after applying the model. This should become even clearer when you remove the research and the citations score from the residuals.