### Technical Report
#### GLOBAL 2000 LIST BY THE CENTER FOR WORLD UNIVERSITY RANKINGS
Author: Jonathan Kateega\
Access No.: B34976

In [8]:
# Packages
import pandas as pd
import sqlite3

#### Data Profile

##### Raw Data

In [33]:
df_raw = pd.read_csv("data/cwur_2025.csv")
df_raw.head(5)

Unnamed: 0,World Rank,Institution,Location,National Rank,Education Rank,Employability Rank,Faculty Rank,Research Rank,Score
0,1 Top 0.1%,Harvard University,USA,1,1,1,1,1,100.0
1,2 Top 0.1%,Massachusetts Institute of Technology,USA,2,4,12,2,11,96.8
2,3 Top 0.1%,Stanford University,USA,3,10,4,3,4,95.2
3,4 Top 0.1%,University of Cambridge,United Kingdom,1,2,26,4,14,94.1
4,5 Top 0.1%,University of Oxford,United Kingdom,2,7,28,9,6,93.3


In [34]:
# Viewing the datatypes
print(df_raw.dtypes)

rows, cols = df_raw.shape
print(f"The dataset has {rows} samples and {cols} features.")
print(f"Features: {', '.join(df_raw.columns)}")

World Rank             object
Institution            object
Location               object
National Rank           int64
Education Rank         object
Employability Rank     object
Faculty Rank           object
Research Rank          object
Score                 float64
dtype: object
The dataset has 2000 samples and 9 features.
Features: World Rank, Institution, Location, National Rank, Education Rank, Employability Rank, Faculty Rank, Research Rank, Score


In [35]:
# Checking for duplicates in the institution column
dup_raw = (df_raw.duplicated(subset=["Institution"]).sum())
print(f"Number of duplicates in the raw data: {dup_raw}")

Number of duplicates in the raw data: 0


The raw dataset extracted from the CWUR website (World University Rankings 2025, https://cwur.org) contains 2000 samples and 9 features. For each sample (institution), there are eight observations including the (1) location, (2) national rank, (3) education rank, (4) employability rank, (5) faculty rank, (6) research rank, (7) score, and the (8) world rank.

Of the 9 features, only the National Rank and Score are numerical type (integer and float, respectively). All other columns, while containing values that appear numerical in nature, bear categorical data types (object). In this form, analysis and comparison is impeded, and adjustment to numerical data is required in order that information and insights can be extracted. Datatype conversion comes later in the report.

In [36]:
# Handling missing data
# Checking for null values
null_counts = df_raw.isnull().sum()
print(null_counts)

World Rank             0
Institution            0
Location               0
National Rank          0
Education Rank         0
Employability Rank     0
Faculty Rank          29
Research Rank          0
Score                  0
dtype: int64


In [37]:
# Expressing missing data as a percentage
null_percentage = df_raw.isnull().mean() * 100
print(null_percentage)

World Rank            0.00
Institution           0.00
Location              0.00
National Rank         0.00
Education Rank        0.00
Employability Rank    0.00
Faculty Rank          1.45
Research Rank         0.00
Score                 0.00
dtype: float64


Null values\
Before performing any adjustments to the raw data, there are 29 empty cells in the "Faculty Rank" column, representing 1.45% of the samples. Since the count of missing values is low (out of 2000 samples), I determine that this can safely be dropped from the dataframe.

In [40]:
df_no_null = df_raw.dropna()
print(df_no_null.isnull().sum())

cols_to_convert = ["Education Rank", "Employability Rank", "Faculty Rank", "Research Rank"]

for col in cols_to_convert:
    print(f"\n--- {col} ---")
    counts = df_no_null[col].value_counts(dropna=False)
    percentages = df_no_null[col].value_counts(normalize=True, dropna=False) * 100
    summary = pd.DataFrame({"Count": counts, "Percentage": percentages.round(2)})
    print(summary)

World Rank            0
Institution           0
Location              0
National Rank         0
Education Rank        0
Employability Rank    0
Faculty Rank          0
Research Rank         0
Score                 0
dtype: int64

--- Education Rank ---
                Count  Percentage
Education Rank                   
-                1526       77.42
5                   1        0.05
16                  1        0.05
12                  1        0.05
6                   1        0.05
...               ...         ...
74                  1        0.05
27                  1        0.05
66                  1        0.05
250                 1        0.05
2                   1        0.05

[446 rows x 2 columns]

--- Employability Rank ---
                    Count  Percentage
Employability Rank                   
-                     950       48.20
1672                    2        0.10
1715                    2        0.10
26                      1        0.05
28                      1

After ensuring no nulls in the "Faculty Rank" column, I find that the columns "Education Rank" (1,526 or 77.42%), "Employability Rank" (950 or 48.20%), "Faculty Rank" (1,693 or 85.90%), and "Research Rank" (68% or 3.45%) contain the value "-", which I interpret as missing data. This value is perhaps the reason why these columns are stored with categorical and not numerical datatypes.

##### Clean Data

In [None]:
# variables for reading data from the sqlite db file
db_file = "data/rankings.db"
table_name = "university_rankings"

# Read from SQLite
conn = sqlite3.connect(db_file)

with sqlite3.connect(db_file) as conn:
    df_cleaned = pd.read_sql(f"SELECT * FROM {table_name}", conn)
    
df_cleaned.head(5)

Unnamed: 0,Rank Number,Institution,Country,National Rank,Education Rank,Employability Rank,Faculty Rank,Research Rank,Score,Global_Region,Overall_Score_Normalized
0,1,Harvard University,USA,1,1.0,1.0,1.0,1.0,100.0,America,1.0
1,2,Massachusetts Institute of Technology,USA,2,4.0,12.0,2.0,11.0,96.8,America,0.905325
2,3,Stanford University,USA,3,10.0,4.0,3.0,4.0,95.2,America,0.857988
3,4,University of Cambridge,United Kingdom,1,2.0,26.0,4.0,14.0,94.1,Europe,0.825444
4,5,University of Oxford,United Kingdom,2,7.0,28.0,9.0,6.0,93.3,Europe,0.801775


In [45]:
# Viewing the datatypes
df_cleaned.info()

rows, cols = df_cleaned.shape
print(f"The dataset has {rows} samples and {cols} features.")
print(f"Features: {', '.join(df_cleaned.columns)}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1971 entries, 0 to 1970
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Rank Number               1971 non-null   int64  
 1   Institution               1971 non-null   object 
 2   Country                   1971 non-null   object 
 3   National Rank             1971 non-null   int64  
 4   Education Rank            445 non-null    float64
 5   Employability Rank        1021 non-null   float64
 6   Faculty Rank              278 non-null    float64
 7   Research Rank             1903 non-null   float64
 8   Score                     1971 non-null   float64
 9   Global_Region             1971 non-null   object 
 10  Overall_Score_Normalized  1971 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 169.5+ KB
The dataset has 1971 samples and 11 features.
Features: Rank Number, Institution, Country, National Rank, Educatio

The cleaned dataset contails 1971 samples (decreased after removing rows where the "Fcaulty Rank" is missing), and 11 features (increased from 9 in the raw data). The population in the dataset is the (institution + Country + Global_Region), which is always unique. The population is compared across the observations (1) Rank Number, (2) National Rank, (3) Education Rank, (4) Employability Rank, (5) Faculty Rank, (6) Research Rank, (7) Score, and (8) Overall_Score_Normalized.

In [None]:
# Checking the cleaned dataframe for Null values
null_counts_clean = df_cleaned.isnull().sum()
print(null_counts_clean)

Rank Number                    0
Institution                    0
Country                        0
National Rank                  0
Education Rank              1526
Employability Rank           950
Faculty Rank                1693
Research Rank                 68
Score                          0
Global_Region                  0
Overall_Score_Normalized       0
dtype: int64


In [None]:
# Expressing missing data as a percentage
null_percentage_clean = df_cleaned.isnull().mean() * 100
print(null_percentage_clean)

Rank Number                  0.000000
Institution                  0.000000
Country                      0.000000
National Rank                0.000000
Education Rank              77.422628
Employability Rank          48.198884
Faculty Rank                85.895485
Research Rank                3.450025
Score                        0.000000
Global_Region                0.000000
Overall_Score_Normalized     0.000000
dtype: float64


After converting the categorical columns "Education Rank", "Employability Rank", "Faculty Rank", and "Research Rank" I discovered that these too have missing data.
- Education Rank is missing 77% of the time
- Employability Rank is missing 48% of the time
- Faculty Rank is missing 85% of the time
- Research Rank is missing 3.4% of the time.

Was it the right thing to do to delete the samples where the Faculty Rank was null? I will revisit this in the wrangling notebook.

In [49]:
# Was the Faculty Rank really "-" on so many records before, even in the raw data? How did I miss this?
df_raw[df_raw["Faculty Rank"] == "-"]

Unnamed: 0,World Rank,Institution,Location,National Rank,Education Rank,Employability Rank,Faculty Rank,Research Rank,Score
36,37 Top 0.2%,Tsinghua University,China,1,369,50,-,13,86.0
43,44 Top 0.3%,Peking University,China,2,379,52,-,19,85.3
45,46 Top 0.3%,University of Chinese Academy of Sciences,China,3,-,1216,-,2,85.1
60,61 Top 0.3%,Shanghai Jiao Tong University,China,4,-,118,-,21,84.0
72,73 Top 0.4%,Fudan University,China,6,363,84,-,42,83.3
...,...,...,...,...,...,...,...,...,...
1995,1996 Top 9.4%,Hunan University of Technology,China,344,-,-,-,1920,66.2
1996,1997 Top 9.4%,Guizhou Normal University,China,345,-,-,-,1922,66.2
1997,1998 Top 9.4%,Bengbu Medical University,China,346,-,-,-,1924,66.2
1998,1999 Top 9.4%,Federal University of Amazonas,Brazil,53,-,-,-,1925,66.2


In [50]:
df_raw["Faculty Rank"].value_counts(dropna=False)

Faculty Rank
-      1693
NaN      29
109       3
282       2
280       2
       ... 
40        1
24        1
48        1
49        1
3         1
Name: count, Length: 269, dtype: int64

In the previous steps, deleting the 29 rows where Faculty Rank is NaN rows not the correct thing to do. There are far more rows where the Faculty Rank is "-". Given the nature of this dataset, I am interpreting both these values as:
- Ranking not assigned, or
- Score unavailable or institution not evaluated.

In [None]:
df_cleaned[["Education Rank", "Employability Rank", "Faculty Rank", "Research Rank"]].describe()

Unnamed: 0,Education Rank,Employability Rank,Faculty Rank,Research Rank
count,445.0,1021.0,278.0,1903.0
mean,275.523596,805.795299,142.017986,964.773516
std,170.208358,523.577292,83.550303,557.917757
min,1.0,1.0,1.0,1.0
25%,120.0,349.0,70.25,481.5
50%,268.0,786.0,140.5,964.0
75%,430.0,1251.0,213.75,1447.5
max,566.0,1753.0,290.0,1996.0


Hypothesis Testing
- Null Hypothesis (H0) - missingness is statistically related to the Region
- Alternative Hypothesis (Ha) - there is no statistically strong relationship between missing data and Region

I choose to use the Chi-Square Test, introducing first a binary column (1, 0) to indicate missing or not missing, and test independence of categories.

If p < 0.05, missingness depends on region (not random).
If p ≥ 0.05, no significant association found (possibly random).

In [None]:
from scipy.stats import chi2_contingency

# Creating a boolean flag for missingness (1 = missing, 0 = present)
df_cleaned["FacultyRank_missing"] = df_cleaned["Faculty Rank"].isna().astype(int)

contingency = pd.crosstab(df_cleaned["Global_Region"], df_cleaned["FacultyRank_missing"])
chi2, p, dof, expected = chi2_contingency(contingency)

print(f"Chi-square test p-value: {p:.5f}")
if p < 0.05:
    print("Accept H0: Missing 'Faculty Rank' is statistically related to the Region (not random).")
else:
    print("Reject Ho: Missing 'Faculty Rank' has no statistically strong relationship with the region (possibly random)")

Chi-square test p-value: 0.00000
Accept H0: Missing 'Faculty Rank' is statistically related to the Region (not random).


In [None]:
# Creating a boolean flag for missingness (1 = missing, 0 = present)
df_cleaned["ResearchRank_missing"] = df_cleaned["Research Rank"].isna().astype(int)

contingency = pd.crosstab(df_cleaned["Global_Region"], df_cleaned["ResearchRank_missing"])
chi2, p, dof, expected = chi2_contingency(contingency)

print(f"Chi-square test p-value: {p:.5f}")
if p < 0.05:
    print("Accept H0: Missing 'Research Rank' is statistically related to the Region (not random).")
else:
    print("Reject Ho: Missing 'Research Rank' has no statistically strong relationship with the region (possibly random)")

Chi-square test p-value: 0.00001
Accept H0: Missing 'Research Rank' is statistically related to the Region (not random).


In [None]:
# Creating a boolean flag for missingness (1 = missing, 0 = present)
df_cleaned["EmployabilityRank_missing"] = df_cleaned["Employability Rank"].isna().astype(int)

contingency = pd.crosstab(df_cleaned["Global_Region"], df_cleaned["EmployabilityRank_missing"])
chi2, p, dof, expected = chi2_contingency(contingency)

print(f"Chi-square test p-value: {p:.5f}")
if p < 0.05:
    print("Accept H0: Missing 'Employability Rank' is statistically related to the Region (not random).")
else:
    print("Reject Ho: Missing 'Employability Rank' has no statistically strong relationship with the region (possibly random)")

Chi-square test p-value: 0.00000
Accept H0: Missing 'Employability Rank' is statistically related to the Region (not random).


In [None]:
# Creating a boolean flag for missingness (1 = missing, 0 = present)
df_cleaned["EducationRank_missing"] = df_cleaned["Education Rank"].isna().astype(int)

contingency = pd.crosstab(df_cleaned["Global_Region"], df_cleaned["EducationRank_missing"])
chi2, p, dof, expected = chi2_contingency(contingency)

print(f"Chi-square test p-value: {p:.5f}")
if p < 0.05:
    print("Accept H0: Missing 'Education Rank' is statistically related to the Region (not random).")
else:
    print("Reject Ho: Missing 'Education Rank' has no statistically strong relationship with the region (possibly random)")

Chi-square test p-value: 0.00000
Accept H0: Missing 'Education Rank' is statistically related to the Region (not random).
