# Analysis of World University Rankings 2020

### Content
+ Introduction: The World University Rankings
+ Data description and objectives
+ Formulation of research question
+ Data preparation: cleaning and shaping

## 1. Introduction: The World University Rankings

The World University Rankings is an annual publication of university rankings by the Times Higher Education (THE) magazine for the year of 2020. The ranking features almost 1400 universities across 92 countries around the world, standing as the largest and most diverse university rankings ever to date. The rankings aim to help prospective students identify the leading schools worldwide.

The Times Higher Education has been providing reliable performance data on universities for students since the year of 2004. 
The table is based on 13 carefully gauged performance measures that reveal an institution’s performance. Mainly, each university is judged in five core missions covering the core missions of all world-class, global universities: Teaching, Research, Citations, Industry Income and International Outlook. The five scores in these categories for each ranked university are available under “performance breakdown”. This means that in addition to browsing universities by overall rank, or searching for a specific institution, we can sort the table by the five metrics to make a decision based on particular priorities.

The ranking is trusted worldwide by students, teachers, governments and industry experts as it provides great insight into the shifting balance of power in global higher education. Moreover, all the data is independently audited by professional services firm PricewaterhouseCoopers (PwC), making the Times Higher Education World University Rankings the only global university rankings to be subjected to full, independent scrutiny of this nature. 

* Source: https://www.timeshighereducation.com/world-university-rankings
* Source: https://www.kaggle.com/joeshamen/world-university-rankings-2020
* Source: https://www.timeshighereducation.com/student/advice/world-university-rankings-explained)

## 2. Data description and objectives

Based on the description of World University Rankings above, it is clear that there are some universities which are most likely included to this list, for instance University of Oxford, Cambdridge University, Tsinghua University and others. These universities are being one of the most prestigious higher education institutions and well known for their great environment that fosters competitiveness, innovation and creativity. Additionally, these universities have amazing performance for the most of the core criteria. Therefore, in this analysis project I am keen of comparing different features that make the universities best.

The analysis of the following data set is based on data of the year of 2020. Apparently, this is the most recent data of Times Higher Education (THE) magazine. Below is the data that will be used for my analysis:

+ Rank_Char: ranking according to The Times Higher Education (varchar)
+ ScoreRank: ranking according to the column "Scoreresult" (number)
+ University: name of the university
+ Country: country of the university
+ Number_students: number of students
+ Numbstudentsper_Staff: ratio between the number of students and the staff
+ International_Students: percentage of international students
+ Percentage_Female: percentage of women
+ Percentage_Male: percentage of men
+ Teaching: score in teaching
+ Research: score in research
+ Citations: score in citations
+ Industry_Income: score in industry income
+ International_Outlook: score in international outlook
+ ScoreResult: resulting score

Note: The score result has been calculated according to the weights described below:

| Variable | Description |Weight |
| -----------|------------|----------------|
| Teaching | The learning environment  | 30% |
| Research | Volume, income and reputation | 30% |
| Citations | Research influence | 30% |
| International outlook | International staff and students | 7.5% |
| Industry income | Income innovation | 2.5% |

## 3. Formulation of research question

For this project, the research questions are as following:
1. Analyze the top universities by countries and continents
2. Analyze the teaching (academic environment) performance by institutions
3. Analyze the international outlook: ratio of international to domestic staff/students
4. Analyze the female to male ratio by the total number of students
5. Analyze the relationship between variables in the dataset

## 4. Data preparation: cleaning and shaping

### 4.1. Dataset information
The "World University Rankings 2020" dataset is found and downloaded from the Kaggle platform. The file is presented in a .csv format (comma separated) with 1395 unique values and 16 variables within it. 

In [61]:
# import all modules that will be used
import numpy as np
import pandas as pd

In [69]:
# read csv file
uni_rank = pd.read_csv('Word_University_Rank_2020.csv')

# drop an unnecessary column: 'Iverall_Ranking' since it has the same values as 'Score_Result' column
drop=(['Overall_Ranking'])
uni_rank.drop(drop,axis='columns',inplace=True)

In [70]:
# information including the existing variables and their data types
# uni_rank.dtypes
uni_rank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1396 entries, 0 to 1395
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Rank_Char                1396 non-null   object 
 1   Score_Rank               1396 non-null   int64  
 2   University               1396 non-null   object 
 3   Country                  1396 non-null   object 
 4   Number_students          1396 non-null   object 
 5   Numb_students_per_Staff  1396 non-null   float64
 6   International_Students   1396 non-null   object 
 7   Percentage_Female        1396 non-null   object 
 8   Percentage_Male          1396 non-null   object 
 9   Teaching                 1396 non-null   float64
 10  Research                 1396 non-null   float64
 11  Citations                1396 non-null   float64
 12  Industry_Income          1396 non-null   float64
 13  International_Outlook    1396 non-null   float64
 14  Score_Result            

In [71]:
# the shape: (rows, columns)
uni_rank.shape

(1396, 15)

In [72]:
# the length
len(uni_rank)

1396

In [73]:
# basic descriptive statistics for all numeric columns
uni_rank.describe()

Unnamed: 0,Score_Rank,Numb_students_per_Staff,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result
count,1396.0,1396.0,1396.0,1396.0,1396.0,1396.0,1396.0,1396.0
mean,315.304441,18.966905,28.229083,23.98116,48.113109,46.477292,47.114542,34.794054
std,140.946223,16.835492,14.14955,17.537044,27.735626,16.273498,23.288723,16.946075
min,1.0,0.9,11.2,6.8,1.7,34.4,13.1,10.7
25%,212.0,12.375,18.3,11.6,23.375,35.775,27.475,21.0
50%,336.0,16.35,23.8,18.0,45.65,39.4,43.1,31.6
75%,437.0,21.9,33.6,30.1,71.95,49.825,62.8,44.5
max,535.0,493.5,92.8,99.6,100.0,100.0,99.7,95.4


From the output above, several conclusions can be made:
1. The total number of records in each of the columns is 1396
2. The mean value for the resulting score is 34.8 out of 100 possible
3. The minimum resulting score is 10.7, while maximum is 95.4 out of 100 possible 

In [89]:
# even though describe() method analyzes only numeric columns by default, we can provide other data types using include parameter
uni_rank.describe(include=np.object)

Unnamed: 0,Rank_Char,University,Country,Number_students,International_Students,Percentage_Female,Percentage_Male
count,1396,1396,1396,1396,1396,1396,1396
unique,145,1395,92,1377,63,74,74
top,1001+,Northeastern University,United States,46683,1%,55%,45%
freq,395,2,172,2,150,77,77


In [90]:
# let's take a look at top one institution's data
print("\nThe top 1 instituion's data:\n", uni_rank.iloc[0])


The top 1 instituion's data:
 Rank_Char                                     1
Score_Rank                                    1
University                 University of Oxford
Country                          United Kingdom
Number_students                          20,664
Numb_students_per_Staff                    11.2
International_Students                      41%
Percentage_Female                           46%
Percentage_Male                             54%
Teaching                                   90.5
Research                                   99.6
Citations                                  98.4
Industry_Income                            65.5
International_Outlook                      96.4
Score_Result                               95.4
Name: 0, dtype: object


### 4.2. Data manipulation: cleaning and shaping
Firstly, we need to create a data frame from the list of all universities data (obtained at the previous step).

Even though we already have a ready dataset, it is better to do the following:
+ round decimal numbers to integer result (if neccessary);
+ check for missing values and replace them by NaN if any;
+ check for dublicates.

In [76]:
# create a dataframe from the list of all universities data
df = pd.DataFrame(uni_rank)
df.head()

Unnamed: 0,Rank_Char,Score_Rank,University,Country,Number_students,Numb_students_per_Staff,International_Students,Percentage_Female,Percentage_Male,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result
0,1,1,University of Oxford,United Kingdom,20664,11.2,41%,46%,54%,90.5,99.6,98.4,65.5,96.4,95.4
1,2,2,California Institute of Technology,United States,2240,6.4,30%,34%,66%,92.1,97.2,97.9,88.0,82.5,94.5
2,3,3,University of Cambridge,United Kingdom,18978,10.9,37%,47%,53%,91.4,98.7,95.8,59.3,95.0,94.4
3,4,4,Stanford University,United States,16135,7.3,23%,43%,57%,92.8,96.4,99.9,66.2,79.5,94.3
4,5,5,Massachusetts Institute of Technology,United States,11247,8.6,34%,39%,61%,90.5,92.4,99.5,86.9,89.0,93.6


In [83]:
# a column representing the number of students per staff members should be converted to integer data type
df = df.astype({"Numb_students_per_Staff": "int"})
df.head(1)

Unnamed: 0,Rank_Char,Score_Rank,University,Country,Number_students,Numb_students_per_Staff,International_Students,Percentage_Female,Percentage_Male,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result
0,1,1,University of Oxford,United Kingdom,20664,11,41%,46%,54%,90.5,99.6,98.4,65.5,96.4,95.4


In [84]:
# check if dataset is complete 
df_valid = pd.DataFrame()
df_valid["Variables"] = list(df.columns)
df_valid["No"] = list(df.count())
df_valid.head(16) 

Unnamed: 0,Variables,No
0,Rank_Char,1396
1,Score_Rank,1396
2,University,1396
3,Country,1396
4,Number_students,1396
5,Numb_students_per_Staff,1396
6,International_Students,1396
7,Percentage_Female,1396
8,Percentage_Male,1396
9,Teaching,1396


In [86]:
# check for missing values 
df = pd.DataFrame(uni_rank) 
# using isnull() function for checking missing values
df.isnull()

Unnamed: 0,Rank_Char,Score_Rank,University,Country,Number_students,Numb_students_per_Staff,International_Students,Percentage_Female,Percentage_Male,Teaching,Research,Citations,Industry_Income,International_Outlook,Score_Result
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1391,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1392,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1393,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1394,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [87]:
# to be exactly sure, let's find the sum of the missing values if any
df.isnull().sum()

Rank_Char                  0
Score_Rank                 0
University                 0
Country                    0
Number_students            0
Numb_students_per_Staff    0
International_Students     0
Percentage_Female          0
Percentage_Male            0
Teaching                   0
Research                   0
Citations                  0
Industry_Income            0
International_Outlook      0
Score_Result               0
dtype: int64

The isnull() function is used in the code above in order to check missing values in Pandas DataFrame. This function returns dataframe of Boolean values which are True for NaN values. According to the output above, we can be sure that there are no missing values. Therefore, there is no need to drop any rows/columns with Null values using dropna() method.  

In [88]:
# check for dublicates
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1391    False
1392    False
1393    False
1394    False
1395    False
Length: 1396, dtype: bool

In [24]:
df.duplicated().sum()

0

The duplicated() function is used to indicate duplicate values. This function returns dataframe of Boolean values which are True for duplicate values. According to the output above, it is clear that there are no duplicates. 