# AIDM7330 Group Project

Group Name: Patrol Paw


# Introduction
This project uses basic data analysis methods to show the specific performance of Hong Kong and mainland universities in the world rankings, including implementing higher-ranked universities in various academic assessments and geographical distribution.

## Preparation

In [None]:
import pandas as pd
import folium
import json
from IPython.display import IFrame
import warnings
warnings.filterwarnings(action='ignore', category=RuntimeWarning)

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

## Connected data

In [None]:
from google.colab import drive
drivePath = '/content/drive' #please do not change
drive.mount(drivePath)

Mounted at /content/drive


In [None]:
# Install the library on your environment
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=6e8fbed25b821b0a8fccafb341aae2e2203790afdf5a3cd84e7674847c4cc923
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget


In [None]:
# Import the library
import wget

# Setup URL and path variables
baseURL = 'https://kweakkk.github.io/'
doc = 'HK-2021-2023.csv'
fullURL = baseURL + doc

dataPath = drivePath + '/MyDrive/Colab Notebooks/data'

# Download the file
fileName1 = wget.download(fullURL, out=dataPath)

# Print the file name including the local path
print(fileName1)

In [None]:
# Setup URL and path variables
baseURL = 'https://kweakkk.github.io/'
doc = 'China_mainland_-2021-2023_1_.csv'
fullURL = baseURL + doc

dataPath = drivePath + '/MyDrive/Colab Notebooks/data'

# Download the file
fileName2 = wget.download(fullURL, out=dataPath)

# Print the file name including the local path
print(fileName2)

## Check the Data

In [None]:
#Hong Kong data
hk=pd.read_csv(fileName1)

In [None]:
#mainland data
mainland=pd.read_csv(fileName2)

In [None]:
hk.tail(5)

In [None]:
mainland.tail(5)

## Explore the data structure

In [None]:
hk.columns

In [None]:
mainland.columns

In [None]:
hk.info

## Data exploration and visualization

#### Q1: What is the change in the rankings of QS 500 mainland Chinese universities in the world?

In [None]:
data = pd.read_csv(fileName2)

rank_change_data = data.pivot(index='University_name', columns='Year', values='World_rank')

# Calculate the rank change between 2021 and 2023
rank_change_data['Rank_Change_2021_to_2023'] = rank_change_data[2021] - rank_change_data[2023]

# Sort the universities based on rank change
rank_change_data_sorted = rank_change_data.sort_values(by='Rank_Change_2021_to_2023', ascending=False).reset_index()

rank_change_data_sorted.head()


In [None]:
sns.set(style="whitegrid")

plt.figure(figsize=(10, 7))
sns.barplot(data=rank_change_data_sorted, x='Rank_Change_2021_to_2023', y='University_name')
plt.title('Rank Change of Chinese Mainland Universities (2021-2023)')
plt.xlabel('Rank Change')
plt.ylabel('University')
plt.show()


Finding：We can find some universities that have significantly improved their rankings over a three-year period. For example, the rankings of Southern University of Science and Technology and Huazhong University of Science and Technology have improved most significantly, improving by 97 and 90 places respectively; the ranking of Shanghai University has dropped by about 40 places; the rankings of Nankai University and Fudan University have been relatively stable.

#### Q2: What is the comprehensive score of universities in different cities from 2021 to 2023?

In [None]:
data = pd.read_csv(fileName2)
plt.figure(figsize=(14, 7))
sns.boxplot(x='Region', y='Overall_score', data=data)

plt.title('Overall score performance of universities in different cities (2021-2023)')
plt.xlabel('Region')
plt.ylabel('Overall_scores')

plt.grid(True)
plt.tight_layout()
plt.show()


Findings：The median comprehensive scores of universities in Beijing and Shanghai are higher, indicating that universities in these cities perform better overall. The median score of universities in Hangzhou is around 70, which is similar to Shanghai, but the IQR is narrower, indicating that the score distribution is more concentrated. For Jinan, the box plot is very short, indicating that the scores of universities in this city are concentrated and low.

Q3：What is the relationship between academic reputation and world rank?

In [None]:
mainland.plot(kind = 'scatter', x = 'World_rank', y = 'Academic_reputation', title = 'Rank vs. Academic_reputation')
plt.show()

In [None]:
hk.plot(kind = 'scatter', x = 'World_rank', y = 'Academic_reputation', title = 'Rank vs. Academic_reputation')
plt.show()

Finding:On the whole, the scatter plots of the two regions show a negative correlation trend; that is, the lower the ranking, the higher the academic reputation. This trend may indicate that academic standing is crucial in university rankings.
However, there are outliers in both charts, suggesting that some universities have a high academic reputation but are not ranked as well as others. For example, on the HK scatter chart, universities ranked in the 50-100 range have an academic reputation of nearly 70 points. In contrast, universities in the top 50 have the academic reputation of about 60 points. This could be for various reasons, such as the assessment of academic standing being influenced by other factors or some particular situation in the ranking algorithm.

Q4:What is the geographical distribution of university rankings in Hong Kong and the mainland?

In [None]:
hk.World_rank.value_counts()

In [None]:
mainland.World_rank.value_counts()

In [None]:
# showing different groups
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(data=hk, x='World_rank', hue='Region', multiple='stack', bins=20, alpha=0.5, legend='brief')
plt.title('HK World_rank by region')
plt.show()



In [None]:
# showing different groups
import seaborn as sns

plt.figure(figsize=(20, 11))
sns.histplot(data=mainland, x='World_rank', hue='Region', multiple='stack', bins=20, alpha=0.5, legend='brief')
plt.title('mainland World_rank by region')
plt.show()




Finding:The analysis indicates higher university rankings in Hong Kong's administrative regions, showcasing the superior standing of its universities globally.  Mainland China's administrative regions exhibit a more dispersed distribution, reflecting variations in university rankings.
In Hong Kong, top-ranked universities concentrate in Pok Fu Lam and Hong Kong Island, reflecting moderately centralized educational resources.  Mainland China's administrative areas show a more uniform ranking distribution, with Beijing hosting the highest distribution of top-ranked universities, boasting nine institutions in the top 100.
These ranking differences may arise from regional variations in education policies, investment, and academic research levels.  

Q5:What are the trends in the QS ranking of the top 500 universities of mainland over the past three years?

In [None]:
mainland_universities = mainland.groupby('University_name').size()
print(mainland_universities.index)
mainlandNumUniversities = len(mainland_universities)
print('num of mainland universities:',mainlandNumUniversities)

In [None]:
fig = plt.figure(figsize=(20,120))
fig.subplots_adjust(hspace=0.5, wspace = 0.4)
n = 1
for university in mainland_universities.index:
    ax = fig.add_subplot(mainlandNumUniversities,4,n)
    x = mainland[mainland['University_name'] == university]['Year']
    y = mainland[mainland['University_name'] == university]['World_rank']
    ax.plot(x,y)
    ax.set_xlabel('Year',fontsize=12)
    ax.set_ylabel('World_rank',fontsize=12)
    ax.legend([university],loc = 'upper right') #title
    n = n+1

plt.show()

Finding: Based on the data analysis, we can find that a considerable number of universities experienced a shift in upward or downward trend in 2022, such as Nankai University and Shanghai Jiaotong University; some schools experienced a certain change in the magnitude of the change even though the trend remained the same, such as Zhejiang University and Wuhan University. In addition, Beijing Normal University and Southern University of Science and Technology have maintained a consistent slope downward over the three years. Through this analysis, we can observe the specific trends of the continental university rankings in these three years, which can help to understand the change trends of specific universities.

## Word Cloud Diagram




In [None]:
pip install wordcloud


In [None]:
from wordcloud import WordCloud

university_names = ["QS World Rank","Tsinghua University", "Peking University", "Fudan University",
    "Shanghai Jiao Tong University", "Zhejiang University",
    "University of Science and Technology of China", "Nanjing University",
    "Wuhan University", "Tongji University", "Harbin Institute of Technology",
    "Sun Yat-sen University", "Beijing Normal University",
    "Xi'an Jiaotong University", "Southern University of Science and Technology",
    "Nankai University", "Shanghai University", "Tianjin University",
    "Beijing Institute of Technology", "Huazhong University of Science and Technology",]

text = " ".join(university_names * 10)
wordcloud = WordCloud(
    width=700, height=700,
    background_color='white',
    min_font_size=10,
    max_words=200,
    scale=0.5).generate(text)

plt.figure(figsize=(7, 5), facecolor=None)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0.1)
plt.show()

## Map Visualization

Map1:The distribution of QS500 universities in China mainland and Hong Kong

Get the data and geographical file.

In [None]:
#import china geojson
baseURL = 'https://kweakkk.github.io/'
doc = 'china (1).json'
fullURL = baseURL + doc

dataPath = drivePath + '/MyDrive/Colab Notebooks/data'

# Download the file
mapLayer = wget.download(fullURL, out=dataPath)

# Print the file name including the local path
print(mapLayer)

In [None]:
mapLayer

In [None]:
#import china geojson
baseURL = 'https://kweakkk.github.io/'
doc = 'distribution.csv'
fullURL = baseURL + doc

dataPath = drivePath + '/MyDrive/Colab Notebooks/data'

# Download the file
data_distribution = wget.download(fullURL, out=dataPath)

# Print the file name including the local path
print(data_distribution)

In [None]:
mainlandHK_geo = r'/content/drive/MyDrive/Colab Notebooks/data/china (1).json'

In [None]:
mainlandHK_proportion=pd.read_csv(data_distribution)

In [None]:
mainlandHK_proportion.head(10)

Make the labels of map

In [None]:
import json

with open(mainlandHK_geo, encoding="utf8") as f:
    map_data = json.load(f)


In [None]:
[key for key in map_data]

In [None]:
[key for key in map_data['features'][0]]

In [None]:
[key for key in map_data['features'][0]['geometry']]

In [None]:
[key for key in map_data['features'][0]['properties']]

In [None]:
print(map_data['features'][0]['properties'])

In [None]:
json_map_file = []
for i in range(len(map_data['features'])):
    json_map_file.append(map_data['features'][i]['properties']['name'])
json_map_file = pd.DataFrame({'Sort_Index': range(len(map_data['features'])), 'Eng_name': json_map_file})
mainlandHK_proportion = mainlandHK_proportion.merge(json_map_file, on='Eng_name')
#mainlandHK_proportion = mainlandHK_proportion.drop(columns=['Sort_Index_x', 'Sort_Index_y'])#if feedback said duplicate Sort_Index can execute this line
mainlandHK_proportion = mainlandHK_proportion.sort_values(by=['Sort_Index']).reset_index(drop=True)
mainlandHK_proportion

In [None]:
tooltip_text = []
for Sort_Index in range(len(mainlandHK_proportion)):
    tooltip_text.append(mainlandHK_proportion['Eng_name'][Sort_Index]+' '+
                        str(int(round(mainlandHK_proportion['proportion'][Sort_Index]*100)))+'%')
tooltip_text

In [None]:
for idx in range(len(tooltip_text)):
    map_data['features'][idx]['properties']['tooltip1'] = tooltip_text[idx]

In [None]:
print(map_data['features'][32]['properties'])

In [None]:
#save the tooltip
with open('china (1).json', 'w') as output:
    json.dump(map_data, output)

Make the distribution map

Q6:What is the distribution on map of QS500 universities of China mainland and Hong Kong?

In [None]:
mainland_HK_geo=map_data
map1 = folium.Map([39.9, 116.3], tiles='cartodbpositron', zoom_start=4)
tiles = ['stamenwatercolor', 'cartodbpositron', 'openstreetmap', 'stamenterrain']
folium.GeoJson(mapLayer).add_to(map1)
for tile in tiles:
    folium.TileLayer(tile).add_to(map1)


choropleth = folium.Choropleth(
    geo_data= mainland_HK_geo,
    name = 'choropleth',
    data = mainlandHK_proportion,
    columns = ['Eng_name','university_number','proportion'],
    key_on = 'feature.properties.name',
    fill_color = 'YlGn',
    fill_opacity = 0.7,
    line_opacity = 0.2,
    legend_name = '2023 China mainland and HK QS500 Universities Distribution',
    highlight = True
).add_to(map1)

folium.LayerControl().add_to(map1)
# Display Region Label
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['tooltip1'], labels=False)
)

In [None]:
map1

In [None]:
#save the map
map1.save('map1.html')

Finding: Based on the map visualization, We can find that most of the QS500 universities in China are concentrated in the eastern and central regions of China, where the regions with the most universities are Beijing, Shanghai, and Hong Kong, which account for 18%, 12% and 18% respectively. In addition, Hubei Province, Guangdong Province, Jiangsu Province and Tianjin each account for 6%, so we can deduce that these places are richer in educational resources compared to other regions. In contrast, many provinces in the west have 0%, but it could also be related to the fact that some Chinese universities do not participate in the QS ranking.

Map2:the relevant universities of Hong Kong's map

In [None]:
import pandas as pd
import folium
import json
from IPython.display import IFrame
import warnings
warnings.filterwarnings(action='ignore', category=RuntimeWarning)
from folium.plugins import MarkerCluster

In [None]:
baseURL = 'https://kweakkk.github.io/'
doc = 'data.xlsx'
fullURL = baseURL + doc

dataPath = drivePath + '/MyDrive/Colab Notebooks/data'

# Download the file
filename3 = wget.download(fullURL, out=dataPath)

# Print the file name including the local path
print(filename3)

In [None]:
data = pd.read_excel("/content/drive/MyDrive/Colab Notebooks/data/data.xlsx")
data = data[data.Year==2023].copy()
data.head()

In [None]:
boulder_coords = [22.38, 114.15]

#Create the map
my_map = folium.Map(location = boulder_coords, zoom_start = 10)

#Display the map
my_map

for index, row in data.iterrows():
    lat = row["lat"]
    lon = row["long"]
    name =  row["University_name"]
    rank= row["World_rank"]
    folium.Marker([lat, lon], popup = "Name: " + name + "<br>" + "Rank: " + str(rank) +"<br>Overall score" + str(row["Overall_score"] )).add_to(my_map)
my_map

# Result
The three-year analysis of university rankings in Hong Kong and mainland China unveils significant trends.  Hong Kong universities consistently secure higher rankings, whereas mainland Chinese universities exhibit a more diverse distribution influenced by regional factors.  The negative correlation between rankings and academic reputation underscores the pivotal role of academic standing, yet outliers hint at additional influencing factors.  The map underscores the concentration of universities in eastern and central China, particularly in Beijing, Shanghai, and Hong Kong, suggesting a concentration of educational resources in these regions.

Specific university examinations, such as Nankai University, Shanghai Jiaotong University, Zhejiang University, and Wuhan University, provide nuanced insights into the evolving academic landscape.  This comprehensive analysis sheds light on the factors impacting university rankings and the regional dynamics that shape academic excellence.