
**Описание данных:**
-------------------------

**Основные метрики рейтинга:**
*   **Rank** — место университета в мировом рейтинге Times Higher Education (THE) 2026.
*   **Name** — официальное название университета.
*   **Overall (Overall_num)** — итоговый балл университета (интегральный показатель от 0 до 100), на основе которого строится ранжирование.
*   **Teaching** — оценка качества преподавания (образовательная среда, репутация).
*   **Research Environment** — оценка исследовательской среды (объем исследований, доход и репутация).
*   **Research Quality** — качество исследований (цитируемость, влияние публикаций, исследовательская сила).
*   **Industry** — доход от индустрии и патентов (передача знаний реальному сектору).
*   **International Outlook** — международная перспектива (доля иностранных студентов, сотрудников и международное сотрудничество).

**География:**
*   **country** — страна расположения университета.
*   **town** — город расположения университета.

**Статистика по студентам и персоналу:**
*   **No. of FTE students** — общее количество студентов (эквивалент полной занятости).
*   **No. of students per staff** — соотношение количества студентов к количеству сотрудников (нагрузка на преподавателей).
*   **International students** — доля иностранных студентов в университете.
*   **Female_share** — доля студентов женского пола.
*   **Male_share** — доля студентов мужского пола.

**Предметные рейтинги (баллы по направлениям):**
*   **Arts and Humanities 2025** — балл по направлению «Искусство и гуманитарные науки».
*   **Engineering 2025** — балл по направлению «Инженерия».
*   **Computer Science 2025** — балл по направлению «Компьютерные науки».
*   **Life Sciences 2025** — балл по направлению «Науки о жизни» (биология и др.).
*   **Medical and Health 2025** — балл по направлению «Медицина и здоровье».
*   **Physical Sciences 2025** — балл по направлению «Физические науки».
*   **Social Sciences 2025** — балл по направлению «Социальные науки».
*   **Business and Economics 2025** — балл по направлению «Бизнес и экономика».
*   **Law 2025** — балл по направлению «Право».
*   **Education Studies 2025** — балл по направлению «Образование и педагогика».



**Детализация предметных областей (Disciplines Taught):**
В этих столбцах содержится текстовое перечисление конкретных узких специальностей, по которым университет ведет деятельность.

*   **Arts and Humanities** — список гуманитарных дисциплин (включает: *Archaeology* — археология, *Art, Performing Arts & Design* — искусство и дизайн, *History, Philosophy & Theology* — история и философия, *Languages, Literature & Linguistics* — языки и литература, *Architecture* — архитектура).
*   **Business and Economics** — дисциплины бизнеса (включает: *Accounting & Finance* — учет и финансы, *Business & Management* — бизнес-менеджмент, *Economics & Econometrics* — экономика).
*   **Computer Science** — компьютерные науки (информатика).
*   **Education Studies** — образование (педагогика, подготовка учителей).
*   **Engineering** — инженерные направления (включает: *Chemical Engineering* — химическая инженерия, *Civil Engineering* — гражданское строительство, *Electrical & Electronic Engineering* — электротехника, *Mechanical & Aerospace Engineering* — машиностроение, *General Engineering* — общая инженерия).
*   **Law** — право и юриспруденция.
*   **Life Sciences** — науки о жизни (включает: *Agriculture & Forestry* — сельское хозяйство, *Biological Sciences* — биология, *Veterinary Science* — ветеринария, *Sport Science* — спортивная наука).
*   **Medical and Health** — медицина (включает: *Medicine & Dentistry* — лечебное дело и стоматология, *Other Health* — другие медицинские специальности).
*   **Physical Sciences** — физико-математические науки (включает: *Mathematics & Statistics* — математика, *Physics & Astronomy* — физика, *Chemistry* — химия, *Geology, Environmental, Earth & Marine Sciences* — науки о Земле).
*   **Psychology** — психология (общая, клиническая и др.).
*   **Social Sciences** — социальные науки (включает: *Communication & Media Studies* — медиа, *Politics & International Studies* — политика, *Sociology* — социология, *Geography* — география).


***

*Источник данных: [Times Higher Education World University Rankings 2026](https://www.timeshighereducation.com/world-university-rankings/latest/world-ranking)*


Считка и чистка данных
--------------------------

In [82]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import functools as ft

In [83]:
bd_main = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/scrapping_dynamics/rankings.csv")
bd_main["Name"] = bd_main["Name"].apply(lambda x: x.split("\n")[0].strip())
#последний стрип на убирание лишних пробелов
bd_main.head()

Unnamed: 0,Rank,Name,Overall,Teaching,Research Environment,Research Quality,Industry,International Outlook,URL
0,1,University of Oxford,98.2,97.2,100.0,97.7,99.9,96.4,https://www.timeshighereducation.com/world-uni...
1,2,Massachusetts Institute of Technology,97.7,99.2,95.3,99.6,100.0,91.9,https://www.timeshighereducation.com/world-uni...
2,=3,Princeton University,97.2,98.2,97.3,99.0,98.0,85.4,https://www.timeshighereducation.com/world-uni...
3,=3,University of Cambridge,97.2,96.2,99.9,97.1,87.6,96.3,https://www.timeshighereducation.com/world-uni...
4,=5,Harvard University,97.1,95.9,100.0,98.9,86.7,88.3,https://www.timeshighereducation.com/world-uni...


In [84]:
bd_1 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/rank/parsed_rankings0_500.csv")
bd_2 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/rank/parsed_rankings500_1000.csv")
bd_3 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/rank/parsed_rankings1000_1500.csv")
bd_4 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/rank/parsed_rankings1500_2000.csv")
bd_5 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/rank/parsed_rankings2000_2500.csv")
bd_6 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/rank/parsed_rankings2500_2810.csv")
bd_1.head()

Unnamed: 0,Name,Geo,World University Rankings 2026,Arts and Humanities 2025,Business and Economics 2025,Medical and Health 2025,Computer Science 2025,Education Studies 2025,Engineering 2025,Law 2025,Life Sciences 2025,Physical Sciences 2025,Social Sciences 2025
0,University of Oxford,"Oxford, United Kingdom",1,3.0,2.0,1.0,1.0,3.0,4.0,7.0,4.0,8.0,2.0
1,Massachusetts Institute of Technology,"Cambridge, United States",2,1.0,1.0,3.0,3.0,3.0,4.0,1.0,,,
2,Princeton University,"Princeton, United States",3,6.0,6.0,8.0,7.0,5.0,3.0,5.0,,,
3,University of Cambridge,"Cambridge, United Kingdom",3,2.0,5.0,3.0,2.0,4.0,6.0,4.0,2.0,7.0,2.0
4,Harvard University,"Cambridge, United States",5,5.0,7.0,2.0,10.0,5.0,1.0,2.0,1.0,2.0,4.0


In [85]:
bd_st = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/scrapping_dynamics/statistics.csv")

bd_st["Name"] = bd_main["Name"].apply(lambda x: x.split("\n")[0].strip())
#последний стрип на убирание лишних пробелов

bd_st[["Female_share", "Male_share"]] = bd_st["Female:Male ratio"].apply(lambda x: pd.Series(str(x).split(":"))).astype(float)
#берется каждая ячейка, split, получается список и превращает этот список pd.Series.

bd_st["International students"] = bd_st["International students"].str.rstrip("%").astype(float) / 100
#str - для преминения к каждой ячейки метода rstrip (откусывание справа)

bd_st["No. of FTE students"] = bd_st["No. of FTE students"].str.replace(",", "").astype(float)
#str - для преминения к каждой ячейки метода replace

bd_st.head()

Unnamed: 0,Rank,Name,No. of FTE students,No. of students per staff,International students,Female:Male ratio,URL,Female_share,Male_share
0,1,University of Oxford,22005.0,10.4,0.43,52 : 48,https://www.timeshighereducation.com/world-uni...,52.0,48.0
1,2,Massachusetts Institute of Technology,11703.0,7.7,0.33,43 : 57,https://www.timeshighereducation.com/world-uni...,43.0,57.0
2,=3,Princeton University,8739.0,8.2,0.23,47 : 53,https://www.timeshighereducation.com/world-uni...,47.0,53.0
3,=3,University of Cambridge,21045.0,11.3,0.38,50 : 50,https://www.timeshighereducation.com/world-uni...,50.0,50.0
4,=5,Harvard University,22680.0,10.1,0.27,53 : 47,https://www.timeshighereducation.com/world-uni...,53.0,47.0


In [86]:
bd_11 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/subjects/universities_subjects0_500.csv")
bd_22 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/subjects/universities_subjects500_1000.csv")
bd_33 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/subjects/universities_subjects1000_1500.csv")
bd_44 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/subjects/universities_subjects1500_2000.csv")
bd_55 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/subjects/universities_subjects2000_2500.csv")
bd_66 = pd.read_csv("/Users/anastasiahimic/Desktop/hse_python_project_3/subjects/universities_subjects2500_2811.csv")
bd_11.head()

Unnamed: 0,Name,Arts and Humanities,Business and Economics,Computer Science,Education Studies,Engineering,Law,Life Sciences,Medical and Health,Physical Sciences,Psychology,Social Sciences
0,University of Oxford,"Archaeology, Art, Performing Art and Design, H...","Accounting and Finance, Business and Managemen...",Computer Science,Education,"Chemical Engineering, Civil Engineering, Elect...",Law,Biological Sciences,"Medicine and Dentistry, Other Health","Chemistry, Geology, Environmental, Earth and M...",Psychology,"Communication and Media Studies, Geography, Po..."
1,Massachusetts Institute of Technology,"Archaeology, Architecture, Art, Performing Art...","Business and Management, Economics and Econome...",Computer Science,,"Chemical Engineering, Civil Engineering, Elect...",,Biological Sciences,Other Health,"Chemistry, Geology, Environmental, Earth and M...",Psychology,"Communication and Media Studies, Politics and ..."
2,Princeton University,"Architecture, Art, Performing Art and Design, ...",Economics and Econometrics,Computer Science,,"Chemical Engineering, Civil Engineering, Elect...",,"Agriculture and Forestry, Biological Sciences",Other Health,"Chemistry, Geology, Environmental, Earth and M...",Psychology,"Politics and International Studies, Sociology"
3,University of Cambridge,"Archaeology, Architecture, Art, Performing Art...","Business and Management, Economics and Econome...",Computer Science,Education,"Chemical Engineering, Civil Engineering, Elect...",Law,"Biological Sciences, Veterinary Science",Medicine and Dentistry,"Chemistry, Geology, Environmental, Earth and M...",Psychology,"Geography, Politics and International Studies,..."
4,Harvard University,"Archaeology, Architecture, Art, Performing Art...","Accounting and Finance, Business and Managemen...",Computer Science,Education,"Civil Engineering, Electrical and Electronic E...",Law,"Agriculture and Forestry, Biological Sciences","Medicine and Dentistry, Other Health","Chemistry, Geology, Environmental, Earth and M...",Psychology,"Communication and Media Studies, Politics and ..."


In [87]:
dn = pd.concat([bd_1, bd_2, bd_3, bd_4, bd_5, bd_6])
subj_dfs = [bd_11, bd_22, bd_33, bd_44, bd_55, bd_66]

merge_bd = bd_main.set_index("Name").combine_first(dn.set_index("Name")).reset_index()
#set_index("Name") - name становится индексом таблицы, .reset_index() — это обратная операция к .set_index(), она возвращает индекс обратно в обычную колонку
#combine_first берет пропуски в левой таблице и заполняет их значениями из правой по индексу (Name)
merge_bd = bd_main[["Name"]].merge(merge_bd, on="Name", how="left")

all_subjects = pd.concat(subj_dfs).drop_duplicates("Name")
new_cols = all_subjects.columns.difference(merge_bd.columns).tolist()
#.columns - названия всех колонок, .difference() - операция "вычитания" множеств, берет колоноки all_subjects и вычитает из него merge_bd.columns
#.tolist() - превращает результат (который технически является специальным типом Index) в обычный питоновский список
merge_bd = merge_bd.merge(all_subjects[["Name"] + new_cols], on="Name", how="left")

cols = ["Name", "Female_share", "Male_share", "International students", "No. of FTE students", "No. of students per staff"]
merge_bd = merge_bd.merge(bd_st[cols], on="Name", how="left")

merge_bd["country"] = merge_bd["Geo"].apply(lambda x: str(x).split(",")[-1].strip())
merge_bd["town"] = merge_bd["Geo"].apply(lambda x: str(x).split(",")[0].strip())
#последний стрип на убирание лишних пробелов

merge_bd["Overall_num"] = merge_bd["Overall"].str.split(r'[-–]').str[0].astype(float)
#.str - работа со строками, сплит по шаблону r'[-–]', astype(float) - сделай число


merge_bd

Unnamed: 0,Name,Arts and Humanities 2025,Business and Economics 2025,Computer Science 2025,Education Studies 2025,Engineering 2025,Geo,Industry,International Outlook,Law 2025,...,Psychology,Social Sciences,Female_share,Male_share,International students,No. of FTE students,No. of students per staff,country,town,Overall_num
0,University of Oxford,3.0,2.0,1.0,3.0,4.0,"Oxford, United Kingdom",99.9,96.4,7.0,...,Psychology,"Communication and Media Studies, Geography, Po...",52.0,48.0,0.43,22005.0,10.4,United Kingdom,Oxford,98.2
1,Massachusetts Institute of Technology,1.0,1.0,3.0,3.0,4.0,"Cambridge, United States",100.0,91.9,1.0,...,Psychology,"Communication and Media Studies, Politics and ...",43.0,57.0,0.33,11703.0,7.7,United States,Cambridge,97.7
2,Princeton University,6.0,6.0,7.0,5.0,3.0,"Princeton, United States",98.0,85.4,5.0,...,Psychology,"Politics and International Studies, Sociology",47.0,53.0,0.23,8739.0,8.2,United States,Princeton,97.2
3,University of Cambridge,2.0,5.0,2.0,4.0,6.0,"Cambridge, United Kingdom",87.6,96.3,4.0,...,Psychology,"Geography, Politics and International Studies,...",50.0,50.0,0.38,21045.0,11.3,United Kingdom,Cambridge,97.2
4,Harvard University,5.0,7.0,10.0,5.0,1.0,"Cambridge, United States",86.7,88.3,2.0,...,Psychology,"Communication and Media Studies, Politics and ...",53.0,47.0,0.27,22680.0,10.1,United States,Cambridge,97.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2805,Turkmen State Architecture and Construction In...,,,,,,"Ashgabat, Turkmenistan",,,,...,,,,,,,,Turkmenistan,Ashgabat,
2806,Turkmen State Institute of Economics and Manag...,,,,,,"Ashgabat City, Turkmenistan",,,,...,,,,,,,,Turkmenistan,Ashgabat City,
2807,Turkmen State Institute of Finance,,,,,,"Ashgabat, Turkmenistan",,,,...,Psychology,"Geography, Politics and International Studies",,,,,,Turkmenistan,Ashgabat,
2808,UIT University,,,,,,"Karachi, Pakistan",,,,...,,,,,,,,Pakistan,Karachi,


In [88]:
merge_bd.columns = merge_bd.columns.str.strip()

In [89]:
def clean_rank(x):
    if pd.isna(x): return np.nan
    val = float(x)
    if val > 2500:
        s = str(int(val))
        return float(s[:len(s)//2])
    return val

ranking_cols = ['World University Rankings 2026', 'Arts and Humanities 2025', 'Business and Economics 2025',
    'Medical and Health 2025', 'Computer Science 2025', 'Education Studies 2025',
    'Engineering 2025', 'Law 2025', 'Life Sciences 2025', 'Physical Sciences 2025',
    'Social Sciences 2025']

merge_bd[ranking_cols] = merge_bd[ranking_cols].applymap(clean_rank)
#.applymap(clean_rank) - функцию clean_rank применяем к каждой ячейке по очереди


DataFrame.applymap has been deprecated. Use DataFrame.map instead.



In [90]:
merge_bd.to_csv("MAIN.csv")

In [91]:
merge_bd = merge_bd.dropna(subset=["Overall"])
merge_bd = merge_bd.dropna(subset=["Geo"])

merge_bd.isna().sum()

Name                                 0
Arts and Humanities 2025           166
Business and Economics 2025        389
Computer Science 2025              852
Education Studies 2025            1042
Engineering 2025                  1189
Geo                                  0
Industry                             0
International Outlook                0
Law 2025                          1307
Life Sciences 2025                1437
Medical and Health 2025            628
Overall                              0
Physical Sciences 2025            1570
Rank                                 0
Research Environment                 0
Research Quality                     0
Social Sciences 2025              1729
Teaching                             0
URL                                  0
World University Rankings 2026       2
Arts and Humanities                270
Business and Economics             207
Computer Science                   197
Education Studies                  559
Engineering              

In [92]:
merge_bd.dtypes

Name                               object
Arts and Humanities 2025          float64
Business and Economics 2025       float64
Computer Science 2025             float64
Education Studies 2025            float64
Engineering 2025                  float64
Geo                                object
Industry                          float64
International Outlook             float64
Law 2025                          float64
Life Sciences 2025                float64
Medical and Health 2025           float64
Overall                            object
Physical Sciences 2025            float64
Rank                               object
Research Environment              float64
Research Quality                  float64
Social Sciences 2025              float64
Teaching                          float64
URL                                object
World University Rankings 2026    float64
Arts and Humanities                object
Business and Economics             object
Computer Science                  

In [93]:
merge_bd.shape

(2082, 40)

In [94]:
merge_bd.describe()

Unnamed: 0,Arts and Humanities 2025,Business and Economics 2025,Computer Science 2025,Education Studies 2025,Engineering 2025,Industry,International Outlook,Law 2025,Life Sciences 2025,Medical and Health 2025,...,Research Quality,Social Sciences 2025,Teaching,World University Rankings 2026,Female_share,Male_share,International students,No. of FTE students,No. of students per staff,Overall_num
count,1916.0,1693.0,1230.0,1040.0,893.0,2082.0,2082.0,775.0,645.0,1454.0,...,2082.0,353.0,2082.0,2080.0,1998.0,1998.0,2080.0,2082.0,2082.0,2082.0
mean,567.063674,572.33668,499.003252,428.154808,441.152296,48.470989,49.853266,380.011613,372.446512,534.320495,...,53.409078,269.74221,29.02171,916.225,51.564064,48.435936,0.111226,21885.595581,18.234966,32.676849
std,300.785263,316.975003,303.717791,299.285566,307.518531,26.18595,21.235824,281.188197,255.886546,321.572421,...,23.659878,201.061046,13.605316,497.87532,12.310686,12.310686,0.137135,24753.915835,10.851656,18.84183
min,1.0,1.0,1.0,1.0,1.0,16.0,17.1,1.0,1.0,1.0,...,4.6,1.0,8.6,1.0,2.0,0.0,0.0,448.0,0.7,10.3
25%,301.0,301.0,251.0,176.0,201.0,23.9,32.525,151.0,176.0,263.5,...,33.625,101.0,19.5,501.0,45.0,41.0,0.02,8886.5,12.4,10.3
50%,601.0,601.0,501.0,401.0,401.0,43.1,45.6,301.0,301.0,501.0,...,54.2,251.0,26.0,1001.0,54.0,46.0,0.06,16434.5,16.3,32.1
75%,801.0,801.0,801.0,601.0,601.0,69.175,63.875,601.0,601.0,801.0,...,72.975,401.0,34.775,1501.0,59.0,55.0,0.15,28351.5,21.8,43.6
max,1251.0,1251.0,1251.0,1251.0,1251.0,100.0,99.5,1001.0,1001.0,1251.0,...,99.6,1001.0,99.2,1501.0,100.0,98.0,0.96,499671.0,226.1,98.2
