# World University Rankings Data

This notebook will lead you through a simple code to extract the data of university ranking from https://www.timeshighereducation.com/
________________________________________________________________________________________________________________

## Aim
This code tries to extract information from a university ranking website. The concept of collecting data from websites is called web scraping and is used mainly to collect data from websites which do not offer an API to collect data natively. 
Several tutorials are available to explain the web scraping basics. This notebook is more of a case study. <br />
Let's start ......  

## 1. Prerequisites
This code assumes that you have python installed on your machine.  Basic knowledge of python is also assumed. Here is a full list of the prerequisites: 
* python 3.6 or above
* Jupyter notebook - or any environment that allows running python
* The following python libraries (BeautifulSoup, Selenium, urllib, objectpath and Pandas) 
* A web browser, I am using chrome 77 here, but you can use other browsers too
* Web driver for the browsers you are using, for chrome and chrome based browsers you can download it from here https://chromedriver.chromium.org/downloads

## 2. What data are you trying to get? 
This is the first question you should ask yourself, before even touching a single key. In our case, we started with the idea of collecting the list of universities with their ranking. To understand how to do so you will need to visit the website itself to understand a bit about it and its webpages. <br/>

The page we are tyring to scrap looked something like this 

![title](img/basic_page_01.PNG)
<br/> <br/> <br/> 


It is clear that the page contains some sort of a table that hosts the information we are trying to collect. However collecting the information will depend on the HTML code hidden behind what we can see in the browser window. In chrome to display the HTML code simply press F12. The page should look something like this  

![title](img/page_code_02.PNG)
<br/> <br/> <br/> 


Using the small inspection cursor you can point at elements of the page and find out which part of the HTML represent them. This important because we will only use the HTML to collect the data and not the displayed page in the browser. Once you have identified the part of HTML corresponds with the information we need then we will start scraping 

## 3. Let's write some python 

A standard method of using python to request internet pages is through the requests library, however in our particular case this approach will not work, because the website uses AJAX to modify the HTML of the page. This means that the HTML code which you will receive by using requests will only contain an empty template of the table and not the information we are trying to collect. To give the JS code a chance to run and populate the table with the information, we use selenium. Selenium uses browsers to request webpages and then collect the HTML after the page is fully loaded, which will allow us to collect the information we need. 

In [1]:
1+1

2

In [21]:
# import standard libraries
import json
import time

# import third party libraries
import objectpath
import pandas as pd

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen 
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

In [6]:
url = 'https://www.timeshighereducation.com/world-university-rankings/latest/world-ranking'

In [24]:
driver = webdriver.Chrome()
driver.get(url)
container = driver.find_element(By.CSS_SELECTOR, "div.css-79elbk tbody")

In [50]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

url = 'https://www.timeshighereducation.com/world-university-rankings/latest/world-ranking'
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

# Закрыть окно куки
try:
    accept_btn = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Allow all')]"))
    )
    driver.execute_script("arguments[0].click();", accept_btn)
    time.sleep(2)
except:
    pass

# Подождать таблицу
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, "table tbody tr"))
)

# Собирать ВСЕ строки (не уникальные)
all_data = []
seen_hrefs = set()  # отслеживать по ссылке, чтобы избежать дублей

last_new_count = 0
no_change_count = 0
scroll_index = 0

for attempt in range(2000):
    rows = driver.find_elements(By.CSS_SELECTOR, "table tbody tr")
    
    for row in rows:
        cells = row.find_elements(By.TAG_NAME, "td")
        if len(cells) == 0:
            continue
        
        # Извлечь текст из ячеек
        row_data = [cell.text.strip() for cell in cells]
        
        if not any(row_data):
            continue
        
        # Извлечь ссылку на страницу вуза (обычно во 2-й колонке — имя)
        try:
            link_element = cells[1].find_element(By.TAG_NAME, "a")
            uni_url = link_element.get_attribute("href")
        except:
            uni_url = ""
        
        # Добавлять только если ссылка уникальная
        if uni_url and uni_url not in seen_hrefs:
            seen_hrefs.add(uni_url)
            row_data.append(uni_url)  # добавить ссылку последней колонкой
            all_data.append(row_data)
    
    current_total = len(all_data)
    
    if current_total > last_new_count:
        print(f"Попытка {attempt + 1}: собрано {current_total} университетов")
        last_new_count = current_total
        no_change_count = 0
    else:
        no_change_count += 1
        if no_change_count >= 25:
            print("Новых строк не появляется, завершаем")
            break
    
    # Скроллить постепенно
    if len(rows) > 10:
        target_index = min(len(rows) - 5, scroll_index)
        if target_index < len(rows):
            driver.execute_script("""
                arguments[0].scrollIntoView({block: 'center', inline: 'nearest'});
            """, rows[target_index])
        scroll_index += 3
    
    time.sleep(1)

print(f"Всего университетов: {len(all_data)}")

# Сохранить в CSV
import pandas as pd
df = pd.DataFrame(all_data)
df.to_csv("the_rankings_2026_full.csv", index=False, encoding='utf-8')
print("Данные сохранены в the_rankings_2026_full.csv")

driver.quit()


Попытка 1: собрано 29 университетов
Попытка 3: собрано 30 университетов
Попытка 3: собрано 30 университетов
Попытка 4: собрано 33 университетов
Попытка 4: собрано 33 университетов
Попытка 5: собрано 36 университетов
Попытка 5: собрано 36 университетов
Попытка 6: собрано 39 университетов
Попытка 6: собрано 39 университетов
Попытка 7: собрано 41 университетов
Попытка 7: собрано 41 университетов
Попытка 8: собрано 45 университетов
Попытка 8: собрано 45 университетов
Попытка 9: собрано 48 университетов
Попытка 9: собрано 48 университетов
Попытка 10: собрано 50 университетов
Попытка 10: собрано 50 университетов
Попытка 11: собрано 53 университетов
Попытка 11: собрано 53 университетов
Попытка 12: собрано 56 университетов
Попытка 12: собрано 56 университетов
Попытка 13: собрано 63 университетов
Попытка 13: собрано 63 университетов
Попытка 14: собрано 71 университетов
Попытка 14: собрано 71 университетов
Попытка 15: собрано 83 университетов
Попытка 15: собрано 83 университетов
Попытка 16: собр