In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

<p>ChromeDriver is a separate executable that Selenium WebDriver uses to control Chrome. It is maintained by the Chromium team with help from WebDriver contributors. If you are unfamiliar with Selenium WebDriver, you should check out the <a href="https://www.selenium.dev/">Selenium site</a>.</p>

https://googlechromelabs.github.io/chrome-for-testing/

## 1. Create df dataframe including university code, type(state or private) and name

In [1059]:
service = Service("chromedriver.exe")
driver= webdriver.Chrome(service=service)
driver.get("https://yokatlas.yok.gov.tr/lisans-anasayfa.php")
driver.maximize_window()
driver.implicitly_wait(5)
optgroups = driver.find_elements(by=By.XPATH,value="//optgroup[@label]")[:4]
labels = [optgroup.get_attribute("label").split()[0] for optgroup in optgroups]
universities = { } # Create a dictionary  {"state":{uni_code:uni_name,....}, {"private":{uni_code:uni_name,...}  }
 
for label , optgroup in zip(labels,optgroups):
    options = optgroup.find_elements (by=By.TAG_NAME,value="option")
    universities[label] = {int(option.get_attribute("value")):option.get_attribute("innerText").strip()  for option in options}
df =pd.DataFrame.from_dict(universities).stack().to_frame().reset_index()
df.columns = ["uni_code","uni_type","uni_name"]
print("Shape of the df:",df.shape) 
df.head(3)  

Shape of the df: (224, 3)


Unnamed: 0,uni_code,uni_type,uni_name
0,1065,Devlet,ABDULLAH GÜL ÜNİVERSİTESİ
1,1104,Devlet,ADANA ALPARSLAN TÜRKEŞ BİLİM VE TEKNOLOJİ ÜNİV...
2,1002,Devlet,ADIYAMAN ÜNİVERSİTESİ


## 2- Create df_city with the columns "city name" and "university name"

 In this step we use another url that gives universities and cities.

In [1060]:
driver.get("https://yokatlas.yok.gov.tr/universite.php")
driver.implicitly_wait(10)
city_web_elements = driver.find_elements(by=By.CLASS_NAME, value="sehir")
cities = [web_element.get_attribute("innerText").strip() for web_element in city_web_elements]

name_web_elements = driver.find_elements(by=By.CLASS_NAME, value="baslik")
names =[web_element.get_attribute("innerText").strip() for web_element in name_web_elements]

df_city = pd.DataFrame({"city":cities,"uni_name":names})
print("Shape of the dataframe df_city:",df_city.shape)
print("Number of unique cities",df_city["city"].nunique())
df_city.head(3)

Shape of the dataframe df_city: (205, 2)
Number of unique cities 81


Unnamed: 0,city,uni_name
0,Kayseri,ABDULLAH GÜL ÜNİVERSİTESİ
1,İstanbul,ACIBADEM MEHMET ALİ AYDINLAR ÜNİVERSİTESİ
2,Adana,ADANA ALPARSLAN TÜRKEŞ BİLİM VE TEKNOLOJİ ÜNİV...


## 3-Merge the two dataframes: df and df_city as df

In [1061]:
df = df.merge(df_city, on="uni_name", how='right')
print("New shape of df after dropping rows those not match on df_city(universities abroad):",df.shape)
df.head(3)

New shape of df after dropping rows those not match on df_city(universities abroad): (205, 4)


Unnamed: 0,uni_code,uni_type,uni_name,city
0,1065.0,Devlet,ABDULLAH GÜL ÜNİVERSİTESİ,Kayseri
1,2001.0,Vakıf,ACIBADEM MEHMET ALİ AYDINLAR ÜNİVERSİTESİ,İstanbul
2,1104.0,Devlet,ADANA ALPARSLAN TÜRKEŞ BİLİM VE TEKNOLOJİ ÜNİV...,Adana


In [1062]:
print("Number of state universities:",len(df.loc[df['uni_type']=='Devlet']))

Number of state universities: 126


## 4-Data Pre-Processing

**The number of universities has dropped to 205 from 224, because we merged dataframes on cities in Turkey. 19 universities abroad have been dropped.<br>
 The number of state universities must be 129. The missing 3 universities are**
* TÜRKİYE ULUSLARARASI İSLAM, BİLİM VE TEKNOLOJİ ÜNİVERSİTESİ
* TÜRK-JAPON BİLİM VE TEKNOLOJİ ÜNİVERSİTESİ
* ANKARA MÜZİK VE GÜZEL SANATLAR ÜNİVERSİTESİ	<br>
The first two universities do not accept students yet. Therefore, they are not given on YOK Atlas website. <br>
Third university's  uni_type and uni_code values are missing. We generate an id and set uni_type as "state".

 #### 4-1. Set missing uni_type and uni_code of *ANKARA MÜZİK VE GÜZEL SANATLAR ÜNİVERSİTESİ*

In [1063]:
df.loc[df["uni_name"]=="ANKARA MÜZİK VE GÜZEL SANATLAR ÜNİVERSİTESİ","uni_code"]= 1
df.loc[df["uni_name"]=="ANKARA MÜZİK VE GÜZEL SANATLAR ÜNİVERSİTESİ","uni_type"]= "Devlet"

 #### 4-2. Drop rows with missing values (vocational schools)

There are vocational schools which only include 2-year pre-license programs. These schools do not include bachelor degree programs.<br>
Type and uni_code of these schools are na. We drop them in this context.

In [1064]:
df [df["uni_code"].isna() ]

Unnamed: 0,uni_code,uni_type,uni_name,city
25,,,ATAŞEHİR ADIGÜZEL MESLEK YÜKSEKOKULU,İstanbul
106,,,İSTANBUL SAĞLIK VE SOSYAL BİLİMLER MESLEK YÜKS...,İstanbul
108,,,İSTANBUL ŞİŞLİ MESLEK YÜKSEKOKULU,İstanbul
120,,,İZMİR KAVRAM MESLEK YÜKSEKOKULU,İzmir


We drop the vocational schools.

In [1072]:
df.dropna(axis=0,inplace=True)
print("Any missing values:",df.isna().any(axis=None))

Any missing values: False


Now there are no missing values.

In [1073]:
print("Shape of the df after dropping vocational schools",df.shape)

Shape of the df after dropping vocational schools (208, 4)


 #### 4-3. Drop rows with missing values (vocational schools)

SAĞLIK BİLİMLERİ ÜNİVERSİTESİ medicine department has several campuses in different cities. We add these seperately.

In [1077]:
df.loc[len(df)]=[1110,"Devlet","SAĞLIK BİLİMLERİ ÜNİVERSİTESİ","Ankara"]
df.loc[len(df)]=[1110,"Devlet","SAĞLIK BİLİMLERİ ÜNİVERSİTESİ","Adana"]
df.loc[len(df)]=[1110,"Devlet","SAĞLIK BİLİMLERİ ÜNİVERSİTESİ","Erzurum"]
df.loc[len(df)]=[1110,"Devlet","SAĞLIK BİLİMLERİ ÜNİVERSİTESİ","Bursa"]
df.loc[len(df)]=[1110,"Devlet","SAĞLIK BİLİMLERİ ÜNİVERSİTESİ","Trabzon"]
df.loc[len(df)]=[1110,"Devlet","SAĞLIK BİLİMLERİ ÜNİVERSİTESİ","İzmir"]
df.loc[len(df)]=[1110,"Devlet","SAĞLIK BİLİMLERİ ÜNİVERSİTESİ","Kayseri"]

In [1078]:
print("Shape of the df after adding cities for SAĞLIK BİLİMLERİ ÜNİVERSİTESİ",df.shape)

Shape of the df after adding cities for SAĞLIK BİLİMLERİ ÜNİVERSİTESİ (208, 4)


 #### 4-4. Converting data type and arranging columns

We can convert uni_code columns to int type.

In [1079]:
df["uni_code"] = df["uni_code"].astype("int")

We can rearrange the dataframe using multi_indexing.

In [1081]:
df = df[["uni_type","city","uni_name","uni_code"]]
df.head(3)

Unnamed: 0,uni_type,city,uni_name,uni_code
0,Devlet,Kayseri,ABDULLAH GÜL ÜNİVERSİTESİ,1065
1,Vakıf,İstanbul,ACIBADEM MEHMET ALİ AYDINLAR ÜNİVERSİTESİ,2001
2,Devlet,Adana,ADANA ALPARSLAN TÜRKEŞ BİLİM VE TEKNOLOJİ ÜNİV...,1104


In [736]:
df.to_csv("df_template.csv" ,index=False)