1. Choose a Website:
The website I intend to scrape, "https://vpri.ku.edu.tr/basarilar/tubitak-bilim-odulleri/", related to Koc University's Vice President for Research and Innovation (VPRI) and showcases achievements and awards received by proffesors associated with the university. Specifically, I aim to scrape data related to "Tübitak Bilim Ödülleri" (Tübitak Science Awards).

The data I intend to scrape includes:

- Names of individuals who have received Tübitak Science Awards
- Departments or faculties associated with the individuals
- Years in which the awards were received

In [63]:
pip install scrapy

Note: you may need to restart the kernel to use updated packages.


In [65]:
# 2. Set Up Your Environment:

# Import a scrapy Selector
from scrapy import Selector

# Import requests
import requests

url = 'https://vpri.ku.edu.tr/basarilar/tubitak-bilim-odulleri/'


In [67]:
# 3. Data Extraction:
# Create the string html containing the HTML source
html = requests.get(url).content

# Create the Selector object sel from html
sel = Selector(text = html)

xpath_for_titles = '//*[@id="main"]/div/div[2]/div/div/div/div/div/div/div/div/div[2]/span[1]/text()'
titles = sel.xpath(xpath_for_titles).extract()
print(titles)

xpath_for_faculties = '//*[@id="main"]/div/div[2]/div/div/div/div/div/div/div/div/div[2]/span[3]/text()'
faculties = sel.xpath(xpath_for_faculties).extract()
print(faculties)

xpath_for_year = '//*[@id="main"]/div/div[2]/div/div/div/div/div/div/div/div/div[2]/span[2]/text()'
year = sel.xpath(xpath_for_year).extract()
print(year)

['Burak Erman', 'Attila Aşkar', 'Ali Ülger', 'Tekin Dereli', 'İskender Yılgör', 'Yaman Arkun', 'Murat Tekalp', 'Ali Mostafazadeh', 'M. İrşadi Aksun', 'Çiğdem Kağıtçıbaşı', 'Özlem Keskin Özkaya', 'Ziya Öniş', 'Alphan Sennaroğlu', 'Zeynep Aycan', 'Sumru Altuğ', 'Özgür Barış Akan']
['Mühendislik Fakültesi', 'Matematik, Fen Fakültesi', 'Matematik, Fen Fakültesi', 'Matematik, Fen Fakültesi', 'Fizik, Fen Fakültesi', 'Fizik, Fen Fakültesi', 'Mühendislik Fakültesi', 'Fizik, Fen Fakültesi', 'Elektrik Elektronik Mühendisliği, Mühendislik Fakültesi', 'Sosyal Bilimler', 'Mühendislik', 'Sosyal Bilimler', 'Mühendislik', 'Sosyal Bilimler', 'Ekonomi', 'Mühendislik']
['Tübitak Bilim Ödülü, 1991', 'Tübitak Bilim Ödülü, 1993', 'Tübitak Bilim Ödülü, 1995', 'Tübitak Bilim Ödülü, 1996', 'Tübitak Bilim Ödülü, 2003', 'Tübitak Bilim Ödülü, 2003', 'Tübitak Bilim Ödülü, 2004', 'Tübitak Bilim Ödülü, 2007', 'Tübitak Bilim Ödülü, 2007', 'Tübitak Bilim Ödülü, 2011', 'Tübitak Bilim Ödülü, 2012', 'Tübitak Bilim Ödülü,

In [68]:
# 4. Data Cleaning:

import pandas as pd

# Define the scraped data
data = {
    'Name': titles,
    'Faculty': faculties,
    'Year': year
}

# Create a pandas DataFrame from the scraped data
df = pd.DataFrame(data)

# Remove the prefix "Tübitak Bilim Ödülü," from the 'Year' column
df['Year'] = df['Year'].str.replace('Tübitak Bilim Ödülü, ', '')

# Remove the string "Fakültesi" from the 'Faculty' column
df['Faculty'] = df['Faculty'].str.replace(' Fakültesi', '')

# Remove the string "Mühendisliği" from the 'Faculty' column
df['Faculty'] = df['Faculty'].str.replace(' Mühendisliği', '')

# Add a main title

# Calculate the required spacing to center the main title
total_width = len(df.to_string().split('\n')[0])  # Total width of the DataFrame output
title_width = len("TÜBİTAK BİLİM ÖDÜLLERİ")  # Width of the main title
left_padding = (total_width - title_width) // 2  # Calculate the left padding
right_padding = total_width - title_width - left_padding  # Calculate the right padding

# Add the main title as bold text and in the center
main_title = "\033[1m" + " " * left_padding + "TÜBİTAK BİLİM ÖDÜLLERİ" + " " * right_padding + "\033[0m"
print(main_title)

# Print the DataFrame
display(df)

[1m                    TÜBİTAK BİLİM ÖDÜLLERİ                     [0m


Unnamed: 0,Name,Faculty,Year
0,Burak Erman,Mühendislik,1991
1,Attila Aşkar,"Matematik, Fen",1993
2,Ali Ülger,"Matematik, Fen",1995
3,Tekin Dereli,"Matematik, Fen",1996
4,İskender Yılgör,"Fizik, Fen",2003
5,Yaman Arkun,"Fizik, Fen",2003
6,Murat Tekalp,Mühendislik,2004
7,Ali Mostafazadeh,"Fizik, Fen",2007
8,M. İrşadi Aksun,"Elektrik Elektronik, Mühendislik",2007
9,Çiğdem Kağıtçıbaşı,Sosyal Bilimler,2011


In [69]:
# 5. Data Storage:
df.to_csv('extracted_data.csv', index=False)