In [1]:
import requests, re, folium
import pandas as pd
from bs4 import BeautifulSoup
from ipyleaflet import Map, Marker, MarkerCluster
from folium import plugins, IFrame
from folium.plugins import MarkerCluster, FloatImage

## Web Scraping

Necessary data will be taken from Wikipedia. Wikipedia has lists of Roman and Greek sites in Turkey. [Here](https://en.wikipedia.org/wiki/Category:Roman_sites_in_Turkey) is the link for the Roman sites. Greek sites were recorded in 2 pages, which means there are 3 pages of lists to scrape.

Our code will track the URLs of each ancient site in these lists and scrape the necessary data from there one by one.

This means that we need to create loops and functions to automate this process.

### Creating Functions

The functions below try to acquire latitude and longitude from the inserted Wikipedia link. If they can not find location data with these parameters, they return none.

In [2]:
def getlat(links): 
    
    url = links
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    
    try:

        lat = soup.find("span", {"class" : "latitude"}).text    
        return lat
        
    except:
        
        return None
    
def getlong(links): 
    
    url = links
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    
    try:

        long = soup.find("span", {"class" : "longitude"}).text
        return long
        
    except:
        
        return None

The function below converts degree minute second format to decimals. This is a necessary process to be able to work with spatial data. Not every latitude and longitude return with second or minute precision. Therefore, try and except commands are used to work with any value.

In [3]:
def dms2dec(s):
    
    try:
    
        deg, a = re.split("°", s) 
        try: 

            minu, b = re.split("′", a)
        except:
            
            minu, b = 0, None
        try: 
            
            sec, direction = re.split("″", b)
        except:
            
            sec, direction = 0, None
            
        decimal = float(deg) + float(minu)/60 + float(sec)/(3600);
    
        if direction in ('S','W'):
    
            decimal*= -1
    
        return decimal
    
    except:
        
        return None

The function below tries to acquire an image URL from the inserted Wikipedia link. If it can not find an image with these parameters, it returns none.

In [4]:
def getimg(links):
    
    r = requests.get(links)

    soup = BeautifulSoup(r.text, "html.parser")
    
    try:

        newurl = "https://en.wikipedia.org/" + soup.find("a", {"class" : "image"})["href"]

        r2 = requests.get(newurl)

        soup2 = BeautifulSoup(r2.text, "html.parser")

        image = "https:" + soup2.find("a", {"class" : "mw-thumbnail-link"})["href"]

        return image

    except:
        
        return None

The function below tries to acquire the first few lines of the description from the inserted Wikipedia link.

In [5]:
def getdesc(links):
    
    r = requests.get(links)

    soup = BeautifulSoup(r.text, "html.parser")

    desc = soup.find("p").text

    return desc

### BeautifulSoup

Status of the first link.

In [6]:
url0 = "https://en.wikipedia.org/w/index.php?title=Category:Ancient_Greek_archaeological_sites_in_Turkey&pageuntil=Laertes+%28Cilicia%29#mw-pages"

r0 = requests.get(url0)

print(r0.status_code)

200


In [7]:
soup0 = BeautifulSoup(r0.text, "html.parser")

print(soup0.title.text)

Category:Ancient Greek archaeological sites in Turkey - Wikipedia


First, we create an empty list to record our rows. Then, we have to locate the list in the pages. Next step is collecting necessary data.

"links" is recording links of the ancient sites with "href" tags so that we can work with the URL.

"rows" will be the rows of our dataset.

"Name" is the first column which is acquired with the "title" tags from the html file.

"Type" is filled manually since we know the type of the inserted page.

"Link" is the same with the "links" but we add the domain to make its elements proper links.

Location data, image URLs and descriptions are being recorded with the previous functions.

In [8]:
frame =[]

In [9]:
groups0 = soup0.find_all("div", {"class" : "mw-category-group"})[-12:]

for group in groups0:
    
    sites = group.find_all("li")
    
    for site in sites:
        
        links = "https://en.wikipedia.org/" + site.find("a")["href"]
        
        rows = {"Name" : site.find("a")["title"],
                
        "Type" : "Greek",
        
        "Link" : "https://en.wikipedia.org/" + site.find("a")["href"],
        
        "Latitude" : getlat(links),
        
        "Longitude" : getlong(links),
                
        "ImageUrl" : getimg(links),
                
        "Description" : getdesc(links)}
        
        frame.append(rows)

Same process with the second page.

In [10]:
soup1 = BeautifulSoup(requests.get("https://en.wikipedia.org/w/index.php?title=Category:Ancient_Greek_archaeological_sites_in_Turkey&pagefrom=Laertes+%28Cilicia%29#mw-pages").text,
                      "html.parser")

groups1 = soup1.find_all("div", {"class" : "mw-category-group"})[-12:]

for group in groups1:
    
    sites = group.find_all("li")
    
    for site in sites:
        
        links = "https://en.wikipedia.org/" + site.find("a")["href"]
        
        rows = {"Name" : site.find("a")["title"],
                
        "Type" : "Greek",
        
        "Link" : "https://en.wikipedia.org/" + site.find("a")["href"],
        
        "Latitude" : getlat(links),
        
        "Longitude" : getlong(links),
                
        "ImageUrl" : getimg(links),
                
        "Description" : getdesc(links)}
        
        frame.append(rows)

Same process with the last(Roman) page.

In [11]:
soup2 = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Category:Roman_sites_in_Turkey").text,
                      "html.parser")

groups2 = soup2.find_all("div", {"class" : "mw-category-group"})

for group in groups2:
    
    sites = group.find_all("li")
    
    for site in sites:
        
        links = "https://en.wikipedia.org/" + site.find("a")["href"]
        
        rows = {"Name" : site.find("a")["title"],
                
        "Type" : "Roman",
        
        "Link" : "https://en.wikipedia.org/" + site.find("a")["href"],
        
        "Latitude" : getlat(links),
        
        "Longitude" : getlong(links),
                
        "ImageUrl" : getimg(links),
               
        "Description" : getdesc(links)}
        
        frame.append(rows)

## Data Wrangling

Scraping takes a while, so we can export our dataset once we have it and use it as a local file in the future.

In [12]:
df = pd.DataFrame(frame)

df.to_csv("AncientSites.csv")

df = pd.read_csv("AncientSites.csv", index_col = 0)

df.head()

Unnamed: 0,Name,Type,Link,Latitude,Longitude,ImageUrl,Description
0,Abonoteichos,Greek,https://en.wikipedia.org//wiki/Abonoteichos,41°58′26″N,33°45′58″E,https://upload.wikimedia.org/wikipedia/commons...,Abonoteichos (Greek: Ἀβώνου τεῖχος Avónou tích...
1,Abydos (Hellespont),Greek,https://en.wikipedia.org//wiki/Abydos_(Hellesp...,40°11′43″N,26°24′18″E,https://upload.wikimedia.org/wikipedia/commons...,"Abydos (Ancient Greek: Ἄβυδος, Latin: Abydus) ..."
2,Acharaca,Greek,https://en.wikipedia.org//wiki/Acharaca,37°54′N,28°06′E,https://upload.wikimedia.org/wikipedia/commons...,Acharaca (Ancient Greek: Ἀχάρακα) was a villag...
3,Achilleion (Troad),Greek,https://en.wikipedia.org//wiki/Achilleion_(Troad),39°54′54″N,26°9′9″E,https://upload.wikimedia.org/wikipedia/commons...,"Achilleion (Ancient Greek: Ἀχίλλειον, romanize..."
4,Acrassus,Greek,https://en.wikipedia.org//wiki/Acrassus,,,,Acrassus or Akrassos (Ancient Greek: Ἄκρασος) ...


Changing the format of the spatial data.

In [13]:
df["LatDec"] = df["Latitude"].apply(dms2dec).astype(float)

df["LongDec"] = df["Longitude"].apply(dms2dec).astype(float)

df.head()

Unnamed: 0,Name,Type,Link,Latitude,Longitude,ImageUrl,Description,LatDec,LongDec
0,Abonoteichos,Greek,https://en.wikipedia.org//wiki/Abonoteichos,41°58′26″N,33°45′58″E,https://upload.wikimedia.org/wikipedia/commons...,Abonoteichos (Greek: Ἀβώνου τεῖχος Avónou tích...,41.973889,33.766111
1,Abydos (Hellespont),Greek,https://en.wikipedia.org//wiki/Abydos_(Hellesp...,40°11′43″N,26°24′18″E,https://upload.wikimedia.org/wikipedia/commons...,"Abydos (Ancient Greek: Ἄβυδος, Latin: Abydus) ...",40.195278,26.405
2,Acharaca,Greek,https://en.wikipedia.org//wiki/Acharaca,37°54′N,28°06′E,https://upload.wikimedia.org/wikipedia/commons...,Acharaca (Ancient Greek: Ἀχάρακα) was a villag...,37.9,28.1
3,Achilleion (Troad),Greek,https://en.wikipedia.org//wiki/Achilleion_(Troad),39°54′54″N,26°9′9″E,https://upload.wikimedia.org/wikipedia/commons...,"Achilleion (Ancient Greek: Ἀχίλλειον, romanize...",39.915,26.1525
4,Acrassus,Greek,https://en.wikipedia.org//wiki/Acrassus,,,,Acrassus or Akrassos (Ancient Greek: Ἄκρασος) ...,,


Checking the none values.

In [14]:
df.isna().sum()

Name            0
Type            0
Link            0
Latitude       49
Longitude      49
ImageUrl       79
Description     0
LatDec         49
LongDec        49
dtype: int64

In [15]:
print(((df.isna().sum()["Latitude"]*100)/len(df)).round(decimals = 2) , "percent of the dataset is empty for spatial data")

11.45 percent of the dataset is empty for spatial data


##### Note that none values in the both format of the spatial data can also be used as an indicator to test our format changing function.
They have 49 none values, which means that our function worked fine.

We won't be able to work with 49 sites, but we can still store them in a different dataset.

In [16]:
df2 = df.copy()

df = df[df["Latitude"].notna()]

In [17]:
df2.shape[0] - df.shape[0]

49

There are common sites in the both lists of Wikipedia. We will drop duplicates and change their type to Greek/Roman.

In [18]:
df.loc[df["Name"].duplicated(keep = False) == True, "Type"] = "Greek/Roman"
df.drop_duplicates(inplace = True, ignore_index = True)

df2.loc[df2["Name"].duplicated(keep = False) == True, "Type"] = "Greek/Roman"
df2.drop_duplicates(inplace = True, ignore_index = True)

Number of the sites in each type.

In [19]:
df2["Type"].value_counts()

Greek          276
Greek/Roman     55
Roman           42
Name: Type, dtype: int64

Custom icon URLs for the sites in the dataset.

In [20]:
df.loc[df["Type"] == "Roman", "IconUrl"] = "https://github.com/ocaktans/Mapping-of-the-Ancient-Greek-and-Roman-Sites-in-Turkey/blob/main/images/roman.png?raw=true"
df.loc[df["Type"] == "Greek", "IconUrl"] = "https://github.com/ocaktans/Mapping-of-the-Ancient-Greek-and-Roman-Sites-in-Turkey/blob/main/images/greek.png?raw=true"
df.loc[df["Type"] == "Greek/Roman", "IconUrl"] = "https://github.com/ocaktans/Mapping-of-the-Ancient-Greek-and-Roman-Sites-in-Turkey/blob/main/images/combined.png?raw=true"

Some site pages without images return the map of Turkey, which use the same URL in every site. We will drop the rows that contain this specific URL.

In [21]:
df.loc[df["ImageUrl"] == "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Turkey_adm_location_map.svg/800px-Turkey_adm_location_map.svg.png",
      
      "ImageUrl"] = None 

In [22]:
df.sort_values(by = "Name", inplace = True)
df.head(5)

Unnamed: 0,Name,Type,Link,Latitude,Longitude,ImageUrl,Description,LatDec,LongDec,IconUrl
0,Abonoteichos,Greek,https://en.wikipedia.org//wiki/Abonoteichos,41°58′26″N,33°45′58″E,https://upload.wikimedia.org/wikipedia/commons...,Abonoteichos (Greek: Ἀβώνου τεῖχος Avónou tích...,41.973889,33.766111,https://github.com/ocaktans/Mapping-of-the-Anc...
1,Abydos (Hellespont),Greek,https://en.wikipedia.org//wiki/Abydos_(Hellesp...,40°11′43″N,26°24′18″E,https://upload.wikimedia.org/wikipedia/commons...,"Abydos (Ancient Greek: Ἄβυδος, Latin: Abydus) ...",40.195278,26.405,https://github.com/ocaktans/Mapping-of-the-Anc...
2,Acharaca,Greek,https://en.wikipedia.org//wiki/Acharaca,37°54′N,28°06′E,,Acharaca (Ancient Greek: Ἀχάρακα) was a villag...,37.9,28.1,https://github.com/ocaktans/Mapping-of-the-Anc...
3,Achilleion (Troad),Greek,https://en.wikipedia.org//wiki/Achilleion_(Troad),39°54′54″N,26°9′9″E,,"Achilleion (Ancient Greek: Ἀχίλλειον, romanize...",39.915,26.1525,https://github.com/ocaktans/Mapping-of-the-Anc...
4,Adada (Pisidia),Greek,https://en.wikipedia.org//wiki/Adada_(Pisidia),37°34′31″N,30°58′59″E,,Adada is an ancient city and archaeological si...,37.575278,30.983056,https://github.com/ocaktans/Mapping-of-the-Anc...


Description column contains some undesired text such as Greek alphabet and the indexes of the sources in the Wikipedia page. 

Description column is created to give an idea of the site to the users. Therefore, we have to make the most of it.

Greek versions of the names take a lot of place in the small description boxes and look a bit messy. They are stored in parentheses, so we can drop them. 

The indexes are stored in brackets. They are dropped with the same method. 

Sources, Greek versions of the names and more information will be given to users via inserted Wikipedia hyperlinks.

In [23]:
df["Description"] = df["Description"].str.replace(r"(\s*\[.*?\]\s*)", "", regex = True).str.strip()
df["Description"] = df["Description"].str.replace(r"(\s*\(.*?\)\s*)", "", regex = True).str.strip()

## Creating the Map

The map is centered at Turkey.

In [24]:
m = folium.Map(location=[39,35.5], tiles="CartoDB Dark_Matter", control_scale = True, zoom_start=6)

Markers are placed. 

Icons are custom icons that use the "icon_url" in the dataset.

"popup" contains the first 200 characters of the "Description" column and the Wikipedia link for the site. "Link" column is used to create a hyperlink. They are arranged to be opened in new tabs. The width parameters are set for a better appearance for every object. The images of the sites could be added as well via the ImageUrl column yet since they are connected to lots of links, they decrease the performance. Therefore, images are not shown on popups.

"tooltip" shows the names of the sites when the cursor is on them.

The sites mostly overlap. Clusters of markers are used for a better appearance. It is possible to classify these clusters according to the type column. We would only need to create 3 marker clusters and filter the dataset according to the type so that we can add these clusters to the map separately. Yet, the sites are not split according to their types in this map. This is a process that does not have one correct answer. These can be changed.

In [25]:
markercluster = MarkerCluster().add_to(m)

df.apply(lambda row: folium.Marker(location=[row["LatDec"], row["LongDec"]],
                                  icon = folium.features.CustomIcon(row["IconUrl"], icon_size=(35, 35)), 
                                  popup = folium.Popup((row["Description"][:200] + "..." + '<a href=' + row["Link"] +' target="_blank"> See more</a>'), min_width = 250, max_width = 250),
                                  tooltip = row["Name"]).add_to(markercluster), axis = 1)

0      <folium.map.Marker object at 0x0000020C7656C160>
1      <folium.map.Marker object at 0x0000020C763B03D0>
2      <folium.map.Marker object at 0x0000020C763B0460>
3      <folium.map.Marker object at 0x0000020C763B0130>
4      <folium.map.Marker object at 0x0000020C763B02B0>
                             ...                       
71     <folium.map.Marker object at 0x0000020C76410070>
223    <folium.map.Marker object at 0x0000020C76410100>
156    <folium.map.Marker object at 0x0000020C76410280>
314    <folium.map.Marker object at 0x0000020C76410400>
160    <folium.map.Marker object at 0x0000020C76410580>
Length: 330, dtype: object

The legend is inserted as an image, since we use custom markers. The image was created before for this map.

In [26]:
FloatImage("https://github.com/ocaktans/Mapping-of-the-Ancient-Greek-and-Roman-Sites-in-Turkey/blob/main/images/thelegend.png?raw=true", bottom=7, left=1).add_to(m)

<folium.plugins.float_image.FloatImage at 0x20c76401cd0>

A minimap is added to the map.

In [27]:
minim = plugins.MiniMap(tile_layer = "CartoDB Dark_Matter", toggle_display = True, height = 115) 

m.add_child(minim)

In [28]:
m.save("themap.html")

In [29]:
df2.to_csv("AncientSitesinTurkey.csv")

In [30]:
df

Unnamed: 0,Name,Type,Link,Latitude,Longitude,ImageUrl,Description,LatDec,LongDec,IconUrl
0,Abonoteichos,Greek,https://en.wikipedia.org//wiki/Abonoteichos,41°58′26″N,33°45′58″E,https://upload.wikimedia.org/wikipedia/commons...,"Abonoteichos, later Ionopolis, was an ancient ...",41.973889,33.766111,https://github.com/ocaktans/Mapping-of-the-Anc...
1,Abydos (Hellespont),Greek,https://en.wikipedia.org//wiki/Abydos_(Hellesp...,40°11′43″N,26°24′18″E,https://upload.wikimedia.org/wikipedia/commons...,Abydoswas an ancient city and bishopric in Mys...,40.195278,26.405000,https://github.com/ocaktans/Mapping-of-the-Anc...
2,Acharaca,Greek,https://en.wikipedia.org//wiki/Acharaca,37°54′N,28°06′E,,"Acharacawas a village of ancient Lydia, Anatol...",37.900000,28.100000,https://github.com/ocaktans/Mapping-of-the-Anc...
3,Achilleion (Troad),Greek,https://en.wikipedia.org//wiki/Achilleion_(Troad),39°54′54″N,26°9′9″E,,Achilleionwas an ancient Greek city in the sou...,39.915000,26.152500,https://github.com/ocaktans/Mapping-of-the-Anc...
4,Adada (Pisidia),Greek,https://en.wikipedia.org//wiki/Adada_(Pisidia),37°34′31″N,30°58′59″E,,Adada is an ancient city and archaeological si...,37.575278,30.983056,https://github.com/ocaktans/Mapping-of-the-Anc...
...,...,...,...,...,...,...,...,...,...,...
71,Çankırı,Greek,https://en.wikipedia.org//wiki/%C3%87ank%C4%B1...,40°36′00″N,33°37′00″E,https://upload.wikimedia.org/wikipedia/commons...,Çankırı is the capital city of Çankırı Provinc...,40.600000,33.616667,https://github.com/ocaktans/Mapping-of-the-Anc...
223,Öküzlü,Greek,https://en.wikipedia.org//wiki/%C3%96k%C3%BCzl...,36°34′N,34°10′E,https://upload.wikimedia.org/wikipedia/commons...,Öküzlü is an archaeological site in Mersin Pro...,36.566667,34.166667,https://github.com/ocaktans/Mapping-of-the-Anc...
156,İskenderun,Greek,https://en.wikipedia.org//wiki/%C4%B0skenderun,36°34′54″N,36°09′54″E,https://upload.wikimedia.org/wikipedia/commons...,,36.581667,36.165000,https://github.com/ocaktans/Mapping-of-the-Anc...
314,İzmir,Roman,https://en.wikipedia.org//wiki/%C4%B0zmir,38°25′N,27°08′E,https://upload.wikimedia.org/wikipedia/commons...,"İzmir, often spelled Izmir in English, is a me...",38.416667,27.133333,https://github.com/ocaktans/Mapping-of-the-Anc...
