# Data Preparation 
## 1. Reading Archive Data Group

This datagroup consist timestamp observations of sismic activity and educational socioeconomic data.

In [23]:
import pandas as pd
import numpy as np
import geopandas as gdp


This is a  dataset from https://www.kaggle.com/datasets/ardaorcun/turkey-6-february-disaster-and-related-datas which is a combination of several datasets from other sources below.

For earthquakes: https://deprem.afad.gov.tr/event-catalog

For any data about population and economic numbers: https://www.tuik.gov.tr/

For datas about education: https://istatistik.meb.gov.tr/

Earthquakes dataset represents the sismic activity of the region. all activities are listed with a timestamp.

ilcelervkordinatlar.csv has the population data from https://www.tuik.gov.tr/ 

The other csv files has the dataset about education according to prefix:

    datasets starting with e =>  elementary school data
    datasets starting with p => preschool data
    datasets starting with h => high school data
    datasets starting with m => middle school data

according to suffixes:

    Class => number of classes avaliable
    Number => number of institutions available
    Students => number of students
    Teachers = > number of teachers
    Type => for high schools type of the highschool

In [24]:
# defining the archive data path that we will need to read several datasets.
path = '../data/archive/'

#make a list of the name of the files we will need to read.in order not to read all of them seperately
lista = ['Earthquakes','Earthquakes2year', 'eClass','eNumber',
     'eStudent', 'eTeachers','hClass','hNumber', 'hStudents',
    'hTeachers','hTypes','ilcelervekoordinatlar', 'mClass',
    'mNumber', 'mNumber','mStudents', 'mTeachers', 'pClass',
    'pNumber', 'pStudents', 'pTeachers',]
# making a dictionary of dataframe.
data = {}
# reading the files and adding them to the dictionary.
for file_name in lista:
    df = pd.read_csv(f"{path}{file_name}.csv")
    data[file_name] = df

School datasets has the same column name in order not to mix with each other we are changing the column names. "Resmi" means public and "özel" means private. "erkek" and "kadin" mean male and female students. 

In [25]:
# dataframes starting with e stands for elementary school. in order not to confuse we are changing column names. 
# number of classes
eClass = data['eClass'].rename(columns={'Resmi':'eclass_public', 'Özel':'eclass_private', 'Toplam':'eclass_total'})
# school numbers
eNumber = data['eNumber'].rename(columns={'Resmi':'enumber_public', 'Özel':'enumber_private', 'Toplam':'enumber_total'})
#number of students in elementary schools.
eStudent = data['eStudent'].rename(columns={'R Erkek':'e_male_public', 'R Kadın':'e_female_public', 'R Toplam':'e_total_public',
                                                    'Ö Erkek':'e_male_private', 'Ö Kadın':'e_female_private', 'Ö Toplam':'e_total_private', 
                                                    'Toplam':'e_total'})
# number of teachers in elementary schools.
eTeachers = data['eTeachers'].rename(columns={'R Erkek':'et_male_public', 'R Kadın':'et_female_public', 'R Toplam':'et_total_public', 
                                                      'Ö Erkek' : 'et_male_private', 'Ö Kadın':'et_female_private', 'Ö Toplam':'et_total_private', 'Toplam':'et_total'})

# dataframes starting with h stands for high school we are changing the column names 
# number of classes
hClass = data['hClass'].rename(columns={'Resmi':'hclass_public', 'Özel':'hclass_private', 'Toplam':'hclass_total'})
# school numbers
hNumber = data['hNumber'].rename(columns={'Resmi':'hnumber_public', 'Özel':'hnumber_private', 'Toplam':'hnumber_total'})
# number of students in high schools.
hStudents = data['hStudents'].rename(columns={'R Erkek':'h_male_public', 'R Kadın':'h_female_public', 'R Toplam':'h_total_public',
                                                      'Ö Erkek': 'h_male_private', 'Ö Kadın':'h_female_private', 'Ö Toplam':'h_total_private', 'Toplam':'h_total'})
# number of teachers in high schools.
hTeachers = data['hTeachers'].rename(columns={'R Erkek':'ht_male_public', 'R Kadın':'ht_female_public', 'R Toplam':'ht_total_public',
                                                      'Ö Erkek' : 'ht_male_private', 'Ö Kadın':'ht_female_private', 'Ö Toplam':'ht_total_private', 'Toplam':'ht_toplam'})
# type of high schools
# religious high schools
hType_din = data['hTypes'][data['hTypes']['Okul Türü'] == 'Din Öğretimi']
hType_din = hType_din.rename(columns={'R Erkek':'hs_religious_male_public', 'R Kadın':'hs_religious_female_public', 'R Toplam':'hs_religious_total_public', 
                                                      'Ö Erkek': 'hs_religious_male_private', 'Ö Kadın':'hs_religious_female_private', 
                                                      'Ö Toplam':'hs_religious_total_private', 'Toplam':'hs_religious_total'})


In [26]:
# ocupational high schools
hTypes_meslek = data['hTypes'][data['hTypes']['Okul Türü'] == 'Mesleki ve Teknik Ortaöğretim']

hTypes_meslek = hTypes_meslek.rename(columns={'R Erkek':'h_male_occupational_public', 'R Kadın':'h_female_occupational_public', 'R Toplam':'h_total_occupational_public',
                                                              'Ö Erkek': 'h_male_occupational_private', 'Ö Kadın':'h_male_occupational_private', 
                                                              'Ö Toplam':'h_total_ocupational_private', 'Toplam':'h_occupational_total'})
# General high schools
hTypes_genel = data['hTypes'][data['hTypes']['Okul Türü'] == 'Genel Ortaöğretim']    

hTypes_genel = hTypes_genel.rename(columns = {'R Erkek': 'h_male_normal_public', 'R Kadın': 'h_female_normal_public', 'R Toplam': 'h_total_normal_public',
                                                              'Ö Erkek': 'h_male_normal_private', 'Ö Kadın': 'h_female_normal_private', 
                                                              'Ö Toplam': 'h_total_normal_private', 'Toplam': 'h_normal_total'})

In [27]:

# middle schools
mClass = data['mClass'].rename(columns={'Resmi':'mclass_public', 'Özel':'mclass_private', 'Toplam':'mclass_total'})
mNumber = data['mNumber'].rename(columns={'Resmi':'mnumber_public', 'Özel':'mnumber_private', 'Toplam':'mnumber_total'})
mStudents = data['mStudents'].rename(columns={'R Erkek':'m_male_public', 'R Kadın':'m_female_public', 'R Toplam':'m_total_public',
                                                      'Ö Erkek': 'm_male_private', 'Ö Kadın':'m_female_private', 'Ö Toplam':'m_total_private', 'Toplam':'m_total'})
mTeachers = data['mTeachers'].rename(columns={'R Erkek':'mt_male_public', 'R Kadın':'mt_female_public', 'R Toplam':'mt_total_public',
                                                      'Ö Erkek' : 'mt_male_private', 'Ö Kadın':'mt_female_private', 'Ö Toplam':'mt_total_private', 'Toplam':'mt_total'})
#preschools
pClass = data['pClass'].rename(columns={'Resmi':'pclass_public', 'Özel':'pclass_private', 'Toplam':'pclass_toplam'})
pNumber = data['pNumber'].rename(columns={'Resmi':'pnumber_public', 'Özel':'pnumber_private', 'Toplam':'pnumber_total'})
pStudents = data['pStudents'].rename(columns={'R Erkek':'p_male_public', 'R Kadın':'p_female_public', 'R Toplam':'p_total_public', 
                                                    'Ö Erkek':'p_male_private', 'Ö Kadın':'p_female_private', 'Ö Toplam':'p_total_private', 
                                                    'Toplam':'p_total'})
pTeachers = data['pTeachers'].rename(columns={'R Erkek':'pt_male_public', 'R Kadın':'pt_female_public', 'R Toplam':'pt_total_public', 
                                                      'Ö Erkek' : 'pt_male_private', 'Ö Kadın':'pt_female_private', 'Ö Toplam':'pt_total_private', 'Toplam':'pt_total'})


In [28]:

# Now we are merging the dataframes according on Province column 
# List of dataframe names to be merged

df_names = [eClass, eNumber, eStudent, eTeachers, hClass, hNumber,
            hStudents,hType_din, hTypes_meslek, hTypes_genel,
            hTeachers,mClass, mNumber, mStudents,
            mTeachers, pClass, pNumber, pStudents, pTeachers]

# Initialize the merged dataframe with the first dataframe in the list
merged = eClass

# Loop over the remaining dataframes and merge them into the merged dataframe
for df_name in df_names[1:]:
    
    merged = pd.merge(merged, df_name, on='Şehir')

# View the resulting merged dataframe
merged= merged.rename(columns={'Şehir':'Province'})
merged.columns

merged.to_csv('../data/archive/merged.csv', index=False)

merged.columns

Index(['Province', 'eclass_public', 'eclass_private', 'eclass_total',
       'enumber_public', 'enumber_private', 'enumber_total', 'e_male_public',
       'e_female_public', 'e_total_public',
       ...
       'p_female_private', 'p_total_private', 'p_total', 'pt_male_public',
       'pt_female_public', 'pt_total_public', 'pt_male_private',
       'pt_female_private', 'pt_total_private', 'pt_total'],
      dtype='object', length=105)

In [29]:
# saving the dataframe as pickle file to be called later in the notebook and not to lose the manipulations.
merged.to_pickle('../data/archive/merged.pkl')

The next dataset has locations of the towns and provinces and population

In [32]:
merged_data = city.merge(merged, on='Province')

merged_data.to_pickle('../data/manipulated/merged_data.pkl')

In [33]:
from shapely.geometry import Point 

merged_data['Geometry'] = merged_data.apply(lambda row: Point(row['Longitude'], row['Latitude']), axis=1)

merged_data.to_pickle('../data/manipulated/merged_data.pkl')

Earthquakes dataset has the data of measured earthquakes from 5th of February(preivous day of disaster) to 10th of March from  https://deprem.afad.gov.tr/event-catalog 

Dataset consist of below variables:

    Date => timestamp of the sismic activity
    Longitude => longitude
    Latitude => latitude
    Depth => depth 
    Rms => Root mean square
    Magnitude => the measurements
    Location => Province and time name
    Type => ML as local magnitude MW as moment magnitude
    EventID => event identifier of the database.



In [34]:
#extracting earthquake data from data dictionary
earthquake = data['Earthquakes']
# changing Location to uppercase to match with the merged_data
earthquake.Location = earthquake.Location.str.upper()
# changing the date column to datetime format
earthquake.Date = pd.to_datetime(earthquake.Date)
# sorting the values by date
earthquake.sort_values(by='Date', inplace=True)
earthquake

Unnamed: 0.1,Unnamed: 0,Date,Longitude,Latitude,Depth,Rms,Type,Magnitude,Location,EventID
15816,15816,2023-02-05 00:00:55,38.828,38.255,6.73,0.16,ML,0.8,PÜTÜRGE (MALATYA),543347
14453,14453,2023-02-05 00:54:55,43.954,41.209,11.26,0.29,ML,1.3,"NINOTSMINDA, SAMTSKHE-JAVAKHETI (GÜRCISTAN) - ...",543353
12787,12787,2023-02-05 02:31:06,44.942,38.648,11.86,0.58,ML,1.6,"KHOY, WEST AZARBAIJAN (İRAN) - [55.57 KM] BAŞK...",543359
6374,6374,2023-02-05 02:37:25,42.641,38.389,7.12,0.28,ML,2.3,TATVAN (BITLIS),543358
9837,9837,2023-02-05 03:18:37,39.186,38.469,7.00,0.28,ML,1.9,SIVRICE (ELAZIĞ),543363
...,...,...,...,...,...,...,...,...,...,...
574,574,2023-03-10 07:58:34,37.493,37.985,7.01,0.41,MW,3.9,NURHAK (KAHRAMANMARAŞ),560818
6268,6268,2023-03-10 08:02:35,38.192,38.054,7.00,0.66,ML,2.3,ÇELIKHAN (ADIYAMAN),560841
12575,12575,2023-03-10 08:04:56,37.387,37.951,7.00,0.91,ML,1.6,NURHAK (KAHRAMANMARAŞ),560838
11718,11718,2023-03-10 08:08:34,36.199,37.805,7.00,0.67,ML,1.7,SAIMBEYLI (ADANA),560839


In [35]:
# Deescriptive statistics of the earthquake data
earthquake.describe()

Unnamed: 0.1,Unnamed: 0,Longitude,Latitude,Depth,Rms,Magnitude,EventID
count,15901.0,15901.0,15901.0,15901.0,15901.0,15901.0,15901.0
mean,7950.0,37.473933,37.810726,7.684039,0.443839,2.265606,552010.119364
std,4590.36765,1.216845,0.726524,2.40463,0.256377,0.782173,5029.779034
min,0.0,35.556,33.835,0.0,0.01,0.3,543347.0
25%,3975.0,36.621,37.42,7.0,0.27,1.7,547646.0
50%,7950.0,37.179,37.955,7.0,0.42,2.1,551940.0
75%,11925.0,38.103,38.137,7.12,0.59,2.7,556350.0
max,15900.0,45.419,42.497,43.16,7.54,7.7,560849.0


In [36]:
import folium

# Create a folium map centered at a specific location
m = folium.Map(location=[37, 37], zoom_start=10)

# Iterate over each row in the merged_data dataframe
for index, row in merged_data.iterrows():
    # Extract the geometry point from the 'Geometry' column
    geometry = row['Geometry']
    
    # Extract the latitude and longitude from the geometry point
    lat = geometry.y
    lon = geometry.x
    
    # Create a marker at the latitude and longitude coordinates
    folium.Marker([lat, lon]).add_to(m)

# Display the map
m


In [43]:
#manipulating the earthquake data 
#dropping the unnamed column
#earthquake.drop(columns= ['Unnamed: 0'], inplace=True)
# extracting Province and Municipality from Location column
earthquake['Province'] = earthquake['Location'].str.extract(r'\((.*?)\)')
earthquake['Municipio'] = earthquake['Location'].str.extract(r'^(.*?)\s*\(')


earthquake['Location_name'] = earthquake['Province']+ "-"+ earthquake['Municipio']

earthquake['Location_name'] 

15816                              MALATYA-PÜTÜRGE
14453    GÜRCISTAN-NINOTSMINDA, SAMTSKHE-JAVAKHETI
12787                   İRAN-KHOY, WEST AZARBAIJAN
6374                                 BITLIS-TATVAN
9837                                ELAZIĞ-SIVRICE
                           ...                    
574                           KAHRAMANMARAŞ-NURHAK
6268                             ADIYAMAN-ÇELIKHAN
12575                         KAHRAMANMARAŞ-NURHAK
11718                              ADANA-SAIMBEYLI
15015                            GAZIANTEP-NURDAĞI
Name: Location_name, Length: 15901, dtype: object

In [44]:
# grouping earthquake data by province to see how many earthquakes happened in each province
earthquake.groupby('Location_name').count()


Unnamed: 0_level_0,Unnamed: 0,Date,Longitude,Latitude,Depth,Rms,Type,Magnitude,Location,EventID,Province,Municipio
Location_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ADANA-AKDENIZ - İSKENDERUN KÖRFEZI - [09.27 KM] YUMURTALIK,1,1,1,1,1,1,1,1,1,1,1,1
ADANA-ALADAĞ,2,2,2,2,2,2,2,2,2,2,2,2
ADANA-CEYHAN,17,17,17,17,17,17,17,17,17,17,17,17
ADANA-FEKE,12,12,12,12,12,12,12,12,12,12,12,12
ADANA-KOZAN,26,26,26,26,26,26,26,26,26,26,26,26
...,...,...,...,...,...,...,...,...,...,...,...,...
ŞANLIURFA-SIVEREK,2,2,2,2,2,2,2,2,2,2,2,2
ŞANLIURFA-SURUÇ,7,7,7,7,7,7,7,7,7,7,7,7
ŞIRNAK-MERKEZ,1,1,1,1,1,1,1,1,1,1,1,1
ŞIRNAK-SILOPI,1,1,1,1,1,1,1,1,1,1,1,1


From earthquake data we will need earthquake counts and distance to the important earthquakes.

In [39]:
# filtering city data to another frame and saving it as pickle file
city_earthquaques = earthquake.groupby('Location_name').count().reset_index()
#filtering the columns, count and locaation
city_earthquaques = city_earthquaques[['Location_name', 'EventID']]
#renaming the count column
city_earthquaques.rename(columns={'EventID':'Count_lastmonth'}, inplace=True)
#saving the dataframe as pickle file
city_earthquaques.to_pickle('../data/manipulated/city_earthquaques.pkl')

In [40]:

# filtering the most important observations to another dataframe later to be used to check distances.
day_eq= earthquake[earthquake['Magnitude'] > 6.5]
#saving the dataframe as pickle file
day_eq.to_pickle('../data/manipulated/day_eq.pkl')
