<a name="top"> <h1>01.Explanatory Data Analisis and Variable Selection </h1> <a>

<p>Geospatial Analysis of the 2023 Earthquakes in Turkey<br />
<strong>Master Thesis</strong><br />
<strong>Master of Data Science</strong></p>


<p style="text-align:right">Gozde Yazganoglu (<em>gozde.yazganoglu@cunef.edu</em>)</p>

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Introduction](#1)
1. [Importing libraries](#2)
1. [Reading the data and Profiling](#3)
1. [Variables](#4)
1. [Visualization of the result of geostatistical data interpolation](#5)
1. [Saving data for other notebooks](#6)

## 1. Introduction <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In the previous notebook we have collected data and created some geographic variables. In this notebook we will have first glance as if we do in all data projects. 

Pandas-profiling is a great library to have a broad glance on the data frame and have some initial insights. It gives us quite a lot information including distinct variables, variable type, null and duplicate variables etc. Later we are going to examine important columns and have some more insight.

## 2. Importing libraries <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In this notebook we are using new_geo_env.yaml. In order to run local this file should be installed.

We will use some functions from aux_func that is also in this repository.Below we import also the necessary ones.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import pickle   
import numpy as np
import seaborn as sns
from pandas_profiling import ProfileReport
import contextily as ctx
import fiona
from aux_func import data_summary, save_data, plot_continuous_for_damage_level, remove_duplicates, plot_for_damage_level# sanitize_string_values

KeyError: 'PolyCollection:kwdoc'

## 3. Reading the Data and Profiling <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In the previous notebook we have gathered the data from several sources. Since the data size is big and contain geo columns it might take time to read it. Alternatively in the previous notebook we have already save also in pickle format. Pickle file should also work in the same way but since this is a python specific file format, this file cannot be read in other languages or could be used softwares such as ArcGIS or QGIS.

In [None]:
#reading geojson data from newly created processed folder
data = gpd.read_file('../data/processed/dataset.geojson')

data.head()

With below code, we create an HTML report that could be found in ('../data/processed/primary_analisis.html') in order to avoid executing all the time below it is commented.

In [None]:
#creating a pandas profile in order to have a broad look on the dataset.
#geometric data does not appear in pandas profile, so we will have to look at it separately.
#df = pd.DataFrame(data)
#profile = ProfileReport(df, title="Pandas Profiling Report")
#profile


In [None]:
#saving profile report to html file

#profile.to_file('../data/processed/primary_analisis.html')

As might be seen in the Pandas report as well, some of the features of the dataframe are not very useful and we have never yet processed them. These variables can be listed as follows.

    * det_method
    * notation
    * cd_value
    * real
    * index_right
    * esmr_id
    * glide_no
    * map_type

Before advancing with further analysis, we can eliminate these variables. This is because they either do not provide meaningful information, contain identical information across all data entries, or hold non-informative values like 'none' or 'not applicable'.






In [None]:
#Droping columns that are not needed for the analysis
data.drop(columns=['det_method', 'notation','cd_value', 'real', 'index_right','emsr_id', 'glide_no', 'map_type'], inplace=True)



Upon further examination of the report, it becomes apparent that several variables exhibit a high degree of correlation with one another. This phenomenon is likely attributed to the inclusion of various socioeconomic variables specific to each municipality in our dataset. 

In [None]:
%matplotlib inline
data['obj_type'].value_counts().sort_values(ascending=False).plot(kind='bar')
plt.show()

In [None]:
#showing prppperties of data using data_summary function from aux_func.py
data_info = data_summary(data)
data_info

Since we have created data different sources and by merging, and we have removed other colums that doesn't have a lot of information we don't have any missing values. However,as infered a good amount of entries are duplicated. We choose to remove duplicated variables. Here we use another function from aux_func

In [None]:

#removing duplicates with the function remove_duplicates

data = remove_duplicates(data)

## 3. Variables <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)




### 1. obj_type, name, info:

The variables presented categorize the buildings, providing various levels of detail. For instance, the 'name' attribute tends to contain more nuanced information (as is the case with identifying a building as a school), whereas the 'obj_type' attribute generally offers less granularity in its classification. The anticipation is that these variables may have a high degree of correlation. Incorporating all of them in our model could potentially lead to multicollinearity or overfitting issues. It's crucial to approach this with care to ensure the integrity and reliability of our model's results.

info category seems to have variables that are wrongly created. To keep this we will manipulate the data and align with the building code and its value.

### 2. damage_gra : 

This variable is the target variable. Using geographic and socioeconomic variables, we can analyze how buildings behaved after the earthquake. According to pandas profile report we can see that this is an unbalanced variable that most of the buildings, facilities or roads are not damaged.

In order this variable to be used in any algorithm we should change it to numerical form. the level of damage can be clasified as ordinal categoric variable. 

### 3. dmg_src_id, or_src_id :

These variables are not meaningful and not explained in detail. It is infered as they are identifying variables, we do not need them in machine learning models.

### 4. locality, area_id :

Variables that explain the local we are observing. Both are them has 19 different locals as we are examining these areas. Both should be clasified as categoric variables. area_id is identifying variable.


### 5. Variables that were received from Turkstat:

More detailed info could be found in previous notebook. Data is added in the previous notebook manually. Some of the values are belong to province mean, therefore more than one locality might have the same score.

    population: Population of 2022
    income: Income in USD for 2022
    total_sales : total house sales in the last 5 years
    second_sales : second hand house sales in the las 5 years ( this might be interesting because these buildings expectedto be more risky than the new ones)
    water_access : percentage of access to city water
    elec_cons : electric consumption of per capita
    building_perm: number of buildings permited
    land_perm : land permited in the province in m2
    labour_fource: labour force participation rate in the local 2022
    unemployment : uneployment rate 2022
    agricultural : agricultural area in decares.
    life_time : expected lifetime
    hb_per1000000 : hospital beds in the province per 100000
    fertility : fertility rate
    hh_size : average household size in the area


### 6. Geospatial Variables:

Variables with a geographical significance include latitude and longitude, which are derived from either points or the center of a polygonal structure. The BallTree method is employed to compute the Euclidean distance for distance-related variables. In models like ours that are geographically-based, if we suspect that any of these might influence the outcome, we can compute these variables as done before.

We utilize longitude and latitude data as location markers for our dataset. Using this method, these metrics help determine the nearest distance to various points. It's essential to understand that this isn't an actual distance, but rather offers us an insight.



### Manipulations to data:

Upon insights we gathered from profiling, "info" variable has grouped in sometimes only numbers and sometimes number with the type of the building. It is important for us to understand the which building types has been damaged more than others. Therefore ewe want to keep it and we want to map it with the corrections.

Target variable damage_gra is a ordinal type of data. perhaps we can transform it with later preprocessing. Right we need to encode it to create some visualizations.


In [None]:
# Manipulations to the data:

#to info variable:

map_info = {'122':'122-Office buildings',
       '123':'123-Wholesale and retail trade buildings',
       '1251':'1251-Industrial buildings',
       '1263':'1263-School, university and research buildings',
       '1272':'1272-Hospital buildings',
       '1279':'1279-Military',
       '1280':'1280-Cemetery',
       'None' :'997-Not Applicable' }

data.info = data['info'].replace(map_info, inplace=True)

# to the target variable:
map_damage_gra = {'Damaged': 3,
                  'Destroyed': 4,
                  'No visible damage' : 1,
                  'Not Analysed': 0,
                  'Possibly damaged':2 }

data['damage_gra'] = data['damage_gra'].replace(map_damage_gra)


data.drop(columns = ['name', 'area_id', 'or_src_id', 'dmg_src_id'], inplace = True)





columns_to_sanitize= ['obj_type', 'info','locality']

data = sanitize_string_values(data, columns_to_sanitize)
                      


accorrding to building types

In [None]:
#makeing a list of categorical variables of interest
variables_of_interest = ['obj_type', 'info', 'locality']

In [None]:
#using plot_damage_level function  we graph categorical variables
for variable in variables_of_interest:
    plot_for_damage_level(data, variable, 'damage_gra')
    

As seen in the graphs, roads are the group of building that have deteriorated the most.

In [None]:



continuous_columns = ['total_sales', 'second_sales', 'water_access', 'elec_cons',
       'building_perm', 'land_permited', 'labour_fource', 'unemployment',
       'agricultural', 'life_time', 'hb_per100000', 'fertility', 'hh_size',
       'longitude', 'latitude', 'nearest_water_source_distance',
       'nearest_camping_distance', 'nearest_earthquake_distance',
       'nearest_fault_distance', 'elev',]

for column in continuous_columns:
    plot_continuous_for_damage_level(data, column, 'damage_gra')

## 6. Variables <a class="anchor" id="6"></a>

[Back to Table of Contents](#0.1)

As por manipulations, in order to use the data in geospatial analisis and we 


In [None]:
# Descriptive Statistics
data.describe()

In [None]:
save_data(data, "dataset2")