# Introduction

The dataset used gathers information of wildfires in Spain from 2001 to 2015. The data comes from the Estadística General de Incendios Forestales (EGIF) made by the Centro de Coordinación de la Información Nacional de Incendios Forestales (CCINIF), and this information is provided by each Autonomous Community yearly. 

The dataset has the following information:

* id |  Identifier
* superficie |	Forest area burned, in hectares
* fecha	| Fire detection date (yyyy-mm-dd format)
* lat |	Geographical latitude  of the wildfire origin
* lng	| Geographical longitude  of the wildfire origin
* latlng_explicit |	Informs whether the geographical coordinates were available (1) or the coordinates of the origin municipality were used instead (0)
* comunidad	| Autonomus community identifier
* provincia	| Province identifier
* municipio	| Name of the municipality
* causa	| Wildfire cause
* causa_supuesta	| 1 if the the cause is assumed, otherwise is blank
* causa_desc	| Wildfire description identifier
* muertos	| Number of deceased
* heridos	| Number of injured
* time_ctrl	| Time lapsed until the fire is controled (in minutes)
* time_ext	| Time lapsed until the fire extinction(in minutes)
* personal	| Number of people that participated in the fire extinction (includes technicians, forestry agents, brigades, firefighters, volunteers, civil guards and military)
* medios	| Number of ground and aerial means involved in extinguishing the fire (including fire engines, bulldozers, tractors, airplanes and others)
* gastos	| Extinguishing costs associated with the fire as reported in EGIF
* perdidas	| Economic losses associated with the fire as reported in EGIF	 

We will now proceed to explore and clean the dataset.

In [24]:
# needed libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pygeohash as pgh

In [46]:
df = pd.read_csv("fires-all.csv", parse_dates = ["fecha"], index_col="fecha")

In [47]:
df.head().transpose()

fecha,2001-03-18,2001-03-24,2001-04-16,2001-05-25,2001-07-20
id,2001010001,2001010004,2001010005,2001010008,2001010017
superficie,3.7,1.5,1.5,7.5,1.04
lat,42.9547,42.5522,48.3025,42.9465,43.0917
lng,-2.32572,-2.64067,-3.3978,-2.48516,-3.02457
latlng_explicit,1,1,1,1,1
idcomunidad,1,1,1,1,1
idprovincia,1,1,1,1,1
idmunicipio,13,41,33,13,10
municipio,BARRUNDIA,NAVARIDAS,LAPUEBLA DE LABARCA,BARRUNDIA,AIARA/AYALA
causa,4,2,2,4,5


In [48]:
print("Dataset size:", df.shape)

Dataset size: (82640, 20)


In [49]:
df.isna().sum()

id                     0
superficie             0
lat                   24
lng                   24
latlng_explicit        0
idcomunidad            0
idprovincia            0
idmunicipio            0
municipio              0
causa                  0
causa_supuesta     36175
causa_desc             0
muertos            79916
heridos            79569
time_ctrl              0
time_ext               0
personal               0
medios                 0
gastos             71016
perdidas           48291
dtype: int64

As shown above, this dataset has missing values that need to be fixed in order to use it for our analysis and model. Firstly, as stated above in the variable descriptions, causa_supuesta has a value of either 1 or blank, hence we can fill the missing values with a 0. 

In [50]:
print("Unique values in causa_supuesta:", df.causa_supuesta.unique())

Unique values in causa_supuesta: [nan  1.]


In [51]:
df["causa_supuesta"] = df.causa_supuesta.fillna(0)

Although they could be very insightful, the variables muertos, heridos, gastos and perdidas show too many missing values and have to be eliminated. The deceased and injured values per year do not match the values given on other official government reports, so it is not that they weren't filled because there were no injured or deceased in those cases. As for the other two values, something similar happens. Since the data is comprised of data gathered by different autonomous communities, not all of them collected data on these. 

In [52]:
#drop gastos; although this variable would be very insightful, it has too many NaN values that cannot be eliminated or filled
df = df.drop(["gastos"], axis = 1)

#drop perdidas; same as above
df = df.drop(["perdidas"], axis = 1)

#drop muertos; same as above
df = df.drop(["muertos"], axis = 1)

#drop heridos; same as above
df = df.drop(["heridos"], axis = 1)

Now, we will remove variables that have no use for our analysis and model. We will remove id, as we have stated date as our index; latlng_explicit because it doesn't add additional information; idmunicipio, which only gives a number id to the municipalities where we already have their names; and causa_desc because unfortunately it is wrongly labeled.

In [57]:
#drop id; adds no info and we have another index now which is the date
df = df.drop(["id"], axis = 1)

#drop latlng_explicit; only says if the coordinates were taken from the town or not
df = df.drop(["latlng_explicit"], axis = 1)

#drop idmunicipio; we already have the name of the municipality
df = df.drop(["idmunicipio"], axis = 1)

#drop causa_desc; it is wrongly labeled
df = df.drop(["causa_desc"], axis = 1)

As for lat and lng variables, the values that are missing only represent less than 0.03% opf the dataset. They could be estimated with the municipality name, stored in the variable municipio; however, this is unlikely to change how the model performs.

In [60]:
#remove NaN rows in lat and lng because they are only 24 rows 

df.drop(df[(df["lng"].isna())].index, inplace = True)

df.drop(df[(df["lat"].isna())].index, inplace = True)

In [63]:
print("Cleaned dataset size:", df.shape)

Cleaned dataset size: (81433, 12)


In [62]:
#save it into a new csv
df.to_csv("clean_fires.csv")

superficie        0
lat               0
lng               0
idcomunidad       0
idprovincia       0
municipio         0
causa             0
causa_supuesta    0
time_ctrl         0
time_ext          0
personal          0
medios            0
dtype: int64