Summary of the data set

The data used in the project contains information about the burnt areas of forest fires located in the northeast region of Portugal created by Cortez and Morais (2007).

The forest fire data used in the study are historical events happened at the Montesinho natural park.

This park covers 748km square or 74,229 ha, in the mountainous regin with altidue ranges from 438 m in the lower valley to 1481 m over the mountain top.

There are 517 observations and 13 rows in the data, and there are no missing values in the dataset. Each row represents one fire monitoring instance, with the column area as our target (showing the burned area), and 12 other meaurements and indexes as features (including month, day, RH, rain, DC, ISI etc).

In [1]:
#importing the libraries
import pandas as pd
import numpy as np #for logarithmic tranformation
import altair as alt #graphs and charts
alt.renderers.enable("default")

RendererRegistry.enable('default')

In [2]:
df = pd.read_csv("forestfires.csv")
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [3]:
# Number of columns and rows 
df.shape

(517, 13)

In [4]:
# Information for each column
df.info()
# As we can see from data information, we don't have any null rows.
# X and Y are location specifiers, so we might need to encode them.
# month and day can also be encoded.
#Let's turn our "RH(Relative Humidity)" into a float column as well.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X       517 non-null    int64  
 1   Y       517 non-null    int64  
 2   month   517 non-null    object 
 3   day     517 non-null    object 
 4   FFMC    517 non-null    float64
 5   DMC     517 non-null    float64
 6   DC      517 non-null    float64
 7   ISI     517 non-null    float64
 8   temp    517 non-null    float64
 9   RH      517 non-null    int64  
 10  wind    517 non-null    float64
 11  rain    517 non-null    float64
 12  area    517 non-null    float64
dtypes: float64(8), int64(3), object(2)
memory usage: 52.6+ KB


In [5]:
df["RH"] = df["RH"].astype(float)
print(df["RH"].dtype)

float64


In [6]:
#We start by counting the unique values in categorical columns.
df["X"].value_counts()
#We can see that the x-axis spatial coordinate within the Montesinho park (X) ranges from 1 to 9

4    91
6    86
2    73
8    61
7    60
3    55
1    48
5    30
9    13
Name: X, dtype: int64

In [7]:
df["Y"].value_counts() 
#The y-axis spatial coordinate within the Montesinho park (Y) ranges from 2 to 9, however there is no y-coordinate of 1 and 7.

4    203
5    125
6     74
3     64
2     44
9      6
8      1
Name: Y, dtype: int64

In [8]:
#Next, we count the unique values in month and day. 
df["month"].value_counts() 
#We can see that majority of the observations are in the months of August and September

aug    184
sep    172
mar     54
jul     32
feb     20
jun     17
oct     15
apr      9
dec      9
jan      2
may      2
nov      1
Name: month, dtype: int64

In [9]:
#the data has more observations during the weekends compared to the weekdays.
df["day"].value_counts()

sun    95
fri    85
sat    84
mon    74
tue    64
thu    61
wed    54
Name: day, dtype: int64

In [10]:
df.describe()
#we can see from the summary statistics down, many columns may contain outliers.

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0
mean,4.669246,4.299807,90.644681,110.87234,547.940039,9.021663,18.889168,44.288201,4.017602,0.021663,12.847292
std,2.313778,1.2299,5.520111,64.046482,248.066192,4.559477,5.806625,16.317469,1.791653,0.295959,63.655818
min,1.0,2.0,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,0.0
25%,3.0,4.0,90.2,68.6,437.7,6.5,15.5,33.0,2.7,0.0,0.0
50%,4.0,4.0,91.6,108.3,664.2,8.4,19.3,42.0,4.0,0.0,0.52
75%,7.0,5.0,92.9,142.4,713.9,10.8,22.8,53.0,4.9,0.0,6.57
max,9.0,9.0,96.2,291.3,860.6,56.1,33.3,100.0,9.4,6.4,1090.84


The attributes FFMC, DMC, DC,ISI are parts of major components to compute the danger rating scales of forest fires. The FFMC determines inluence of litters for the ignition and spread of fire. The DMC and DC identify fire intensity, while ISI correlates to the fire velocity spread. 
The other four attributes(temp, RH, wind, rain) are meteorological data that can also affecct fire spread. The target of our modeling is the last attribute, area.

In [11]:
#Fig.1
#From figure 1, we see that target is highly skewed,with many observations of 0 value. 
alt.Chart(df).mark_bar().encode(
   alt.X("area",bin= alt.Bin(maxbins = 30)),y="count()")

In [12]:
#Checking for the count of 0 values in target variable
df['area'].where(df['area'] == 0).count()
#This 0 values are due to the data collection treshold that burned area less than 100m^2 shall not be recorded

247

In [13]:
#Therefore, we have to rescale the burned area with the formula
df["log_area"] = np.log10(df['area']+1)

alt.Chart(df).mark_bar().encode(
    alt.X("log_area", bin = alt.Bin(maxbins=20)),y = alt.Y("count()"))

Predictors

In [14]:
#Fig.3 Boxplots of burnt areas of the forest (sqrt transformed) per day of the week
#Many of observations have area of 0, for visualization purposes, we apply a square root transformation to target value.
alt.Chart(df).mark_boxplot(size=15).encode(
    x = alt.X("area",
              scale = alt.Scale(type = "sqrt"),
              title = "Area Burnt"),
    y = alt.Y("day",sort = "x",
              title = "Day of Week"),
    color = alt.Color("day",legend = None)
).properties(
  height = 250,
  width = 400)

#Figure 3 shows that there is no clear relationship between the burnt area of the forest and the days of the week.

In [15]:
#Fig 4. Boxplots of burnt areas of the forest (sqrt transformed) per month
alt.Chart(df).mark_boxplot(size=15).encode(
   x = alt.X("area",
             scale = alt.Scale(type ="sqrt"),
             title = "Area Burnt (Square Root Transformation)"),
   y = alt.Y("month",
             sort = "x",
             title = "Month"),
             color = alt.Color("month",
                              legend = None)
).properties(
  height = 250,
  width = 450)

#Figure 4 shows that some months such as January, May and November do not have many obsevrations.
#Since the month variable is unbalances, to avoid overfitting, we create a season variable.

In [16]:
season_mapping = {
    "dec":"winter",
    "jan":"winter",
    "feb":"winter",
    "mar":"spring",
    "apr":"spring",
    "may":"spring",
    "jun":"summer",
    "jul":"summer",
    "aug":"summer",
    "sep":"fall",
    "oct":"fall",
    "nov":"fall"
}
df["season"] = df["month"].map(season_mapping)
#Fig 5. Boxplots of burnt areas of the forest (sqrt tranformed) per season
alt.Chart(df).mark_boxplot(size = 20).encode(
    x = alt.X("area",
              scale = alt.Scale(type = "sqrt"),
              title = "Area Burnt (Square Root Transformation)"),
    y = alt.Y("season",
              sort = "x",
              title = "Season"),
    color = alt.Color("season",legend = None)
).properties(
  height = 200,
  width = 450)

In [17]:
#Fig 6. Location of the burnt areas of the forest
alt.Chart(df).mark_circle().encode(
    x = alt.X("X",
              title = "X-axis Coordinate"),
    y = alt.Y("Y",
              title = "Y-axis Coordinate"),
    size = alt.Size("area",
                    scale = alt.Scale(range = (20,1500)),
                    title = "Burnt Area")
).configure_mark(
  color = "orange",
  opacity = 0.7
)
#Different locations of the park experienced different extents of areas burnt.
#We can see that (6,5) and (8,6) stand out.

In [18]:
alt.Chart(df).mark_circle().encode(
    x = alt.X(alt.repeat("row"), type = "quantitative"),
    y = alt.Y(alt.repeat("column"), type = "quantitative"),
    color = "season"
).properties(
    width = 110,
    height = 110
).repeat(
    column = ["FFMC", "DMC", "DC", "ISI", "temp", "RH", "wind", "rain"],
    row = ["FFMC", "DMC", "DC", "ISI", "temp", "RH", "wind", "rain"]
).configure_mark(
    opacity = 0.4
).interactive()
#Figure 7 plots the pairwise relationships between the numerical variables of the dataset.
#This plot shows the patterns between the numerical variables and reveals the outliers in the data.
#For example, the variables such as FFMC, DMC, DC, ISI and rain contain outliers.

In [19]:
#Fig 8. Correlation heatmap for numerical variables
df_numeric = df.drop(["X", "Y", "month", "day"], axis=1)

corr_df = df_numeric.corr("spearman").stack().reset_index(name="corr")
corr_df.loc[corr_df["corr"] == 1, "corr"] = 0
corr_df["abs"] = corr_df["corr"].abs()

(
    alt.Chart(corr_df)
    .mark_circle()
    .encode(x = alt.X("level_0", title = "Variables"),
            y = alt.Y("level_1", title = "Variables"),
            size = "abs",
            color = alt.Color('corr',
                               scale = alt.Scale(scheme = 'blueorange',
                                                 domain = (-1, 1)),
                               title = "Correlation"))
).properties(
    width = 300,
    height = 300
)
#Figure 8 shows the correlations between the numerical variables of our data. We can see that some variables seem to be correlated to each other.
#For example, The correlations between ISI and FFMC or the correlation between DMC and DC seem to be somewhat high.
#Again, we need to keep this in mind when making our model.