In [None]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
import nbstripout
np.random.seed(15)

In [None]:
# Loading data
data = pd.read_csv("forest_fires_dataset.csv")
attributes = pd.read_csv("attributes_forest_fires.csv")

In [None]:
attributes

As we do not have many features in our dataset I decided to understand ambiguous ones before ongoing analysis.


<b>FFMC</b>, The Fine Fuel Moisture Code represents fuel moisture of forest litter fuels under the shade of a forest canopy. It is intended to represent moisture conditions for shaded litter fuels, the equivalent of 16-hour timelag. It ranges from 0-101. 

<b>DMC</b>, The Duff Moisture Code represents fuel moisture of decomposed organic material underneath the litter. System designers suggest that it is represents moisture conditions for the equivalent of 15-day (or 360 hr) timelag fuels. It is unitless and open ended. It may provide insight to live fuel moisture stress.

<b>DC</b>, The Drought Code  much like the Keetch-Byrum Drought Index, represents drying deep into the soil. It approximates moisture conditions for the equivalent of 53-day (1272 hour) timelag fuels. It is unitless, with a maximum value of 1000. Extreme drought conditions have produced DC values near 800.

<b>ISI</b>,The Initial Spread Index  is analogous to the NFDRS Spread Component (SC). It integrates fuel moisture for fine dead fuels and surface windspeed to estimate a spread potential. ISI is a key input for fire behavior predictions in the FBP system. It is unitless and open ended. 

<b> Bigger values of indices means that forest is dryer </b>

Let's take a first look what we have in our dataset

In [None]:
data.shape

In [None]:
data.info()
data.columns

In [None]:
data.describe()

In [None]:
data["area"].value_counts()

In [None]:
data["rain"].value_counts()

<b>Observations: </b>

We don't have any missing values, so we don't have to bother with any missing value treatment :)

Month and days columns are represented as strings. It shall be changed for numerical values.

99 percent of 'rain' values are 0 . it could be hard to imagine fire when it rains

Half of observations depict situation when there was no fire in a forest. (I don't know if I should throw them away or what to do with them )

Let's analyse only these situations where there was a fire (area > 0)


In [None]:
fires = data[data.area > 0]

In [None]:
sns.boxplot(data = fires, x = "area")

Let's get rid of outliers 

In [None]:
fires_without_outliers = fires[fires.area < 600]

In [None]:
fires_without_outliers.hist(bins = 40,figsize=(20,15))

When we get to the north part of Park, there are no fires. 
Higher indices implies higher amount of fires.
We can see that that most of forest fires were really small.

In [None]:
months = ["jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"]
sns.countplot(data = fires, x = "month", order = months )


We can see that almost all fires happen in August or September

In [None]:
by_months = data.groupby("month").sum()
by_months = by_months.reset_index()
by_months = by_months[["month","area"]]
sns.barplot(data=by_months, x = "month", y = "area", order = months)

Also the are burnt by fires is the biggest in those months. 

In [None]:
days = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
sns.countplot(data = data, x = "day", order = days )


Day of the week seems not to be relevant, although on Sunday there was the highest amount of forest fires.

In [None]:
by_days = data.groupby("day").sum()
by_days = by_days.reset_index()
by_days = by_days[["day","area"]]
sns.barplot(data=by_days, x = "day", y = "area", order = days)

But we can see that the biggest area was burnt on Saturday! but it might be due to our outliars. How does it look without them?

In [None]:
by_days_without_outliers = fires_without_outliers.groupby("day").sum()
by_days_without_outliers = by_days_without_outliers.reset_index()
by_days_without_outliers = by_days_without_outliers[["day","area"]]
sns.barplot(data=by_days_without_outliers, x = "day", y = "area", order = days)

Now it looks a bit different. The biggest burnt area was still on Saturday, but the difference is not so massive. 

In [None]:
sns.pairplot(fires_without_outliers, y_vars = "area", x_vars = data.columns.values[:5])
sns.pairplot(fires_without_outliers, y_vars = "area", x_vars = data.columns.values[5:10])
sns.pairplot(fires_without_outliers, y_vars = "area", x_vars = data.columns.values[10:11])

In [None]:
sns.heatmap(data.corr())

It doesn't give us much information as we cannot see any correlation between area variable and other variables

In [None]:
pandas_profiling.ProfileReport(data)


It doesn't give us a possibility to somehow correlate categorical variables with continuous ones.