# Analysing wildfire data

In this notebook, we are going to analyse a database of occurred wildfires using the popular library **[Pandas](https://pandas.pydata.org/)**!  

<img src="https://cdn-images-1.medium.com/max/800/1*vjm1w-uem8LErnbsffAmQg.jpeg" width="200">

Pandas is an Open Source Python framework, maintained by the PyData community. It’s mostly used for Data Analysis and Processing. 

In this notebook, we will work on a dataset of wildfires data, The dataset will be loaded from a CSV file, which is a format for files that encode data in Series (columns), where each object (row) has a value. Each row is a line in the file, and each value is separated from the previous one with a comma, thus Comma Separated Values file. The first line is reserved for the header, with the names for each column.


## Intro to Pandas: Loading and exploring the data
Let's first import the needed libraries.

In [2]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
# magic command to plot make the notebook plot inline
%matplotlib notebook

We can now load the dataset as csv file from the dataset directory.
The dataset will be loaded from disk and assigned to the `df` variable, which is a **pandas DataFrame**.

A Pandas Dataframe is an abstraction of tabular data, where each column is a Pandas Series.

We first import the CSV file as a Dataframe with the lines:


In [4]:
df = pd.read_csv('datasets/Grid100m_first100.csv')

Using df.head(k) for some k will let us see the first k lines of the dataframe, which will look formatted as a table thanks to Jupyter’s magic. This is an easy way to get a sense of the data (and your main debugging tool when you start processing it).

In [12]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,FID,grid_code,WF_i,rain_i,temp_i,WF_e,temp_e,rains_e,slope,...,veg21,veg22,veg23,veg32,veg333,veg34,veg35,veg37,POINT_X,POINT_Y
0,0,0,585.564026,0,897.73999,7.51484,0,17.467199,647.487976,19.583799,...,0,0,0,0,0,34,0,0,1494606.0,4946982.0
1,1,1,586.700989,0,897.093018,7.50925,0,17.461201,647.455017,21.7612,...,0,0,0,0,0,34,0,0,1494706.0,4946982.0
2,2,2,584.84198,0,896.432007,7.51839,0,17.471001,647.401001,20.1194,...,0,0,0,0,0,34,0,0,1494806.0,4946982.0


We can use the `describe` method to get a description of the dataset

In [13]:
df.describe()

Unnamed: 0.1,Unnamed: 0,FID,grid_code,WF_i,rain_i,temp_i,WF_e,temp_e,rains_e,slope,...,veg21,veg22,veg23,veg32,veg333,veg34,veg35,veg37,POINT_X,POINT_Y
count,499.0,499.0,499.0,499.0,499.0,499.0,499.0,499.0,499.0,499.0,...,499.0,499.0,499.0,499.0,499.0,499.0,499.0,499.0,499.0,499.0
mean,249.0,249.0,486.135323,0.0,888.98446,8.003791,0.0,17.990866,649.224334,24.257863,...,0.378758,0.220441,0.0,0.0,0.667335,7.835671,22.094188,1.779559,1496005.0,4946203.0
std,144.193157,144.193157,126.866224,0.0,15.437113,0.623878,0.0,0.668197,1.985348,9.18302,...,2.797523,2.193341,0.0,0.0,14.907127,14.332716,16.903136,7.924816,1633.196,396.9129
min,0.0,0.0,251.130997,0.0,842.802978,6.48059,0.0,16.359501,645.072022,0.867539,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1493906.0,4945382.0
25%,124.5,124.5,392.513489,0.0,882.990509,7.57289,0.0,17.52935,647.893494,17.983049,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1494806.0,4945882.0
50%,249.0,249.0,489.339996,0.0,892.192017,7.98803,0.0,17.974001,649.142029,24.802799,...,0.0,0.0,0.0,0.0,0.0,0.0,35.0,0.0,1495706.0,4946182.0
75%,373.5,373.5,573.759491,0.0,899.687988,8.464185,0.0,18.483951,650.565521,30.95455,...,0.0,0.0,0.0,0.0,0.0,0.0,35.0,0.0,1496706.0,4946532.0
max,498.0,498.0,795.880005,0.0,910.304016,9.15945,0.0,19.228599,654.362,49.895699,...,21.0,22.0,0.0,0.0,333.0,34.0,35.0,37.0,1501206.0,4946982.0


In order to only see one of the Series, all we have to do is index it using the square brackets operator with the name of the column we want to extract

In [23]:
df['temp_i'].head()

0    7.51484
1    7.50925
2    7.51839
3    7.56431
4    7.57553
Name: temp_i, dtype: float64

You can call several aggregation methods on any Series. 

Example: mean, sum, count and median.

In [22]:
print("mean: ", df['temp_i'].mean())
print("min: ", df['temp_i'].min())
print("count: ", df['temp_i'].count())
print("sum: ", df['temp_i'].sum())
#print("value_counts: ", df['temp_i'].value_counts())

mean:  8.003790984168337
min:  6.480589900000001
count:  499
sum:  3993.8917011000003


If you want to apply a function to a Series, you can just define your own function, and use the apply method on the Series you want to modify. 

In [21]:
df.apply(lambda r: r['temp_i'] * 2, axis=1).head()

0    15.02968
1    15.01850
2    15.03678
3    15.12862
4    15.15106
dtype: float64

If you want to filter your Dataframe and only keep the rows that maintain a certain property, this is what you’ll do:

In [20]:
df[df["temp_e"]>=10].head()

Unnamed: 0.1,Unnamed: 0,FID,grid_code,WF_i,rain_i,temp_i,WF_e,temp_e,rains_e,slope,...,veg21,veg22,veg23,veg32,veg333,veg34,veg35,veg37,POINT_X,POINT_Y
0,0,0,585.564026,0,897.73999,7.51484,0,17.467199,647.487976,19.583799,...,0,0,0,0,0,34,0,0,1494606.0,4946982.0
1,1,1,586.700989,0,897.093018,7.50925,0,17.461201,647.455017,21.7612,...,0,0,0,0,0,34,0,0,1494706.0,4946982.0
2,2,2,584.84198,0,896.432007,7.51839,0,17.471001,647.401001,20.1194,...,0,0,0,0,0,34,0,0,1494806.0,4946982.0
3,3,3,575.504028,0,895.758972,7.56431,0,17.520201,647.364014,18.117399,...,0,0,0,0,0,34,0,0,1494906.0,4946982.0
4,4,4,573.221985,0,895.073975,7.57553,0,17.5322,647.294983,13.6347,...,0,0,0,0,0,34,0,0,1495006.0,4946982.0
