# Analysing wildfire data

In this notebook, we are going to analyse a database of occurred wildfires using the popular library **[Pandas](https://pandas.pydata.org/)**!  

<img src="https://cdn-images-1.medium.com/max/800/1*vjm1w-uem8LErnbsffAmQg.jpeg" width="200">

Pandas is an Open Source Python framework, maintained by the PyData community. It’s mostly used for Data Analysis and Processing.  Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.

Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...
It's documentation: http://pandas.pydata.org/pandas-docs/stable/

### When do you need pandas?

- When working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...):
- Import data
- Clean up messy data
- Explore data, gain insight into data
- Process and prepare your data for analysis
- Analyse your data (together with scikit-learn, statsmodels, ...)

In this notebook, we will work on a dataset of wildfires data, The dataset will be loaded from a CSV file, which is a format for files that encode data in Series (columns), where each object (row) has a value. Each row is a line in the file, and each value is separated from the previous one with a comma, thus Comma Separated Values file. The first line is reserved for the header, with the names for each column.


## Intro to Pandas: Loading and exploring the data
Let's first import the needed libraries.

In [18]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
plt.ion()

We can now load the dataset as csv file from the dataset directory.
The dataset will be loaded from disk and assigned to the `df` variable, which is a **pandas DataFrame**.


A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table. You can think of it as multiple Series object which share the same index.

### Loading the dataframe from csv
We first import the CSV file as a Dataframe with the lines:

In [3]:
df = pd.read_csv('datasets/incendi_liguria.csv')

A DataFrame has an index attribute and also a columns attribute

In [24]:
df.index

RangeIndex(start=0, stop=5964, step=1)

In [26]:
df.columns

Index(['giorno', 'mese', 'anno', 'stagione', 'data', 'area_ha', 'non_veg',
       'coltivi', 'prati', 'oliveti', 'pinete', 'latifoglie', 'castagneti',
       'arbusti', 'altro', 'provincia', 'comune'],
      dtype='object')

You can get an overview of the Dataframe using the `info` method

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5964 entries, 0 to 5963
Data columns (total 17 columns):
giorno        5964 non-null int64
mese          5964 non-null int64
anno          5964 non-null int64
stagione      5964 non-null int64
data          5964 non-null object
area_ha       5964 non-null float64
non_veg       5964 non-null float64
coltivi       5964 non-null float64
prati         5964 non-null float64
oliveti       5964 non-null float64
pinete        5964 non-null float64
latifoglie    5964 non-null float64
castagneti    5964 non-null float64
arbusti       5964 non-null float64
altro         5964 non-null float64
provincia     5964 non-null object
comune        5964 non-null object
dtypes: float64(10), int64(4), object(3)
memory usage: 792.2+ KB


Note that a DataFrame can be also constructed using different methods, such as a dictonary of arrays

In [29]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
df_countries = pd.DataFrame(data)
df_countries

Unnamed: 0,country,population,area,capital
0,Belgium,11.3,30510,Brussels
1,France,64.3,671308,Paris
2,Germany,81.3,357050,Berlin
3,Netherlands,16.9,41526,Amsterdam
4,United Kingdom,64.9,244820,London


###  Exploring the data

Using df.head(k) for some k will let us see the first k lines of the dataframe, which will look formatted as a table thanks to Jupyter’s magic. This is an easy way to get a sense of the data (and your main debugging tool when you start processing it).

In [4]:
df.head(3)

Unnamed: 0,giorno,mese,anno,stagione,data,area_ha,non_veg,coltivi,prati,oliveti,pinete,latifoglie,castagneti,arbusti,altro,provincia,comune
0,3,1,1997,1,1997-01-03,2.4664,0.0,0.0,0.0,0.0,0.0,0.0,1.6,0.96,0.0,IMPERIA,COSIO DI ARROSCIA
1,9,1,1997,1,1997-01-09,3.3408,0.0,0.0,0.0,0.0,0.12,0.0,0.0,3.08,0.0,IMPERIA,VENTIMIGLIA
2,12,1,1997,1,1997-01-12,3.14875,0.0,0.2,0.0,1.68,0.0,0.0,0.0,1.2,0.0,IMPERIA,DOLCEDO


We can use the `describe` method to get a description of the dataset

In [20]:
df.describe(include='all')

Unnamed: 0,giorno,mese,anno,stagione,data,area_ha,non_veg,coltivi,prati,oliveti,pinete,latifoglie,castagneti,arbusti,altro,provincia,comune
count,5964.0,5964.0,5964.0,5964.0,5964,5964.0,5964.0,5964.0,5964.0,5964.0,5964.0,5964.0,5964.0,5964.0,5964.0,5964,5964
unique,,,,,2337,,,,,,,,,,,4,231
top,,,,,2002-03-23,,,,,,,,,,,IMPERIA,GENOVA
freq,,,,,33,,,,,,,,,,,2237,523
mean,15.673877,5.620221,2003.33283,1.457243,,9.71345,0.049054,0.101341,0.764621,0.203065,0.740979,1.48165,0.654105,3.332871,0.195808,,
std,8.665188,3.246608,4.944066,0.49821,,52.827459,0.60152,0.902872,9.711419,1.372017,8.330526,8.389115,5.389534,25.381324,2.145707,,
min,1.0,1.0,1997.0,1.0,,0.100034,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
25%,8.0,3.0,1999.0,1.0,,0.347652,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
50%,16.0,6.0,2003.0,1.0,,1.009925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
75%,23.0,8.0,2007.0,2.0,,3.872253,0.0,0.0,0.0,0.0,0.0,0.32,0.0,0.56,0.0,,


In order to only see one of the Series, all we have to do is index it using the square brackets operator with the name of the column we want to extract

In [12]:
df['area_ha'].head()

0    2.46640
1    3.34080
2    3.14875
3    4.57335
4    1.20470
Name: area_ha, dtype: float64

You can call several aggregation methods on any Series. 

Example: mean, sum, count and median.

In [13]:
print("mean: ", df['area_ha'].mean())
print("min: ", df['area_ha'].min())
print("count: ", df['area_ha'].count())
print("sum: ", df['area_ha'].sum())
#print("value_counts: ", df['temp_i'].value_counts())

mean:  9.713449881120052
min:  0.100034
count:  5964
sum:  57931.015090999994


If you want to apply a function to a Series, you can just define your own function, and use the apply method on the Series you want to modify. 

In [21]:
def area_to_km2(r):
    return r['area_ha'] / 100

df.apply(area_to_km2, axis=1).head()

0    0.024664
1    0.033408
2    0.031488
3    0.045733
4    0.012047
dtype: float64

Filter your Dataframe and only keep the rows that maintain a certain property:

In [16]:
df[df["area_ha"]>=1000]

Unnamed: 0,giorno,mese,anno,stagione,data,area_ha,non_veg,coltivi,prati,oliveti,pinete,latifoglie,castagneti,arbusti,altro,provincia,comune
51,6,2,1997,1,1997-02-06,1339.01,3.96,12.36,447.88,0.0,115.28,23.92,100.88,634.44,0.0,GENOVA,GENOVA
610,25,9,1997,2,1997-09-25,1236.79,7.8,15.52,115.76,36.4,2.4,161.88,0.0,891.04,0.0,SAVONA,ALBENGA
3986,15,2,2005,1,2005-02-15,1800.5,2.48,40.48,495.28,6.96,467.24,215.32,59.48,477.44,0.0,GENOVA,GENOVA
