<a href="https://colab.research.google.com/github/mrsferret/Machine-Learning-ITNPBD6-/blob/main/Forest_Fires_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Check the Data

Before we go anywhere near any data mining tools, let’s take a look at our data in its raw form. Browse to the local folder where you have saved these files. Double click on forestfires_classification-1.csv -

Scroll through the data I can see that there are 13 columns; the first row identifies what each column contains. There are 517 rows of data

What types of data are there?
Mix of data - text fields, integers, decimals



In [None]:
import pandas as pd
import numpy as np

# remember to change this path if you've saved the data somewhere else
df=pd.read_csv("forestfires_classification-1.csv")

Let's take a look at the data before doing anything else...

* How many rows (instances) are there in the data?
* How many columns (variables) are there?

We can find out these like this:

In [None]:
df.shape

(517, 13)

CSV contains 517 rows x 13 columns

In [None]:
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,F
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,F
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,F
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,F
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,F


Get summary stats

In [None]:
df.describe(include="all")

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517,517,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517
unique,,,12,7,,,,,,,,,2
top,,,aug,sun,,,,,,,,,T
freq,,,184,95,,,,,,,,,292
mean,4.669246,4.299807,,,90.644681,110.87234,547.940039,9.021663,18.889168,44.288201,4.017602,0.021663,
std,2.313778,1.2299,,,5.520111,64.046482,248.066192,4.559477,5.806625,16.317469,1.791653,0.295959,
min,1.0,2.0,,,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,
25%,3.0,4.0,,,90.2,68.6,437.7,6.5,15.5,33.0,2.7,0.0,
50%,4.0,4.0,,,91.6,108.3,664.2,8.4,19.3,42.0,4.0,0.0,
75%,7.0,5.0,,,92.9,142.4,713.9,10.8,22.8,53.0,4.9,0.0,


At first glance, data looks fairly clean.
-- there are no missing values but lets just double check:

In [6]:
for col in df.columns:
    num_na = sum(pd.isna(df[col]))
    print(f"The {col} column has {num_na} missing values.")

The X column has 0 missing values.
The Y column has 0 missing values.
The month column has 0 missing values.
The day column has 0 missing values.
The FFMC column has 0 missing values.
The DMC column has 0 missing values.
The DC column has 0 missing values.
The ISI column has 0 missing values.
The temp column has 0 missing values.
The RH column has 0 missing values.
The wind column has 0 missing values.
The rain column has 0 missing values.
The area column has 0 missing values.


As initially thought - no missing values

Now let's list the types of columns

In [None]:
df.dtypes

Unnamed: 0,0
X,int64
Y,int64
month,object
day,object
FFMC,float64
DMC,float64
DC,float64
ISI,float64
temp,float64
RH,int64


# Data Table

| Variable 	| Type 	| Description 	|
|---	|---	|---	|
| X 	| Integer 	| x-axis spatial coordinate within the Montesinho park map: 1 to 9 	|
| Y 	| Integer 	| y-axis spatial coordinate within the Montesinho park map: 2 to 9 	|
| month 	| Categorical 	| month of the year: 'jan' to 'dec' 	|
| day 	| Categorical 	| day of the week: 'mon' to 'sun' 	|
| FFMC 	| Continuous 	| Fine Fuel Moisture Index (FFMC) index from the Canadian Forest Fire Weather Index (FWI) system. It represents the moisture content of litter and other cured fine fuels: 18.7 to 96.20 	|
| DMC 	| Integer 	| Duff Moisture Index from the FWI system. It indicates the moisture content of loosely compacted organic layers of moderate depth: 1.1 to 291.3 	|
| DC 	| Continuous 	| Drought Code index from the FWI system. It is related to the moisture content of deep, compact organic layers: 7.9 to 860.6 	|
| ISI 	| Continuous 	| Initial Spread index from the FWI system. It combines the effects of wind and the FFMC to estimate the rate of fire spread: 0.0 to 56.10 	|
| temp 	| Continuous 	| temperature (degrees Celsius) recorded at noon (standard time): 2.2 to 33.30 	|
| RH 	| Integer 	| relative humidity in percentage recorded at noon (standard time): 15.0 to 100 	|
| wind 	| Continuous 	| wind speed in km/h recorded at noon (standard time): 0.40 to 9.40 	|
| rain 	| Integer 	| outside rain in mm/m squared recorded at noon (standard time): 0.0 to 6.4 	|
| area 	| Integer 	| the burned area of the forest: if it exceeds 4%, it is True(T); otherwise, it is False (F) 	|

Plots of data distributions

In [None]:
df.hist(figsize = (15,20))
array([[<Axes: title={'center': 'Age'}>,
        <Axes: title={'center': 'Yrs DL'}>,
        <Axes: title={'center': 'Points'}>],
       [<Axes: title={'center': 'NCD'}>,
        <Axes: title={'center': 'Engine cc'}>,
        <Axes: title={'center': 'Ins Group'}>],
       [<Axes: title={'center': 'Car Age'}>,
        <Axes: title={'center': 'Premium'}>, <Axes: >]], dtype=object)