# Overall Observation

According to Exploratory Data Analysis, I have all categorical features and 13 missing features with a larger than 1 value. In the Feature Engineering portion, I will transform categorical to numerical data, as well as remove missing values, outliers, etc.

# 1. Importing Libraries

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#### Dataset Location:
https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++#

#### Data Set Information:

The dataset includes `244` instances that regroup a data of two regions of Algeria,namely the Bejaia 
region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria.

`122` instances for each region.

The period from `June 2012 to September 2012`.
The dataset includes `11 attribues and 1 output attribue (class)`
The 244 instances have been classified into â€˜fireâ€™ (138 classes) and â€˜not fireâ€™ (106 classes) classes.


#### Attribute Information:

1. Date : (DD/MM/YYYY) Day, month ('june' to 'september'), year (2012)
* `Weather data observations`
2. Temp : temperature noon (temperature max) in Celsius degrees: 22 to 42
3. RH : Relative Humidity in %: 21 to 90
4. Ws :Wind speed in km/h: 6 to 29
5. Rain: total day in mm: 0 to 16.8
* `FWI Components`
6. Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5
7. Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9
8. Drought Code (DC) index from the FWI system: 7 to 220.4
9. Initial Spread Index (ISI) index from the FWI system: 0 to 18.5
10. Buildup Index (BUI) index from the FWI system: 1.1 to 68
11. Fire Weather Index (FWI) Index: 0 to 31.1
12. Classes: two classes, namely â€œFireâ€ and â€œnot Fireâ€

In [2]:
ls

Algerian_forest_fires_dataset_UPDATE.csv
EDA.csv
EDA.ipynb


# 2. Importing Dataset

In [3]:
data=pd.read_csv("Algerian_forest_fires_dataset_UPDATE.csv")
df=data.copy()
df.head(5)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Bejaia Region Dataset
day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
01,06,2012,29,57,18,0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
02,06,2012,29,61,13,1.3,64.4,4.1,7.6,1,3.9,0.4,not fire
03,06,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
04,06,2012,25,89,13,2.5,28.6,1.3,6.9,0,1.7,0,not fire


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 247 entries, ('day', 'month', 'year', 'Temperature', ' RH', ' Ws', 'Rain ', 'FFMC', 'DMC', 'DC', 'ISI', 'BUI', 'FWI') to ('30', '09', '2012', '24', '64', '15', '0.2', '67.3', '3.8', '16.5', '1.2', '4.8', '0.5')
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Bejaia Region Dataset   245 non-null    object
dtypes: object(1)
memory usage: 49.3+ KB


`Observation:` This is a multiindex data. Now, I am going to convert multi_index to single_index using pands libary

# 3. Converting Range Index

In [5]:
#converting RangeIndex from multiindexing
df=df.reset_index()
df.head(5)

Unnamed: 0,level_0,level_1,level_2,level_3,level_4,level_5,level_6,level_7,level_8,level_9,level_10,level_11,level_12,Bejaia Region Dataset
0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
1,01,06,2012,29,57,18,0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
2,02,06,2012,29,61,13,1.3,64.4,4.1,7.6,1,3.9,0.4,not fire
3,03,06,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
4,04,06,2012,25,89,13,2.5,28.6,1.3,6.9,0,1.7,0,not fire


In [6]:
new_header=df.iloc[0]
df=df[1:]
df.columns=new_header
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
1,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
2,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
3,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
4,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
5,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire


## 3.1 Fixed the spacing issue of Columns

In [7]:
#looking, there are misspaced in some columns. Now, I am going to fix it 
df.columns

Index(['day', 'month', 'year', 'Temperature', ' RH', ' Ws', 'Rain ', 'FFMC',
       'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes  '],
      dtype='object', name=0)

In [8]:
df.columns=df.columns.str.strip()
df.columns

Index(['day', 'month', 'year', 'Temperature', 'RH', 'Ws', 'Rain', 'FFMC',
       'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes'],
      dtype='object', name=0)

## 3.2 Fixed the spacing issue in rows level

In [9]:
df['Classes'].unique()

array(['not fire   ', 'fire   ', 'fire', 'fire ', 'not fire', 'not fire ',
       nan, 'Classes  ', 'not fire     ', 'not fire    '], dtype=object)

In [10]:
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [11]:
df['Classes'].unique()

array(['not fire', 'fire', nan, 'Classes'], dtype=object)

In [12]:
df.head(5)

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
1,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
2,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
3,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
4,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
5,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire


# 4. EDA Analysis 

## 4.1 Numerical Features

In [13]:
#Represents all numerical and categorical features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246 entries, 1 to 246
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          246 non-null    object
 1   month        245 non-null    object
 2   year         245 non-null    object
 3   Temperature  245 non-null    object
 4   RH           245 non-null    object
 5   Ws           245 non-null    object
 6   Rain         245 non-null    object
 7   FFMC         245 non-null    object
 8   DMC          245 non-null    object
 9   DC           245 non-null    object
 10  ISI          245 non-null    object
 11  BUI          245 non-null    object
 12  FWI          245 non-null    object
 13  Classes      244 non-null    object
dtypes: object(14)
memory usage: 27.0+ KB


In [14]:
df['Classes'].dtype!='O'

False

## list of the numerical variables
[feature for feature in df.columns if df[feature].dtype!='O']

In [15]:
numerical_features=[feature for feature in df.columns if df[feature].dtype!='O']#show only numerical features
print(len(numerical_features))
#df[numerical_features].head()

0


`Observations:` It is clearly shows that there is no numerical features.

## 4.2 Categorical Features

In [16]:
categorical_features=[feature for feature in df.columns if df[feature].dtype=='O']
print(len(categorical_features))
df[categorical_features].head()

14


Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
1,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
2,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
3,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
4,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
5,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire


`Observations`: It is clearly shows that 14 are categorical features available in this dataset 

In [17]:
for feature in categorical_features:
    print("The feature name is {} and the number of categories are {}".format(feature,len(df[feature].unique())))
    

The feature name is day and the number of categories are 33
The feature name is month and the number of categories are 6
The feature name is year and the number of categories are 3
The feature name is Temperature and the number of categories are 21
The feature name is RH and the number of categories are 64
The feature name is Ws and the number of categories are 20
The feature name is Rain and the number of categories are 41
The feature name is FFMC and the number of categories are 175
The feature name is DMC and the number of categories are 168
The feature name is DC and the number of categories are 200
The feature name is ISI and the number of categories are 108
The feature name is BUI and the number of categories are 176
The feature name is FWI and the number of categories are 128
The feature name is Classes and the number of categories are 4


## 4.3 Missing Values

In [18]:
df.isnull().sum()

0
day            0
month          1
year           1
Temperature    1
RH             1
Ws             1
Rain           1
FFMC           1
DMC            1
DC             1
ISI            1
BUI            1
FWI            1
Classes        2
dtype: int64

In [19]:
features_nan=[features for features in df.columns if df[features].isnull().sum()>=1]
print(len(features_nan))

13


`Observations`: There are missing values in 13 features inthis dataset

## 4.4 Outliers

## 4.5 Cleaning Data

In [20]:
df.to_csv("EDA.csv")