# Algerian Forest Fire Project

This project involves extensive **EDA**, **feature engineering**, and training of multiple regression models:

1. **Simple Linear Regression**
2. **Multiple Linear Regression**
3. **Ridge, Lasso, and Elastic Net Regression**

We will use the **Algerian Forest Fires dataset** consisting of data from two regions: **Bejaia** and **Sidi Bel Abbes**.

---

## 1. Dataset Overview

* **Instances**: 244
* **Regions**: Bejaia (NE Algeria), Sidi Bel Abbes (NW Algeria)
* **Period**: June 2012 – September 2012
* **Attributes**: 11 features + 1 class output
* **Targets**:

  * **Classification**: `class` (Fire / Non-Fire)
  * **Regression**: `FWI` as dependent variable

**Columns Include**:

| Column      | Description             |
| ----------- | ----------------------- |
| Date        | Observation date        |
| Temperature | Celsius                 |
| RH          | Relative Humidity (%)   |
| WS          | Wind Speed              |
| Rain        | Total daily rainfall    |
| FMC         | Fine Fuel Moisture Code |
| DMC         | Duff Moisture Code      |
| DC          | Drought Code            |
| ISI         | Initial Spread Index    |
| BUI         | Build Up Index          |
| FWI         | Fire Weather Index      |
| Class       | Fire / Non-Fire         |

---

## 2. Import Libraries

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```

---

## 3. Read the Dataset

```python
df = pd.read_csv('Algerian_forest_fire_data_update.csv', header=1)
df.head()
```

* **Header handling**: `header=1` ensures the first row is treated as column names.

---

## 4. Initial Observations

```python
df.info()
```

* Check **data types** (objects vs numeric)
* Identify **missing values**

---

## 5. Missing Values Handling

```python
# Check missing values
df.isnull().sum()
df[df.isnull().any(axis=1)]
```

* Drop rows with null values:

```python
df = df.dropna().reset_index(drop=True)
```

---

## 6. Region Column Creation

* **Bejaia**: index 0–121 → label `0`
* **Sidi Bel Abbes**: index 122–243 → label `1`

```python
df.loc[:121, 'Region'] = 0
df.loc[122:, 'Region'] = 1
df['Region'] = df['Region'].astype(int)
```

---

## 7. Drop Duplicate Header Rows

```python
df = df.drop([0, 128]).reset_index(drop=True)
```

---

## 8. Fix Column Names

```python
df.columns = df.columns.str.strip()
```

* Removes extra spaces in column names

---

## 9. Convert Data Types

### Integer Columns

```python
int_cols = ['Month', 'Day', 'Year', 'Temperature', 'RH', 'WS']
df[int_cols] = df[int_cols].astype(int)
```

### Float Columns

```python
# Convert all object-type columns except 'Class' to float
for col in [c for c in df.columns if df[c].dtype == 'O' and c != 'Class']:
    df[col] = df[col].astype(float)
```

---

## 10. Verify Data Types and Cleaned Data

```python
df.info()
df.head()
df.describe()
```

* **Region**: int
* **Numerical features**: float/int
* **Class**: categorical object

---

## ✅ Summary of Cleaning & EDA Steps

1. Combined datasets from **two regions**
2. Handled **missing values**
3. Created a **Region label column**
4. Removed **duplicate header rows**
5. Fixed **column names**
6. Converted **columns to appropriate data types**
7. Dataset is now ready for **advanced EDA and feature engineering**

---

### Next Steps

1. **Advanced EDA**:

   * Correlation analysis
   * Distribution plots
   * Outlier detection
2. **Feature Engineering**:

   * One-hot encoding for categorical variables
   * Scaling and normalization
3. **Model Training**:

   * Simple / Multiple Linear Regression
   * Ridge, Lasso, Elastic Net Regression
4. **Hyperparameter Tuning** and **Validation**

---

**Note:**
For regression, use `FWI` as the dependent variable. For classification, use `Class`.

---

You can now visualize the dataset using Seaborn plots:

```python
sns.pairplot(df, hue='Class')
plt.show()
```

---

This cleaned and structured dataset is ready for **model training and feature engineering**.


In [2]:
# ==============================
# Algerian Forest Fire Dataset EDA & Cleaning
# ==============================

# 1. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 2. Read Dataset
df = pd.read_csv("C:\\Users\\dell\\Downloads\\Ridge+Lassso+Elastic+Regression+Practicals\\Ridge Lassso Elastic Regression Practicals\\Algerian_forest_fires_cleaned_dataset.csv", header=1)

# 3. Initial Dataset Info
print(df.info())
print(df.head())

# 4. Check Missing Values
print(df.isnull().sum())
print(df[df.isnull().any(axis=1)])

# 5. Drop rows with missing values and reset index
df = df.dropna().reset_index(drop=True)

# 6. Add Region Column
# Bejaia: 0–121 -> 0, Sidi Bel Abbes: 122–end -> 1
df.loc[:121, 'Region'] = 0
df.loc[122:, 'Region'] = 1
df['Region'] = df['Region'].astype(int)

# 7. Drop duplicate header rows if any
# (Example indices from explanation: 0, 128)
df = df.drop([0, 128], errors='ignore').reset_index(drop=True)

# 8. Fix Column Names (remove extra spaces)
df.columns = df.columns.str.strip()

# 9. Convert Data Types
# Integer columns
int_cols = ['Month', 'Day', 'Year', 'Temperature', 'RH', 'WS']
for col in int_cols:
    if col in df.columns:
        df[col] = df[col].astype(int)

# Float columns: all object types except 'Class'
for col in [c for c in df.columns if df[c].dtype == 'O' and c != 'Class']:
    df[col] = df[col].astype(float)

# 10. Verify Data Types and Dataset
print(df.info())
print(df.describe())
print(df.head())

# 11. Basic EDA Plots (Optional)
# Pairplot by Class
sns.pairplot(df, hue='Class')
plt.show()

# Correlation Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   1            242 non-null    int64  
 1   6            242 non-null    int64  
 2   2012         242 non-null    int64  
 3   29           242 non-null    int64  
 4   57           242 non-null    int64  
 5   18           242 non-null    int64  
 6   0.0          242 non-null    float64
 7   65.7         242 non-null    float64
 8   3.4          242 non-null    float64
 9   7.6          242 non-null    float64
 10  1.3          242 non-null    float64
 11  3.4.1        242 non-null    float64
 12  0.5          242 non-null    float64
 13  not fire     242 non-null    object 
 14  0            242 non-null    int64  
dtypes: float64(7), int64(7), object(1)
memory usage: 28.5+ KB
None
   1  6  2012  29  57  18   0.0  65.7  3.4   7.6  1.3  3.4.1  0.5  \
0  2  6  2012  29  61  13   1.3  64.4  4.1  

ValueError: could not convert string to float: 'not fire   '

# Algerian Forest Fire Dataset Analysis

## 1. Initial Data Cleaning

We performed the following initial cleaning steps:

1. Merged datasets.
2. Checked and renamed columns.
3. Changed data types where necessary.

### Save the Cleaned Dataset

```python
# Save cleaned dataset as CSV
df.to_csv("Algerian_Cleaned_Dataset.csv", index=False)
index=False ensures that the index column is not saved in the CSV.

This cleaned CSV can be used for further analysis.

2. Preparing the Dataset for EDA
Create a Copy
python
Copy code
# Create a copy of the dataframe
df_copy = df.copy()
Drop Unnecessary Columns
We remove day, month, and year since they are not required for predicting FWI (Fire Weather Index).

python
Copy code
df_copy = df_copy.drop(columns=['day', 'month', 'year'])
3. Encoding Categorical Feature
The classes column contains categories fire and not fire. Some entries have extra spaces.

Convert to Numeric (0/1)
python
Copy code
import numpy as np

# Encode 'classes': 'not fire' -> 0, 'fire' -> 1
df_copy['classes'] = np.where(df_copy['classes'].str.contains("not fire"), 0, 1)

# Check the encoding
df_copy['classes'].value_counts()
Ensures all variations of "not fire" are mapped to 0 and "fire" to 1.

4. Exploratory Data Analysis (EDA)
4.1 Histograms of All Features
python
Copy code
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn')
df_copy.hist(bins=50, figsize=(20, 15))
plt.show()
Shows distribution of numerical features.

Skewness can be observed: left-skewed or right-skewed.

4.2 Pie Chart for Class Distribution
python
Copy code
class_counts = df_copy['classes'].value_counts(normalize=True) * 100
labels = ['Not Fire', 'Fire']

plt.figure(figsize=(8, 8))
plt.pie(class_counts, labels=labels, autopct='%.1f%%')
plt.title("Class Distribution (Fire vs Not Fire)")
plt.show()
Fire: ~56.4%

Not Fire: ~43.6%

4.3 Correlation Analysis
Compute Correlation Matrix
python
Copy code
correlation_matrix = df_copy.corr()
correlation_matrix
Diagonal elements = 1 (self-correlation)

Positive correlation: values close to +1

Negative correlation: values close to -1

Heatmap Visualization
python
Copy code
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()
Helps identify multicollinearity between features.

4.4 Box Plot for Dependent Feature (FWI)
python
Copy code
plt.figure(figsize=(10, 6))
sns.boxplot(y=df_copy['FWI'], color='green')
plt.title("Boxplot for FWI")
plt.show()
Shows median, quartiles, and outliers.

4.5 Monthly Fire Analysis by Region
Béjaia Region
python
Copy code
plt.figure(figsize=(13, 6))
sns.countplot(x='month', hue='classes', data=df_copy[df_copy['region'] == 1])
plt.xlabel("Month", fontweight='bold')
plt.ylabel("Number of Fires", fontweight='bold')
plt.title("Monthly Fire Analysis - Béjaia Region", fontweight='bold')
plt.show()
Maximum fires occurred in August.

Fires mostly occurred in June, July, and August.

September had fewer fires.

CD Bel Region
python
Copy code
plt.figure(figsize=(13, 6))
sns.countplot(x='month', hue='classes', data=df_copy[df_copy['region'] == 2])
plt.xlabel("Month", fontweight='bold')
plt.ylabel("Number of Fires", fontweight='bold')
plt.title("Monthly Fire Analysis - CD Bel Region", fontweight='bold')
plt.show()
Maximum fires also occurred in August.

Observation: Across regions, most fires happen in summer months (June-August), peaking in August.

5. Observations from EDA
Most forest fires occurred during June, July, and August, peaking in August.

The dependent feature FWI is highly correlated with some numerical features like ISI, BUI, etc.

Class distribution shows slightly more fires than non-fires.

Outliers are present but not excessive.

Correlation heatmap helps identify features with high multicollinearity.

Monthly fire analysis shows seasonal fire trends across all regions.