## Summative Lab: Forest Fires Prevention

### Step 1: Load the Dataset

*   Install and import the ucimlrepo library.
*   Load the Forest Fires dataset:
 *   Predictors: Features from forest_fires.data.features.
 *   Target: forest_fires.data.targets.

In [None]:
# Run pip install if necessary to access the UCI ML Repository (uncomment the next line)
# ! pip install ucimlrepo

In [7]:
# Data
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

forest_fires = fetch_ucirepo(id=162)
X = forest_fires.data.features
y = forest_fires.data.targets


# Display dataset structure
print(X.info())
print(X.describe())
print(X.tail())
print(y.tail())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X       517 non-null    int64  
 1   Y       517 non-null    int64  
 2   month   517 non-null    object 
 3   day     517 non-null    object 
 4   FFMC    517 non-null    float64
 5   DMC     517 non-null    float64
 6   DC      517 non-null    float64
 7   ISI     517 non-null    float64
 8   temp    517 non-null    float64
 9   RH      517 non-null    int64  
 10  wind    517 non-null    float64
 11  rain    517 non-null    float64
dtypes: float64(7), int64(3), object(2)
memory usage: 48.6+ KB
None
                X           Y        FFMC         DMC          DC         ISI  \
count  517.000000  517.000000  517.000000  517.000000  517.000000  517.000000   
mean     4.669246    4.299807   90.644681  110.872340  547.940039    9.021663   
std      2.313778    1.229900    5.520111   64.046482  248.066192 

#### Explanation of Variables ###

Dataset Information from the UCI website and definitions from Wikipedia.

- FFMC: Fine Fuel Moisture Code. Measures the moisture level in the smaller, surface materials that are often the first fuels to burn in a fire.
- DMC: Duff Moisture Code. Measures moisture levels in the organic soil just below the surface, as well as moisture in medium-sized woody material, such as small logs.
- DC: Drought Code.
- ISI: Initial Spread Index.
- temp: temperature in Celsius
- RH: relative humidity in %
- wind: wind speed in km/h
- rain: outside rain in mm/m<sup>2</sup>
- area: burned area of the forest in hectares (very skewed towards 0.0 it may make sense to model with a logarithmic transform). 

### Step 2: EDA

* Examine the dataset structure and summary statistics.
* Analyze correlations between predictors and the target variable.
* Plot scatterplots for key predictors vs. the target.
* Generate a residual plot to check for randomness in residuals.

In [None]:
# converting strings to numerical values in X
month_map = {'jan': '1', 'feb': '2', 'mar': '3', 'apr':'4', 'may':'5', 'jun': '6', 'jul': '7', 'aug': '8', 'sep': '9', 'oct': '10', 'nov': '11', 'dec': '12'}
day_map = {'sun': '1', 'mon': '2', 'tue': '3', 'wed': '4', 'thu': '5', 'fri': '6', 'sat': '7'}
num_months = X['month'].map(month_map)
num_days = X['day'].map(day_map)
num_data = pd.concat([num_months, num_days], axis=1, ignore_index=False)
X_transition = X.drop(['month', 'day'], axis=1)
X_num = pd.concat([X_transition, num_data], axis=1, ignore_index=False)
X_num.head

# initializing and fittin a multivariate regression model
multi_reg = LinearRegression().fit(X_num, y)

# showing coefficients intercepts and adjusted R2
print("Intercepts:", multi_reg.intercept_)
print("Coefficients:", multi_reg.coef_)


Intercepts: [-16.1538261]
Coefficients: [[ 1.90018693  0.32407576 -0.11271241  0.09664007 -0.03148937 -0.73053094
   0.95456009 -0.1757589   1.23212806 -3.19577372  1.45017849  0.66347348
   1.45017849  0.66347348]]


In [28]:
Y_pred = multi_reg.predict(X_num)

# r-squared for each element of X
r2 = r2_score(y, Y_pred)
print(r2)

0.025350671349257503


log_area = np.log(y)

X_ffmc = sm.add_constant(X['FFMC'])
model_ffmc = sm.OLS(y, X_ffmc).fit()

X_dmc = sm.add_constant(X['DMC'])
model_dmc = sm.OLS(y, X_dmc).fit()

X_dc = sm.add_constant(X['DC'])
model_dc = sm.OLS(y, X_dc).fit()

X_isi = sm.add_constant(X['ISI'])
model_isi = sm.OLS(y, X_isi).fit()

X_temp = sm.add_constant(X['temp'])
model_temp = sm.OLS(y, X_temp).fit()

X_rh = sm.add_constant(X['RH'])
model_rh = sm.OLS(y, X_rh).fit()

X_wind = sm.add_constant(X['wind'])
model_wind = sm.OLS(y, X_wind).fit()

X_rain = sm.add_constant(X['rain'])
model_rain = sm.OLS(y, X_rain).fit()

print(f"P-value when fitting FFMC is {model_ffmc.pvalues.loc['FFMC']:.4f}")
print(f"P-value when fitting DMC is {model_dmc.pvalues.loc['DMC']:.4f}")
print(f"P-value when fitting DC is {model_dc.pvalues.loc['DC']:.4f}")
print(f"P-value when fitting ISI is {model_isi.pvalues.loc['ISI']:.4f}")
print(f"P-value when fitting temp is {model_temp.pvalues.loc['temp']:.4f}")
print(f"P-value when fitting RH is {model_rh.pvalues.loc['RH']:.4f}")
print(f"P-value when fitting wind is {model_wind.pvalues.loc['wind']:.4f}")
print(f"P-value when fitting rain is {model_rain.pvalues.loc['rain']:.4f}")


In [None]:
X['log_area'] = np.log(y)

# scatterplot: FFMC vs. area burned / log area burned
fig, axes = plt.subplots(1, 2, figsize=(15,4), sharey=True)
axes[0].scatter(X['FFMC'], log_y)
axes[0].set_title('Scatterplot: Transformed Fire Area vs Surface Moisture')
axes[0].set_xlabel('Surface Moisture Level (FFMC)')
axes[0].set_ylabel('Log Forest Fire Area (hectares)')
axes[0].grid(True)

axes[1].scatter(X['FFMC'], y)
axes[1].set_title('Scatterplot: Fire Area vs Surface Moisture')
axes[1].set_xlabel('Surface Moisture Level (FFMC)')
axes[1].set_ylabel('Forest Fire Area (hectares)')
axes[1].grid(True)
plt.show()

# scatterplot: DMC vs. area burned / log area burned
fig, axes = plt.subplots(1, 2, figsize=(15,4), sharey=True)
axes[0].scatter(X['DMC'], log_y)
axes[0].set_title('Scatterplot: Transformed Fire Area vs Soil Moisture')
axes[0].set_xlabel('Soil Moisture Level (DMC)')
axes[0].set_ylabel('Log Forest Fire Area (hectares)')
axes[0].grid(True)

axes[1].scatter(X['DMC'], y)
axes[1].set_title('Scatterplot: Fire Area vs Soil Moisture')
axes[1].set_xlabel('Soil Moisture Level (DMC)')
axes[1].set_ylabel('Forest Fire Area (hectares)')
axes[1].grid(True)
plt.show()

# scatterplot: DC vs. area burned / log area burned
fig, axes = plt.subplots(1, 2, figsize=(15,4), sharey=True)
axes[0].scatter(X['DC'], log_y)
axes[0].set_title('Scatterplot: Transformed Fire Area vs Drought Level')
axes[0].set_xlabel('Drought Level (DC)')
axes[0].set_ylabel('Log Forest Fire Area (hectares)')
axes[0].grid(True)

axes[1].scatter(X['DC'], y)
axes[1].set_title('Scatterplot: Fire Area vs Drought Level')
axes[1].set_xlabel('Drought Level (DC)')
axes[1].set_ylabel('Forest Fire Area (hectares)')
axes[1].grid(True)
plt.show()

fig, ax = plt.subplots(figsize=(6,4))
sns.scatterplot(x=X['DC'], y=log_y, ax=ax)
plt.title('Scatterplot: Fire Area vs Drought Level')
plt.xlabel('Drought Level (DC)')
plt.ylabel('Forest Fire Area (hectares)')
plt.grid(True)
plt.show()

# scatterplot: ISI vs. log area burned
fig, ax = plt.subplots(figsize=(6,4))
sns.scatterplot(x=X['ISI'], y=log_y, ax=ax)
plt.title('Scatterplot: Fire Area vs Initial Spread')
plt.xlabel('Initial Spread (ISI)')
plt.ylabel('Forest Fire Area (hectares)')
plt.grid(True)
plt.show()

# scatterplot: temp vs. log area burned
fig, ax = plt.subplots(figsize=(6,4))
sns.scatterplot(x=X['temp'], y=log_y, ax=ax)
plt.title('Scatterplot: Fire Area vs Temperature')
plt.xlabel('Temperature (C)')
plt.ylabel('Forest Fire Area (hectares)')
plt.grid(True)
plt.show()

### Step 3: Fit the regression models

* Fit a baseline multiple linear regression model with key predictors.
* Include nonlinear terms (e.g., quadratic transformations for significant predictors).
* Add interaction terms (e.g., between predictors with strong correlations).
* Incorporate indicator variables if categorical variables are present.
* Apply transformations (e.g., logarithmic transformations for skewed predictors).

### Step 4: Evaluate model diagnostics

* Compare models using metrics like 2R^2, adjusted RR^2, AIC, and BIC.
* Plot residuals and create Q-Q plots to assess normality.
* Identify influential observations using Cook's Distance.

### Step 5: Apply regularization

* Use Ridge (L2) and Lasso (L1) regression from sklearn to handle multicollinearity.
* Extract coefficients and calculate Mean Squared Error (MSE).
* Compare the performance of Ridge and Lasso models.

### Step 6: Prepare the data for binary classification

* Create a binary target variable based on a threshold in y (e.g., median or other percentile).
* Select relevant predictors and scale them using StandardScaler.

### Step 7: Train and evaluate a logistic regression model

Train a logistic regression model using the scaled predictors.

* Display coefficients and the intercept.
* Predict probabilities and binary outcomes.
* Evaluate performance using accuracy, confusion matrix, precision, recall, and F1-score.

### Step 8: Check assumptions

* Use Variance Inflation Factor (VIF) to assess multicollinearity among predictors.

### Step 9: Summative Findings

* Compare regression models and classification results.
* Highlight trade-offs between model simplicity, performance, and interpretability.
* Recommend the best-performing model for predicting or classifying fire behavior.

[Type your findings here.]