Here we can see the head of our joined dataset, showing a glimpse of how the data from all three classes is combined into a unified dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.utils import resample
import seaborn as sns
import matplotlib.pyplot as plt


df1 = pd.read_csv('Data_Class_1.csv')
df2 = pd.read_csv('Data_Class_2.csv')
df3 = pd.read_csv('Data_Class_3.csv')

df = pd.concat([df1, df2, df3], axis=0)

df.reset_index(drop=True, inplace=True)

print(df.head())


Here we can see how many rows and columns it has.

In [None]:
df.shape

Here we can see descriptive statistics for the dataset

In [None]:
print(df.describe())


Here we can see data types, and also if there are missing values, or how many unique values is there.

In [None]:
data_info = pd.DataFrame({
    'Data Type': df.dtypes,
    'Missing Values': df.isnull().sum(),
    'Unique Values': df.nunique()
})

data_info

Plot histograms for all numerical columns, to better understand their distributions.

Here we can see distribution of various numerical variables using histograms. Each plot visualizes how the data is spread for features such as Altitude, Slope Orientation, Slope, and more. The density curves (where applicable) help indicate the shape of these distributions. For instance, Altitude and Slope Orientation exhibit fairly normal distributions, while variables like Vertical Distance to Water show skewed distributions.



In [None]:

sns.set(style="whitegrid")

# List of numerical columns to plot
numerical_columns = [
    'Altitude', 'Slope_Orientation', 'Slope', 
    'Horizontal_Distance_To_Water', 'Vertical_Distance_To_Water', 
    'Horizontal_Distance_To_Roadways', 'Shadow_Index_9h', 
    'Shadow_Index_12h', 'Shadow_Index_15h', 
    'Horizontal_Distance_To_Fire_Points', 'Canopy_Density', 
    'Rainfall_Summer', 'Rainfall_Winter', 'Wind_Exposure_Level'
]

# Set up the subplots, adjusting number of rows and columns to fit all features
num_plots = len(numerical_columns)
cols = 3
rows = num_plots // cols + (num_plots % cols > 0)

fig, axes = plt.subplots(rows, cols, figsize=(18, 12))
fig.suptitle('Distribution of Numerical Variables', fontsize=16)

# Plot histograms for each numerical feature
for i, col in enumerate(numerical_columns):
    row = i // cols
    col_idx = i % cols
    sns.histplot(df[col], kde=True, bins=20, ax=axes[row, col_idx])
    axes[row, col_idx].set_title(f'{col} Distribution')

# Hide any unused subplots
for i in range(num_plots, rows * cols):
    fig.delaxes(axes.flat[i])

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


Graphs for bivariate analysis, to see scatter plots between the numerical variables and target variable to observe any trends or patters.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the style for the visualizations
sns.set(style="whitegrid")

# Plot bar plots for categorical variables
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Distribution of Categorical Variables')

# Plot for Soil_Type
sns.countplot(data=df, x='Soil_Type', hue='Soil_Type', ax=axes[0], palette='viridis', legend=False)
axes[0].set_title('Soil Type Distribution')
axes[0].tick_params(axis='x', rotation=90)  

# Plot for Wilderness_Area
sns.countplot(data=df, x='Wilderness_Area', hue='Wilderness_Area', ax=axes[1], palette='coolwarm', legend=False)
axes[1].set_title('Wilderness Area Distribution')

# Plot for Vegetation_Type
sns.countplot(data=df, x='Vegetation_Type', hue='Vegetation_Type', ax=axes[2], palette='Set2', legend=False)
axes[2].set_title('Vegetation Type Distribution')

# Adjust layout to prevent overlap
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


# General Observations:
Altitude: The vegetation types appear to be well-separated by altitude. Type_1 is at a higher altitude, while Type_3 is at a lower altitude, with Type_2 in between. This suggests altitude may be a strong feature for distinguishing between the types.

Slope_Orientation: All three types show overlap in terms of slope orientation, so it doesn't seem to differentiate vegetation types strongly.

Slope: There is no significant distinction between the vegetation types based on slope alone, as all seem to occupy similar ranges.

Horizontal and Vertical Distance to Water: These variables show some degree of separation, especially for Type_3, which tends to have smaller horizontal distances to water. Type_1 and Type_2 overlap more but still show some separation.

Shadow Index (9h, 12h, 15h): There’s a fair amount of overlap in the shadow indices among the vegetation types, meaning these variables may not be significant in distinguishing between them.

Horizontal Distance to Roadways: This feature appears to be quite distinct, especially for Type_1, which has a wider range and larger distances from roadways compared to Type_2 and Type_3.

Horizontal Distance to Fire Points: This variable has some separation between vegetation types, with Type_1 having much higher distances to fire points than Type_2 and Type_3, which cluster lower on this axis.

Canopy Density: All vegetation types appear to have similar canopy densities, making it difficult to differentiate between them based on this feature.

Rainfall (Summer and Winter): The rainfall in both seasons seems to be very similar across vegetation types, showing little to no variation or overlap.

Wind Exposure Level: There is minimal distinction among the vegetation types based on wind exposure, as all appear to have similar values.

## Key Insights:
Altitude and horizontal distance to roadways/fire points appear to be strong variables for separating vegetation types, particularly Type_1.
Some features, like slope orientation, shadow indices, and canopy density, show a lot of overlap, suggesting they might not be as important in classification.
Horizontal distance to water also provides some separability for Type_3, which might help in classification.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the style for the visualizations
sns.set(style="whitegrid")

# List of numerical variables to plot against Vegetation_Type
numerical_columns = [
    'Altitude', 'Slope_Orientation', 'Slope', 
    'Horizontal_Distance_To_Water', 'Vertical_Distance_To_Water', 
    'Horizontal_Distance_To_Roadways', 'Shadow_Index_9h', 
    'Shadow_Index_12h', 'Shadow_Index_15h', 
    'Horizontal_Distance_To_Fire_Points', 'Canopy_Density', 
    'Rainfall_Summer', 'Rainfall_Winter', 'Wind_Exposure_Level'
]

# Set up the subplots grid
num_plots = len(numerical_columns)
cols = 3  # Number of columns
rows = num_plots // cols + (num_plots % cols > 0)  # Number of rows

fig, axes = plt.subplots(rows, cols, figsize=(18, 4 * rows))
fig.suptitle('Scatter Plots of Numerical Variables vs Vegetation Type', fontsize=16)

# Plot each numerical variable vs Vegetation_Type
for i, col in enumerate(numerical_columns):
    row = i // cols
    col_idx = i % cols
    # Use stripplot for jitter effect or scatterplot directly
    sns.stripplot(x='Vegetation_Type', y=col, data=df, ax=axes[row, col_idx], jitter=True, palette='Set2', hue='Vegetation_Type', alpha=0.6, legend=False)
    axes[row, col_idx].set_title(f'{col} vs Vegetation Type')
    axes[row, col_idx].tick_params(axis='x', rotation=45)  # Rotate x labels for readability

# Hide any empty subplots (if any)
for i in range(num_plots, rows * cols):
    fig.delaxes(axes.flat[i])

# Adjust layout
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


# Key Observations from the Correlation Heatmap:
Altitude: Strong negative correlation with Vegetation_Type_Encoded (-0.85), showing altitude plays a major role in distinguishing vegetation types.

Horizontal_Distance_To_Roadways: Moderate negative correlation (-0.44), indicating distance from roadways helps differentiate vegetation types.

Slope: Positive correlation (0.37), suggesting steeper slopes are more common in certain vegetation types.

Horizontal_Distance_To_Fire_Points: Moderate negative correlation (-0.35), showing vegetation type is influenced by distance from fire points.

Shadow Indices: Little correlation with Vegetation_Type_Encoded, implying minimal impact on vegetation classification.

## Summary:
Altitude, slope, and distances to roadways/fire points are key variables for distinguishing vegetation types.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Convert Vegetation_Type to numerical labels using LabelEncoder
label_encoder = LabelEncoder()
df['Vegetation_Type_Encoded'] = label_encoder.fit_transform(df['Vegetation_Type'])

# List of numerical columns to include in the correlation heatmap
numerical_columns = [
    'Altitude', 'Slope_Orientation', 'Slope', 
    'Horizontal_Distance_To_Water', 'Vertical_Distance_To_Water', 
    'Horizontal_Distance_To_Roadways', 'Shadow_Index_9h', 
    'Shadow_Index_12h', 'Shadow_Index_15h', 
    'Horizontal_Distance_To_Fire_Points', 'Canopy_Density', 
    'Rainfall_Summer', 'Rainfall_Winter', 'Wind_Exposure_Level',
    'Vegetation_Type_Encoded'  # Include encoded Vegetation_Type
]

# Compute the correlation matrix
corr_matrix = df[numerical_columns].corr()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()


Box plots for comparing categorical variables.

In the graphs, we observe the relationship between Soil Type and Vegetation Type. The first two vegetation types (Type 1 and Type 2) show similar distributions for the various soil types. However, Vegetation Type 3 exhibits a distinct soil type distribution, indicating that it occurs in areas with different soil characteristics.

For the Wilderness Area vs Vegetation Type plot, we see that Vegetation Types 1 and 2 share similar wilderness areas (Areas 1, 2, and 3). In contrast, Vegetation Type 3 appears to be predominantly associated with Area 4, suggesting a different wilderness distribution compared to the other vegetation types.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set the style for the visualizations
sns.set(style="whitegrid")

# Plot box plots for categorical variables vs Vegetation_Type
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
fig.suptitle('Box Plots of Categorical Variables vs Vegetation Type', fontsize=16)

# Box plot for Soil_Type vs Vegetation_Type
sns.boxplot(x='Vegetation_Type', y='Soil_Type', data=df, ax=axes[0], palette='Set2', hue='Vegetation_Type', legend=False)
axes[0].set_title('Soil Type vs Vegetation Type')
axes[0].set_xlabel('Vegetation Type')
axes[0].set_ylabel('Soil Type')

# Box plot for Wilderness_Area vs Vegetation_Type
sns.boxplot(x='Vegetation_Type', y='Wilderness_Area', data=df, ax=axes[1], palette='Set2', hue='Vegetation_Type', legend=False)
axes[1].set_title('Wilderness Area vs Vegetation Type')
axes[1].set_xlabel('Vegetation Type')
axes[1].set_ylabel('Wilderness Area')

# Remove automatic legends
axes[0].legend([], [], title='Vegetation Type', loc='upper right')
axes[1].legend([], [], title='Vegetation Type', loc='upper right')

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


# Tryout LR all variables
The model demonstrates good predictive ability with an accuracy of up to 84.6%. However, performance is weaker for Type_2 vegetation, which suggests room for improvement in capturing this class. Further efforts can be made to balance the model's performance across all vegetation types.

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Step 2: Load your dataset (adjust the file path as needed)
# Assuming df is already loaded with the Vegetation_Type and Vegetation_Type_Encoded

# Step 3: Prepare the data
# Define numerical features
numerical_features = [ 
    'Altitude', 'Slope_Orientation', 'Slope', 
    'Horizontal_Distance_To_Water', 'Vertical_Distance_To_Water', 
    'Horizontal_Distance_To_Roadways', 'Shadow_Index_9h', 
    'Shadow_Index_12h', 'Shadow_Index_15h', 
    'Horizontal_Distance_To_Fire_Points', 'Canopy_Density'
    ,'Rainfall_Summer', 'Rainfall_Winter', 'Wind_Exposure_Level'
]

# Define categorical features
categorical_features = ['Soil_Type', 'Wilderness_Area']  # Add your categorical columns here

# Separate target variable (Vegetation_Type) and features
X = df[numerical_features + categorical_features]  # Include both numerical and categorical
y = df['Vegetation_Type']

# Convert categorical variables into dummy variables
X = pd.get_dummies(X, columns=categorical_features, drop_first=True)

# Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Feature Scaling (Standardize the data)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 6: Apply Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Step 7: Make predictions on the test data
y_pred = logreg.predict(X_test)

# Step 8: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

# Step 9: Visualize the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
plt.show()


The model is robust overall, achieving an accuracy of 84.6%. The main area for improvement lies in distinguishing Type_2, which shows lower recall compared to the other classes. I chosed only features with which ones I hadd best results.

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Step 2: Load your dataset (adjust the file path as needed)
# Assuming df is already loaded with the Vegetation_Type and Vegetation_Type_Encoded

# Step 3: Prepare the data
# Separate target variable (Vegetation_Type) and features
y = df['Vegetation_Type']

# Define numerical and categorical features
numerical_features = ['Altitude', 'Slope_Orientation', 'Horizontal_Distance_To_Roadways',
                      'Shadow_Index_9h', 'Shadow_Index_12h']

categorical_features = ['Soil_Type', 'Wilderness_Area']  # Include your categorical columns

# Convert categorical variables into dummy variables
X_categorical = pd.get_dummies(df[categorical_features], drop_first=True)

# Keep numerical variables as is
X_numerical = df[numerical_features]

# Combine numerical and dummy categorical variables
X = pd.concat([X_numerical, X_categorical], axis=1)

# Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Step 5: Feature Scaling (Standardize the data)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 6: Apply Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Step 7: Make predictions on the test data
y_pred = logreg.predict(X_test)

# Step 8: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

# Step 9: Visualize the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
plt.show()


# LDA Tryout

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

y = df['Vegetation_Type']

# Define numerical and categorical features
numerical_features = ['Altitude', 'Slope_Orientation', 'Horizontal_Distance_To_Roadways',
                      'Shadow_Index_9h', 'Shadow_Index_12h']

categorical_features = ['Soil_Type', 'Wilderness_Area']  # Include your categorical columns

# Step 3: Convert categorical variables into dummy variables
# It's important to include only the selected features after converting to dummy
X_categorical = pd.get_dummies(df[categorical_features], drop_first=True)

# Keep numerical variables as is
X_numerical = df[numerical_features]

# Combine numerical and dummy categorical variables
X = pd.concat([X_numerical, X_categorical], axis=1)

# Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Feature Scaling (Standardize the data)
scaler = MinMaxScaler()  # Create a StandardScaler object
X_train = scaler.fit_transform(X_train)  # Fit and transform the training data
X_test = scaler.transform(X_test)  # Only transform the test data

# Step 6: Apply Linear Discriminant Analysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# Step 7: Make predictions on the test data
y_pred = lda.predict(X_test)

# Step 8: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

# Step 9: Visualize the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
plt.show()


# QDA TryOut, some warning tho

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis  # Import QDA instead of LDA
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df is already defined and loaded with your dataset
y = df['Vegetation_Type']

# Define numerical and categorical features
numerical_features = ['Altitude', 'Slope_Orientation', 'Horizontal_Distance_To_Roadways',
                      'Shadow_Index_9h', 'Shadow_Index_12h']

categorical_features = ['Soil_Type', 'Wilderness_Area']  # Include your categorical columns

# Step 3: Convert categorical variables into dummy variables
# It's important to include only the selected features after converting to dummy
X_categorical = pd.get_dummies(df[categorical_features], drop_first=True)

# Keep numerical variables as is
X_numerical = df[numerical_features]

# Combine numerical and dummy categorical variables
X = pd.concat([X_numerical, X_categorical], axis=1)

# Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Feature Scaling (Standardize the data)
scaler = MinMaxScaler()  # Create a MinMaxScaler object
X_train = scaler.fit_transform(X_train)  # Fit and transform the training data
X_test = scaler.transform(X_test)  # Only transform the test data

# Step 6: Apply Quadratic Discriminant Analysis
qda = QuadraticDiscriminantAnalysis()  # Create a QDA object
qda.fit(X_train, y_train)  # Fit the QDA model to the training data

# Step 7: Make predictions on the test data
y_pred = qda.predict(X_test)

# Step 8: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

# Step 9: Visualize the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual Values')
plt.xlabel('Predicted Values')
plt.show()


## Preparing data for models

In [42]:
def preprocess_data(df, target, numerical_features, categorical_features):
    # Combine numerical and categorical features
    selected_features = numerical_features + categorical_features
    
    # Separate features and target variable
    X = df[selected_features]
    y = df[target]
    
    # Convert categorical variables into dummy variables
    X = pd.get_dummies(X, drop_first=True)
    
    # Scale numerical features
    scaler = MinMaxScaler()
    X[numerical_features] = scaler.fit_transform(X[numerical_features])

    return X, y

## Holdout validation

In [43]:
def holdout_validation(model, X, y, test_size=0.2):
    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
    
    # Fit the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"Holdout Validation Accuracy: {accuracy}")
    print("Confusion Matrix:")
    print(conf_matrix)
    print("Classification Report:")
    print(class_report)
    
    # Plot confusion matrix
    # sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
    # plt.title('Confusion Matrix (Holdout Validation)')
    # plt.ylabel('Actual Values')
    # plt.xlabel('Predicted Values')
    # plt.show()

## Cross validation

In [44]:
def k_fold_cross_validation(model, X, y, k=5):
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=kf)
    print(f"K-Fold Cross Validation (k={k}) Accuracy Scores: {scores}")
    print(f"Mean Accuracy: {np.mean(scores)}")

## LOOCV Cross validation

In [45]:
def loocv_cross_validation(model, X, y, max_samples=None):
    loo = LeaveOneOut()
    
    # Use max_samples to limit the number of LOOCV evaluations
    if max_samples is not None and max_samples < len(X):
        indices = np.random.choice(len(X), max_samples, replace=False)
        X = X.iloc[indices]
        y = y.iloc[indices]

    accuracies = []
    
    for train_index, test_index in loo.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        accuracy = accuracy_score(y_test, y_pred)
        accuracies.append(accuracy)

    print(f"LOOCV Accuracy: {np.mean(accuracies):.4f}")

## Bootstrap evaluation

In [46]:
def bootstrap_evaluation(model, X, y, n_iterations=1000):
    n_size = int(len(X) * 0.9)  # Use 90% of the data for bootstrap sample
    scores = []

    for i in range(n_iterations):
        # Generate a bootstrap sample
        X_train, y_train = resample(X, y, n_samples=n_size, random_state=i)
        
        # Fit the model on the bootstrap sample
        model.fit(X_train, y_train)
        
        # Test on the out-of-bag data (remaining 10%)
        X_test, y_test = X.drop(X_train.index), y.drop(y_train.index)
        y_pred = model.predict(X_test)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        scores.append(accuracy)

    print(f"Bootstrap Mean Accuracy: {np.mean(scores)}")
    print(f"Bootstrap Standard Deviation: {np.std(scores)}")

## Pipeline for running resempling in models

In [47]:
def run_cross_validation_pipeline(df, model, target='Vegetation_Type', use_loocv=False, use_bootstrap=False):
    # Define numerical and categorical features
    numerical_features = ['Altitude', 'Slope_Orientation', 'Horizontal_Distance_To_Roadways', 
                          'Shadow_Index_9h', 'Shadow_Index_12h']
    categorical_features = ['Soil_Type', 'Wilderness_Area']
    
    # Step 1: Preprocess the data
    X, y = preprocess_data(df, target, numerical_features, categorical_features)

    
    # Step 2: Perform Holdout Validation
    print("Holdout Validation Results:")
    holdout_validation(model, X, y, test_size=0.2)
    
    # Step 3: Perform K-Fold Cross Validation
    print("\nK-Fold Cross Validation Results (k=5):")
    k_fold_cross_validation(model, X, y, k=5)
    
    print("\nK-Fold Cross Validation Results (k=10):")
    k_fold_cross_validation(model, X, y, k=10)
    
    # Step 4: Optionally Perform Leave-One-Out Cross Validation (LOOCV)
    if use_loocv:
        print("\nLOOCV Results:")
        loocv_cross_validation(model, X, y, max_samples=1000)
    
    # Step 5: Optionally Perform Bootstrap Evaluation
    if use_bootstrap:
        print("\nBootstrap Evaluation Results:")
        bootstrap_evaluation(model, X, y, n_iterations=1000)  # Limit Bootstrap iterations


### Trying the types

In [None]:
logreg_model = LogisticRegression()
lda_model = LinearDiscriminantAnalysis()
qda_model = QuadraticDiscriminantAnalysis()
# Example usage without LOOCV and Bootstrap (default)
#run_cross_validation_pipeline(df, logreg_model)

# Example usage with LOOCV and Bootstrap (optional)
run_cross_validation_pipeline(df, logreg_model, use_loocv=True, use_bootstrap=True)

run_cross_validation_pipeline(df, lda_model, use_loocv=True, use_bootstrap=True)

run_cross_validation_pipeline(df, qda_model, use_loocv=True, use_bootstrap=True)




# Conclusion:
Logistic Regression is the best-performing model based on both accuracy and F1 score, making it the preferred choice for this dataset.


QDA underperforms significantly, indicating that its assumptions about feature distribution may not align well with the dataset. (It is maybe because of the warning)

In [None]:
from sklearn.metrics import accuracy_score, f1_score

def compare_models(models, X_train, X_test, y_train, y_test):
    comparison_results = []
    
    for name, model in models:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        comparison_results.append({'Model': name, 'Accuracy': accuracy, 'F1 Score': f1})

    return pd.DataFrame(comparison_results)

# Example usage
models = [('Logistic Regression', logreg_model), ('LDA', lda_model), ('QDA', qda_model)]
compare_models(models, X_train, X_test, y_train, y_test)
