## Logistic Regression Analysis on Termite Dataset

This notebook explores the use of Logistic Regression to predict termite discovery based on the initial and final weights of wood blocks. Our focus is to demonstrate how Logistic Regression can be applied to an ecological dataset for meaningful insights.


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load the dataset
df = pd.read_csv(r'C:\Users\isabe\INDE 577\INDE-577\global_termite_microbe_wd.csv')

# Filter the data for a specific country, e.g., "Australia"
country_data = df[df['country'] == 'Australia']

## Data Exploration

Initially, we explore the dataset to understand its structure, features, and any missing values. This step is crucial for informed preprocessing.


In [5]:
# Selecting relevant columns and handling missing values
# Assume 'wood_used', 'treatment', 'initial_wt', 'final_wt' are relevant features
relevant_columns = ['wood_used', 'treatment', 'initial_wt', 'final_wt', 'termite_discovery']
country_data = country_data[relevant_columns].dropna()

# Encoding categorical data
label_encoder = LabelEncoder()
country_data['wood_used'] = label_encoder.fit_transform(country_data['wood_used'])
country_data['treatment'] = label_encoder.fit_transform(country_data['treatment'])

# Normalizing numerical features
scaler = StandardScaler()
country_data[['initial_wt', 'final_wt']] = scaler.fit_transform(country_data[['initial_wt', 'final_wt']])


## Implementing Logistic Regression

Logistic Regression is a statistical method for predicting binary classes. In our case, it will be used to predict whether termites have discovered the wood blocks based on their weights.


In [6]:
# Splitting the data into features and labels
X = country_data.drop('termite_discovery', axis=1)
y = country_data['termite_discovery']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializing and training the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_mat = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_mat)


Accuracy: 0.9097472924187726
Confusion Matrix:
 [[234   3]
 [ 22  18]]


In [8]:
# Function to plot decision boundaries
def plot_decision_boundary(X, y, model, features, resolution=0.02):
    # Setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = plt.cm.RdBu
    
    # Define bounds of the plot
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    
    # Create a mesh grid
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    
    # Predict class for each point in the mesh grid
    Z = model.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    
    # Plot the decision boundary
    plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())
    
    # Plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=colors[idx],
                    marker=markers[idx], label=f'Class {cl}',
                    edgecolor='black')

# Extract features for visualization (e.g., initial_wt and final_wt)
X_vis = X_train[['initial_wt', 'final_wt']].values  # Ensure this uses the normalized data
y_vis = y_train.values

# Call the function
plot_decision_boundary(X_vis, y_vis, model, ['initial_wt', 'final_wt'])

plt.xlabel('Initial Weight (normalized)')
plt.ylabel('Final Weight (normalized)')
plt.legend(loc='upper left')
plt.title('Logistic Regression Decision Boundary')
plt.show()



ValueError: X has 2 features, but LogisticRegression is expecting 4 features as input.

## Model Interpretation

After training, the Logistic Regression model provides us with its performance metrics. The accuracy, confusion matrix, and classification report give us insight into how well the model predicts termite discovery.
