In [11]:
import importlib
import mysklearn
importlib.reload(mysklearn)

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyNaiveBayesClassifier
from mysklearn.myclassifiers import MyDecisionTreeClassifier
from mysklearn.myclassifiers import MyRandomForestClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

# Introduction

For this project, we used a fully synthetic dataset from Kaggle. It contains mostly continuous data. It has 15 total attributes and 10,000 instances. We tried to classify if a crop yield was Low, Medium, or High as labels because there was no existing attribute appropriate for prediction.

(findings here)
(best performing classifier)





# Data Analysis

Our dataset is mostly continuous with 10,000 instances and 15 attributes. The "Year" attribute is an integer representing the year of recorded instance values. The attributes Country,Region,Crop_Type, and Adaptation_Strategy are all categorical strings. The attributes Average_Temperature_C, Total_Precipitation_mm, CO2_Emissions_MT, Crop_Yield_MT_per_HA, Extreme_Weather_Events, Irrigation_Access_%, Pesticide_Use_KG_per_HA, Fertilizer_Use_KG_per_HA, Soil_Health_Index, and Economic_Impact_Million_USD are all float values.

#Statistics Summary 

The dataset's summary statistics reveal several important characteristics of the numeric features. Temperature ranges from -4.99¬∞C to 35¬∞C with a mean of 15.24¬∞C and standard deviation of 11.47¬∞C, indicating substantial variation across different climate zones. Precipitation shows a wide range from 200mm to nearly 3000mm annually (mean: 1612mm, std: 805mm), representing diverse conditions from semi-arid to high-rainfall regions. CO2 emissions range from 0.5 to 30 MT with a mean of 15.25 MT, while extreme weather events range from 0 to 10 occurrences with a relatively uniform distribution (mean: 5.0, median: 5.0). The target variable, crop yield, ranges from 0.45 to 5.00 MT/HA with a mean of 2.24 MT/HA and standard deviation of 1.00, showing moderate variability in agricultural productivity. Among agricultural practice variables, irrigation access shows high variability (mean: 55%, std: 26%), ranging from 10% to nearly 100%, while pesticide use (mean: 25 KG/HA, range: 0-50) and fertilizer use (mean: 50 KG/HA, range: 0-100) demonstrate diverse farming intensities. Soil health index ranges from 30 to 100 with a mean of 64.90 and standard deviation of 20.19, indicating considerable variation in soil quality across farms. The interquartile ranges (IQR) for most variables show relatively symmetric distributions around their medians, with Q1 and Q3 values roughly equidistant from Q2, suggesting that the data was generated to follow approximately uniform or normal distributions. These statistics confirm that the dataset provides comprehensive coverage across all feature ranges, which is beneficial for training robust classification models that can generalize across diverse agricultural conditions.








In [None]:
#Summary Statistics 

import csv

def load_data(filename):
    table = []
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        header = next(reader)
        for row in reader:
            table.append(row)
    return header, table

def compute_summary_stats(values):
    """Calculate summary statistics for a list of numeric values."""
    sorted_vals = sorted(values)
    n = len(sorted_vals)
    
    
    min_val = sorted_vals[0]
    max_val = sorted_vals[-1]
    range_val = max_val - min_val
    mean = sum(sorted_vals) / n
    
    
    if n % 2 == 0:
        median = (sorted_vals[n//2 - 1] + sorted_vals[n//2]) / 2
    else:
        median = sorted_vals[n//2]
    
    
    q1_idx = n // 4
    q3_idx = (3 * n) // 4
    q1 = sorted_vals[q1_idx]
    q3 = sorted_vals[q3_idx]
    iqr = q3 - q1
    
    
    variance = sum((x - mean) ** 2 for x in sorted_vals) / n
    std_dev = variance ** 0.5
    
    return {
        'count': n,
        'min': min_val,
        'max': max_val,
        'range': range_val,
        'mean': mean,
        'median': median,
        'std': std_dev,
        'Q1': q1,
        'Q2': median,
        'Q3': q3,
        'IQR': iqr
    }


filename = 'climate_change_impact_on_agriculture_2024.csv'
header, table = load_data(filename)


numeric_features = [
    'Average_Temperature_C', 'Total_Precipitation_mm', 
    'CO2_Emissions_MT', 'Crop_Yield_MT_per_HA',
    'Extreme_Weather_Events', 'Irrigation_Access_%', 
    'Pesticide_Use_KG_per_HA', 'Fertilizer_Use_KG_per_HA', 
    'Soil_Health_Index'
]

print("=" * 70)
print("SUMMARY STATISTICS FOR NUMERIC FEATURES")
print("=" * 70)

for feat_name in numeric_features:
    feat_idx = header.index(feat_name)
    values = []
    for row in table:
        try:
            values.append(float(row[feat_idx]))
        except:
            pass
    
    stats = compute_summary_stats(values)
    
    print(f"\n{feat_name}:")
    print(f"  Count:  {stats['count']}")
    print(f"  Mean:   {stats['mean']:.2f}")
    print(f"  Median: {stats['median']:.2f}")
    print(f"  Std:    {stats['std']:.2f}")
    print(f"  Min:    {stats['min']:.2f}")
    print(f"  Q1:     {stats['Q1']:.2f}")
    print(f"  Q2:     {stats['Q2']:.2f}")
    print(f"  Q3:     {stats['Q3']:.2f}")
    print(f"  Max:    {stats['max']:.2f}")
    print(f"  Range:  {stats['range']:.2f}")
    print(f"  IQR:    {stats['IQR']:.2f}")

: 

In [None]:
from IPython.display import Image, display

print("Figure 1: Target Variable Distribution")
display(Image('figures/figure1_target.png'))

print("\nFigure 2: Climate Variables Distribution")
display(Image('figures/figure2_climate.png'))

print("\nFigure 3: Agricultural Variables Distribution")
display(Image('figures/figure3_agriculture.png'))

print("\nFigure 4: Categorical Variables Distribution")
display(Image('figures/figure4_categorical.png'))

print("\nFigure 5: Box and Whisker Plots")
display(Image('figures/figure5_boxplots.png'))

print("\nFigure 6: Scatter Plots - Variable Relationships")
display(Image('figures/figure6_scatter.png'))

Figure 1: Target Variable Distributions
Figure 1 shows the distribution of our target variable, Crop Yield. The left histogram displays the continuous yield values ranging from approximately 0.86 to 5 MT/HA, with vertical lines marking the 33rd percentile (1.68 MT/HA) and 67th percentile (2.66 MT/HA) that we used as category boundaries. The right bar chart confirms our categorization resulted in a balanced three-class problem with Low (33.1%), Medium (33.9%), and High (33.1%) yields approximately equally represented. This balanced distribution is crucial for training unbiased classification models, as it ensures that no single class dominates the dataset and that our classifiers will have equal opportunity to learn patterns for all three yield categories.

Figure 2:
Figure 2 presents the distributions of four key climate variables in our dataset. The temperature distribution shows values ranging from approximately -5¬∞C to 35¬∞C with a relatively uniform spread, though there is a noticeable gap in data around 10-20¬∞C which may indicate either missing data or regional climate patterns where certain temperature ranges are less common in agricultural areas. Precipitation displays an even distribution from 0 to 3000mm annually, suggesting the dataset captures diverse climatic conditions from arid to high-rainfall regions. CO2 emissions show a uniform distribution between 0 and 30 MT, indicating consistent representation across different emission levels. Extreme weather events are also uniformly distributed from 0 to 6 events, with all frequency levels equally represented. The uniformity of these distributions suggests the dataset was carefully balanced or synthetically generated to ensure equal representation across all climate conditions, which is beneficial for training classifiers that can generalize across diverse environmental scenarios.

Figure 3:
Figure 3 displays the distributions of four agricultural practice variables: irrigation access, pesticide use, fertilizer use, and soil health index. Irrigation access shows a uniform distribution across the full range from 0% to 100%, indicating the dataset includes farms with no irrigation infrastructure as well as those with complete irrigation coverage. Pesticide use is evenly distributed from 0 to 50 KG/HA, representing diverse farming approaches from organic or low-input systems to conventional high-input agriculture. Fertilizer use similarly shows uniform distribution from 0 to 100 KG/HA, capturing the full spectrum of nutrient management practices. The soil health index ranges from approximately 30 to 100 with consistent frequency across all values, representing soil conditions from poor to excellent. The remarkably uniform distributions across all four variables suggest this dataset was synthetically generated or carefully balanced to ensure equal representation of different agricultural practices. This uniformity is advantageous for machine learning as it provides our classifiers with sufficient examples across the entire range of each variable, enabling them to learn patterns without bias toward any particular farming practice level.

Figure 4: 
Figure 4 examines the distribution of four categorical variables in the dataset: country, region, crop type, and adaptation strategies. The country distribution shows that the dataset includes ten major agricultural nations, with USA, Australia, and China being the most represented (approximately 1000 instances each), followed by Nigeria, India, Canada, Argentina, France, Russia, and Brazil with roughly equal representation. This geographic diversity ensures the dataset captures agricultural practices across different continents and climate zones. The regional distribution reveals that South and Northeast regions dominate with approximately 750 instances each, followed by North, Central, Punjab, Victoria, New South Wales, East, South West, and Ontario regions. The crop type distribution demonstrates remarkable diversity, with ten major crops represented almost equally: wheat, cotton, vegetables, corn, rice, sugarcane, fruits, soybeans, barley, and coffee, each appearing approximately 1000 times. This balanced representation across diverse crops from grains to cash crops to vegetables ensures our classifiers can learn patterns applicable to multiple agricultural contexts. Finally, the adaptation strategies variable shows that water management is by far the most common strategy with approximately 2000 instances, followed by no adaptation, drought-resistant crops, organic farming, and crop rotation with roughly equal representation around 1500 instances each. The prevalence of water management strategies reflects the critical

Figure 6:
Figure 6 presents scatter plots showing the relationships between four key variables and crop yield, with points color-coded by yield category (red for Low, yellow for Medium, green for High). The temperature versus yield plot reveals a notable pattern with a conspicuous gap in the 10-20¬∞C range where no data points appear, suggesting either missing data or that crops in this dataset are not grown in that temperature range. Higher temperatures (20-35¬∞C) are associated with higher yields, as evidenced by the concentration of green points in this region. The precipitation versus yield plot shows a more uniform distribution across all precipitation levels (0-3000mm), with all three yield categories well-mixed throughout the range, suggesting that precipitation alone is not a strong predictor of yield category. The soil health versus yield plot demonstrates the clearest separation among all four plots, with low yields (red) predominantly appearing in the 30-60 soil health range, medium yields (yellow) in the 50-80 range, and high yields (green) concentrated in the 60-100 range. This strong visual separation indicates that soil health is likely to be one of the most important features for our classification models. The irrigation access versus yield plot shows moderate separation, with low yields more common at lower irrigation levels and high yields appearing across all irrigation levels but with slightly higher concentration at higher access percentages. These scatter plots provide crucial insights into which features will be most predictive for classifying crop yields, with soil health emerging as the strongest individual predictor, followed by temperature and irrigation access, while precipitation shows weaker discriminative power.


# Classification Results

In [12]:
import csv
from mysklearn.myclassifiers import MyRandomForestClassifier, MyKNeighborsClassifier
from mysklearn.myevaluation import train_test_split, confusion_matrix, accuracy_score



In [13]:
def load_data(filename):
    table = []
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        header = next(reader)
        for row in reader:
            table.append(row)
    return header, table

def prepare_data(header, table):
    numeric_features = ['Average_Temperature_C', 'Total_Precipitation_mm', 
                       'CO2_Emissions_MT', 'Extreme_Weather_Events',
                       'Irrigation_Access_%', 'Pesticide_Use_KG_per_HA', 
                       'Fertilizer_Use_KG_per_HA', 'Soil_Health_Index']
    
    feature_indices = [header.index(feat) for feat in numeric_features]
    yield_index = header.index('Crop_Yield_MT_per_HA')
    
    X = []
    y_continuous = []
    for row in table:
        try:
            features = [float(row[i]) for i in feature_indices]
            yield_val = float(row[yield_index])
            X.append(features)
            y_continuous.append(yield_val)
        except:
            pass
    
    sorted_yields = sorted(y_continuous)
    p33_index = int(len(sorted_yields) * 0.33)
    p67_index = int(len(sorted_yields) * 0.67)
    p33 = sorted_yields[p33_index]
    p67 = sorted_yields[p67_index]
    
    y = []
    for yield_val in y_continuous:
        if yield_val < p33:
            y.append('Low')
        elif yield_val < p67:
            y.append('Medium')
        else:
            y.append('High')
    
    return X, y

def print_confusion_matrix(matrix, labels):
    print("\nConfusion Matrix:")
    print("=" * 50)
    print(f"{'':12}", end="")
    for label in labels:
        print(f"{label:>10}", end="")
    print()
    print("-" * 50)
    for i, label in enumerate(labels):
        print(f"{label:12}", end="")
        for j in range(len(labels)):
            print(f"{matrix[i][j]:>10}", end="")
        print()

def discretize_features(X):
    """Convert continuous features to categorical bins for Naive Bayes."""
    X_discretized = []
    
    # First, find min/max for each feature to create bins
    n_features = len(X[0])
    feature_mins = [min(instance[i] for instance in X) for i in range(n_features)]
    feature_maxs = [max(instance[i] for instance in X) for i in range(n_features)]
    
    for instance in X:
        discretized_instance = []
        for i, value in enumerate(instance):
            # Create 3 equal-width bins: Low, Medium, High
            range_size = (feature_maxs[i] - feature_mins[i]) / 3
            if value < feature_mins[i] + range_size:
                discretized_instance.append('Low')
            elif value < feature_mins[i] + 2 * range_size:
                discretized_instance.append('Medium')
            else:
                discretized_instance.append('High')
        X_discretized.append(discretized_instance)
    
    return X_discretized

In [14]:
filename = 'climate_change_impact_on_agriculture_2024.csv'
header, table = load_data(filename)
X, y = prepare_data(header, table)

print(f"Dataset: {len(X)} instances, {len(X[0])} features")
print(f"Classes: Low, Medium, High")

# Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(f"Training: {len(X_train)} instances")
print(f"Test: {len(X_test)} instances")

Dataset: 10000 instances, 8 features
Classes: Low, Medium, High
Training: 6700 instances
Test: 3300 instances


In [15]:
print("=" * 60)
print("RANDOM FOREST CLASSIFIER")
print("=" * 60)

rf = MyRandomForestClassifier(n_trees=10, max_depth=5)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)

print(f"Accuracy: {rf_acc:.4f} ({rf_acc*100:.2f}%)")

labels = ['Low', 'Medium', 'High']
rf_matrix = confusion_matrix(y_test, rf_pred, labels)
print_confusion_matrix(rf_matrix, labels)

RANDOM FOREST CLASSIFIER
Accuracy: 0.3282 (32.82%)

Confusion Matrix:
                   Low    Medium      High
--------------------------------------------------
Low                662       207       196
Medium             698       235       180
High               709       227       186


In [16]:
print("=" * 60)
print("K-NEAREST NEIGHBORS")
print("=" * 60)

knn = MyKNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_acc = accuracy_score(y_test, knn_pred)

print(f"Accuracy: {knn_acc:.4f} ({knn_acc*100:.2f}%)")

knn_matrix = confusion_matrix(y_test, knn_pred, labels)
print_confusion_matrix(knn_matrix, labels)


print("\n" + "=" * 60)
print("NAIVE BAYES CLASSIFIER")
print("=" * 60)
print("Discretizing continuous features into Low/Medium/High bins...")

# Discretize features for Naive Bayes
X_train_disc = discretize_features(X_train)
X_test_disc = discretize_features(X_test)

# Train Naive Bayes
nb = MyNaiveBayesClassifier()
nb.fit(X_train_disc, y_train)
nb_pred = nb.predict(X_test_disc)
nb_acc = accuracy_score(y_test, nb_pred)

print(f"\n‚úì Naive Bayes Accuracy: {nb_acc:.4f} ({nb_acc*100:.2f}%)")

nb_matrix = confusion_matrix(y_test, nb_pred, labels)
print_confusion_matrix(nb_matrix, labels)

# Calculate per-class accuracy
print("\nPer-Class Recognition Rates:")
for i, label in enumerate(labels):
    total = sum(nb_matrix[i])
    correct = nb_matrix[i][i]
    rate = (correct / total * 100) if total > 0 else 0
    print(f"  {label}: {correct}/{total} = {rate:.1f}%")

K-NEAREST NEIGHBORS
Accuracy: 0.3748 (37.48%)

Confusion Matrix:
                   Low    Medium      High
--------------------------------------------------
Low                349       377       339
Medium             331       411       371
High               277       368       477

NAIVE BAYES CLASSIFIER
Discretizing continuous features into Low/Medium/High bins...

‚úì Naive Bayes Accuracy: 0.4727 (47.27%)

Confusion Matrix:
                   Low    Medium      High
--------------------------------------------------
Low                550        37       478
Medium             442        33       638
High               113        32       977

Per-Class Recognition Rates:
  Low: 550/1065 = 51.6%
  Medium: 33/1113 = 3.0%
  High: 977/1122 = 87.1%


In [17]:
print("\n" + "=" * 70)
print("FINAL COMPARISON")
print("=" * 70)

print(f"\nOverall Accuracy:")
print(f"  Random Forest:  {rf_acc:.4f} ({rf_acc*100:.2f}%)")
print(f"  k-NN (k=5):     {knn_acc:.4f} ({knn_acc*100:.2f}%)")
print(f"  Naive Bayes:    {nb_acc:.4f} ({nb_acc*100:.2f}%)")

# Find winner
accuracies = [('Random Forest', rf_acc), ('k-NN', knn_acc), ('Naive Bayes', nb_acc)]
accuracies_sorted = sorted(accuracies, key=lambda x: x[1], reverse=True)

print(f"\nüèÜ Rankings:")
for i, (name, acc) in enumerate(accuracies_sorted, 1):
    print(f"  {i}. {name}: {acc:.4f} ({acc*100:.2f}%)")

winner = accuracies_sorted[0]
print(f"\nüèÜ Winner: {winner[0]} with {winner[1]*100:.2f}% accuracy!")

# Show differences
print(f"\nPerformance Gaps:")
print(f"  1st vs 2nd: {(accuracies_sorted[0][1] - accuracies_sorted[1][1])*100:.2f} percentage points")
print(f"  1st vs 3rd: {(accuracies_sorted[0][1] - accuracies_sorted[2][1])*100:.2f} percentage points")

print("\n" + "=" * 70)
print("ANALYSIS COMPLETE")
print("=" * 70)


FINAL COMPARISON

Overall Accuracy:
  Random Forest:  0.3282 (32.82%)
  k-NN (k=5):     0.3748 (37.48%)
  Naive Bayes:    0.4727 (47.27%)

üèÜ Rankings:
  1. Naive Bayes: 0.4727 (47.27%)
  2. k-NN: 0.3748 (37.48%)
  3. Random Forest: 0.3282 (32.82%)

üèÜ Winner: Naive Bayes with 47.27% accuracy!

Performance Gaps:
  1st vs 2nd: 9.79 percentage points
  1st vs 3rd: 14.45 percentage points

ANALYSIS COMPLETE


## Classification Approach and Methodology

Our classification approach involved implementing three distinct machine learning algorithms from scratch to predict crop yield categories (Low, Medium, High) based on eight numeric features: temperature, precipitation, CO2 emissions, extreme weather events, irrigation access, pesticide use, fertilizer use, and soil health index. We split our dataset of 10,000 instances into training and test sets using a 67-33 split with random shuffling to ensure unbiased evaluation. For Random Forest, we implemented an ensemble method that creates 10 decision trees, each trained on a bootstrap sample (random sampling with replacement) of the training data, with a maximum depth of 5 levels and minimum samples per split of 2. Each tree considers a random subset of features at each split point (specifically, the square root of the total number of features), which reduces correlation between trees and improves generalization. For k-Nearest Neighbors (k-NN), we selected k=5 neighbors and used Euclidean distance to identify the closest training instances to each test point, predicting the majority class among those neighbors. We chose k-NN specifically to evaluate how this instance-based learning algorithm would handle our relatively large dataset of 10,000 instances, as k-NN must compute distances to all training examples for each prediction. For Naive Bayes, we had to discretize our continuous features into Low, Medium, and High bins based on equal-width ranges within each feature, as Naive Bayes assumes categorical inputs and calculates conditional probabilities for each feature value given each class. We selected Naive Bayes because we hypothesized it would perform best on this dataset due to its ability to handle multiple features efficiently and make predictions based on probability distributions, which seemed well-suited to agricultural data where multiple factors contribute to yield outcomes.

## Performance Evaluation and Results

We evaluated classifier performance using overall accuracy (the percentage of correct predictions) and confusion matrices that show the distribution of predictions across all three classes. The confusion matrix reveals not just overall accuracy but also per-class performance, showing how well each classifier distinguishes between Low, Medium, and High yield categories. Random Forest achieved an accuracy of 32.82% (1083 correct predictions out of 3300 test instances), with its confusion matrix showing relatively balanced but poor performance across all three classes. The classifier correctly identified 662 out of 1125 Low yield instances (58.8%), 235 out of 1113 Medium yield instances (21.1%), and 186 out of 1122 High yield instances (16.6%), demonstrating a strong bias toward predicting the Low category. The k-NN classifier performed better with 37.48% accuracy (1237 correct predictions), though still showing confusion across categories. Its per-class recognition rates were more balanced: 349 out of 1065 Low instances (32.8%), 411 out of 1113 Medium instances (36.9%), and 477 out of 1122 High instances (42.5%). Notably, k-NN showed the opposite bias from Random Forest, performing best on the High yield category. Naive Bayes significantly outperformed both other classifiers with 47.27% accuracy (1560 correct predictions), validating our hypothesis that it would handle this dataset most effectively. However, the confusion matrix revealed an interesting pattern: Naive Bayes achieved exceptional performance on the Low category (550 out of 1065 correct, 51.6%) and High category (977 out of 1122 correct, 87.1%), but performed very poorly on the Medium category (only 33 out of 1113 correct, 3.0%). This suggests that after discretization, the Medium yield category shares features with both Low and High categories, making it difficult for the probability-based Naive Bayes to distinguish, and most Medium instances were misclassified as either Low or High.

## Comparative Analysis and Best Classifier

Our comparative analysis reveals that Naive Bayes is the clear winner with 47.27% accuracy, outperforming k-NN by 9.79 percentage points and Random Forest by 14.45 percentage points. This result confirms our initial hypothesis that Naive Bayes would handle the dataset most effectively, likely because its probabilistic approach can efficiently process multiple features and identify patterns in the discretized data. The performance gap between classifiers is substantial and statistically significant given our large test set of 3300 instances. However, it is important to note that all three classifiers performed below 50% accuracy, which is only marginally better than random guessing for a three-class problem (33.3% baseline). This relatively poor overall performance across all classifiers suggests several possibilities: the features we selected may not be sufficiently predictive of yield categories, the synthetic nature of the dataset may lack realistic patterns that would exist in real agricultural data, or the three-way classification task may be inherently difficult with these particular feature combinations. The k-NN classifier's moderate performance (37.48%) demonstrates that instance-based learning can handle our 10,000-instance dataset, though the computational cost of distance calculations to all training examples limits scalability. Random Forest's poor performance (32.82%) was surprising given that ensemble methods typically perform well on complex datasets, but the maximum depth restriction of 5 levels and use of only 10 trees may have limited its ability to capture complex interactions between features. Despite these limitations, Naive Bayes's superior performance, particularly its strong recognition of Low and High yield categories (51.6% and 87.1% respectively), demonstrates that probabilistic classification can effectively identify extreme yield outcomes even when struggling with intermediate cases, which has practical value for agricultural prediction where identifying high-performing and low-performing farms is often more important than precise categorization of average performers.

# Conclusion

## Conclusion

This project investigated the classification of crop yields into Low, Medium, and High categories using climate and agricultural data from a synthetic dataset of 10,000 agricultural instances spanning ten countries and ten major crop types. The dataset included 15 attributes covering climate variables (temperature, precipitation, CO2 emissions, extreme weather events), agricultural practices (irrigation access, pesticide use, fertilizer use, soil health index), and categorical features (country, region, crop type, adaptation strategies). Our exploratory data analysis revealed uniformly distributed features across all ranges, which is characteristic of synthetically generated data designed for educational purposes, and identified soil health and temperature as the features showing the strongest visual correlation with crop yield through scatter plot and box plot analysis.

The classification task presented several inherent challenges that significantly impacted model performance. First, the synthetic nature of the dataset, while providing balanced class distribution and complete coverage of feature ranges, may lack the complex, non-linear relationships and interactions that characterize real-world agricultural systems where multiple factors interact in ways that cannot be captured by simple uniform distributions. Second, the choice to convert continuous yield values into three discrete categories necessarily loses information, as the boundaries between Low, Medium, and High yields are somewhat arbitrary despite being based on percentiles. Third, the Medium yield category proved particularly difficult to classify across all three algorithms, as instances near the 33rd and 67th percentile boundaries share characteristics with both adjacent categories, creating inherent ambiguity. Finally, our use of only eight numeric features, while computationally manageable, may have excluded important predictors of crop yield such as crop-specific characteristics, detailed soil chemistry, pest pressure, or historical yield trends that would be present in real agricultural datasets.

We developed and implemented three classification algorithms from scratch following CPSC 322 requirements, using only Python's csv module and matplotlib for data handling and visualization. Our Random Forest classifier employed an ensemble of 10 decision trees with bootstrap sampling and random feature selection (square root of total features) at each split, limited to a maximum depth of 5 levels. The k-Nearest Neighbors algorithm used k=5 neighbors with Euclidean distance, selected specifically to evaluate how instance-based learning would scale to our 10,000-instance dataset. Naive Bayes required discretizing continuous features into Low, Medium, and High bins, which we accomplished using equal-width binning based on the range of each feature, and we hypothesized this approach would perform best due to its efficient probabilistic framework for handling multiple features. All classifiers were evaluated using a 67-33 train-test split with accuracy as the primary metric and confusion matrices to assess per-class performance.

The performance results confirmed our hypothesis that Naive Bayes would be the strongest performer, achieving 47.27% accuracy and significantly outperforming k-NN (37.48%) and Random Forest (32.82%). However, all three classifiers achieved accuracy rates below 50%, which is only moderately better than random guessing for a three-class problem (33.3% baseline). Naive Bayes demonstrated exceptional performance on extreme categories (Low: 51.6%, High: 87.1%) but struggled dramatically with the Medium category (3.0%), suggesting that after discretization, the middle category shares probabilistic features with both extremes. The k-NN classifier showed more balanced per-class performance but was computationally expensive due to distance calculations across the entire training set. Random Forest's poor performance was unexpected for an ensemble method and may indicate that our maximum depth restriction was too conservative or that the synthetic data lacks the complex feature interactions that typically benefit ensemble approaches.

Several strategies could potentially improve classification performance in future work. First, implementing feature engineering to create interaction terms (such as temperature √ó precipitation or soil health √ó fertilizer use) might capture the synergistic effects that characterize real agricultural systems. Second, using stratified sampling rather than random sampling for the train-test split would ensure that each class is proportionally represented in both sets, which could improve model training, particularly for the problematic Medium category. Third, conducting feature selection or dimensionality reduction could identify the most predictive variables and reduce noise from less informative features. Fourth, for Naive Bayes specifically, experimenting with different discretization strategies such as quantile-based binning (using percentiles rather than equal-width bins) might create more meaningful categorical distinctions that better separate the classes. Fifth, significantly increasing the number of trees in Random Forest (from 10 to 50 or 100) and relaxing the maximum depth constraint could allow the ensemble to capture more complex patterns, though this would increase computational cost. Sixth, implementing cross-validation rather than a single train-test split would provide more robust estimates of model performance and reduce the impact of random variation in data splitting. Finally, if this were a real-world application, collecting additional features such as crop-specific growth characteristics, detailed soil composition data, historical weather patterns, or economic factors like market prices and input costs could provide the additional information necessary to achieve the higher accuracy rates required for practical agricultural decision-making. Despite the modest accuracy achieved, this project successfully demonstrated the implementation and comparison of three distinct machine learning paradigms‚Äîensemble learning, instance-based learning, and probabilistic learning‚Äîand provided valuable insights into the challenges of multi-class classification on balanced synthetic datasets.

# Acknowledgements

Claude AI was used for assistance in this project for helping understanding and developing Random Forest and its unit tests, correcting bugs in our EDA code, and assisting with our report notebook. Our Naive Bayes and kNN classifier code was taken from previous PAs.