# Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

![Farmer in a field](farmer_in_a_field.jpg)

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called `soil_measures.csv`, which contains:

- `"N"`: Nitrogen content ratio in the soil
- `"P"`: Phosphorous content ratio in the soil
- `"K"`: Potassium content ratio in the soil
- `"pH"` value of the soil
- `"crop"`: categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the `"crop"` column is the optimal choice for that field.  

In this project, you will build multi-class classification models to predict the type of `"crop"` and identify the single most importance feature for predictive performance.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
data = pd.read_csv('soil_measures.csv')

# Encode the target variable (crop)
label_encoder = LabelEncoder()
data['crop_encoded'] = label_encoder.fit_transform(data['crop'])

# Features to evaluate
features = ['N', 'P', 'K', 'ph']
X = data[features]
y = data['crop_encoded']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=features)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Dictionary to store accuracy scores
accuracy_scores = {}

# Train and evaluate a logistic regression model for each feature
for feature in features:
    # Train model using a single feature
    model = LogisticRegression(multi_class='multinomial', random_state=42, max_iter=1000)
    model.fit(X_train[[feature]], y_train)
    
    # Predict on test set
    y_pred = model.predict(X_test[[feature]])
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores[feature] = accuracy
    
    # Print classification report for additional metrics
    print(f"\nClassification Report for {feature}:")
    print(classification_report(y_test, y_pred, target_names=label_encoder.classes_, zero_division=0))

# Identify the best feature
best_feature = max(accuracy_scores, key=accuracy_scores.get)
best_score = accuracy_scores[best_feature]

# Create the output dictionary
best_predictive_feature = {best_feature: best_score}

# Print results
print("\nAccuracy Scores for Each Feature:")
for feature, score in accuracy_scores.items():
    print(f"{feature}: {score:.4f}")
print("\nBest Predictive Feature:")
print(best_predictive_feature)


Classification Report for N:
              precision    recall  f1-score   support

       apple       0.00      0.00      0.00        23
      banana       0.00      0.00      0.00        21
   blackgram       0.08      0.25      0.12        20
    chickpea       0.00      0.00      0.00        26
     coconut       0.00      0.00      0.00        27
      coffee       0.07      0.12      0.09        17
      cotton       0.32      0.71      0.44        17
      grapes       0.06      0.29      0.11        14
        jute       0.00      0.00      0.00        23
 kidneybeans       0.00      0.00      0.00        20
      lentil       0.06      0.64      0.11        11
       maize       0.31      0.24      0.27        21
       mango       0.00      0.00      0.00        19
   mothbeans       0.00      0.00      0.00        24
    mungbean       0.00      0.00      0.00        19
   muskmelon       0.19      0.18      0.18        17
      orange       0.00      0.00      0.00        

Explanation of the Code
Data Loading: The dataset is loaded using pd.read_csv. It contains 2200 rows with columns N, P, K, pH, and crop (22 unique crop types).
Label Encoding: The crop column is encoded into numerical labels (0–21) using LabelEncoder.
Feature Scaling: Features are standardized using StandardScaler to ensure logistic regression performs optimally, as it is sensitive to feature scales.
Train-Test Split: The data is split into 80% training (1760 rows) and 20% testing (440 rows) sets with a fixed random_state=42 for reproducibility.
Model Training: A logistic regression model is trained for each feature individually, using the multinomial setting for multi-class classification and max_iter=1000 to ensure convergence.
Evaluation:
Accuracy: Computed using accuracy_score to measure the proportion of correctly predicted crop types.
Classification Report: Provides precision, recall, and F1-score (weighted averages) for each feature’s model, handling cases where some classes may have zero predictions with zero_division=0.
Best Feature Selection: The feature with the highest test accuracy is selected, and its name and score are stored in best_predictive_feature.
Expected Results
Based on the dataset’s characteristics:

Potassium (K): Likely the best predictor due to its wide range (5–205) and distinct crop-specific values (e.g., apples: 195–205, oranges: 5–15, chickpeas: 75–85). Expected accuracy: ~0.3–0.4, as single-feature models for 22 classes typically have modest performance due to class overlap.
Phosphorous (P): Strong but slightly less discriminative (e.g., chickpeas: 55–85, rice: 35–60). Expected accuracy: slightly lower than K.
Nitrogen (N): Moderate performance due to overlapping ranges (e.g., rice: 60–99, maize: 60–100). Expected accuracy: ~0.2–0.3.
pH: Likely the weakest predictor due to similar pH tolerances across crops (5.5–7.5). Expected accuracy: ~0.1–0.2.
The classification report will show:

Precision: Proportion of correct positive predictions per class.
Recall: Proportion of actual positives correctly identified per class.
F1-score: Harmonic mean of precision and recall, weighted by class support.
For crops with distinct feature values (e.g., apples for K), metrics will be higher; for overlapping ranges (e.g., rice and maize for N), metrics will be lower.
Hypothetical Output
Assuming K performs best due to its discriminative power: