# Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

![Farmer in a field](farmer_in_a_field.jpg)

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called `soil_measures.csv`, which contains:

- `"N"`: Nitrogen content ratio in the soil
- `"P"`: Phosphorous content ratio in the soil
- `"K"`: Potassium content ratio in the soil
- `"pH"` value of the soil
- `"crop"`: categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the `"crop"` column is the optimal choice for that field.  

In this project, you will build multi-class classification models to predict the type of `"crop"` and identify the single most importance feature for predictive performance.

### Loading and Preparing the Data

Before diving into the analysis, we will first import the necessary libraries for data processing, visualisation, and modelling. We will then load the dataset, which contains soil measurements and the corresponding ideal crop for each field.


In [1]:
# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

## Exploratory Data Analysis (EDA)

To understand the patterns in the soil data and identify any useful trends, we begin with a visual and statistical exploration. This helps inform how we approach the predictive modelling task ahead.


### First Look at the Data

Before we begin building any models, it's helpful to take a quick look at the dataset. Here, we display the first few rows to get familiar with the structure and confirm the presence of the expected features—soil nutrients (`N`, `P`, `K`), pH level, and the target variable `crop`, which identifies the optimal crop for each soil composition.



In [2]:
crops.head()

Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice


From this preview, we can see that the dataset is clean and clearly structured, with consistent formatting across columns. Each row corresponds to a different set of soil conditions and a suitable crop.

Next, we examine how frequently each crop type appears in the dataset. This helps us confirm whether the classes are balanced or if any particular crops are over- or under-represented.


In [3]:
crops['crop'].value_counts().sort_index()

crop
apple          100
banana         100
blackgram      100
chickpea       100
coconut        100
coffee         100
cotton         100
grapes         100
jute           100
kidneybeans    100
lentil         100
maize          100
mango          100
mothbeans      100
mungbean       100
muskmelon      100
orange         100
papaya         100
pigeonpeas     100
pomegranate    100
rice           100
watermelon     100
Name: count, dtype: int64

We can now inspect the structure of the dataset in more detail using `.info()`. This will show us the number of entries, column types, and whether any data is missing.

In [4]:
crops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   N       2200 non-null   int64  
 1   P       2200 non-null   int64  
 2   K       2200 non-null   int64  
 3   ph      2200 non-null   float64
 4   crop    2200 non-null   object 
dtypes: float64(1), int64(3), object(1)
memory usage: 86.1+ KB


All 2,200 entries are complete with **no missing values**. Each column has a consistent data type—ideal for a machine learning workflow. This means we won’t need to worry about data cleaning or imputation.


In [5]:
crops.describe()

Unnamed: 0,N,P,K,ph
count,2200.0,2200.0,2200.0,2200.0
mean,50.551818,53.362727,48.149091,6.46948
std,36.917334,32.985883,50.647931,0.773938
min,0.0,5.0,5.0,3.504752
25%,21.0,28.0,20.0,5.971693
50%,37.0,51.0,32.0,6.425045
75%,84.25,68.0,49.0,6.923643
max,140.0,145.0,205.0,9.935091


The descriptive statistics reveal quite a bit about the soil conditions in the dataset:

- Nutrient levels (N, P, K) vary significantly—particularly potassium, which ranges from 5 to 205—indicating diverse soil fertility across samples.
- The pH values span from strongly acidic (3.5) to highly alkaline (almost 10), though the average is around 6.47, which is near-neutral and suitable for most crops.
- The relatively high standard deviations, especially for potassium and nitrogen, suggest meaningful variation in soil profiles—useful for distinguishing between crop preferences.


## Preparing the Data for Modeling

Before training the model, the dataset must be prepared for machine learning. Key preprocessing steps include:

- **Reducing memory usage** by converting the categorical `crop` column to a more efficient data type.
- **Encoding the target variable** so it can be used in classification models.
- **Scaling numerical features** to ensure consistency across different ranges of values.

These steps help improve both model performance and computational efficiency.


### Optimising memory usage

Converting the `crop` column from object to category type resulted in a substantial reduction in memory usage. This is an effective preprocessing step when working with repeated string values, especially in larger datasets where efficiency matters.


In [7]:
memory_used_obj = crops.memory_usage(deep=True).sum() / 1024  # in kB
crops['crop'] = crops['crop'].astype('category')
memory_used_cat = crops.memory_usage(deep=True).sum() / 1024  # in kB
print(f'memory usage with "object" type: {memory_used_obj.round(1)} kB')
print(f'memory usage with "category" type: {memory_used_cat.round(1)} kB')


memory usage with "object" type: 189.5 kB
memory usage with "category" type: 72.8 kB


### Splitting Features and Target

In this step, we separate the dataset into input features and the target variable. The feature matrix, `X`, will contain the soil measurements, while the target vector, `y`, will store the corresponding crop labels.

After reviewing the distribution of the values, we now proceed to extract the features (`N`, `P`, `K`, and `pH`) into a NumPy array, `X`, and store the target variable (`crop`) separately in `y`. This separation is a standard preprocessing step that ensures compatibility with machine learning models and prepares the data for subsequent transformations or model training.


In [8]:
X = crops.drop('crop', axis=1).values 
y = np.array(crops['crop'].values) # 'values' is used to get only the numpy array.
print(type(X), type(y))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [9]:
print(X[:3])
print(type(X))

[[90.         42.         43.          6.50298529]
 [85.         58.         41.          7.03809636]
 [60.         55.         44.          7.84020714]]
<class 'numpy.ndarray'>


In [10]:
print(y[:3])
print(type(y))

['rice' 'rice' 'rice']
<class 'numpy.ndarray'>


### Splitting the data for modeling

To ensure the model generalises well, it’s essential to split the dataset into training and testing subsets. Here, we use the train_test_split function to divide the data, reserving 20% for testing. A random seed is set to ensure reproducibility of the results.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=7)

### Feature Scaling: Enhancing Model Performance

To ensure the performance and stability of the machine learning models, we scale the features to standardise their range. Using `StandardScaler`, the data is transformed to have a mean of zero and a standard deviation of one. This step is particularly important for models sensitive to the magnitude of the data, such as gradient-based models.


In [12]:
scaler=StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Evaluating the Contribution of Each Feature

Next, we evaluate the performance of a **logistic regression** model using each individual feature. By training the model on each feature independently, we can assess the contribution of each one to the overall prediction accuracy.



In [13]:
features = ["N", "P", "K", "ph"]
features_n = range(0,4)

# Store mean accuracy for each feature
feature_scores = {}

The following code trains a **logistic regression** model on each individual feature and calculates the **F1 score** for each. The results will provide insight into which features are more predictive of the target variable.


In [14]:
# Train logistic regression on each individual feature
for i in features_n:
    logreg = LogisticRegression()
    logreg.fit(X_train_scaled[:,i].reshape(-1,1), y_train) 
    y_pred = logreg.predict(X_test_scaled[:,i].reshape(-1,1)) # calculate the predicted values
    score = metrics.f1_score(y_test, y_pred, average='weighted') # comparing the test vs predicted values
    feature_scores[features[i]] = score # create the dictionary

### F1 Scores for Each Feature

The table below displays the F1 score for each feature used in the model. These scores provide a measure of the model's performance based on individual features. A higher score indicates that the feature contributes more effectively to accurate predictions.


In [15]:
# Display the scores
for feature, score in feature_scores.items():
    print(f'F1 score for "{feature}": {score:.4f}')
    

F1 score for "N": 0.0712
F1 score for "P": 0.1366
F1 score for "K": 0.1381
F1 score for "ph": 0.0279


### Identifying the Best Predictive Feature

The following code identifies the feature with the highest predictive power by calculating the F1 score for each feature and selecting the one with the highest value.



In [16]:
best_predictive_feature = {
    max(feature_scores, key=feature_scores.get): max(feature_scores.values())
}

The output reveals the feature that offers the most predictive power. This feature has the highest F1 score, indicating it is the most effective in contributing to the accuracy of the model.

In [17]:
print(f"The best predicted feature is {list(best_predictive_feature.keys())[0]} with a value of {list(best_predictive_feature.values())[0]:.4f}")

The best predicted feature is K with a value of 0.1381
