# Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

![Farmer in a field](farmer_in_a_field.jpg)

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called `soil_measures.csv`, which contains:

- `"N"`: Nitrogen content ratio in the soil
- `"P"`: Phosphorous content ratio in the soil
- `"K"`: Potassium content ratio in the soil
- `"pH"` value of the soil
- `"crop"`: categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the `"crop"` column is the optimal choice for that field.  

In this project, you will build multi-class classification models to predict the type of `"crop"` and identify the single most importance feature for predictive performance.

In [1]:
# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn import metrics

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

## Exploring the Dataset

Before diving into model building, it's important to take a first look at the dataset. The code below displays the first few rows of the `soil_measures.csv` file to understand the structure of the data.

This initial exploration allows us to confirm the presence of expected features: nitrogen (`N`), phosphorous (`P`), potassium (`K`), pH level, and the target variable `crop`, which indicates the most suitable crop for a given soil composition.


In [2]:
crops.head()

Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice


The number of unique values in the `crop` column will first be counted using the `nunique()` method. If the result is reasonable, the frequency of each value will be calculated. This will provide both the list of crop types and how often each one appears in the dataset.

In [3]:
print(crops['crop'].nunique())

22


In [4]:
crops['crop'].value_counts().sort_index()

crop
apple          100
banana         100
blackgram      100
chickpea       100
coconut        100
coffee         100
cotton         100
grapes         100
jute           100
kidneybeans    100
lentil         100
maize          100
mango          100
mothbeans      100
mungbean       100
muskmelon      100
orange         100
papaya         100
pigeonpeas     100
pomegranate    100
rice           100
watermelon     100
Name: count, dtype: int64

The dataset contains 2,200 rows, with **22 unique** crops. Each crop type appears exactly 100 times. This balanced distribution ensures that all crop classes are equally represented, which is beneficial for training a fair and unbiased classification model. 
    
The list also shows there are **no duplicated values**.

In [5]:
crops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   N       2200 non-null   int64  
 1   P       2200 non-null   int64  
 2   K       2200 non-null   int64  
 3   ph      2200 non-null   float64
 4   crop    2200 non-null   object 
dtypes: float64(1), int64(3), object(1)
memory usage: 86.1+ KB


All 2,200 entries are complete with **no missing values** across any of the five columns. This confirms the dataset is clean and ready for analysis, with consistent data types and no need for imputation or structural adjustments.


In [6]:
crops.describe()

Unnamed: 0,N,P,K,ph
count,2200.0,2200.0,2200.0,2200.0
mean,50.551818,53.362727,48.149091,6.46948
std,36.917334,32.985883,50.647931,0.773938
min,0.0,5.0,5.0,3.504752
25%,21.0,28.0,20.0,5.971693
50%,37.0,51.0,32.0,6.425045
75%,84.25,68.0,49.0,6.923643
max,140.0,145.0,205.0,9.935091


### Summary Statistics of Soil Features

The descriptive statistics give an overview of the soil characteristics across the dataset:

- **Nutrient levels (N, P, K)** show wide variability, especially potassium (K), which ranges from 5 to 205. This suggests diverse soil fertility across samples.
- **pH values** range from 3.5 to nearly 10, with a mean around 6.47 — slightly acidic to neutral, which is typical for a wide range of crops.
- The relatively large standard deviations, especially for potassium and nitrogen, indicate significant variation, which is useful for training a model to distinguish between different crop preferences.


## Preparing the Data for Modeling

Before training the model, the dataset must be prepared for machine learning. Key preprocessing steps include:

- **Reducing memory usage** by converting the categorical `crop` column to a more efficient data type.
- **Encoding the target variable** so it can be used in classification models.
- **Scaling numerical features** to ensure consistency across different ranges of values.

These steps help improve both model performance and computational efficiency.


### Optimising memory usage

Converting the `crop` column from object to category type resulted in a substantial reduction in memory usage. This is an effective preprocessing step when working with repeated string values, especially in larger datasets where efficiency matters.


In [7]:
memory_used_obj = crops.memory_usage(deep=True).sum() / 1024  # in kB
crops['crop'] = crops['crop'].astype('category')
memory_used_cat = crops.memory_usage(deep=True).sum() / 1024  # in kB
print(f'memory usage with "object" type: {memory_used_obj.round(1)} kB')
print(f'memory usage with "category" type: {memory_used_cat.round(1)} kB')


memory usage with "object" type: 189.5 kB
memory usage with "category" type: 72.8 kB


### Creating Dummy Variables for Crops

The crop labels will be converted into dummy variables to prepare them for modeling. This transformation will result in a binary column for each crop type, enabling straightforward comparison and classification.


In [8]:
crops_dummies = pd.get_dummies(crops['crop'], drop_first=True, dtype='uint8')
print(crops_dummies.sample(5))

      banana  blackgram  chickpea  coconut  coffee  cotton  grapes  jute  \
1949       0          0         0        0       0       1       0     0   
2119       0          0         0        0       1       0       0     0   
886        0          0         0        0       0       0       0     0   
1619       0          0         0        0       0       0       0     0   
1649       0          0         0        0       0       0       0     0   

      kidneybeans  lentil  ...  mango  mothbeans  mungbean  muskmelon  orange  \
1949            0       0  ...      0          0         0          0       0   
2119            0       0  ...      0          0         0          0       0   
886             0       1  ...      0          0         0          0       0   
1619            0       0  ...      0          0         0          0       1   
1649            0       0  ...      0          0         0          0       1   

      papaya  pigeonpeas  pomegranate  rice  watermelon 

## Separating Features and Target

The dataset will now be split into input features and the target variable. The feature matrix will contain the soil measurements, while the target array will store the corresponding crop labels.


In [26]:
X = crops.drop('crop', axis=1) # 'values' is used to get only the numpy array.
y = crops['crop'].values

In [27]:
X.head()

Unnamed: 0,N,P,K,ph
0,90,42,43,6.502985
1,85,58,41,7.038096
2,60,55,44,7.840207
3,74,35,40,6.980401
4,78,42,42,7.628473


In [28]:
y[:3]

['rice', 'rice', 'rice']
Categories (22, object): ['apple', 'banana', 'blackgram', 'chickpea', ..., 'pigeonpeas', 'pomegranate', 'rice', 'watermelon']

### Setting Up Cross-Validation

A 5-fold cross-validation strategy will be used to evaluate model performance. The data will be shuffled before splitting to ensure each fold is representative of the overall dataset. A fixed random seed is set for reproducibility.


In [29]:
kf = KFold(n_splits = 5,
           shuffle = True,
           random_state = 7)

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=7)

In [31]:
logreg = LogisticRegression()


In [40]:
features = ["N", "P", "K", "ph"]

# Store mean accuracy for each feature
feature_scores = {}

# Train logistic regression on each individual feature
for feature in X.columns:
    logreg = LogisticRegression(max_iter=1000)
    logreg.fit(X_train[[feature]].values,y_train)
    y_pred = logreg.predict(X_test[[feature]].values)
    score = metrics.f1_score(y_test, y_pred,'weighted')
    print(logreg.confussion_matrix(y_test, y_pred))
    feature_scores[feature] = score

# Display the scores
for feature, score in feature_scores.items():
    print(f"{feature}: {score:.4f}")
    

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


TypeError: too many positional arguments