# Crop Recommendation System — Exploratory Data Analysis

**Dataset:** Crop Recommendation Dataset  
**Source:** [Kaggle — Crop Recommendation Dataset](https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset)  
**Notebook:** 01 — Exploratory Data Analysis  
**Authors:** Group E  

---

## Business Context

Precision agriculture has emerged as a critical paradigm for addressing the dual challenge of increasing global food demand while minimising the environmental footprint of farming operations. By leveraging data-driven methods, it becomes possible to move away from uniform, resource-intensive practices towards targeted, site-specific interventions — optimising inputs such as fertilisers and water while sustaining or improving crop yields.

A central challenge in this domain is **crop selection**: recommending the most suitable crop for a given plot of land based on its soil composition and prevailing climatic conditions. Poor crop-soil-climate alignment leads to suboptimal yields, excessive use of fertilisers (particularly nitrogen, phosphorus, and potassium), and avoidable economic losses for farmers. Machine learning models trained on soil and climate data offer a scalable, cost-effective solution to support agronomic decision-making.

The dataset used in this project was compiled to support the development of such recommendation systems. It captures seven agronomic and environmental variables — soil macronutrient levels (N, P, K), temperature, humidity, soil pH, and rainfall — alongside the crop label that represents the optimal crop for those conditions. The dataset covers **22 distinct crop types** and comprises **2,200 observations**, with a perfectly balanced distribution of 100 samples per crop.

## Objective of This Notebook

This notebook constitutes the **Exploratory Data Analysis (EDA)** phase of the project. Its primary objectives are to:

1. Assess the structural integrity of the dataset (completeness, types, duplicates);
2. Characterise the statistical distribution of each feature;
3. Identify relationships between features and the target variable;
4. Detect multicollinearity and redundant features;
5. Generate data-driven hypotheses and insights to inform subsequent feature engineering and modelling decisions.

All findings are summarised at the end of this notebook as a structured set of insights that directly feed into `02_feature_engineering.ipynb` and `03_modeling.ipynb`.

## 0) Notebook Setup

---
### Data Dictionary

The table below describes all variables present in the dataset. Understanding the agronomic meaning of each feature is essential for contextualising the statistical patterns identified during analysis.

| Variable | Type | Unit | Description |
|---|---|---|---|
| `N` | Numerical (continuous) | kg/ha | Ratio of Nitrogen content in the soil. Nitrogen is a primary macronutrient essential for leaf and stem growth. |
| `P` | Numerical (continuous) | kg/ha | Ratio of Phosphorus content in the soil. Phosphorus supports root development and energy transfer in plants. |
| `K` | Numerical (continuous) | kg/ha | Ratio of Potassium content in the soil. Potassium regulates water uptake and improves disease resistance. |
| `temperature` | Numerical (continuous) | °C | Average ambient temperature of the growing environment. |
| `humidity` | Numerical (continuous) | % | Relative humidity of the surrounding air. |
| `ph` | Numerical (continuous) | — | pH value of the soil (scale 0–14). Most crops thrive in a slightly acidic to neutral range (6.0–7.5). |
| `rainfall` | Numerical (continuous) | mm | Average annual rainfall in the crop's growing region. |
| `label` | Categorical | — | **Target variable.** The recommended crop type for the given soil and climate conditions. Contains 22 distinct crop classes. |

> **Note:** N, P, and K are collectively referred to as *NPK* — the three primary soil macronutrients universally used in fertiliser characterisation.

In [1]:
# =============================================================================
# SECTION 0 — IMPORTS & CONFIGURATION
# =============================================================================

# --- Standard Library ---
import warnings
warnings.filterwarnings('ignore')

# --- Data Manipulation ---
import numpy as np
import pandas as pd

# --- Data Visualisation ---
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns

# --- Display Settings ---
pd.set_option('display.max_columns', None)       # Show all columns in DataFrames
pd.set_option('display.float_format', '{:.3f}'.format)  # Limit float decimals for readability

In [2]:
# =============================================================================
# CONFIGURATION — Reproducibility & Visual Style
# =============================================================================

# Random seed for reproducibility across all stochastic operations
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# --- Plot Style ---
sns.set_theme(style='whitegrid', font_scale=1.1)
plt.rcParams.update({
    'figure.dpi': 120,
    'figure.facecolor': 'white',
    'axes.titlesize': 13,
    'axes.labelsize': 11,
    'axes.titleweight': 'bold',
    'xtick.labelsize': 9,
    'ytick.labelsize': 9,
})

# Consistent colour palette used throughout this notebook
PALETTE = 'Set2'
ACCENT_COLOR = '#2ecc71'  # Primary accent for single-colour plots

print("Configuration loaded successfully.")
print(f"  Random seed : {RANDOM_SEED}")
print(f"  Plot style  : whitegrid | DPI 120")
print(f"  Palette     : {PALETTE}")

Configuration loaded successfully.
  Random seed : 42
  Plot style  : whitegrid | DPI 120
  Palette     : Set2


In [5]:
# =============================================================================
# DATA LOADING
# =============================================================================


DATA_PATH = '../data/raw/Crop_recommendation.csv'
df = pd.read_csv(DATA_PATH)

print("Dataset loaded successfully.")
print(f"  Shape   : {df.shape[0]} rows × {df.shape[1]} columns")
print(f"  Columns : {df.columns.tolist()}")
print()
df.head(10)

Dataset loaded successfully.
  Shape   : 2200 rows × 8 columns
  Columns : ['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall', 'label']



Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.88,82.003,6.503,202.936,rice
1,85,58,41,21.77,80.32,7.038,226.656,rice
2,60,55,44,23.004,82.321,7.84,263.964,rice
3,74,35,40,26.491,80.158,6.98,242.864,rice
4,78,42,42,20.13,81.605,7.628,262.717,rice
5,69,37,42,23.058,83.37,7.073,251.055,rice
6,69,55,38,22.709,82.639,5.701,271.325,rice
7,94,53,40,20.278,82.894,5.719,241.974,rice
8,89,54,38,24.516,83.535,6.685,230.446,rice
9,68,58,38,23.224,83.033,6.336,221.209,rice


---
## 1) Data Loading & Initial Overview

Before any analytical work, it is essential to verify the structural integrity of the dataset. This section examines data types, completeness, and the presence of duplicate records. These checks establish whether the data is fit for analysis and flag any issues that would require remediation prior to modelling.

In [6]:
# =============================================================================
# 1.1 — SCHEMA & DATA TYPES
# =============================================================================

print("=" * 55)
print(" DATASET SCHEMA")
print("=" * 55)
df.info()
print()
print(f"Total observations : {df.shape[0]}")
print(f"Total features     : {df.shape[1] - 1}  (excluding target)")
print(f"Target variable    : 'label' — {df['label'].nunique()} unique classes")

 DATASET SCHEMA
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   N            2200 non-null   int64  
 1   P            2200 non-null   int64  
 2   K            2200 non-null   int64  
 3   temperature  2200 non-null   float64
 4   humidity     2200 non-null   float64
 5   ph           2200 non-null   float64
 6   rainfall     2200 non-null   float64
 7   label        2200 non-null   object 
dtypes: float64(4), int64(3), object(1)
memory usage: 137.6+ KB

Total observations : 2200
Total features     : 7  (excluding target)
Target variable    : 'label' — 22 unique classes


In [7]:
# =============================================================================
# 1.2 — MISSING VALUES
# =============================================================================

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)

missing_report = pd.DataFrame({
    'Missing Count': missing,
    'Missing (%)': missing_pct
})

print("=" * 55)
print(" MISSING VALUES REPORT")
print("=" * 55)
print(missing_report)
print()

if missing.sum() == 0:
    print(">> No missing values detected. Dataset is complete.")
else:
    print(f">> WARNING: {missing.sum()} missing value(s) found. Remediation required.")

 MISSING VALUES REPORT
             Missing Count  Missing (%)
N                        0        0.000
P                        0        0.000
K                        0        0.000
temperature              0        0.000
humidity                 0        0.000
ph                       0        0.000
rainfall                 0        0.000
label                    0        0.000

>> No missing values detected. Dataset is complete.


In [None]:
# =============================================================================
# 1.3 — DUPLICATE RECORDS
# =============================================================================

n_duplicates = df.duplicated().sum()

print("=" * 55)
print(" DUPLICATE RECORDS REPORT")
print("=" * 55)
print(f"Duplicate rows found : {n_duplicates}")
print()

if n_duplicates > 0:
    print(">> Duplicate rows detected:")
    print(df[df.duplicated(keep=False)].sort_values(by=df.columns.tolist()).head(10))
    print()
    print(">> ACTION: Duplicates will be removed before modelling.")
else:
    print(">> No duplicate rows detected.")

### 2) Target Variable Analysis

### 3) Univariate Analysis — Feature Distributions

### 4) Bivariate Analysis — Features vs. Target

### 5) Correlation & Multivariate Analysis

### 6) Key Insights & Implications for Modelling