<a href="https://colab.research.google.com/github/gwegayhu/dashboards-app/blob/master/Crop_Yield_Predication_with_ML_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Crop Yield Predication with ML Model

In [None]:
# Title: Crop Yield Prediction with ML Models
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib


1. Synthetic Data Generation

What it does: Creates a dataset with the specified features.
How:
Generates n_samples rows of synthetic data using Python's numpy library.
Randomly assigns categorical values for Crop_Type, Irrigation_Type, Soil_Type, and Season.
Uses uniform random distributions to generate continuous values for features like Farm_Area(acres), Fertilizer_Used(tons), Pesticide_Used(kg), Water_Usage(cubic meters), and the target variable Yield(tons).
Purpose: Simulates realistic data for analysis and model training when real data isn't available.

In [None]:
# Step 1: Generate Synthetic Data
np.random.seed(42)

n_samples = 1000

data = pd.DataFrame({
    'Farm_ID': [f'F{str(i).zfill(4)}' for i in range(1, n_samples + 1)],
    'Crop_Type': np.random.choice(['Cotton', 'Carrot', 'Wheat'], size=n_samples),
    'Irrigation_Type': np.random.choice(['Drip', 'Manual', 'Flood'], size=n_samples),
    'Soil_Type': np.random.choice(['Loamy', 'Sandy', 'Silty'], size=n_samples),
    'Season': np.random.choice(['Kharif', 'Rabi', 'Zaid'], size=n_samples),
    'Farm_Area(acres)': np.random.uniform(10, 500, size=n_samples),
    'Fertilizer_Used(tons)': np.random.uniform(1, 10, size=n_samples),
    'Pesticide_Used(kg)': np.random.uniform(0.5, 5, size=n_samples),
    'Water_Usage(cubic meters)': np.random.uniform(5000, 100000, size=n_samples),
    'Yield(tons)': np.random.uniform(5, 50, size=n_samples)
})


2. Feature Preprocessing

What it does: Prepares data for machine learning by transforming it into a suitable format.
How:
Numerical Features: Standardizes features like Farm_Area(acres) and Fertilizer_Used(tons) using StandardScaler, which scales the data to have a mean of 0 and a standard deviation of 1.
Categorical Features: Encodes variables like Crop_Type using OneHotEncoder, creating binary columns for each category (e.g., Cotton → [1, 0, 0]).
ColumnTransformer: Combines these transformations for numerical and categorical features into a single preprocessing step.
Purpose: Ensures that the machine learning model can interpret the data correctly, improving its predictive power.

In [None]:
# Step 2: Prepare Features and Target
X = data.drop(columns=['Farm_ID', 'Yield(tons)'])
y = data['Yield(tons)']

In [None]:
# Step 3: Preprocessing
categorical_features = ['Crop_Type', 'Irrigation_Type', 'Soil_Type', 'Season']
numerical_features = ['Farm_Area(acres)', 'Fertilizer_Used(tons)', 'Pesticide_Used(kg)', 'Water_Usage(cubic meters)']

numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

3. Model Training

What it does: Fits a Random Forest Regressor to the data.
How:
Splits the data into training (80%) and testing (20%) sets using train_test_split.
Builds a pipeline that preprocesses the data and trains the model in one seamless step.
Uses the Random Forest Regressor, an ensemble learning method, which combines multiple decision trees for robust predictions.
Purpose: Trains the model to predict Yield(tons) based on the input features.

In [None]:
# Step 4: Model Pipeline
model = RandomForestRegressor(random_state=42)
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])

In [None]:
# Step 5: Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Step 6: Train the Model
pipeline.fit(X_train, y_train)

4. Evaluation

What it does: Assesses the model’s performance using the Mean Squared Error (MSE).
How:
Predicts Yield(tons) for the testing data (X_test).
Calculates the MSE, which measures the average squared difference between predicted and actual values.
Purpose: Evaluates how well the model generalizes to unseen data. Lower MSE values indicate better performance.

In [None]:
# Step 7: Evaluate the Model
y_pred = pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 196.7127834054803


In [None]:
# Step 8: Save the Model
model_path = 'farm_yield_model.pkl'
joblib.dump(pipeline, model_path)
print(f"Model saved to {model_path}")

Model saved to farm_yield_model.pkl
