**MODEL TO PREDICT COFFEE DISEASE RISK FOR PROACTIVE FARM MANAGEMENT**

**OVERVIEW**

This project entails building a supervised machine learning model to predict the risk level of specific crop disease outbreaks. The objective is to classify upcoming risk as Low, Medium, or High based on environmental and historical data, enabling farmers to apply fungicides or pesticides proactively and only when necessary. The focus will be on weather patterns, historical pest incidence, and crop growth stages to forecast disease probability for critical crops like potatoes or grapes.
Data Source: The model will be trained on:
- Weather Data: Historical and forecast meteorological data (temperature, humidity, rainfall) from the NASA POWER API, specifically tailored for agromodeling.
•	    Link: https://power.larc.nasa.gov/
The model will utilize features that represent the growing environment and historical context. The project will employ predictive classification models, including Logistic Regression (base model), Decision Tree/Random Forest, and Gradient Boosting (XGBoost), to assign a risk class. This model will be used to generate actionable alerts for farmers, helping to reduce unnecessary chemical input costs, minimize environmental impact, and protect crop yields.
Model evaluation will be based on precision, recall, F1-score (critical due to potential class imbalance), and overall multi-class accuracy, with strong emphasis on the business interpretability of the risk categories.


**Data Science Team(Group 8)**
- June Masolo
- Catherine Kaino
- Joram Lemaiyan
- Kennedy Omoro
- Kigen Tuwei
- Hellen Khisa
- Alvin Ngeno

**1. BUSINESS PROBLEM**

Coffee production in Kenya is increasingly threatened by unpredictable disease outbreaks, such as Coffee Leaf Rust, which can reduce smallholder yields by up to 70% and lead to significant financial instability. Current management practices are largely reactive—farmers either wait for visible symptoms (when it's often too late) or apply expensive fungicides indiscriminately, leading to wasted capital and environmental degradation. There is a critical lack of an early-warning system that leverages environmental data to provide actionable, proactive insights.

**2. DATA UNDERSTANDING**

The project will follow the **CRISP-DM** framework
- Extracting Data from NASA POWER API (The model will use Nasa Power dataset from 01-01-2020 to 30-12-2020 focusing on Coffee plantation, Kenya in Nyeri area(major coffee zone)).

In [2]:
# Importing libraries

import requests
import pandas as pd
import json

# Defining function to fetch data from NASA POWER API Weather Data

def fetch_nasa_power_data(lat, lon, start_date, end_date):
    """
    Fetches Agroclimatology data from NASA POWER API.
    Dates must be in YYYYMMDD format.
    """
    # API Endpoint for Daily Agroclimatology
    base_url = "https://power.larc.nasa.gov/api/temporal/daily/point"

    # Parameters for Coffee Disease Modeling:
    # T2M: Temp at 2m, RH2M: Humidity, PRECTOTCORR: Rainfall, WS2M: Wind Speed
    params = {
        "start": start_date,
        "end": end_date,
        "latitude": lat,
        "longitude": lon,
        "community": "ag",
        "parameters": "T2M,RH2M,PRECTOTCORR,WS2M",
        "format": "json",
        "header": "true"
    }

    print(f"Fetching data for Lat: {lat}, Lon: {lon}...")
    response = requests.get(base_url, params=params)

    if response.status_code == 200:
        data = response.json()

        # Extract features and dates
        features = data['properties']['parameter']
        df = pd.DataFrame(features)

        # Convert index (dates) to proper datetime format
        df.index = pd.to_datetime(df.index)
        df.index.name = "Date"

        return df
    else:
        print(f"Error: {response.status_code}")
        return None

# --- EXECUTION ---
# Coordinates for Nyeri, Kenya (Major Coffee Zone)
LATITUDE = -0.4213
LONGITUDE = 36.9511
START = "20100101"
END = "20201231"

weather_data = fetch_nasa_power_data(LATITUDE, LONGITUDE, START, END)

if weather_data is not None:
# Saving to CSV for the project
    weather_data.to_csv("kenya_coffee_weather_2010_2020.csv")
    print("Success! Data saved to 'kenya_coffee_weather_2010_2020.csv'")


Fetching data for Lat: -0.4213, Lon: 36.9511...
Success! Data saved to 'kenya_coffee_weather_2010_2020.csv'


In [3]:
# Preview the first 5 rows
print(weather_data.head())

              T2M   RH2M  PRECTOTCORR  WS2M
Date                                       
2010-01-01  16.82  86.57        10.87  1.95
2010-01-02  15.08  89.25         9.91  0.79
2010-01-03  16.30  85.61         2.40  1.84
2010-01-04  15.51  85.22         9.23  1.70
2010-01-05  16.33  78.20         3.48  1.49



The code output is a CSV file containing:

 - T2M: Daily average temperature (used to see if the fungus can grow).

 - RH2M: Relative humidity (critical for spore germination).

 - PRECTOTCORR: Daily rainfall (washes spores onto other leaves).

 - WS2M: Wind speed (disperses the disease across the farm).


In [4]:
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4018 entries, 2010-01-01 to 2020-12-31
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   T2M          4018 non-null   float64
 1   RH2M         4018 non-null   float64
 2   PRECTOTCORR  4018 non-null   float64
 3   WS2M         4018 non-null   float64
dtypes: float64(4)
memory usage: 157.0 KB


In [5]:
weather_data.shape

(4018, 4)

- The Weather data contains 4018 rows with 4 columns, no null values.

**Feature engineering**

We will use a combination of column renaming, date-based mapping for the Crop Stage, and a rule-based logic for the Risk Label.

In Kenya, coffee has a specific seasonality, the "Flowering" stage typically occurs after the rains start (around March/April and again in October/November). Disease risk isn't just about weather; a plant is often more vulnerable during the Flowering and Early Cherry stages than it is during pruning.

To create a predictive model, we'll need a "Target" column (the Risk Level), for the project we'll use Agronomic Rules—scientific logic used by plant pathologists—to label the data (For Coffee Leaf Rust (CLR), research shows that the fungus Hemileia vastatrix thrives when:Temperature is between  15∘C  and  30∘C  (Optimal:  21∘C–25∘C ).Relative Humidity is very high ( >90% ) for at least 24–48 hours.Rainfall is present (to splash spores) but not so heavy that it washes them away completely).

In [6]:
# 1. Rename the existing columns
"""Renaming: The .rename() function takes a dictionary to swap the technical NASA codes for your readable titles."""

column_mapping = {
    'T2M': 'Temp (Avg)',
    'RH2M': 'Humidity (%)',
    'PRECTOTCORR': 'Rainfall (mm)',
    'WS2M': 'Wind Speed (m/s)'
}
df = weather_data.rename(columns=column_mapping)
df.head()

Unnamed: 0_level_0,Temp (Avg),Humidity (%),Rainfall (mm),Wind Speed (m/s)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-01,16.82,86.57,10.87,1.95
2010-01-02,15.08,89.25,9.91,0.79
2010-01-03,16.3,85.61,2.4,1.84
2010-01-04,15.51,85.22,9.23,1.7
2010-01-05,16.33,78.2,3.48,1.49


In [8]:
# 2. Function to define Crop Stage based on Kenya's Coffee Calendar
"""Crop Stages: Based on research from the Coffee Research Institute (CRI) in Kenya, coffee follows a bimodal cycle. Flowering is triggered by the "Long Rains" and "Short Rains."""

def get_crop_stage(date):
    month = date.month
    if month in [3, 4, 10, 11]:
        return "Flowering"
    elif month in [5, 6, 12, 1]:
        return "Cherry Development"
    elif month in [7, 8, 9, 2]:
        return "Harvesting/Pruning"
    return "Vegetative"

# 3. Apply the functions to create new columns
df['Crop Stage'] = df.index.map(get_crop_stage)
df.head()

Unnamed: 0_level_0,Temp (Avg),Humidity (%),Rainfall (mm),Wind Speed (m/s),Crop Stage
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-01,16.82,86.57,10.87,1.95,Cherry Development
2010-01-02,15.08,89.25,9.91,0.79,Cherry Development
2010-01-03,16.3,85.61,2.4,1.84,Cherry Development
2010-01-04,15.51,85.22,9.23,1.7,Cherry Development
2010-01-05,16.33,78.2,3.48,1.49,Cherry Development


In [9]:
# 4. Disease Risk Labeling Logic
"""Target Creation: This creates the ground truth i.e. When we move to training, the model will try to learn why a specific day was labeled "High" versus "Low."""

def label_disease_risk(row):
    temp = row['Temp (Avg)']
    hum = row['Humidity (%)']
    rain = row['Rainfall (mm)']

    # Logic: High Humidity + Optimal Temp + Rainfall = High Risk
    if (21 <= temp <= 25) and (hum >= 90) and (rain > 0.1):
        return "High"
    elif (15 <= temp <= 30) and (hum >= 75):
        return "Medium"
    else:
        return "Low"

df['Risk Label (Target)'] = df.apply(label_disease_risk, axis=1)

# Display the formatted output
print(df[['Temp (Avg)', 'Humidity (%)', 'Rainfall (mm)', 'Crop Stage', 'Risk Label (Target)']].head())

            Temp (Avg)  Humidity (%)  Rainfall (mm)          Crop Stage  \
Date                                                                      
2010-01-01       16.82         86.57          10.87  Cherry Development   
2010-01-02       15.08         89.25           9.91  Cherry Development   
2010-01-03       16.30         85.61           2.40  Cherry Development   
2010-01-04       15.51         85.22           9.23  Cherry Development   
2010-01-05       16.33         78.20           3.48  Cherry Development   

           Risk Label (Target)  
Date                            
2010-01-01              Medium  
2010-01-02              Medium  
2010-01-03              Medium  
2010-01-04              Medium  
2010-01-05              Medium  


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4018 entries, 2010-01-01 to 2020-12-31
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Temp (Avg)           4018 non-null   float64
 1   Humidity (%)         4018 non-null   float64
 2   Rainfall (mm)        4018 non-null   float64
 3   Wind Speed (m/s)     4018 non-null   float64
 4   Crop Stage           4018 non-null   object 
 5   Risk Label (Target)  4018 non-null   object 
dtypes: float64(4), object(2)
memory usage: 219.7+ KB


In [11]:
# Moving the Date from the Index to a regular Column
df = df.reset_index()

# check the info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4018 entries, 0 to 4017
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date                 4018 non-null   datetime64[ns]
 1   Temp (Avg)           4018 non-null   float64       
 2   Humidity (%)         4018 non-null   float64       
 3   Rainfall (mm)        4018 non-null   float64       
 4   Wind Speed (m/s)     4018 non-null   float64       
 5   Crop Stage           4018 non-null   object        
 6   Risk Label (Target)  4018 non-null   object        
dtypes: datetime64[ns](1), float64(4), object(2)
memory usage: 219.9+ KB


In [12]:
df.head()

Unnamed: 0,Date,Temp (Avg),Humidity (%),Rainfall (mm),Wind Speed (m/s),Crop Stage,Risk Label (Target)
0,2010-01-01,16.82,86.57,10.87,1.95,Cherry Development,Medium
1,2010-01-02,15.08,89.25,9.91,0.79,Cherry Development,Medium
2,2010-01-03,16.3,85.61,2.4,1.84,Cherry Development,Medium
3,2010-01-04,15.51,85.22,9.23,1.7,Cherry Development,Medium
4,2010-01-05,16.33,78.2,3.48,1.49,Cherry Development,Medium


**Lagging**

To build a model that actually helps a farmer, we must account for the incubation period, the "time lag" between the weather event and the visible outbreak.

In Kenya, specifically for Coffee Leaf Rust (CLR), research shows that the lag between high humidity/rainfall and a measurable peak in disease can range from 15 to 30 days, though a shorter lag of 8 to 15 days is often used for early warning systems.