[//]: # ( Horticultural Data Analysis Template )
[//]: # ( License: MIT License )
[//]: # ( Repository: https://github.com/outobecca/botanical-colabs )

# üî¨ Horticultural Data Analysis Template
**Template Version 1.0** | Created: 2025-11-04

## üìã Overview

**Purpose:** Template for analyzing environmental sensor data, soil tests, and plant measurements.

**Use this template for:** Loading and cleaning horticultural datasets, exploratory data analysis, anomaly detection, statistical summaries

### üéØ Template Structure
This specialized template includes:
- Pre-configured imports and dependencies
- Standard helper functions for this workflow type
- Sample data generation functions
- Visualization templates
- Export and citation sections

### üìù How to Use This Template
1. Copy this notebook to create your analysis
2. Update the header with your specific research question
3. Modify sample data generators or add data loading
4. Customize analysis and visualization sections
5. Update citations with your data sources

### ‚ö†Ô∏è Template Notes
- Replace [brackets] with your specific content
- Modify sample data to match your research
- Add or remove sections as needed
- Follow the established code style


## üìö Background & Methodology

### Scientific Context
Modern horticulture uses data from:
- Environmental sensors (IoT devices, weather stations)
- Soil laboratory analyses
- Plant measurements (growth, yield)

Systematic analysis helps optimize conditions and improve outcomes.

### Methodology
1. **Data Loading** - Import from files or generate samples
2. **Data Cleaning** - Handle missing values and outliers
3. **Exploratory Analysis** - Statistics and distributions
4. **Visualization** - Charts and plots
5. **Export** - Save results

### Expected Outputs
- Summary statistics
- Time series plots
- Distribution histograms
- Correlation heatmaps
- Cleaned datasets


## ‚öôÔ∏è Step 1: Installation and Configuration

Run the cells below to install libraries and configure your analysis.


In [ ]:
# ============================================================================
# Library Installation and Import
# ============================================================================
"""
Installs required Python libraries.
Run this cell first.
"""

# Installation
!pip install -q pandas numpy matplotlib seaborn scipy ipywidgets openpyxl plotly scikit-learn

# Core imports
from typing import Dict, Optional, List, Any, Tuple
from IPython.display import display, Markdown, HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries installed successfully")


In [ ]:
# ============================================================================
# Interactive Configuration
# ============================================================================

# Data source selection (FORM)
print("üìã SELECT DATA SOURCE:")
data_source_options = {
    '1': 'Sample Environmental Data (30 days of sensor readings)',
    '2': 'Sample Soil Analysis Data (50 samples)',
    '3': 'Sample Plant Growth Data (100 plants)',
    '4': 'Upload My File (CSV, Excel, or JSON)'
}

for key, desc in data_source_options.items():
    print(f"  [{key}] {desc}")

DATA_SOURCE_CHOICE = input("Enter choice (1-4): ").strip() or '1'

# Upload file if needed
UPLOADED_DATA = None
if DATA_SOURCE_CHOICE == '4':
    print("üì§ Upload your file in the next cell using Google Colab's file upload")
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        filename = list(uploaded.keys())[0]
        print(f"‚úÖ Uploaded: {filename}")
        # Load the file
        if filename.endswith('.csv'):
            UPLOADED_DATA = pd.read_csv(filename)
        elif filename.endswith(('.xlsx', '.xls')):
            UPLOADED_DATA = pd.read_excel(filename)
        elif filename.endswith('.json'):
            UPLOADED_DATA = pd.read_json(filename)
        print(f"üìä Loaded {len(UPLOADED_DATA)} rows")

# Outlier handling (FORM)
print("üéØ OUTLIER DETECTION:")
print("  [1] Remove outliers (Z-score method)")
print("  [2] Remove outliers (IQR method)")
print("  [3] Keep all data")

OUTLIER_CHOICE = input("Enter choice (1-3): ").strip() or '1'
REMOVE_OUTLIERS = OUTLIER_CHOICE in ['1', '2']
OUTLIER_METHOD = 'zscore' if OUTLIER_CHOICE == '1' else 'iqr'

if REMOVE_OUTLIERS:
    Z_THRESHOLD = float(input("Z-score threshold (default 3.0): ").strip() or '3.0')
else:
    Z_THRESHOLD = 3.0

print("‚úÖ Configuration complete!")
print(f"   Data source: {data_source_options[DATA_SOURCE_CHOICE]}")
print(f"   Outlier handling: {OUTLIER_METHOD if REMOVE_OUTLIERS else 'None'}")


## üîß Step 2: Helper Functions

Data processing utilities.


In [ ]:
# ============================================================================
# Helper Functions
# ============================================================================

def generate_environmental_data(days=30):
    """Generate sample environmental sensor data."""
    np.random.seed(42)
    dates = pd.date_range(end=datetime.now(), periods=days*24, freq='H')
    hours = np.array([d.hour for d in dates])
    
    # Realistic patterns
    temp = 22 + 5 * np.sin((hours - 6) * np.pi / 12) + np.random.normal(0, 1, len(dates))
    humidity = 60 - 15 * np.sin((hours - 6) * np.pi / 12) + np.random.normal(0, 3, len(dates))
    light = np.maximum(0, 400 * np.sin((hours - 6) * np.pi / 12) + np.random.normal(0, 30, len(dates)))
    
    df = pd.DataFrame({
        'timestamp': dates,
        'temperature_c': temp,
        'humidity_percent': humidity,
        'light_ppfd': light,
        'soil_moisture': 65 + np.random.normal(0, 5, len(dates))
    })
    
    # Add anomalies
    anomalies = np.random.choice(len(df), 5, replace=False)
    df.loc[anomalies, 'temperature_c'] += np.random.choice([-10, 10], 5)
    
    return df

def generate_soil_data(n=50):
    """Generate sample soil analysis data."""
    np.random.seed(42)
    return pd.DataFrame({
        'sample_id': [f'SOIL_{i:03d}' for i in range(1, n+1)],
        'location': np.random.choice(['Field A', 'Field B', 'Field C'], n),
        'ph': np.clip(np.random.normal(6.5, 0.5, n), 5.0, 8.0),
        'nitrogen_ppm': np.clip(np.random.normal(45, 10, n), 0, 100),
        'phosphorus_ppm': np.clip(np.random.normal(30, 8, n), 0, 80),
        'potassium_ppm': np.clip(np.random.normal(180, 30, n), 0, 300),
        'organic_matter_%': np.clip(np.random.normal(4.5, 1.2, n), 1, 10),
        'date': pd.date_range(end=datetime.now(), periods=n)
    })

def generate_plant_data(n=100):
    """Generate sample plant growth data."""
    np.random.seed(42)
    treatments = ['Control', 'Treatment A', 'Treatment B']
    df = pd.DataFrame({
        'plant_id': [f'P{i:04d}' for i in range(1, n+1)],
        'variety': np.random.choice(['Var1', 'Var2', 'Var3'], n),
        'treatment': np.random.choice(treatments, n),
        'height_cm': np.random.normal(45, 12, n),
        'leaf_count': np.random.poisson(25, n),
        'yield_g': np.random.normal(125, 30, n),
        'date': pd.date_range(end=datetime.now(), periods=n)[::-1]
    })
    
    # Treatment effects
    for treat, factor in {'Control': 1.0, 'Treatment A': 1.15, 'Treatment B': 1.25}.items():
        mask = df['treatment'] == treat
        df.loc[mask, ['height_cm', 'yield_g']] *= factor
    
    return df.clip(lower=0)

def detect_outliers(df, column, method='zscore', threshold=3.0):
    """Detect outliers in a column."""
    if method == 'zscore':
        z = np.abs((df[column] - df[column].mean()) / df[column].std())
        return z > threshold
    else:  # IQR
        Q1, Q3 = df[column].quantile([0.25, 0.75])
        IQR = Q3 - Q1
        return (df[column] < Q1 - 1.5*IQR) | (df[column] > Q3 + 1.5*IQR)

def clean_data(df, remove_outliers=True, method='zscore', threshold=3.0):
    """Clean dataframe."""
    df_clean = df.copy()
    
    # Fill missing values
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    df_clean[numeric_cols] = df_clean[numeric_cols].fillna(df_clean[numeric_cols].median())
    
    # Remove outliers
    if remove_outliers:
        outlier_mask = pd.Series(False, index=df_clean.index)
        for col in numeric_cols:
            outlier_mask |= detect_outliers(df_clean, col, method, threshold)
        df_clean = df_clean[~outlier_mask]
        print(f"üóëÔ∏è Removed {outlier_mask.sum()} outliers")
    
    return df_clean

print("‚úÖ Helper functions loaded")


## üì° Step 3: Load Data

Load the selected dataset.


In [ ]:
# ============================================================================
# Load Data
# ============================================================================

if UPLOADED_DATA is not None:
    data = UPLOADED_DATA
    print("‚úÖ Using uploaded data")
elif DATA_SOURCE_CHOICE == '1':
    data = generate_environmental_data(30)
    print("‚úÖ Generated environmental data (720 readings)")
elif DATA_SOURCE_CHOICE == '2':
    data = generate_soil_data(50)
    print("‚úÖ Generated soil data (50 samples)")
elif DATA_SOURCE_CHOICE == '3':
    data = generate_plant_data(100)
    print("‚úÖ Generated plant growth data (100 plants)")
else:
    data = generate_environmental_data(30)
    print("‚úÖ Using default environmental data")

print(f"üìä Dataset shape: {data.shape}")
print(f"üìã Columns: {list(data.columns)}")

# Preview
display(Markdown("### üîç Data Preview"))
display(data.head(10))


## üöÄ Step 4: Clean and Analyze Data

Clean the data and compute statistics.


In [ ]:
# ============================================================================
# Data Cleaning and Analysis
# ============================================================================

print("üîÑ Cleaning data...")
data_clean = clean_data(data, REMOVE_OUTLIERS, OUTLIER_METHOD, Z_THRESHOLD)

print(f"üìä Original: {len(data)} rows")
print(f"üìä Cleaned: {len(data_clean)} rows")

# Summary statistics
display(Markdown("### üìà Summary Statistics"))
display(data_clean.describe())

# Missing values
display(Markdown("### üîç Missing Values"))
missing = data.isnull().sum()
if missing.sum() > 0:
    display(missing[missing > 0])
else:
    print("‚úÖ No missing values")

# Data types
display(Markdown("### üìã Data Types"))
display(pd.DataFrame({'Type': data_clean.dtypes, 'Count': data_clean.count()}))


## üìä Step 5: Visualizations

Create exploratory visualizations.


In [ ]:
# ============================================================================
# Data Visualization
# ============================================================================

numeric_cols = data_clean.select_dtypes(include=[np.number]).columns

# Distribution plots
if len(numeric_cols) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    for i, col in enumerate(numeric_cols[:4]):
        axes[i].hist(data_clean[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
        axes[i].set_title(f'Distribution: {col}', fontweight='bold')
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Frequency')
        axes[i].grid(True, alpha=0.3)
    
    # Hide extra subplots
    for i in range(len(numeric_cols), 4):
        axes[i].set_visible(False)
    
    plt.tight_layout()
    plt.show()

# Correlation heatmap
if len(numeric_cols) > 1:
    display(Markdown("### üî• Correlation Heatmap"))
    plt.figure(figsize=(10, 8))
    corr = data_clean[numeric_cols].corr()
    sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, square=True, fmt='.2f')
    plt.title('Correlation Matrix', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Time series (if timestamp column exists)
time_col = [c for c in data_clean.columns if 'time' in c.lower() or 'date' in c.lower()]
if time_col and len(numeric_cols) > 0:
    display(Markdown("### üìÖ Time Series"))
    fig, ax = plt.subplots(figsize=(14, 6))
    for col in numeric_cols[:3]:  # Plot first 3 numeric columns
        ax.plot(data_clean[time_col[0]], data_clean[col], label=col, marker='o', markersize=2)
    ax.set_xlabel('Time', fontsize=12)
    ax.set_ylabel('Value', fontsize=12)
    ax.set_title('Time Series Plot', fontsize=14, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

print("‚úÖ Visualizations complete")


## üìö Step 6: Export and Citations

Export cleaned data and document sources.


In [ ]:
# ============================================================================
# Export Results
# ============================================================================

# Export cleaned data
export_filename = f"cleaned_data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
data_clean.to_csv(export_filename, index=False)
print(f"‚úÖ Exported: {export_filename}")

# Summary report
display(Markdown(f"""
### üìã Analysis Summary

**Date:** {datetime.now().strftime('%Y-%m-%d %H:%M')}  
**Dataset:** {DATA_SOURCE_CHOICE}  
**Original rows:** {len(data)}  
**Cleaned rows:** {len(data_clean)}  
**Columns:** {len(data_clean.columns)}  
**Outlier method:** {OUTLIER_METHOD if REMOVE_OUTLIERS else 'None'}

### üìö Data Sources
- Sample data generated using NumPy (BSD License)
- Analysis performed using Pandas and SciPy

### üìñ Citation
If using this notebook, please cite:
> Botanical Colabs (2025). Horticultural Data Analysis & Exploration. 
> https://github.com/outobecca/botanical-colabs

### üìù Notes
- Always verify results with domain experts
- Sample data is for demonstration only
- Clean uploaded data may have different characteristics
"""))

print("‚úÖ Analysis complete!")
