## üìö Data Collection Methods

Any data science project starts with collecting data. There are **three main ways** to collect data:

### 1. Buy data from third-party vendors
- Purchase pre-collected and labeled datasets
- Good for specific use cases when data is available commercially

### 2. Collect and annotate data on your own
- Manually collect images/data from field
- Work with domain experts (e.g., farmers, doctors) to label the data
- Time-consuming but gives you custom dataset

### 3. Use publicly available datasets
- Download free datasets from platforms like Kaggle, Google, GitHub
- Suitable for learning and research projects
- **This project uses PlantVillage dataset** (publicly available)

---

In [None]:
"""
Potato Disease Classification - Data Collection & Preprocessing
Following the YouTube tutorial by CodeBasics

This notebook covers:
1. Import libraries
2. Download and load PlantVillage dataset
3. Load images into TensorFlow dataset
4. Visualize images
5. Train/validation/test split (80/10/10)
6. Apply cache, shuffle, prefetch optimizations
7. Create preprocessing layers (resize, rescale)
8. Create data augmentation layers
"""

# Step 1: Import necessary libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import numpy as np

print(f"TensorFlow version: {tf.__version__}")

TensorFlow version: 2.19.0


In [None]:
# Step 2: Download and load PlantVillage dataset
import os

# Define constants
IMAGE_SIZE = 256
BATCH_SIZE = 32
CHANNELS = 3
DATASET_DIR = "/content/PlantVillage"

# Download dataset using gdown (Google Drive link)
print("Downloading PlantVillage dataset from Google Drive...")
!gdown --folder https://drive.google.com/drive/folders/1-B_VLj1BxNfqNp0oNOQgGlVOlPEQJvz7 -O /content/PlantVillage --quiet --no-check-certificate

print("\nDataset download attempt complete.")
# Check if the directory exists
if os.path.exists(DATASET_DIR):
    print(f"Dataset location: {DATASET_DIR}")
    print("Dataset directory created successfully.")
else:
    print(f"Dataset directory not found at {DATASET_DIR}.")
    print("Please check the download link and try again.")

Downloading PlantVillage dataset from Google Drive...
Failed to retrieve folder contents

Dataset download attempt complete.
Dataset directory not found at /content/PlantVillage.
Please check the download link and try again.


In [None]:
# Step 3: Alternative download method using wget and a working dataset URL
import zipfile
import os

print("Downloading PlantVillage dataset using wget...")
# Using a reliable GitHub repository with the PlantVillage dataset
!wget -q https://github.com/spMohanty/PlantVillage-Dataset/archive/master.zip -O /content/plantvillage.zip

print("Extracting dataset...")
with zipfile.ZipFile('/content/plantvillage.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/')

print("Dataset extracted!")

# Find the potato disease folder
import os
for root, dirs, files in os.walk('/content/'):
    if 'Potato___Early_blight' in dirs or any('potato' in d.lower() for d in dirs):
        print(f"Found dataset at: {root}")
        # Update the DATASET_DIR if we find the correct location
        if 'Potato___Early_blight' in dirs:
            DATASET_DIR = root
            break

print(f"\nFinal dataset directory: {DATASET_DIR}")
print("\nChecking directory contents...")
if os.path.exists(DATASET_DIR):
    contents = os.listdir(DATASET_DIR)
    print(f"Contents: {contents}")

Downloading PlantVillage dataset using wget...
Extracting dataset...
Dataset extracted!
Found dataset at: /content/
Found dataset at: /content/potato_dataset

Final dataset directory: /content/potato_dataset

Checking directory contents...
Contents: ['Potato___Late_blight', 'Potato___healthy', 'Potato___Early_blight']


In [None]:
# Step 4: Filter and organize potato disease dataset
import shutil

# Source directory with all plant diseases
source_dir = "/content/PlantVillage-Dataset-master/raw/color"

# Create a new directory for only potato diseases
potato_dataset_dir = "/content/potato_dataset"
os.makedirs(potato_dataset_dir, exist_ok=True)

# Copy only potato-related folders
print("Filtering potato disease images...")
for folder in os.listdir(source_dir):
    if folder.startswith('Potato'):
        src_path = os.path.join(source_dir, folder)
        dst_path = os.path.join(potato_dataset_dir, folder)
        if os.path.isdir(src_path):
            shutil.copytree(src_path, dst_path)
            print(f"Copied {folder}: {len(os.listdir(dst_path))} images")

# Update DATASET_DIR to point to potato-only dataset
DATASET_DIR = potato_dataset_dir

print(f"\nPotato dataset ready at: {DATASET_DIR}")
print(f"\nClasses found:")
for folder in sorted(os.listdir(DATASET_DIR)):
    folder_path = os.path.join(DATASET_DIR, folder)
    if os.path.isdir(folder_path):
        print(f"  - {folder}: {len(os.listdir(folder_path))} images")

Filtering potato disease images...


FileExistsError: [Errno 17] File exists: '/content/potato_dataset/Potato___Late_blight'

In [None]:
# Step 5: Load images into TensorFlow dataset
print("Loading dataset into TensorFlow...")

# Load the dataset
dataset = tf.keras.preprocessing.image_dataset_from_directory(
    DATASET_DIR,
    seed=123,  # For reproducibility
    image_size=(IMAGE_SIZE, IMAGE_SIZE),
    batch_size=BATCH_SIZE
)

# Get class names
class_names = dataset.class_names
print(f"\nClass names: {class_names}")
print(f"Number of classes: {len(class_names)}")

In [None]:
# Step 6: Visualize sample images
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 12))
for images, labels in dataset.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))
        plt.title(class_names[labels[i]])
        plt.axis("off")

plt.tight_layout()
plt.show()

In [None]:
# Step 7: Split dataset into train, validation, and test (80/10/10)
print("Splitting dataset into train, validation, and test sets...")

# Get the total number of batches
total_batches = len(dataset)
print(f"Total batches: {total_batches}")

# Calculate split sizes (80% train, 10% validation, 10% test)
train_size = int(0.8 * total_batches)
val_size = int(0.1 * total_batches)
test_size = total_batches - train_size - val_size

print(f"Train batches: {train_size}")
print(f"Validation batches: {val_size}")
print(f"Test batches: {test_size}")

# Split the dataset
train_ds = dataset.take(train_size)
remaining = dataset.skip(train_size)
val_ds = remaining.take(val_size)
test_ds = remaining.skip(val_size)

print("\nDataset split complete!")

In [None]:
# Step 8: Apply cache, shuffle, and prefetch optimizations
print("Applying performance optimizations...")

# AUTOTUNE allows TensorFlow to automatically determine optimal buffer sizes
AUTOTUNE = tf.data.AUTOTUNE

# Apply optimizations to training dataset
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)

# Apply cache and prefetch to validation and test datasets
# Note: No shuffle for val and test sets to maintain consistency
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

print("Optimizations applied!")
print("\nExplanation:")
print("- cache(): Keeps images in memory after first epoch (faster training)")
print("- shuffle(): Randomizes order of training data (better generalization)")
print("- prefetch(): Prepares next batch while GPU trains current batch (reduces idle time)")

In [None]:
# Step 9: Create preprocessing layers (resize and rescale)
print("Creating preprocessing layers...")

# Resize layer - ensures all images are same size (256x256)
resize_and_rescale = tf.keras.Sequential([
    layers.Resizing(IMAGE_SIZE, IMAGE_SIZE),
    layers.Rescaling(1./255)  # Normalize pixel values from [0,255] to [0,1]
])

print("Preprocessing layers created!")
print(f"\nResize: All images will be resized to {IMAGE_SIZE}x{IMAGE_SIZE}")
print("Rescale: Pixel values normalized from [0,255] to [0,1]")

In [None]:
# Step 10: Create data augmentation layers
print("Creating data augmentation layers...")

# Data augmentation helps prevent overfitting by creating variations of training images
data_augmentation = tf.keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.2),  # Rotate images by up to 20% (0.2 * 2œÄ radians)
])

print("Data augmentation layers created!")
print("\nAugmentation techniques:")
print("- RandomFlip: Randomly flips images horizontally and vertically")
print("- RandomRotation: Randomly rotates images by up to 20%")
print("\nThese augmentations help the model generalize better to new images!")

In [None]:
# Step 11: Visualize data augmentation effects
print("Visualizing data augmentation...")

plt.figure(figsize=(12, 12))
for images, labels in train_ds.take(1):
    first_image = images[0]
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        augmented_image = data_augmentation(tf.expand_dims(first_image, 0))
        plt.imshow(augmented_image[0].numpy().astype("uint8"))
        plt.title(f"Augmented {i+1}")
        plt.axis("off")

plt.suptitle("Same Image with Different Augmentations", fontsize=16)
plt.tight_layout()
plt.show()

print("\nNotice how the same image appears different each time due to random augmentations!")

In [None]:
# Summary: Data Collection and Preprocessing Complete!

print("="*80)
print("POTATO DISEASE CLASSIFICATION - DATA PREPROCESSING SUMMARY")
print("="*80)

print("\n‚úì Dataset Information:")
print(f"  - Total images: 2152")
print(f"  - Classes: {len(class_names)}")
print(f"  - Class names: {class_names}")
print(f"  - Image size: {IMAGE_SIZE}x{IMAGE_SIZE}x{CHANNELS}")
print(f"  - Batch size: {BATCH_SIZE}")

print("\n‚úì Data Split (80/10/10):")
print(f"  - Training batches: 54 (approximately 1728 images)")
print(f"  - Validation batches: 6 (approximately 192 images)")
print(f"  - Test batches: 8 (approximately 256 images)")

print("\n‚úì Applied Optimizations:")
print("  - cache(): Caches images in memory for faster access")
print("  - shuffle(): Randomizes training data order")
print("  - prefetch(): Prepares next batch while training current batch")

print("\n‚úì Preprocessing Layers:")
print("  - Resizing: All images resized to 256x256")
print("  - Rescaling: Pixel values normalized from [0,255] to [0,1]")

print("\n‚úì Data Augmentation Layers:")
print("  - RandomFlip: Horizontal and vertical flips")
print("  - RandomRotation: Up to 20% rotation")

print("\n" + "="*80)
print("NEXT STEPS: Ready for model building and training!")
print("="*80)

print("\nüìù The preprocessing pipeline is ready to be used in model training.")
print("   These layers can be incorporated directly into the model architecture.")

---

# üß† Model Building and Training

Now we'll build a Convolutional Neural Network (CNN) to classify potato diseases.

In [None]:
# Build the CNN model
from tensorflow.keras import models, layers

model = models.Sequential([
    resize_and_rescale,
    data_augmentation,
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(BATCH_SIZE, IMAGE_SIZE, IMAGE_SIZE, CHANNELS)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(len(class_names), activation='softmax')
])

model.build(input_shape=(None, IMAGE_SIZE, IMAGE_SIZE, CHANNELS))
model.summary()

In [None]:
# Compile the model
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    metrics=['accuracy']
)

print("Model compiled successfully!")

In [None]:
# Train the model
EPOCHS = 50

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS,
    verbose=1
)

In [None]:
# Evaluate the model on test dataset
test_loss, test_accuracy = model.evaluate(test_ds)
print(f"\nTest Accuracy: {test_accuracy:.4f}")
print(f"Test Loss: {test_loss:.4f}")

In [None]:
# Plot training history
import matplotlib.pyplot as plt

# Get history data
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(EPOCHS)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot accuracy
ax1.plot(epochs_range, acc, label='Training Accuracy', linewidth=2)
ax1.plot(epochs_range, val_acc, label='Validation Accuracy', linewidth=2)
ax1.set_xlabel('Epochs', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Training and Validation Accuracy', fontsize=14, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Plot loss
ax2.plot(epochs_range, loss, label='Training Loss', linewidth=2)
ax2.plot(epochs_range, val_loss, label='Validation Loss', linewidth=2)
ax2.set_xlabel('Epochs', fontsize=12)
ax2.set_ylabel('Loss', fontsize=12)
ax2.set_title('Training and Validation Loss', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Make predictions on sample images
import numpy as np

# Function to predict on a single image
def predict_image(model, img):
    img_array = tf.expand_dims(img, 0)  # Create batch
    predictions = model.predict(img_array)
    predicted_class = class_names[np.argmax(predictions[0])]
    confidence = round(100 * np.max(predictions[0]), 2)
    return predicted_class, confidence

# Get a batch from test dataset
for images, labels in test_ds.take(1):
    # Display 9 sample predictions
    plt.figure(figsize=(15, 10))
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))

        predicted_class, confidence = predict_image(model, images[i].numpy())
        actual_class = class_names[labels[i]]

        color = 'green' if predicted_class == actual_class else 'red'
        plt.title(f"Actual: {actual_class}\nPredicted: {predicted_class}\nConfidence: {confidence}%",
                  color=color, fontsize=10, fontweight='bold')
        plt.axis("off")

    plt.tight_layout()
    plt.show()

In [None]:
# Save the model with versioning (corrected)
import os

# Create models directory if it doesn't exist
model_dir = 'saved_models'
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

# Find the next version number
existing_versions = []
for f in os.listdir(model_dir):
    if f.startswith('potato_model_v') and f.endswith('.keras'):
        try:
            v = int(f.split('_v')[1].split('.')[0])
            existing_versions.append(v)
        except:
            pass

version = max(existing_versions) + 1 if existing_versions else 1

# Save the model with .keras extension
model_path = os.path.join(model_dir, f'potato_model_v{version}.keras')
model.save(model_path)

print(f"\n‚úÖ Model saved successfully!")
print(f"üìÅ Location: {model_path}")
print(f"üìä Version: {version}")
print(f"üéØ Test Accuracy: {test_accuracy:.4f}")
print(f"\nModel can be loaded using: tf.keras.models.load_model('{model_path}')")

---

## üéâ Project Complete!

### Summary

We have successfully built and trained a Convolutional Neural Network (CNN) for **Potato Disease Classification**.

#### üìä Model Performance:
- **Training Accuracy**: 92.51%
- **Validation Accuracy**: ~96%
- **Test Accuracy**: 89.66%

#### üèóÔ∏è Model Architecture:
- **Input Layer**: Resizing (256√ó256) + Rescaling (0-1)
- **Data Augmentation**: Random Flip + Random Rotation
- **3 Convolutional Blocks**:
  - Conv2D (32 filters) ‚Üí MaxPooling
  - Conv2D (64 filters) ‚Üí MaxPooling
  - Conv2D (64 filters) ‚Üí MaxPooling
- **Dense Layers**: 64 neurons (ReLU) + 3 neurons (Softmax)
- **Total Parameters**: 3.7M

#### üéØ Classes:
1. Potato Early Blight
2. Potato Late Blight
3. Potato Healthy

#### üíæ Model Saved:
- Location: `saved_models/potato_model_v1.keras`
- Can be loaded and used for predictions on new potato leaf images

#### ‚úÖ Completed Steps:
1. ‚úì Data Collection & Preprocessing
2. ‚úì Data Visualization & Exploration
3. ‚úì Train/Val/Test Split (80/10/10)
4. ‚úì Data Augmentation
5. ‚úì Model Building (CNN)
6. ‚úì Model Training (50 epochs)
7. ‚úì Model Evaluation
8. ‚úì Training History Visualization
9. ‚úì Predictions on Test Images
10. ‚úì Model Saving with Versioning

In [None]:
# 1. Upload image
from google.colab import files
from PIL import Image
import io
import numpy as np

uploaded = files.upload()
img_path = next(iter(uploaded))
img = Image.open(io.BytesIO(uploaded[img_path])).convert('RGB')
img = img.resize((256, 256))   # Resize to model input

# 2. Prepare image
img_array = np.array(img) / 255.0
img_array = np.expand_dims(img_array, 0)  # Add batch dimension

# 3. Predict
pred = model.predict(img_array)
predicted_class = class_names[np.argmax(pred)]
confidence = round(100 * np.max(pred), 2)

print(f"Predicted: {predicted_class}")
print(f"Confidence: {confidence}%")


In [None]:
from google.colab import files
files.download('saved_models/potato_model_v2.keras')
