# Exploratory Data Analysis: Monet Painting Dataset

In this notebook, we'll perform exploratory data analysis of the Monet Painting Dataset. We'll examine the characteristics of both the Monet paintings and the photographs, analyze color distributions, identify key style elements, and prepare the data for our GAN model.

## 1. Load Libraries

In [None]:
# Load essential libraries
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import os
import random
import cv2
import tensorflow as tf
from sklearn.cluster import KMeans
from collections import Counter
from skimage import color
import math

# Set plot style - using a style compatible with newer matplotlib versions
plt.style.use('default')

# For reproducibility
np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)

## 2. Load Data

First, we'll load the dataset and examine its structure. We'll set up paths for both Kaggle environment and local development.

In [None]:
# Define paths
# Check if we're in Kaggle environment
IN_KAGGLE = os.path.exists('/kaggle/input')

if IN_KAGGLE:
    # Kaggle paths
    MONET_JPG_DIR = "/kaggle/input/gan-getting-started/monet_jpg"
    PHOTO_JPG_DIR = "/kaggle/input/gan-getting-started/photo_jpg"
    MONET_TFREC_DIR = "/kaggle/input/gan-getting-started/monet_tfrec"
    PHOTO_TFREC_DIR = "/kaggle/input/gan-getting-started/photo_tfrec"
else:
    # Local paths - adjust these based on your data location
    BASE_DIR = '../data'
    MONET_JPG_DIR = os.path.join(BASE_DIR, 'monet_jpg')
    PHOTO_JPG_DIR = os.path.join(BASE_DIR, 'photo_jpg')
    MONET_TFREC_DIR = os.path.join(BASE_DIR, 'monet_tfrec')
    PHOTO_TFREC_DIR = os.path.join(BASE_DIR, 'photo_tfrec')

# Check if the paths exist
print(f"Monet JPG directory exists: {os.path.exists(MONET_JPG_DIR)}")
print(f"Photo JPG directory exists: {os.path.exists(PHOTO_JPG_DIR)}")
print(f"Monet TFRecord directory exists: {os.path.exists(MONET_TFREC_DIR)}")
print(f"Photo TFRecord directory exists: {os.path.exists(PHOTO_TFREC_DIR)}")

In [None]:
# Count the number of images in each directory
try:
    monet_jpg_count = len([f for f in os.listdir(MONET_JPG_DIR) if f.endswith('.jpg')])
    photo_jpg_count = len([f for f in os.listdir(PHOTO_JPG_DIR) if f.endswith('.jpg')])
    
    print(f"Number of Monet paintings: {monet_jpg_count}")
    print(f"Number of photographs: {photo_jpg_count}")
except Exception as e:
    print(f"Error counting images: {e}")
    print("Please ensure the dataset is downloaded and the paths are correctly set.")

## 3. Helper Functions

Let's define some helper functions to load and process images.

In [None]:
def load_image(image_path):
    """Load an image from a file path."""
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.cast(img, tf.float32) / 255.0  # Normalize to [0,1]
    return img.numpy()

def load_random_images(directory, n=5):
    """Load n random images from a directory."""
    image_files = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith('.jpg')]
    selected_files = random.sample(image_files, min(n, len(image_files)))
    return [load_image(f) for f in selected_files], selected_files

def display_images(images, titles=None, cols=5, figsize=(15, 10)):
    """Display a list of images in a grid."""
    n_images = len(images)
    rows = math.ceil(n_images / cols)
    
    fig = plt.figure(figsize=figsize)
    for i, image in enumerate(images):
        ax = fig.add_subplot(rows, cols, i+1)
        if titles is not None:
            ax.set_title(titles[i])
        plt.imshow(image)
        plt.axis('off')
    plt.tight_layout()
    plt.show()
    
def get_image_dimensions(directory):
    """Get dimensions of all images in a directory."""
    dimensions = []
    for filename in os.listdir(directory):
        if filename.endswith('.jpg'):
            img_path = os.path.join(directory, filename)
            with Image.open(img_path) as img:
                dimensions.append(img.size)
    return dimensions

## 4. Visual Exploration

Let's visualize some sample images from both the Monet paintings and photographs datasets.

In [None]:
# Load random Monet paintings
monet_images, monet_files = load_random_images(MONET_JPG_DIR, n=5)
monet_titles = [os.path.basename(f) for f in monet_files]

# Display Monet paintings
print("Sample Monet Paintings:")
display_images(monet_images, monet_titles)

In [None]:
# Load random photographs
photo_images, photo_files = load_random_images(PHOTO_JPG_DIR, n=5)
photo_titles = [os.path.basename(f) for f in photo_files]

# Display photographs
print("Sample Photographs:")
display_images(photo_images, photo_titles)

## 5. Image Dimensions Analysis

Let's analyze the dimensions of the images in both datasets to ensure they are consistent.

In [None]:
# Get dimensions of Monet paintings
try:
    monet_dimensions = get_image_dimensions(MONET_JPG_DIR)
    monet_dim_counter = Counter(monet_dimensions)
    
    print("Monet Paintings Dimensions:")
    for dim, count in monet_dim_counter.items():
        print(f"{dim}: {count} images")
except Exception as e:
    print(f"Error analyzing Monet dimensions: {e}")

In [None]:
# Get dimensions of photographs
try:
    photo_dimensions = get_image_dimensions(PHOTO_JPG_DIR)
    photo_dim_counter = Counter(photo_dimensions)
    
    print("Photographs Dimensions:")
    for dim, count in photo_dim_counter.items():
        print(f"{dim}: {count} images")
except Exception as e:
    print(f"Error analyzing Photo dimensions: {e}")

## 6. Color Analysis

Now, let's analyze the color distributions in both datasets to understand Monet's distinctive style.

In [None]:
def extract_color_palette(image, n_colors=5):
    """Extract the dominant colors from an image using K-means clustering."""
    # Reshape the image to be a list of pixels
    pixels = image.reshape(-1, 3)
    
    # Cluster the pixel intensities
    clt = KMeans(n_clusters=n_colors, random_state=42)
    clt.fit(pixels)
    
    # Count the number of pixels in each cluster
    hist = Counter(clt.labels_)
    # Sort by frequency
    hist = sorted(hist.items(), key=lambda x: x[1], reverse=True)
    
    # Get the colors
    colors = clt.cluster_centers_
    
    # Return colors sorted by frequency
    return [colors[label] for label, _ in hist]

def plot_color_palette(colors, title="Color Palette"):
    """Plot a color palette."""
    plt.figure(figsize=(10, 2))
    plt.title(title)
    for i, color in enumerate(colors):
        plt.fill_betweenx(y=[0, 1], x1=i, x2=i+1, color=color)
    plt.xlim(0, len(colors))
    plt.yticks([])
    plt.xticks([])
    plt.show()

In [None]:
# Analyze color palettes of Monet paintings
print("Color Palettes of Monet Paintings:")
for i, image in enumerate(monet_images[:3]):  # Analyze first 3 images
    colors = extract_color_palette(image)
    plot_color_palette(colors, title=f"Monet Painting {i+1} - {monet_titles[i]}")

In [None]:
# Analyze color palettes of photographs
print("Color Palettes of Photographs:")
for i, image in enumerate(photo_images[:3]):  # Analyze first 3 images
    colors = extract_color_palette(image)
    plot_color_palette(colors, title=f"Photograph {i+1} - {photo_titles[i]}")

## 7. RGB and HSV Color Distribution Analysis

Let's analyze the RGB and HSV color distributions to better understand the differences between Monet paintings and photographs.

In [None]:
def plot_rgb_histograms(image, title="RGB Histograms"):
    """Plot RGB histograms for an image."""
    plt.figure(figsize=(15, 5))
    plt.suptitle(title)
    
    colors = ['red', 'green', 'blue']
    for i, color in enumerate(colors):
        plt.subplot(1, 3, i+1)
        plt.title(f"{color.capitalize()} Channel")
        plt.hist(image[:, :, i].flatten(), bins=50, color=color, alpha=0.7)
        plt.xlim(0, 1)
        plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

def plot_hsv_histograms(image, title="HSV Histograms"):
    """Plot HSV histograms for an image."""
    # Convert RGB to HSV
    hsv_image = color.rgb2hsv(image)
    
    plt.figure(figsize=(15, 5))
    plt.suptitle(title)
    
    channels = ['Hue', 'Saturation', 'Value']
    colors = ['purple', 'magenta', 'black']
    
    for i, (channel, plot_color) in enumerate(zip(channels, colors)):
        plt.subplot(1, 3, i+1)
        plt.title(channel)
        plt.hist(hsv_image[:, :, i].flatten(), bins=50, color=plot_color, alpha=0.7)
        plt.xlim(0, 1)
        plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Analyze RGB and HSV distributions for a Monet painting
monet_sample = monet_images[0]
plt.figure(figsize=(5, 5))
plt.imshow(monet_sample)
plt.title("Sample Monet Painting")
plt.axis('off')
plt.show()

plot_rgb_histograms(monet_sample, title="RGB Histograms - Monet Painting")
plot_hsv_histograms(monet_sample, title="HSV Histograms - Monet Painting")

In [None]:
# Analyze RGB and HSV distributions for a photograph
photo_sample = photo_images[0]
plt.figure(figsize=(5, 5))
plt.imshow(photo_sample)
plt.title("Sample Photograph")
plt.axis('off')
plt.show()

plot_rgb_histograms(photo_sample, title="RGB Histograms - Photograph")
plot_hsv_histograms(photo_sample, title="HSV Histograms - Photograph")

## 8. Texture Analysis

Let's analyze the texture characteristics of Monet paintings compared to photographs using edge detection.

In [None]:
def compute_edge_density(image):
    """Compute the edge density of an image using Canny edge detection."""
    # Convert to grayscale
    gray = cv2.cvtColor((image * 255).astype(np.uint8), cv2.COLOR_RGB2GRAY)
    
    # Apply Canny edge detection
    edges = cv2.Canny(gray, 100, 200)
    
    # Compute edge density
    edge_density = np.sum(edges > 0) / (edges.shape[0] * edges.shape[1])
    
    return edge_density, edges

def plot_edge_detection(image, title="Edge Detection"):
    """Plot original image and its edge detection result."""
    edge_density, edges = compute_edge_density(image)
    
    plt.figure(figsize=(12, 5))
    plt.suptitle(f"{title} - Edge Density: {edge_density:.4f}")
    
    plt.subplot(1, 2, 1)
    plt.title("Original Image")
    plt.imshow(image)
    plt.axis('off')
    
    plt.subplot(1, 2, 2)
    plt.title("Edge Detection")
    plt.imshow(edges, cmap='gray')
    plt.axis('off')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Analyze texture of Monet paintings
print("Texture Analysis of Monet Paintings:")
for i, image in enumerate(monet_images[:2]):  # Analyze first 2 images
    plot_edge_detection(image, title=f"Monet Painting {i+1}")

In [None]:
# Analyze texture of photographs
print("Texture Analysis of Photographs:")
for i, image in enumerate(photo_images[:2]):  # Analyze first 2 images
    plot_edge_detection(image, title=f"Photograph {i+1}")

## 9. Key Findings and Insights

Based on our exploratory data analysis, we can draw several insights about the differences between Monet paintings and photographs:

1. **Color Palette**: Monet paintings typically feature a more vibrant and diverse color palette, with an emphasis on blues, greens, and warm tones. The photographs tend to have more natural and muted colors.

2. **Texture**: Monet paintings show a distinctive brushstroke texture that creates a higher edge density compared to photographs. This impressionistic style is characterized by visible brushstrokes rather than smooth transitions.

3. **Color Distribution**: The RGB and HSV histograms reveal that Monet paintings often have a broader distribution of colors, particularly in the blue channel, and higher saturation values compared to photographs.

4. **Image Dimensions**: Both datasets contain images of consistent dimensions, which is important for training our GAN model.

These insights will guide our approach to developing a GAN model that can effectively transform photographs into Monet-style paintings. The model will need to learn to:

- Adjust the color palette to match Monet's style
- Add appropriate texture and brushstroke effects
- Enhance certain color channels and saturation levels
- Preserve the overall composition and subject matter of the original photographs

In the next notebook, we'll design and implement our GAN architecture based on these findings.