# Build Your First Machine Learning Project - Part 1 (Google Colab Edition)
# Data Wrangling with Gemini API

This notebook demonstrates how to build a complete machine learning project using Google Colab and Gemini API.

## What We'll Cover:

1. **Data Loading** - Load the bear dataset from GitHub using pandas
2. **Image Analysis with Gemini** - Extract features from bear images using Gemini's vision capabilities
3. **Feature Extraction** - Analyze fur color, facial profile, and paw pad texture
4. **Data Preparation** - Combine all features into a final dataset
5. **Data Export** - Save the processed data to CSV files

## Prerequisites:
- Google Colab account
- Gemini API key (free from [Google AI Studio](https://makersuite.google.com/app/apikey))
- Store your API key in Colab Secrets as `BEAR_ML_KEY`

## 1. Setup and Installation

First, we'll install the necessary packages and set up our environment.

In [None]:
# Install required packages
!pip install -q google-generativeai pandas pillow requests

## 2. Configure API Key from Colab Secrets

‚ö†Ô∏è **IMPORTANT**: Before running this cell:
1. Click the üîë key icon in the left sidebar
2. Click "Add new secret"
3. Name: `BEAR_ML_KEY`
4. Value: Your Gemini API key
5. Toggle "Notebook access" ON

In [None]:
import google.generativeai as genai
from google.colab import userdata

# Get API key from Colab secrets
try:
    api_key = userdata.get('BEAR_ML_KEY')
    genai.configure(api_key=api_key)
    print("‚úÖ API key loaded successfully from Colab secrets!")
except Exception as e:
    print("‚ùå Error loading API key. Make sure you've added BEAR_ML_KEY to Colab secrets.")
    print(f"Error: {e}")
    raise

## 3. Import Libraries and Setup

In [None]:
import pandas as pd
import time
import requests
from PIL import Image
from io import BytesIO
from datetime import datetime

print("‚úÖ All libraries imported successfully!")

## 4. Load Bear Dataset

Load the raw bear data from GitHub containing physical measurements.

In [None]:
# Load bear dataset from GitHub
base_url = "https://raw.githubusercontent.com/dataprofessor/bear-dataset/master/"
df = pd.read_csv(base_url + "bear_raw_data.csv")

print("üìä Bear Dataset Loaded:")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

## 5. Setup Rate Limiter

We'll create a rate limiter that ensures we don't exceed 10 requests per minute (free tier limit).

In [None]:
class RateLimiter:
    """
    Rate limiter to ensure we don't exceed Gemini free tier limits.
    Free tier: 10 requests per minute (RPM)
    """
    def __init__(self, max_requests_per_minute=10):
        self.max_rpm = max_requests_per_minute
        self.min_interval = 60.0 / max_requests_per_minute  # seconds between requests
        self.last_request_time = 0
        self.request_count = 0
        
    def wait_if_needed(self):
        """Wait if necessary to respect rate limits"""
        current_time = time.time()
        time_since_last = current_time - self.last_request_time
        
        if time_since_last < self.min_interval:
            wait_time = self.min_interval - time_since_last
            print(f"‚è≥ Rate limiting: waiting {wait_time:.1f}s...")
            time.sleep(wait_time)
        
        self.last_request_time = time.time()
        self.request_count += 1
    
    def get_stats(self):
        """Get statistics about API usage"""
        return {
            'total_requests': self.request_count,
            'max_rpm': self.max_rpm
        }

# Initialize rate limiter
rate_limiter = RateLimiter(max_requests_per_minute=10)
print("‚úÖ Rate limiter initialized (10 requests/minute max)")

## 6. Setup Gemini Model and Helper Functions

In [None]:
# Initialize Gemini model (using Flash for faster, free tier)
model = genai.GenerativeModel('gemini-2.5-flash')

def load_image_from_url(url):
    """Load an image from a URL"""
    try:
        response = requests.get(url)
        response.raise_for_status()
        return Image.open(BytesIO(response.content))
    except Exception as e:
        print(f"‚ùå Error loading image from {url}: {e}")
        return None

def analyze_bear_image(image_url, prompt, bear_id):
    """
    Analyze a bear image using Gemini API with rate limiting.
    
    Args:
        image_url: URL of the bear image
        prompt: Analysis prompt for Gemini
        bear_id: ID of the bear being analyzed
    
    Returns:
        Analysis result as string
    """
    # Apply rate limiting
    rate_limiter.wait_if_needed()
    
    # Load image
    image = load_image_from_url(image_url)
    if image is None:
        return "Error: Could not load image"
    
    try:
        # Generate content with Gemini
        response = model.generate_content([prompt, image])
        result = response.text.strip()
        print(f"‚úì Analyzed {bear_id}: {result}")
        return result
    except Exception as e:
        print(f"‚ùå Error analyzing {bear_id}: {e}")
        return f"Error: {str(e)}"

print("‚úÖ Gemini model and helper functions ready!")

## 7. Display Sample Bear Images

Let's look at examples of each bear species.

In [None]:
from IPython.display import Image as IPImage, display, HTML

# Bear species information
bears = [
    ("ABB", "American Black Bear"),
    ("EUR", "Eurasian Brown Bear"), 
    ("GRZ", "Grizzly Bear"),
    ("KDK", "Kodiak Bear")
]

image_base_url = "https://github.com/dataprofessor/bear-dataset/blob/master/images/"

print("üêª Sample Bear Species:\n")

html_content = '<div style="display: flex; justify-content: space-around;">'
for species, name in bears:
    image_url = f"{image_base_url}{species}_01.png?raw=true"
    html_content += f'''
    <div style="text-align: center; margin: 10px;">
        <img src="{image_url}" width="200"/>
        <p><strong>{name}</strong><br/>({species}_01.png)</p>
    </div>
    '''
html_content += '</div>'

display(HTML(html_content))

## 8. Feature Extraction: Fur Color

Now we'll use Gemini to analyze the fur color of each bear.

‚ö†Ô∏è **Note**: This will take approximately 20 minutes to process all 200 images (10 requests/minute).

In [None]:
# Fur color analysis prompt
fur_color_prompt = """
Analyze the provided image of a bear. Describe only the fur color of the bear
by choosing the most appropriate term from the following list. The response
should be a single value with no explanation.

Choose from:
- Light Brown
- Medium Brown
- Blond
- Dark Brown
- Grizzled
- Reddish Brown
- Blackish Brown
- Black
- Brown
- Cinnamon

Respond with ONLY the color name.
"""

# Test with first image
print("üß™ Testing with first bear image...\n")
test_id = df['id'].iloc[0]
test_url = f"https://raw.githubusercontent.com/dataprofessor/bear-dataset/master/images/{test_id}.png"
test_result = analyze_bear_image(test_url, fur_color_prompt, test_id)
print(f"\n‚úÖ Test successful! Result: {test_result}")

In [None]:
# Process all images for fur color
# WARNING: This will take ~20 minutes for 200 images

print("\n" + "="*60)
print("üé® ANALYZING FUR COLOR FOR ALL BEARS")
print("="*60)
print(f"Total images to process: {len(df)}")
print(f"Estimated time: ~{len(df) * 6 / 60:.0f} minutes")
print(f"Rate limit: 10 requests/minute\n")

start_time = datetime.now()
fur_color_results = []

for idx, row in df.iterrows():
    bear_id = row['id']
    image_url = f"https://raw.githubusercontent.com/dataprofessor/bear-dataset/master/images/{bear_id}.png"
    
    # Analyze fur color
    fur_color = analyze_bear_image(image_url, fur_color_prompt, bear_id)
    
    # Store result
    fur_color_results.append({
        'id': bear_id,
        'fur_color': fur_color
    })
    
    # Progress update every 10 images
    if (idx + 1) % 10 == 0:
        elapsed = (datetime.now() - start_time).total_seconds() / 60
        remaining = ((len(df) - idx - 1) * 6) / 60
        print(f"\nüìä Progress: {idx + 1}/{len(df)} ({(idx + 1)/len(df)*100:.1f}%)")
        print(f"   Elapsed: {elapsed:.1f}m | Remaining: {remaining:.1f}m\n")

# Create DataFrame
df_fur_color = pd.DataFrame(fur_color_results)

end_time = datetime.now()
total_time = (end_time - start_time).total_seconds() / 60

print("\n" + "="*60)
print("‚úÖ FUR COLOR ANALYSIS COMPLETE!")
print("="*60)
print(f"Total time: {total_time:.1f} minutes")
print(f"Total API calls: {rate_limiter.get_stats()['total_requests']}")
print(f"\nResults preview:")
display(df_fur_color.head(10))

## 9. Feature Extraction: Facial Profile

Next, we'll analyze the facial profile of each bear.

In [None]:
# Facial profile analysis prompt
facial_profile_prompt = """
Analyze the provided image of a bear. Describe only the facial profile of the bear.
The response must be one of the following two values as a single word with no explanation:

- Dished (Concave profile, where the bridge of the nose dips)
- Straight (Flat profile, with no dip from the forehead to the nose)

Respond with ONLY one word: either "Dished" or "Straight".
"""

print("\n" + "="*60)
print("üëÉ ANALYZING FACIAL PROFILE FOR ALL BEARS")
print("="*60)
print(f"Total images to process: {len(df)}")
print(f"Estimated time: ~{len(df) * 6 / 60:.0f} minutes\n")

start_time = datetime.now()
facial_profile_results = []

for idx, row in df.iterrows():
    bear_id = row['id']
    image_url = f"https://raw.githubusercontent.com/dataprofessor/bear-dataset/master/images/{bear_id}.png"
    
    # Analyze facial profile
    facial_profile = analyze_bear_image(image_url, facial_profile_prompt, bear_id)
    
    # Store result
    facial_profile_results.append({
        'id': bear_id,
        'facial_profile': facial_profile
    })
    
    # Progress update every 10 images
    if (idx + 1) % 10 == 0:
        elapsed = (datetime.now() - start_time).total_seconds() / 60
        remaining = ((len(df) - idx - 1) * 6) / 60
        print(f"\nüìä Progress: {idx + 1}/{len(df)} ({(idx + 1)/len(df)*100:.1f}%)")
        print(f"   Elapsed: {elapsed:.1f}m | Remaining: {remaining:.1f}m\n")

# Create DataFrame
df_facial_profile = pd.DataFrame(facial_profile_results)

end_time = datetime.now()
total_time = (end_time - start_time).total_seconds() / 60

print("\n" + "="*60)
print("‚úÖ FACIAL PROFILE ANALYSIS COMPLETE!")
print("="*60)
print(f"Total time: {total_time:.1f} minutes")
print(f"\nResults preview:")
display(df_facial_profile.head(10))

## 10. Feature Extraction: Paw Pad Texture

Finally, we'll analyze the paw pad texture.

In [None]:
# Paw pad texture analysis prompt
paw_pad_prompt = """
Analyze the provided image of a bear. Describe only the paw pad texture of the bear.
The response must be one of the following two values as a single word with no explanation:

- Smooth (Less textured and relatively flat, for walking)
- Rough (More textured and grooved, for gripping and climbing)

Respond with ONLY one word: either "Smooth" or "Rough".
"""

print("\n" + "="*60)
print("üêæ ANALYZING PAW PAD TEXTURE FOR ALL BEARS")
print("="*60)
print(f"Total images to process: {len(df)}")
print(f"Estimated time: ~{len(df) * 6 / 60:.0f} minutes\n")

start_time = datetime.now()
paw_pad_results = []

for idx, row in df.iterrows():
    bear_id = row['id']
    image_url = f"https://raw.githubusercontent.com/dataprofessor/bear-dataset/master/images/{bear_id}.png"
    
    # Analyze paw pad texture
    paw_pad = analyze_bear_image(image_url, paw_pad_prompt, bear_id)
    
    # Store result
    paw_pad_results.append({
        'id': bear_id,
        'paw_pad_texture': paw_pad
    })
    
    # Progress update every 10 images
    if (idx + 1) % 10 == 0:
        elapsed = (datetime.now() - start_time).total_seconds() / 60
        remaining = ((len(df) - idx - 1) * 6) / 60
        print(f"\nüìä Progress: {idx + 1}/{len(df)} ({(idx + 1)/len(df)*100:.1f}%)")
        print(f"   Elapsed: {elapsed:.1f}m | Remaining: {remaining:.1f}m\n")

# Create DataFrame
df_paw_pad = pd.DataFrame(paw_pad_results)

end_time = datetime.now()
total_time = (end_time - start_time).total_seconds() / 60

print("\n" + "="*60)
print("‚úÖ PAW PAD TEXTURE ANALYSIS COMPLETE!")
print("="*60)
print(f"Total time: {total_time:.1f} minutes")
print(f"\nResults preview:")
display(df_paw_pad.head(10))

## 11. Combine All Features

Now we'll merge all the extracted features with the original dataset.

In [None]:
# Standardize IDs (ensure uppercase)
df['id'] = df['id'].str.upper()
df_fur_color['id'] = df_fur_color['id'].str.upper()
df_facial_profile['id'] = df_facial_profile['id'].str.upper()
df_paw_pad['id'] = df_paw_pad['id'].str.upper()

# Merge all features
df_combined = df.merge(df_fur_color, on='id', how='inner')
df_combined = df_combined.merge(df_facial_profile, on='id', how='inner')
df_combined = df_combined.merge(df_paw_pad, on='id', how='inner')

print("‚úÖ All features combined successfully!")
print(f"\nFinal dataset shape: {df_combined.shape}")
print(f"Columns: {df_combined.columns.tolist()}")
print("\nFirst 10 rows:")
display(df_combined.head(10))

# Check for any missing values
print("\nüìä Missing values check:")
print(df_combined.isnull().sum())

## 12. Summary Statistics

Let's examine the distribution of extracted features.

In [None]:
print("üìä Feature Distribution Summary\n")
print("=" * 60)

# Fur color distribution
print("\nüé® Fur Color Distribution:")
print(df_combined['fur_color'].value_counts())

# Facial profile distribution
print("\nüëÉ Facial Profile Distribution:")
print(df_combined['facial_profile'].value_counts())

# Paw pad texture distribution
print("\nüêæ Paw Pad Texture Distribution:")
print(df_combined['paw_pad_texture'].value_counts())

# Species distribution
print("\nüêª Species Distribution:")
print(df_combined['species'].value_counts())

## 13. Export Data

Save the processed data to CSV files for use in subsequent analysis.

In [None]:
# Save combined dataset
df_combined.to_csv('bear_data_complete.csv', index=False)
print("‚úÖ Saved complete dataset to: bear_data_complete.csv")

# Save individual feature files
df_fur_color.to_csv('fur_color_extracted.csv', index=False)
df_facial_profile.to_csv('facial_profile_extracted.csv', index=False)
df_paw_pad.to_csv('paw_pad_texture_extracted.csv', index=False)

print("‚úÖ Saved individual feature files:")
print("   - fur_color_extracted.csv")
print("   - facial_profile_extracted.csv")
print("   - paw_pad_texture_extracted.csv")

# Download files to local machine
print("\nüíæ To download files to your computer:")
print("   1. Click the folder icon in the left sidebar")
print("   2. Right-click on the CSV files")
print("   3. Select 'Download'")

## 14. API Usage Summary

In [None]:
stats = rate_limiter.get_stats()

print("\n" + "="*60)
print("üìä GEMINI API USAGE SUMMARY")
print("="*60)
print(f"\nTotal API Requests: {stats['total_requests']}")
print(f"Rate Limit: {stats['max_rpm']} requests/minute")
print(f"\nFeatures Extracted:")
print(f"  ‚Ä¢ Fur Color: ‚úÖ")
print(f"  ‚Ä¢ Facial Profile: ‚úÖ")
print(f"  ‚Ä¢ Paw Pad Texture: ‚úÖ")
print(f"\nTotal Bears Analyzed: {len(df_combined)}")
print("\n‚úÖ All analysis complete! Ready for machine learning.")

## 15. Next Steps

Now that you have the complete dataset with extracted features, you can:

1. **Exploratory Data Analysis (EDA)**: Visualize distributions and correlations
2. **Feature Engineering**: Create additional features or transformations
3. **Machine Learning**: Train classification models (Random Forest, SVM, Logistic Regression)
4. **Model Evaluation**: Compare model performance
5. **Deployment**: Create a Streamlit app for predictions

## Resources

- [Gemini API Documentation](https://ai.google.dev/docs)
- [Bear Dataset GitHub](https://github.com/dataprofessor/bear-dataset)
- [scikit-learn Documentation](https://scikit-learn.org/)
- [Pandas Documentation](https://pandas.pydata.org/)