🌲 Project Brief (Role-Playing Style) 

🔍 Scenario: You’re a Data Scientist for EcoWatch


You're working with EcoWatch, an environmental NGO monitoring ecosystems in protected forests. A fungal disease has recently been reported in one section of a national park, affecting a vulnerable species of tree.

Your team has collected GPS coordinates of all observed trees of that species, and now the park rangers need your help.
<hr>

💬 The Mission Briefing from the Chief Ecologist

“We think the disease may be spreading within groves—but we’re not sure how many groves there are or where the isolated trees are.

Can you:

1. Identify natural clusters of the trees (groves),

2. Spot isolated trees that may either be early signs of spread or resilient outliers,

3. And help us visualize the spatial patterns of the trees?”

<hr>

### 📁 The Data You’re Given

Each row in the dataset is an observation of a tree:

- `TreeID`: Unique identifier  
- `Latitude`: GPS coordinate  
- `Longitude`: GPS coordinate  
- `Infected`: Yes/No (for later use)

<hr>

### 🧠 Your Mission

Use **DBSCAN** to:

- Detect **natural groves** of trees based on their locations  
- Identify **isolated individuals** that might:
  - Be infected and at risk of spreading the disease
  - Or be distant and unaffected (potential for study)

You will then **visualize the clusters and outliers**, helping the ecological team prioritize:

- Disease containment zones
  
- Surveillance for isolated trees

   
- Future planting or protection strategies

---

- A **written or slide summary** answering:
  
  - How many groves exist?
    
  - Where are isolated trees located?
 
    
  - What strategies would you recommend for disease control?
 
    
  - (Bonus) A version with `Infected` trees color-coded

  
  - (Bonus++) Test different `eps` values and describe the impact

---


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import random

# Generate new data with three distinct groves
new_centers = [(48.8566, 2.3522), (48.8600, 2.3580), (48.8530, 2.3470)]
new_cluster_std = [0.0005, 0.0005, 0.0005]

# Generate clustered data for three clusters
X_new, _ = make_blobs(n_samples=90, centers=new_centers, cluster_std=new_cluster_std)

# Add 5 outliers
outliers_new = np.array([
    [48.8700, 2.3700],
    [48.8400, 2.3300],
    [48.8500, 2.3650],
    [48.8720, 2.3400],
    [48.8450, 2.3600]
])

# Combine data
tree_locations_new = np.vstack([X_new, outliers_new])

# Assign infection status to 10 random trees
infection_status_new = ['Yes' if i in random.sample(range(95), 10) else 'No' for i in range(95)]

# Create DataFrame
tree_data_new = pd.DataFrame(tree_locations_new, columns=['Latitude', 'Longitude'])
tree_data_new['TreeID'] = range(1, 96)
tree_data_new['Infected'] = infection_status_new

print(tree_data_new.head())

    Latitude  Longitude  TreeID Infected
0  48.859545   2.358319       1       No
1  48.852222   2.345608       2       No
2  48.853853   2.346984       3       No
3  48.853329   2.346545       4      Yes
4  48.852763   2.347476       5       No


In [None]:
# Answer:









Note: A good rule of thumb:

Set min_samples = 2 * num_features, where num_features is the number of dimensions in your dataset (in your case 2D → try 4).

Larger min_samples will require tighter clusters and make DBSCAN more selective.



<hr>

# 🎯 Project Brief (Role-Playing Style)


👩‍💼 Scenario: You're Working at “SmartStyle Retail”.

You’ve just joined SmartStyle, a fast-growing retail company with thousands of customers. Your team has been handed a dataset of customer profiles. You’ve been asked to segment customers to help the Marketing Team build personalized offers and improve retention.

The marketing director tells you:

“We’ve tried some basic clustering in the past like K-Means, but it didn’t work well. Some customers just didn’t fit the patterns, and it threw everything off. We don’t want to force everyone into a group. Can you try something smarter?”

Double-click **here** for the additional information.

<!-- Your answer is below:

These customer outliers might be:

1. Excluded from group targeting

2. Flagged for special VIP treatment

3. Or simply analyzed separately

-->

📁 The data you’re given contains:

CustomerID: A unique identifier

Gender

Age

Annual Income (k$)

Spending Score (1–100) – a custom score reflecting how frequently and how much the customer spends

🧠 Your Mission

Use clustering to group customers based on their behavior—BUT make sure:

You don’t assume the number of groups in advance

You can identify and ignore outliers who don’t fit anywhere

You explain why your clustering method is practical for the business

💼 Deliverables

You will present:

1. What you discovered

2. What method you choose and why?

3. What actions the company should take

4. Plots and charts that clearly show clusters and outliers

Bonus: Compare with K-Means and explain the difference

In [None]:
# Answer




