In [None]:
%reset
%matplotlib inline
low_memory=False
# Import required libraries
# Your code here

## 8.1 Introduction & Motivation

Just like in the previous chapter, we're analyzing our data to answer a fundamental question: within the given data, are there any **groups** or **clusters** of datapoints that naturally belong together?

However, this time we'll explore **hierarchical clustering**, which takes a different approach:

**How it works:**
1. Start with each individual datapoint as its own cluster
2. Progressively merge the closest clusters together
3. Continue until only one cluster remains
4. Choose where to "cut" the hierarchy to get the optimal number of clusters

**Key advantage:** We can visualize the entire clustering process and decide on the optimal number of clusters after seeing the structure, rather than guessing beforehand.

## 8.2 Problem Setting

**Business Scenario:**

Consider an automobile manufacturer that has developed prototypes for a new vehicle. Before launching this new model, the manufacturer needs to understand the competitive landscape:

**Key Questions:**
* Which existing vehicles on the market most closely resemble our prototypes?
* What categories of vehicles currently exist in the market?
* Which category is our new model most similar to?
* Who will be our direct competitors?

**Our Approach:**

We'll use clustering techniques to identify distinct groups of vehicles based on their characteristics. This will:
1. Provide a clear overview of the current market structure
2. Help identify which segment our new model fits into
3. Reveal our direct competition
4. Inform strategic decisions about positioning and marketing

**Dataset:** We'll work with data on various cars, including specifications like engine size, fuel efficiency, dimensions, and pricing.

## 8.3 Model

First, let's have a look at the data.

In [None]:
# Load the Cars.csv dataset
# Your code here

In [None]:
# Display dataset information
# Your code here

##### Question 1: Create a correlation heatmap to explore the relationships between variables in the dataset.

**Instructions:**
* Plot a heatmap showing correlations between all numerical columns
* You'll encounter two columns that cannot be plotted (non-numerical)
* **Important:** Drop these two columns from the heatmap visualization only
* Do NOT drop them from your main dataframe - we'll need them later for labeling
* **Hint:** You can either use `.drop()` temporarily in the plotting command or store them in a separate dataframe

In [None]:
# Create correlation heatmap
# Your code here

##### Question 2: Investigate an unusual column in the correlation heatmap.

**Observation:** You'll notice one column that behaves differently than expected. 
* It shows no correlation with any other variables
* This is unusual - most car specifications are related to each other

**Your Tasks:**
1. Identify which column behaves unusually
2. Investigate why this is happening (examine the actual values)
3. Determine if this column provides any useful information
4. Decide whether to keep or drop it, and explain your reasoning

**Hint:** Use `.value_counts()` to see the distribution of values in suspicious columns.

**Your analysis here:**

In [None]:
# Investigate the unusual column
# Your code here

**Your conclusion and action:**

In [None]:
# Drop the column if needed
# Your code here

## 8.4 Model Evaluation

##### Question 3: Create a dendrogram to visualize the hierarchical clustering structure.

**Instructions:**
1. Create a dendrogram of the cars dataset
2. **Remember:** Exclude the two non-numerical columns ('manufact' and 'model') from the clustering
3. **However:** Use these columns to create meaningful labels for your dendrogram
4. **Hint:** Research the `leaf_label_func` parameter in the dendrogram function
5. Consider using horizontal orientation for better readability with many labels

**Goal:** Each car should be labeled with its manufacturer and model name (e.g., "[Toyota Camry]") so we can see which cars cluster together.

In [None]:
# Create dendrogram with custom labels
# Your code here

##### Question 4: Determine the optimal number of clusters by analyzing the dendrogram.

**Think about:**
* Hierarchical clustering merges clusters based on their distance/similarity
* We want to maximize separation between clusters (large distances)
* While minimizing the number of clusters (for interpretability)

**What to look for in the dendrogram:**
* Find where there's a long horizontal distance without any merges
* This indicates a natural "gap" between cluster levels
* The cutoff just before this gap gives you optimal clusters

**Your Task:**
1. Examine the dendrogram carefully
2. Identify where you would draw the cutoff line
3. Count how many clusters this would create
4. Explain your reasoning: Why is this number optimal?

**Your analysis here:**

##### Question 5: Use the elbow method to mathematically determine the optimal number of clusters.

**Instructions:**
1. Create an elbow plot using the linkage criterion
2. Identify where the "elbow" (bend) occurs in the graph
3. Compare this result with your visual estimate from the dendrogram

**Questions to answer:**
* What number of clusters does the elbow method suggest?
* Does this match your estimate from Question 4?
* If there's a discrepancy, which method do you trust more and why?

In [None]:
# Create elbow plot
# Your code here

**Your analysis here:**

##### Question 6: Visualize the clusters by drawing a cutoff line on the dendrogram.

**Tasks:**
1. Recreate the dendrogram from Question 3
2. Add a horizontal cutoff line showing where we divide into 6 clusters
3. Analyze the resulting groups

**Analysis Questions:**
* Do the cars in each cluster make intuitive sense?
* Are the clusters balanced (similar sizes) or unbalanced?
* If unbalanced, what might explain this?
* Can you identify the "type" of each cluster (e.g., sports cars, family cars, trucks)?

**Hint:** Use `plt.axvline()` to draw a vertical line (since orientation is 'right') at the appropriate distance.

In [None]:
# Create dendrogram with cutoff line
# Your code here

**Your analysis here:**

1. **Cluster Quality:**
   * Describe which types of cars are grouped together
   * Do the groupings make sense?

2. **Cluster Balance:**
   * Are clusters similar in size or very different?
   * What does this tell us about the data?

3. **Business Interpretation:**
   * What vehicle categories have you identified?
   * How could a manufacturer use this information?

## 8.4 Exercises

##### Question 7: Consider the cars dataset. Which clustering method would you prefer? Why?

**Your comparative analysis here:**

Compare hierarchical clustering vs. k-Means for this dataset:

**Hierarchical Clustering Advantages:**
* List advantages specific to this dataset

**k-Means Advantages:**
* List advantages

**Your Recommendation:**
* Which method do you prefer and why?
* Consider: data characteristics, interpretability, business needs

##### Question 8: For comparison, determine the optimal k for k-Means clustering.

**Task:** Create an elbow plot for k-Means to find the optimal number of clusters.

**Purpose:** This allows us to compare whether k-Means and hierarchical clustering suggest the same number of clusters for this dataset.

**Question:** Do the two methods agree on the optimal k? If they differ, why might that be?

In [None]:
# Create k-Means elbow plot
# Your code here

**Your comparison analysis here:**

##### Question 9: Analyze bivariate relationships between MPG and other car features using k-Means.

**Context:** Now we'll examine how fuel efficiency (MPG) relates to individual car characteristics.

**Task:** For each of the following independent variables, treat 'mpg' as the dependent variable:
* sales
* resale
* price
* engine_s (engine size)
* horsepow (horsepower)
* wheelbas (wheelbase)
* width
* length
* curb_wgt (curb weight)
* fuel_cap (fuel capacity)
* lnsales (log of sales)

**Instructions:**
1. Create 11 separate bivariate datasets (each feature paired with MPG)
2. For each dataset, use the elbow method to find optimal k
3. Fit k-Means models with optimal k for each
4. Create scatter plots showing the clusters

**Goal:** Understand which car features have the strongest relationship with fuel efficiency and how they cluster.

In [None]:
# Create bivariate datasets
# Your code here

In [None]:
# Find optimal k for each dataset using elbow method
# Your code here

In [None]:
# Fit k-Means models and create visualizations
# Your code here

**Your analysis here:**

* Which features show the clearest clustering patterns with MPG?
* Which features seem less related to MPG?
* What insights can you draw about fuel efficiency?

##### Question 10: Repeat the bivariate analysis using hierarchical clustering.

**Task:** Perform the same analysis as Question 9, but using hierarchical clustering instead of k-Means.

**Instructions:**
1. For each of the 11 feature pairs (feature + MPG)
2. Use the linkage criterion to find optimal k
3. Fit hierarchical clustering models
4. Create scatter plots showing the clusters

**Comparison Goal:** 
* Do hierarchical and k-Means identify similar patterns?
* Which method produces more interpretable results for these bivariate relationships?
* Are the cluster boundaries similar or different?

In [None]:
# Find optimal k using linkage criterion
# Your code here

In [None]:
# Fit hierarchical clustering models and create visualizations
# Your code here

**Your comparative analysis here:**

* How do the results compare between k-Means and hierarchical clustering?
* Which method provided more interpretable results?
* What are the practical implications of any differences you observed?