## CSCE 676 :: Data Mining and Analysis :: Texas A&M University :: Spring 2026


# Weekly Homework 5: Decision Trees


***Goals of this homework:***
Work with decision trees.


***Submission instructions:***

You should post your notebook to Canvas (look for the assignment there). Please name your submission **your-uin_hw3.ipynb**, so for example, my submission would be something like **555001234_hw3.ipynb**. Your notebook should be fully executed when you submit ... so run all the cells for us so we can see the output, then submit that.

***Grading philosophy:***

We are grading reasoning, judgment, and clarity, not just correctness. Show us that you understand the data, the constraints, and the limits of your conclusions.

***For each question, you need to respond with 2 cells:***
1. **[A Code Cell] Your Code:**
  - If code is not applicable for the question, you can skip this cell.
  - For tests: tests can be simple assertions or checks (e.g., using `assert` or `print` or small functions or visual inspection); formal testing frameworks are not required.
2. **[A Markdown Cell] Your Answer:** Write up your answers and explain them in complete sentences. Include any videos in this section as well; for videos, upload them to your TAMU Google Drive, and ensure they are set to be visible by the instruction team (set to: **anyone with a TAMU email can view**), then share the link to the video in the cell.

***At the end of each Section (A/B/C/...) include a cell for your resources:***

**[A Markdown Cell] Your Resources:** You need to cite 3 types of resources and note how they helped you: (1) Collaborators, (2) Web Sources (e.g. StackOverflow), and (3) AI Tools (you must also describe how you prompted, but we do not require any links to any specific chats). Specifically, use the following format as a template:
```
On my honor, I declare the following resources:
1. Collaborators:
- Reveille A.: Helped me understand that a df in pandas is a data structure kinda like a CSV.
- Sully A.: Helped me fix a bug with the vector addition of 2 columns.
- ...

2. Web Sources:
- https://stackoverflow.com/questions/46562479/python-pandas-data-frame-creation: how to create a pd df
- ...

3. AI Tools:
- ChatGPT: I gave it the homework .ipynb file and the ufo.csv, and told it to generate the code for the first question, but it did it with csv.reader(), so I re-prompted it to use pandas and that one was correct
- ...
```
***Why do we require this cell?*** This cell is important...

1. For academic integrity, you must give credit where credit is due.

2. We want you to pay attention to how you can successfully get help to move through problems! Is there someone you work with or an AI tool that helps you learn the material better? That's great! The point of engineering is to use your tools to solve hard problems, and part of graduate school is learning about how *you* learn and solve problems best.

***A reminder: you get out of it what you put into it.***
Do your best on these homeworks, show us your creativity, and ask for help when you need it -- good luck!

# A [72pts]. Decision Trees

**Rubric**

[24 pts] Strong/Professional: Correct and complete implementation of the task; Reasonable assumptions, stated or implied, and justified; Thoughtful handling of real-world data issues (missingness, noise, scale, duplicates, edge cases); Clear, concise explanations of what was done and why; Code is clean, readable, and well-structured, uses appropriate pandas, and would plausibly pass a professional code review; Tests meaningfully validate non-trivial behavior (not just "the code runs so it must be right").

[12 pts] Partial/Developing: Core task mostly completed but with gaps, weak assumptions, or minor mistakes; Reasoning is shallow or mostly descriptive; Code works but is messy, repetitive, or fragile; Tests are superficial, incomplete, or poorly motivated.

[0 pts] Minimal/Incorrect: Task is largely incorrect, missing, or misunderstands the goal; Little to no reasoning or justification; Code does not run or ignores constraints; No meaningful tests.


## Environment Setup & Sampling (Optional)

- You may use the full datasets. Sampling is optional (for speed).
- If you sample, briefly report what you did (n/frac, whether you stratified, any seed).


In [1]:
#!/bin/bash
! curl -L -o mushroom-classification.zip https://www.kaggle.com/api/v1/datasets/download/uciml/mushroom-classification
! unzip -o mushroom-classification.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 35057  100 35057    0     0  36101      0 --:--:-- --:--:-- --:--:-- 36101
Archive:  mushroom-classification.zip
  inflating: mushrooms.csv           


In [2]:
##### sampling code (optional)
from pathlib import Path
import pandas as pd

# Edit paths if needed
MUSHROOM_PATH = Path("./mushrooms.csv")

def load_csv(path, **kwargs):
    if path.exists():
        return pd.read_csv(path, **kwargs)
    print(f"Warning: {path} not found.")
    return None

mushroom = load_csv(MUSHROOM_PATH)

# ====== (Optional) Sampling ======
# Leave all values as None to use the full dataset.
SAMPLE = {
    "mushroom": {"n": None, "frac": None, "random_state": None, "stratify_col": None},  # e.g., {"frac": 1.0}
}

def maybe_sample(df, cfg):
    """Return sampled df if n/frac set; otherwise return df. Optional stratify by a column name."""
    if df is None:
        return None
    n, frac, rs, strat = cfg.get("n"), cfg.get("frac"), cfg.get("random_state"), cfg.get("stratify_col")
    if strat and strat in df.columns and (n or frac):
        # stratified sampling (simple & proportional when using frac)
        if frac:
            return (df.groupby(strat, group_keys=False)
                      .apply(lambda g: g.sample(frac=frac, random_state=rs))
                      .reset_index(drop=True))
        # proportional n by class frequency (rounded)
        counts = df[strat].value_counts(normalize=True) * n
        parts = []
        for k, need in counts.round().astype(int).items():
            part = df[df[strat]==k].sample(n=min(need, len(df[df[strat]==k])), random_state=rs)
            parts.append(part)
        out = pd.concat(parts).reset_index(drop=True)
        return out.sample(frac=1.0, random_state=rs).reset_index(drop=True)
    # simple sampling
    if frac: return df.sample(frac=frac, random_state=rs).reset_index(drop=True)
    if n:    return df.sample(n=min(n, len(df)), random_state=rs).reset_index(drop=True)
    return df.reset_index(drop=True)

mushroom_sample = maybe_sample(mushroom, SAMPLE["mushroom"])

print("Mushroom:", None if mushroom is None else mushroom.shape,
      "-> sample:", None if mushroom_sample is None else mushroom_sample.shape)


Mushroom: (8124, 23) -> sample: (8124, 23)


This dataset contains descriptions of mushrooms from the `Agaricus` and `Lepiota` families.

Each sample is labeled as `edible` or `poisonous`, based on observable traits such as `cap shape`, `color`, `odor`, and `habitat`.

All features are categorical, making it a good exercise for preprocessing, encoding, and classification.

> You will `predict` whether a mushroom is edible or poisonous using `classification` methods.

# 1. Decision Tree
- Train a Decision Tree classifier to predict whether each mushroom is edible or poisonous.
- Compare results using both criterion=`gini` and criterion=`entropy`, and sweep over different `max_depth` values.
- Plot test `accuracy` vs. `tree depth` and briefly discuss the effect of overfitting. What do you find out?

# 2. Random Forest

- Train Random Forest classifiers with different numbers of trees — e.g., `n_estimators ∈ {50, 100, 200}`.
- Compare their accuracy to your best single Decision Tree.
- Then, plot the top-10 most important features and discuss which mushroom traits seem most influential.

# 3. Interpretability

- Train a small Decision Tree with max_depth=3 for easy visualization.
- Display the tree structure and manually trace 1–2 samples through the decision path.
- Explain in words why the model makes those predictions.
- Write 2-3 tests for your model, pretending that this is a model we are going to put in production. Why did you choose those tests? How confident are you that this model is going to succeed in production?
- Pretend that we are going to add a new feature: a textual description of the mushroom, written by an expert. What should we do with this feature? How can we use it to improve the decision tree?

# B [24pts]. Interview Questions

We now pretend this is a real job interview. Here's some guidance on how to answer these questions:

1. Briefly restate the question and state any assumptions you are making.

2. Explain your reasoning out loud, focusing on tradeoffs, limitations, and constraints.

3. As a principle, keep your answers as short and clear as they can be (while still answering the question).

4. Write/speak in a conversational but professional tone (avoid being overly formal). For speaking: speak at a reasonable pace and volume, speak clearly, pause when you need to, and practice making "eye contact" with the camera. Keep a confident, positive, and professional tone. *For additional coaching and practice, the University Writing Center provides individual appointments: https://writingcenter.tamu.edu/make-an-appointment.*

There may not be a single correct answer. We are grading whether your reasoning is reasonable and aware of limitations.


**Rubric**

[8pt] Clear understanding of the question; reasonable assumptions; thoughtful reasoning that acknowledges tradeoffs and limitations; clear, concise communication in a conversational but professional tone (for speaking: clear pace, volume, and articulation).

[4pt] Basic understanding but shallow reasoning or unclear assumptions; communication is somewhat unclear, overly verbose, or overly informal/formal.

[0pt] Minimal, unclear, or incorrect response; poor communication or unprofessional tone.

# 1.
If a model performs well on training data but poorly on test data, how would you explain the issue and fix it?


# 2.
Why do deep decision trees often achieve near perfect training accuracy?

# 3.
What do you do when the model fits in memory but the data does not? What if neither fits?

# 4.
If entropy and Gini give different trees, which one is correct?

#5.
If your model fails in production, but passed all offline tests, what do you suspect first?

# 6.
Why does a shallow tree sometimes outperform a deep one?

# 7.
What's worse: missing values or incorrect values in your training datset?

# 8.
(Video; 1 minute max) Why are decision trees still widely used despite deep learning?

# C [4pts]. What new questions do you have?
We want you to think bigger! Tell us what questions and curiosity this homework brings up for you.

**Rubric**

[4pt] Complete, thoughtful response.

[2pt] Partial response.

[0pt] Minimal response.

# 1.
What new questions do you have after this homework? Or, what topics are you curious about now? List at least 3.