# **Refinement**

**Project Milestone 4**

**Team:** B1 Team 13

**Team Member:** Mohamad Gong

# Introduction

Milestone 4 constitutes a serious analytical checkpoint in this project. This submission is intended to demonstrate a mature understanding of the dataset and the fair use domain, along with significant progress toward well-supported high-level findings and insightful interpretation.

In earlier milestones, I established the core framing for Q2: venue-level differences can be misleading if courts and circuits see different mixes of dispute types. This notebook focuses on making those comparisons more credible by grouping cases into comparable case types and then evaluating whether venue effects persist after controlling for that case mix.

The main methodological shift in M4 is how case types are constructed. Prior work relied on TF-IDF and NMF, which represent documents through word and phrase frequency patterns and then summarize them into topic mixtures. That approach is interpretable and effective for identifying recurring terms, but it is still largely driven by lexical overlap. In this notebook, I move to text embeddings and similarity-based clustering, which represent each case as a dense semantic vector and group cases by semantic similarity rather than only shared vocabulary.

The benefit of this upgrade is that it better handles paraphrasing, synonyms, and stylistic differences across court descriptions, which are common in legal summaries. It also reduces the risk that boilerplate language dominates the grouping, and it supports more coherent and stable case types for downstream analysis. As a result, the venue comparisons are more aligned with the goal of assessing differences that remain after accounting for dispute type.

The notebook is organized to remain rigorous and interpretable. It includes clear preprocessing choices, transparent cluster interpretation through keywords and representative examples, and robustness checks such as parameter sensitivity and stability across runs. The main outputs are a case type dictionary and a venue comparison table reporting adjusted outcomes, supported by focused visualizations that make the results easy to evaluate.

# Setting Up Environment

In [14]:
import numpy as np
import pandas as pd
import re
from sentence_transformers import SentenceTransformer

# Data Importing, Inspection and Preparation

## Data Importing

Because merging `fair_use_cases` and `fair_use_findings` is unreliable and can introduce case loss or mismatches, this notebook uses `fair_use_findings` as the sole analysis table. This keeps the pipeline reproducible and preserves maximum coverage for the text-based modeling steps.

In [2]:
fair_use_findings = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-08-29/fair_use_findings.csv')

## Data Inspection

### Fair Use Findings Table

The `fair_use_findings` table contains complementary case-level text, including summaries of key facts, legal issues, holdings, and descriptive tags. Inspection centers on text completeness and variability, as these fields support later analysis of language patterns, similarity, and thematic structure across cases.

| variable    | class     | description                                                                            |
| ----------- | --------- | -------------------------------------------------------------------------------------- |
| title       | character | The title of the case.                                                                 |
| case_number | character | The case number or numbers of the case.                                                |
| year        | character | The year in which the finding was made (or findings were made).                        |
| court       | character | The court or courts involved.                                                          |
| key_facts   | character | The key facts of the case.                                                             |
| issue       | character | A brief description of the fair use issue.                                             |
| holding     | character | The decision of the court in paragraph form.                                           |
| tags        | character | Comma- or semicolon-separated tags for this case.                                      |
| outcome     | character | A brief description of the outcome of the case. These fields have not been normalized. |

In [3]:
print("Dataset Info:")
print(fair_use_findings.info())

print("\nFirst 5 rows:")
print(fair_use_findings.head())

print("\nMissing Values:")
print(fair_use_findings.isnull().sum())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        251 non-null    object
 1   case_number  251 non-null    object
 2   year         251 non-null    object
 3   court        251 non-null    object
 4   key_facts    251 non-null    object
 5   issue        251 non-null    object
 6   holding      251 non-null    object
 7   tags         251 non-null    object
 8   outcome      251 non-null    object
dtypes: object(9)
memory usage: 17.8+ KB
None

First 5 rows:
                                               title  \
0                              De Fontbrune v. Wofsy   
1                          Sedlik v. Von Drachenberg   
2  Sketchworks Indus. Strength Comedy, Inc. v. Ja...   
3  Am. Soc'y for Testing & Materials v. Public.Re...   
4                           Yang v. Mic Network Inc.   

                                 

## Preparing Data

### Outcome Flag Construction

The outcome column is converted into a simple label for analysis. The text is cleaned and then grouped into three outcomes: fair use found, fair use not found, and indeterminate (preliminary, mixed, remand, or unclear). A binary fair_use_found flag is created only for the final outcomes, and indeterminate cases are left out of binary rate calculations.

In [4]:
# Count outcome column from fair_use_findings and reset index
outcome_counts = fair_use_findings["outcome"].astype(str).str.lower().str.strip().value_counts().reset_index()
fair_use_findings["outcome"] = fair_use_findings["outcome"].astype(str).str.lower().str.strip()
outcome_counts.columns = ["outcome", "count"]

# Display the counts
print(outcome_counts)

                                              outcome  count
0                                  fair use not found    100
1                                      fair use found     98
2         preliminary ruling, mixed result, or remand     28
3             preliminary finding; fair use not found      4
4                                        mixed result      3
5              preliminary ruling, fair use not found      3
6              fair use not found, preliminary ruling      3
7              preliminary ruling; fair use not found      2
8              fair use not found; preliminary ruling      2
9                          preliminary ruling, remand      1
10                                fair use not found.      1
11                                    fair use found.      1
12  preliminary ruling, fair use not found, mixed ...      1
13                 preliminary ruling, fair use found      1
14  fair use found; second circuit affirmed on app...      1
15                      

Based on the grouped outcome counts, outcomes fall into three clear categories. Entries labeled “Fair use found” (including minor punctuation or appeal notes) are treated as fair use found, and entries labeled “Fair use not found” (including punctuation variants) are treated as fair use not found. All remaining outcomes, such as preliminary rulings, mixed results, remands, and irregular text entries, are treated as indeterminate. A binary fair_use_found flag is then defined only for the final outcomes, while indeterminate cases are excluded from binary rate calculations.

In [5]:
outcome_map = {
    # FINAL: fair use found
    "fair use found": "FAIR_USE_FOUND",
    "fair use found.": "FAIR_USE_FOUND",
    "fair use found; second circuit affirmed on appeal.": "FAIR_USE_FOUND",

    # FINAL: fair use not found
    "fair use not found": "FAIR_USE_NOT_FOUND",
    "fair use not found.": "FAIR_USE_NOT_FOUND",

    # INDETERMINATE
    "preliminary ruling, mixed result, or remand": "INDETERMINATE",
    "preliminary finding; fair use not found": "INDETERMINATE",
    "mixed result": "INDETERMINATE",
    "preliminary ruling, fair use not found": "INDETERMINATE",
    "fair use not found, preliminary ruling": "INDETERMINATE",
    "preliminary ruling; fair use not found": "INDETERMINATE",
    "fair use not found; preliminary ruling": "INDETERMINATE",
    "preliminary ruling, remand": "INDETERMINATE",
    "preliminary ruling, fair use not found, mixed result": "INDETERMINATE",
    "preliminary ruling, fair use found": "INDETERMINATE",
    "fair use found; mixed result": "INDETERMINATE",
    "plaintiff patrick cariou published yes rasta, a book of portraits and landscape photographs taken in jamaica. defendant richard prince was an appropriation artist who altered and incorporated several of plaintiff’s photographs into a series of paintings and collages called canal zone that was exhibited at a gallery and in the gallery’s exhibition catalog. plaintiff filed an infringement claim, and the district court ruled in his favor, stating that to qualify as fair use, a secondary work must “comment on, relate to the historical context of, or critically refer back to the original works.” defendant appealed.": "INDETERMINATE",
}

In [6]:
# Create outcome_std using the mapping from outcome_map
fair_use_findings["outcome_std"] = fair_use_findings["outcome"].replace(outcome_map)

# Create a boolean flag for determinate cases (True if Fair Use Found or Not Found)
fair_use_findings["is_determinate"] = fair_use_findings["outcome_std"].isin(["FAIR_USE_FOUND", "FAIR_USE_NOT_FOUND"])

# Display counts for verification
print("Outcome Standardized Counts:")
print(fair_use_findings["outcome_std"].value_counts())
print("\nIs Determinate Counts:")
print(fair_use_findings["is_determinate"].value_counts())

Outcome Standardized Counts:
outcome_std
FAIR_USE_NOT_FOUND    101
FAIR_USE_FOUND        100
INDETERMINATE          50
Name: count, dtype: int64

Is Determinate Counts:
is_determinate
True     201
False     50
Name: count, dtype: int64


### Year Type Conversion

The `year` column is converted to a numeric integer format to ensure it can be used reliably in grouping, filtering, and any downstream modeling steps. Any non-numeric or missing values are handled safely during conversion.

In [7]:
# Turn the year column to integer
fair_use_findings["year"] = pd.to_numeric(fair_use_findings["year"], errors="coerce").astype("Int64")

In [8]:
fair_use_findings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           251 non-null    object
 1   case_number     251 non-null    object
 2   year            250 non-null    Int64 
 3   court           251 non-null    object
 4   key_facts       251 non-null    object
 5   issue           251 non-null    object
 6   holding         251 non-null    object
 7   tags            251 non-null    object
 8   outcome         251 non-null    object
 9   outcome_std     251 non-null    object
 10  is_determinate  251 non-null    bool  
dtypes: Int64(1), bool(1), object(9)
memory usage: 20.2+ KB


# Preprocessing

## Embedding-Ready Text Construction

Text fields are prepared specifically for embedding. Case-type representation is built from `issue` and `key_facts` because they describe dispute substance. `holding` is excluded from embedding inputs to reduce decision-language dominance.

In [13]:
df = fair_use_findings.copy()

def clean_text(s):
    s = "" if pd.isna(s) else str(s)
    s = s.replace("\u00a0", " ")
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["issue_clean"] = df["issue"].map(clean_text)
df["facts_clean"] = df["key_facts"].map(clean_text)
df["court_clean"] = df["court"].map(clean_text)

# Light weighting so the short issue statement stays visible in the representation
df["text_for_embedding"] = df["issue_clean"] + " " + df["facts_clean"]

df[["issue_clean","facts_clean","text_for_embedding"]].head()

Unnamed: 0,issue_clean,facts_clean,text_for_embedding
0,Whether reproduction of photographs documentin...,Plaintiffs own the rights to a catalogue compr...,Whether reproduction of photographs documentin...
1,Whether use of a photograph as the reference i...,Plaintiff Jeffrey Sedlik is a photographer who...,Whether use of a photograph as the reference i...
2,"Whether the use of protected elements, includi...",Plaintiff Sketchworks Industrial Strength Come...,"Whether the use of protected elements, includi..."
3,Whether it is fair use to make available onlin...,"Defendant Public.Resource.Org, Inc., a non-pro...",Whether it is fair use to make available onlin...
4,"Whether using a screenshot from an article, in...",Plaintiff Stephen Yang (“Yang”) licensed a pho...,"Whether using a screenshot from an article, in..."


## Embedding Generation

A pretrained Sentence Transformer model converts each case text into a semantic vector using a single `encode()` call, matching the simple pattern used in course notes.

In [15]:
model = SentenceTransformer("all-MiniLM-L6-v2")

texts = df["text_for_embedding"].tolist()
X_embed = model.encode(texts, show_progress_bar=True)

print("Embeddings Shape:", X_embed.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

Embeddings Shape: (251, 384)


# Topic Modeling
First unsupervised or unstructured analysis method

The text input for topic modeling is constructed by combining key_facts and issue. These fields describe the dispute context and legal question, and they are used to define “case types” without leaking decision language from holding. Basic cleaning (lowercasing and whitespace normalization) is applied to reduce superficial variation before vectorization.

In [None]:
# Combine key facts + issue into one text field for topic modeling
fair_use_findings["text"] = (
    fair_use_findings["key_facts"].fillna("").astype(str) + " " +
    fair_use_findings["issue"].fillna("").astype(str)
)

# Basic cleaning: lowercase, collapse whitespace, trim
fair_use_findings["text"] = (
    fair_use_findings["text"]
    .str.lower()
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

Topic modeling is used to extract dispute themes from the combined key_facts + issue text. The primary approach uses NMF on a TF-IDF–weighted document–term matrix, which downweights common boilerplate terms and typically produces clearer and more distinct topic vocabularies, especially for short case summaries. NMF is fit using three different values of K to assess sensitivity to the number of topics, and the resulting topic-word summaries and per-case topic mixtures are compared for interpretability and stability to select a final configuration for later steps.

For completeness, LDA was also tested using a count-based representation, but the topics were largely driven by generic litigation language and showed high overlap, so it was not used in the main workflow. LDA may be revisited later with additional preprocessing or longer text inputs.

## Non-negative Matrix Factorization (NMF)

TF-IDF is used to downweight very common legal terms and emphasize terms that distinguish cases by fact patterns. NMF is then fit to the TF-IDF matrix to extract non-negative topic components, producing interpretable themes that can be treated as text-derived “case types.” These topic mixtures provide a compact numeric representation of each case summary for later analysis.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Define the number of topics (K) to investigate for NMF.
K_try = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

# Initialize TF-IDF vectorizer for text feature extraction.
tfidf_vec = TfidfVectorizer(
    stop_words="english",
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2)
)

# Transform text data into TF-IDF features.
X_tfidf = tfidf_vec.fit_transform(fair_use_findings["text"])
# Store the vocabulary terms for interpretation.
terms_tfidf = tfidf_vec.get_feature_names_out()

# Prepare to store NMF results for each K value.
nmf_results = {}

# Loop through each specified number of topics (K) to run NMF.
for K in K_try:
    # Set up the NMF model for the current K.
    nmf = NMF(n_components=K, random_state=42, init="nndsvda", max_iter=400)
    # Apply NMF to extract topic distributions per document.
    nmf_doc_topic = nmf.fit_transform(X_tfidf)

    # Record the model's output and key metrics for the current K.
    nmf_results[K] = {
        "model": nmf,
        "doc_topic": nmf_doc_topic,                 # (n_cases, K) topic weights per case
        "dominant_topic": nmf_doc_topic.argmax(1),  # length n_cases
        "reconstruction_err": nmf.reconstruction_err_,
        "terms": terms_tfidf
    }

    # Assign the dominant topic for the current K to the DataFrame.
    fair_use_findings[f"nmf_topic_k{K}"] = nmf_results[K]["dominant_topic"]

    # Report the reconstruction error for the NMF model at current K.
    print(f"NMF K={K} | recon_err={nmf_results[K]['reconstruction_err']:.4f}")

In [None]:
# Extract the sorted K values and corresponding reconstruction errors
Ks = np.array(sorted(nmf_results.keys()), dtype=float)
errs = np.array([nmf_results[int(k)]["reconstruction_err"] for k in Ks], dtype=float)

# Define the start and end points of the curve for the chord line
x1, y1 = Ks[0], errs[0]
x2, y2 = Ks[-1], errs[-1]

# Calculate the perpendicular distance of each point from the chord line
numer = np.abs((y2 - y1) * Ks - (x2 - x1) * errs + x2 * y1 - y2 * x1)
denom = np.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)
dists = numer / denom

# Set distances at endpoints to -inf so they aren't selected as the knee
dists[0] = -np.inf
dists[-1] = -np.inf

# Identify the index of the maximum distance (the knee point)
knee_idx = int(np.argmax(dists))
knee_k = int(Ks[knee_idx])
knee_err = errs[knee_idx]

print(f"Knee K* = {knee_k} (recon_err = {knee_err:.4f})")

# Initialize the plot for visualization
plt.figure(figsize=(8, 5))

# Plot the reconstruction error curve
plt.plot(Ks, errs, marker="o", linewidth=2, label="Reconstruction error")

# Plot the chord line connecting the endpoints
plt.plot([x1, x2], [y1, y2], linestyle="--", linewidth=1.5, label="Endpoint chord")

# Highlight the identified knee point
plt.scatter([knee_k], [knee_err], s=120, zorder=3, label=f"Knee K*={knee_k}")
plt.axvline(knee_k, linestyle="--", linewidth=1)

# Add titles, labels, and grid to the plot
plt.title("NMF: Reconstruction Error vs K (Knee Point)")
plt.xlabel("Number of topics (K)")
plt.ylabel("Reconstruction error")
plt.xticks(Ks.astype(int))
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

The reconstruction error steadily decreases as the number of topics increases, which is expected because more topics give the model more flexibility to fit the data. However, the rate of improvement slows down after a point. The knee-point check highlights K = 10 as the point where the curve starts to flatten, meaning that moving beyond 10 topics provides only small reductions in error relative to the extra complexity. Choosing K = 10 is therefore a practical tradeoff: it captures meaningful variation in dispute themes while keeping the topic space small enough to interpret and use as stable “case-type” features in the later steps.

To compare interpretability across K values, the top words for each topic are printed. Topics that are easy to label and show minimal repetition across topics are preferred, since they provide clearer “case type” themes for later steps.

In [None]:
# Define the number of top words to display for each topic.
top_n = 10

# Iterate through each NMF model trained with different numbers of topics (K).
for K in K_try:
    # Retrieve the NMF model and the corresponding vocabulary terms for the current K.
    nmf = nmf_results[K]["model"]
    terms = nmf_results[K]["terms"]

    # Print a header for the current K to delineate the output.
    print(f"\nTop words per topic (NMF, K={K})")
    # For each topic generated by the NMF model, extract and display its most representative words.
    for topic_id, topic_weights in enumerate(nmf.components_):
        # Get the indices of the top 'top_n' words based on their weights in the current topic.
        top_idx = topic_weights.argsort()[::-1][:top_n]
        # Print the topic ID and the comma-separated list of its top words.
        print(f"Topic {topic_id}: " + ", ".join(terms[top_idx]))

A second check is whether topic assignments produce sensible group sizes. Very small topics may be unstable and hard to interpret, while extremely large topics may be too broad. The dominant-topic counts are shown for each K to assess balance.

In [None]:
# Iterate through each NMF model (for different values of K)
for K in K_try:
    # Calculate the number of cases assigned to each dominant topic for the current K
    counts = fair_use_findings[f"nmf_topic_k{K}"].value_counts().sort_index()

    # Print the dominant topic counts and their minimum/maximum sizes
    print(f"\nDominant topic counts (NMF, K={K})")
    print(counts)
    print("Min topic size:", counts.min(), "| Max topic size:", counts.max())

Across the full range of tested values (K = 2 to 20), the dominant-topic counts show a clear pattern: smaller K values create a few very broad buckets that mix multiple dispute themes, while larger K values increasingly split those buckets into narrower themes. This is useful for checking whether the model is producing usable “case types,” because extremely large topics usually signal over-broad themes, and extremely small topics can be unstable or too specific to interpret reliably.

The topic-size summaries support selecting K = 10 as a practical middle ground. At K = 10, topics are still distinct enough to label and interpret, and the distribution avoids both the heavy collapse seen at lower K and the very small niche topics that start to appear at higher K. Together with the knee-point result from the reconstruction-error curve, this makes K = 10 a reasonable choice for downstream case-type analysis.

### Viewing The Results

In [None]:
K = 10
W = nmf_results[K]["doc_topic"]  # shape: (n_cases, K)

topic_cols = [f"topic_{i}" for i in range(K)]

nmf_topic_weights_k10 = pd.DataFrame(W, columns=topic_cols)
nmf_topic_weights_k10.insert(0, "case_number", fair_use_findings["case_number"].values)

# Useful summary columns (kept inside this table, not added back to fair_use_findings)
nmf_topic_weights_k10["dominant_topic"] = W.argmax(axis=1)
nmf_topic_weights_k10["dominant_weight"] = W.max(axis=1)

nmf_topic_weights_k10.head(10)

In [None]:
terms = nmf_results[10]["terms"]
H = nmf_results[10]["model"].components_  # shape: (K, n_terms)

top_n = 20
topic_labels_k10 = {}
for t in range(K):
    top_idx = np.argsort(H[t])[-top_n:][::-1]
    topic_labels_k10[t] = ", ".join(terms[top_idx])

pd.DataFrame({
    "topic": list(topic_labels_k10.keys()),
    "top_words": list(topic_labels_k10.values())
})

# Clustering

Second unsupervised or unstructured analysis method

Clustering is used to group cases into comparable “case types” based on dispute content rather than where the case was decided or how it turned out. Each case is represented only by its NMF topic-mixture weights (K=10) learned from the combined key_facts + issue text. Venue (court) and decision outcome are intentionally excluded to avoid leakage. The resulting clusters provide text-driven case-type groupings that can later be used to compare fair-use outcomes across courts while holding dispute type approximately constant.

For clustering, the feature matrix contains only the 10 topic-weight columns for each case. These features are already numeric and comparable across cases, but they are still standardized so that no single topic dimension dominates distance calculations due to scale differences. This produces a clean input for Hierarchical Clustering and K-means, and keeps the clustering interpretation tightly tied to dispute themes.

In [None]:
from sklearn.preprocessing import StandardScaler

# Retrieve the document-topic matrix from the NMF results for K=10
W = nmf_results[10]["doc_topic"]

# Apply standard scaling to the topic weights matrix
X_cluster = StandardScaler().fit_transform(W)

# Print the shape of the resulting feature matrix
print("X_cluster shape:", X_cluster.shape)

# Create a list of column names for the topics
cluster_cols = [f"topic_w{i}" for i in range(10)]

# Construct a DataFrame with the topic weights for inspection
cluster_df = pd.DataFrame(W, columns=cluster_cols)

# Display the first 5 rows of the DataFrame
cluster_df.head()

## K-Means Clustering

K-means clustering is used to group cases into content-based “case types” using the standardized NMF (K=10) topic-weight features only. The first step is choosing the number of clusters, k. An elbow plot is used to see where adding more clusters yields diminishing improvements in within-cluster fit (inertia). This helps select a cluster count that balances simplicity and separation.

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Define range of k to test for the elbow plot
k_values = range(2, 16)
inertias = []

# Fit K-Means for each k and record inertia
for k in k_values:
    km = KMeans(n_clusters=k, random_state=42, n_init=20)
    km.fit(X_cluster)
    inertias.append(km.inertia_)

# Plot inertia to identify the elbow
plt.figure(figsize=(7, 4))
plt.plot(list(k_values), inertias, marker="o")
plt.title("K-means elbow plot (inertia)")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Inertia")
plt.tight_layout()
plt.show()

Based on the elbow plot, the rate of improvement slows noticeably around k = 10, so 10 clusters are used as a practical balance between coarse grouping and over-fragmentation. The K-means model is then fit on the standardized topic-weight features, and the resulting cluster label is saved as a case-type indicator. Cluster sizes are checked to confirm that groups are not implausibly tiny.

In [None]:
from sklearn.cluster import KMeans

# Choose k based on the elbow plot
k_final = 10

# Fit K-Means on standardized topic-weight features only
kmeans = KMeans(n_clusters=k_final, random_state=42, n_init=50)
fair_use_findings["case_type_cluster"] = kmeans.fit_predict(X_cluster)

# Check cluster sizes
print("Cluster sizes (k=10):")
print(fair_use_findings["case_type_cluster"].value_counts().sort_index())

To evaluate how well-separated the clusters are, a silhouette plot is produced. Values closer to 1 indicate well-separated clusters, values near 0 indicate overlap, and negative values suggest cases may fit better in a different cluster.

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

k_final = 10

# Silhouette scores for the final clustering (topic weights only)
labels = fair_use_findings["case_type_cluster"].to_numpy()
sil_vals = silhouette_samples(X_cluster, labels)
sil_avg = silhouette_score(X_cluster, labels)

print(f"Silhouette mean (k={k_final}): {sil_avg:.3f}")

# Silhouette plot
plt.figure(figsize=(7, 5))
y_lower = 10

for c in sorted(np.unique(labels)):
    c_sil = np.sort(sil_vals[labels == c])
    size_c = len(c_sil)
    y_upper = y_lower + size_c

    plt.fill_betweenx(np.arange(y_lower, y_upper), 0, c_sil, alpha=0.9)
    plt.text(-0.05, y_lower + 0.5 * size_c, str(c))

    y_lower = y_upper + 10

plt.axvline(sil_avg, color="red", linestyle="--", linewidth=1.5)
plt.title(f"Silhouette plot (k={k_final}), mean={sil_avg:.3f}")
plt.xlabel("Silhouette coefficient")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()

For k = 10, the silhouette plot shows mostly positive scores with an average of about 0.37, which suggests the clusters are reasonably separated but not perfectly distinct. This is expected for short legal summaries where dispute themes can overlap, and it supports using k = 10 as a workable set of “case types” for downstream venue comparisons

## Hierarchical Clustering

Hierarchical clustering is used as a complementary, structure-first check on the case-type groupings. Unlike K-means, it does not require choosing the number of clusters upfront. Instead, it builds a tree of merges based on distances between cases in the standardized topic-weight space, which helps visualize whether the data naturally forms a few large groups or many smaller ones.

In [None]:
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import linkage, fcluster

Z = linkage(X_cluster, method="ward")

_ = dendrogram(Z,)

The dendrogram shows many merges happening at low distances, meaning lots of cases are fairly similar in topic space. Only a few large jumps appear near the top, where broader groups are forced together.

This suggests dispute themes overlap rather than forming a few perfectly separate groups. A mid-sized cut (not 2–3 clusters) makes more sense, which is consistent with using k = 10 as a practical number of case-type groups.

In [None]:
n_clusters = 10

# Generate cluster labels using hierarchical clustering
# We assign them to the main DataFrame because X_cluster is a numpy array and doesn't support named columns
fair_use_findings['hierarchical_cluster'] = fcluster(Z, n_clusters, criterion='maxclust')

# Display the size of each hierarchical cluster
print("Hierarchical Cluster Sizes:")
print(fair_use_findings['hierarchical_cluster'].value_counts().sort_index())

To evaluate how well-separated the clusters are, a silhouette plot is produced. Values closer to 1 indicate well-separated clusters, values near 0 indicate overlap, and negative values suggest cases may fit better in a different cluster.

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

k_final = 10

# Silhouette scores for the final clustering (topic weights only)
# Switched to use hierarchical_cluster labels as requested
labels = fair_use_findings["hierarchical_cluster"].to_numpy()
sil_vals = silhouette_samples(X_cluster, labels)
sil_avg = silhouette_score(X_cluster, labels)

print(f"Silhouette mean (k={k_final}): {sil_avg:.3f}")

# Silhouette plot
plt.figure(figsize=(7, 5))
y_lower = 10

for c in sorted(np.unique(labels)):
    c_sil = np.sort(sil_vals[labels == c])
    size_c = len(c_sil)
    y_upper = y_lower + size_c

    plt.fill_betweenx(np.arange(y_lower, y_upper), 0, c_sil, alpha=0.9)
    plt.text(-0.05, y_lower + 0.5 * size_c, str(c))

    y_lower = y_upper + 10

plt.axvline(sil_avg, color="red", linestyle="--", linewidth=1.5)
plt.title(f"Silhouette plot (Hierarchical, k={k_final}), mean={sil_avg:.3f}")
plt.xlabel("Silhouette coefficient")
plt.ylabel("Cluster")
plt.tight_layout()
plt.show()

For k = 10, the silhouette plot shows mostly positive scores with an average of about 0.35, which suggests the clusters are reasonably separated but not perfectly distinct. This is expected for short legal summaries where dispute themes can overlap, and it supports using k = 10 as a workable set of “case types” for downstream venue comparisons.

## Comparing Methods

In [None]:
from sklearn.metrics import silhouette_score

# Compare clustering quality to choose the best "case type" definition for the venue analysis
# We evaluate both K-means and Hierarchical clustering on the same feature space (X_cluster)
X  = X_cluster
km = fair_use_findings["case_type_cluster"].to_numpy()
hc = fair_use_findings["hierarchical_cluster"].to_numpy()

# Calculate silhouette scores (higher means better separation between case types)
s_km = silhouette_score(X, km)
s_hc = silhouette_score(X, hc)

print(f"Silhouette  KMeans: {s_km:.3f} | Hier(Ward): {s_hc:.3f}")

Both methods produce reasonably separated case-type groupings using the same topic-weight features, but K-means performs slightly better by the silhouette metric. The average silhouette score is 0.372 for K-means versus 0.352 for hierarchical (Ward), indicating that K-means yields clusters that are, on average, more internally coherent and more distinct from neighboring clusters. Given this modest but consistent improvement, and because K-means also provides a direct, stable assignment of cases into a fixed number of groups, K-means is selected as the final clustering approach for defining case types in the downstream venue comparison.

# Analyzing Results

This section interprets the case-type clusters and connects them to outcomes and venue effects. The main goal is to check whether the clusters reflect meaningful dispute themes and then test whether some venues still look more creator-friendly or rightsholder-friendly after controlling for case type.

The first step is to understand what the clusters mean. Each cluster is summarized by its average topic weights, and then matched to the most representative topic words. This produces human-readable “case type” labels (software, photos/social media, books, news footage, education, film clips, music, etc.).

In [None]:
K = 10
cluster_labels = fair_use_findings["case_type_cluster"]

# Recover the original NMF topic weights for the interpretation phase
# This allows us to see which topics drive each cluster, even though clustering used standardized features
topic_cols = [f"topic_{i}" for i in range(K)]
cluster_interpretation = pd.DataFrame(nmf_results[K]["doc_topic"], columns=topic_cols)
cluster_interpretation["cluster"] = cluster_labels.values

# Compute the average topic profile for each cluster to identify its dominant themes
cluster_profiles = cluster_interpretation.groupby("cluster").mean()

# Visualize the cluster-topic relationship with a heatmap
# High values indicate that a specific topic is central to that cluster's definition
plt.figure(figsize=(10, 6))
sns.heatmap(cluster_profiles, cmap="Reds", annot=True, fmt=".2f")
plt.title("Mean Topic Weights by Cluster (K-means, k=10)")
plt.xlabel("Topic")
plt.ylabel("Cluster")
plt.show()

print("\nCluster Characterization (based on dominant topics):")
terms = nmf_results[K]["terms"]

# Iterate through each cluster to automatically generate a descriptive label
# based on its most prominent topic and the top words associated with that topic
for cluster_id, row in cluster_profiles.iterrows():
    top_topic_idx = row.argmax()
    top_words_idx = nmf_results[K]["model"].components_[top_topic_idx].argsort()[::-1][:10]
    top_words = ", ".join(terms[top_words_idx])
    print(f"Cluster {cluster_id}: Dominant topic_{top_topic_idx} -> {top_words}")

The cluster labels show that the clustering is separating cases into recognizable dispute themes. Several clusters align with clear content domains (software/video games, music, photographs/social media, books/publishing, university course materials, film clips). This supports using clusters as “case types,” since they reflect dispute content rather than venue or outcome.

After the case types are defined, fair-use rates are computed by cluster using determinate cases only. This checks whether some dispute types tend to be more creator-friendly than others, independent of venue.

In [None]:
print("\n" + "="*40)
print("Fair Use Outcome Rates by Case Type (Cluster)")
print("="*40)

# Cross-tabulate case types against outcomes to see the distribution of results per cluster
cluster_outcomes = pd.crosstab(
    fair_use_findings["case_type_cluster"],
    fair_use_findings["outcome"]
)

# Calculate win rates based only on determinate cases (Found/Not Found), excluding ambiguous results
determinate_cases = cluster_outcomes.get("FAIR_USE_FOUND", 0) + cluster_outcomes.get("FAIR_USE_NOT_FOUND", 0)
cluster_outcomes["fair_use_rate"] = cluster_outcomes.get("FAIR_USE_FOUND", 0) / determinate_cases
cluster_outcomes["total_determinate"] = determinate_cases

# Sort clusters by success rate to highlight which case types are most/least creator-friendly
cluster_outcomes_sorted = cluster_outcomes.sort_values("fair_use_rate", ascending=False)

print(cluster_outcomes_sorted[["FAIR_USE_FOUND", "FAIR_USE_NOT_FOUND", "INDETERMINATE", "fair_use_rate"]])

# Visualize the variation in fair use success rates across different dispute types
plt.figure(figsize=(10, 6))
sns.barplot(
    x=cluster_outcomes_sorted.index,
    y=cluster_outcomes_sorted["fair_use_rate"],
    hue=cluster_outcomes_sorted.index,
    legend=False
)
plt.title("Fair Use Found Rate by Case Type (Determinate Cases Only)")
plt.xlabel("Cluster (Case Type)")
plt.ylabel("Fair Use Rate")
plt.axhline(0.5, color="gray", linestyle="--", alpha=0.7)
plt.ylim(0, 1)
plt.show()

Fair-use rates differ substantially by case type. Some clusters show relatively high fair-use success (for example, film clips and social-media photo disputes), while others show much lower success (for example, university course-material disputes). This confirms that case mix matters: venues that see more of the “low win-rate” case types can appear less creator-friendly even if they behave similarly to the national average.

Before testing venue fairness, case mix is examined. A venue’s distribution across clusters shows whether it disproportionately receives certain dispute types, which can distort raw fair-use rates.

In [None]:
determinate = fair_use_findings.dropna(subset=["outcome"]).copy()

# Focus on the most active venues to ensure case-mix patterns are not driven by small-sample noise
top_venues = determinate["court"].value_counts().head(10).index
d = determinate[determinate["court"].isin(top_venues)].copy()

# Calculate the distribution of case types within each venue (row percentages)
# This reveals whether a venue disproportionately receives specific types of disputes (e.g., music vs. software)
mix_counts = pd.crosstab(d["court"], d["case_type_cluster"])
mix_pct = mix_counts.div(mix_counts.sum(axis=1), axis=0)

# Visualize the case-mix differences to check if venue outcomes might be confounded by case type
plt.figure(figsize=(10, 5))
plt.imshow(mix_pct, aspect="auto")
plt.colorbar(label="Share of cases")
plt.yticks(range(len(mix_pct.index)), mix_pct.index)
plt.xticks(range(len(mix_pct.columns)), mix_pct.columns)
plt.title("Case-type mix by venue (row percentages)")
plt.xlabel("Case type (cluster)")
plt.ylabel("Venue")
plt.tight_layout()
plt.show()

display(mix_pct.round(2))

The case-type mix varies by venue. Some courts have a heavier share of book/publishing or education-related clusters, while others see more media, photo, or software disputes. This supports the need to adjust for case mix when comparing venues, because differences in what types of cases arrive at each venue can drive differences in observed win rates.

Finally, venue fairness is evaluated by comparing each venue’s actual fair-use rate to an expected fair-use rate based on the venue’s case mix. The difference (actual − expected) is interpreted as an adjusted venue effect.

In [None]:
df_det = fair_use_findings[
    fair_use_findings["outcome"].isin(["FAIR_USE_FOUND", "FAIR_USE_NOT_FOUND"])
].copy()

df_det["is_fair_use"] = (df_det["outcome"] == "FAIR_USE_FOUND").astype(int)

# Calculate the global win rate for each case type to serve as a neutral baseline
cluster_baselines = df_det.groupby("case_type_cluster")["is_fair_use"].mean()

# Assign an "expected probability" of fair use to each case based solely on its case type
# This controls for the difficulty of the dispute (e.g., software cases might be harder to win than film clips)
df_det["expected_prob"] = df_det["case_type_cluster"].map(cluster_baselines)

# Compare actual wins per venue against the expected wins predicted by their case mix
venue_stats = df_det.groupby("court").agg(
    n_cases=("is_fair_use", "count"),
    actual_wins=("is_fair_use", "sum"),
    expected_wins=("expected_prob", "sum")
)

# The "adjusted difference" represents the venue's creator-friendliness relative to the national average for those specific disputes
venue_stats["actual_rate"] = venue_stats["actual_wins"] / venue_stats["n_cases"]
venue_stats["expected_rate"] = venue_stats["expected_wins"] / venue_stats["n_cases"]
venue_stats["adjusted_diff"] = venue_stats["actual_rate"] - venue_stats["expected_rate"]

min_cases = 10
venue_analysis = venue_stats[venue_stats["n_cases"] >= min_cases].sort_values("adjusted_diff", ascending=False)

print(f"Venue Fairness Analysis (controlling for case type, n >= {min_cases}):")
display(venue_analysis[["n_cases", "actual_rate", "expected_rate", "adjusted_diff"]].round(3))

plt.figure(figsize=(10, 6))
sns.barplot(
    x=venue_analysis["adjusted_diff"],
    y=venue_analysis.index,
    hue=venue_analysis.index,
    legend=False
)
plt.axvline(0, color="black", linestyle="-", linewidth=0.8)
plt.title(f"Creator-Friendliness by Venue (Actual - Expected Win Rate)\nControlling for Case Mix (n >= {min_cases})")
plt.xlabel("Actual minus expected fair-use rate")
plt.ylabel("Court")
plt.grid(axis="x", linestyle="--", alpha=0.5)
plt.show()

After controlling for case type, the adjusted differences suggest only modest venue-level effects in the venues with enough cases to analyze. In this subset, CDCA and SDNY are slightly above expected (more creator-friendly than their case mix predicts), while the Second and Ninth Circuits are below expected (more rightsholder-friendly than their case mix predicts). The magnitudes are not large, but they indicate that venue effects may still exist beyond case mix.