# **Integrated Analysis & Synergy For Fair Use Cases Data**

**Project Milestone 4**

**Team:** B1 Team 13

**Team Members:** Rita Feng

# Introduction

Fair use decisions are highly fact specific, so outcomes can vary across cases and courts. In this project, we apply computational text analysis to fair use opinions to uncover common patterns, compare outcomes across venues, support precedent retrieval, and flag unusual boundary cases.

We combine **`fair_use_findings`** (detailed narrative text and outcomes) with **`fair_use_cases`** (metadata such as year, venue, and tags). Because case names are inconsistent, matches between the two tables are incomplete, so we use **`fair_use_findings`** as the main text source and add **`fair_use_cases`** variables only when linkage succeeds.

We will explore new method for the text unsurpvied machien learning for question 1


# Executive Summary

- This notebook analyzes the `fair_use_findings` dataset and uses both narrative text (`key_facts`, `issue`, `holding`) and metadata (`tags`, `court`, `year`).
- We build one shared text representation from `key_facts + issue` using TF IDF and an NMF topic model with **K = 10**. These topic weights are reused across Q4, Q1, and Q2.
- We run **Q4 first** because outliers can distort topic structure, clustering, and venue comparisons. We flag red-flag cases with an Isolation Forest (5% contamination) using topic weights plus holding length, tag count, and year.
- **Q1** creates case types with K Means (**k = 10**) on the NMF topic weights, then retrieves similar precedents using cosine similarity, with an option to restrict matches to the same case type.
- **Q2** estimates venue differences after controlling for case mix by comparing each court’s actual fair use rate to an expected rate implied by its case type mix, using determinate outcomes only and reporting courts with **n ≥ 10** cases.
- **Q3** mines tag co-occurrence patterns with Apriori (min_support = 0.005, then filtering by confidence and lift) and also clusters `full_text` (facts + issue + holding) with TF IDF and K Means (**k = 5**) to summarize recurring dispute archetypes.

# Setting Up Environment

In [56]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Core
import os
import re
import math
import json
import time
import string
import random
import warnings
from pathlib import Path
from collections import Counter, defaultdict

# Data
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# NLP / Text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, PCA, TruncatedSVD
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity

# Modeling / Anomaly detection
from sklearn.ensemble import IsolationForest

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Association rules
from mlxtend.frequent_patterns import apriori, association_rules

# Dimensionality reduction for viz
from sklearn.manifold import TSNE

import nltk
from nltk.tokenize import word_tokenize

# Data Importing, Inspection and Preparation

## Choosing the Working Analysis Table

In Milestone 2, the team used **both approaches**: some analyses worked entirely within a single table (most often `fair_use_findings`), while others attempted to merge `fair_use_cases` and `fair_use_findings` to enrich metadata. We explicitly tested whether the two tables could be integrated into a single “master” dataset. A straightforward join exposed structural mismatch in identifiers, producing only a **12% match rate**. Follow-up preprocessing improved linkage by normalizing formatting differences (e.g., standardizing “vs” to “v” and stripping punctuation), recovering **70.5% of records (177 cases)** in one improved approach. Other join strategies achieved results in the **low 30% range**, but still left a large portion unmatched. Even when match rates increase, the remaining unmatched share is large enough that (1) we risk losing too many cases if we require a strict join, and (2) we risk introducing false matches if we loosen the rules.

For Milestone 3, we prioritize a reproducible pipeline with minimal merge risk. As an integration choice, we therefore use **`fair_use_findings` as the single working analysis table** and treat `fair_use_cases` as optional metadata rather than a required dependency. This aligns with the milestone’s core modeling needs because our unsupervised methods depend most heavily on the narrative fields in `fair_use_findings`, while the structured fields in that table are sufficient for the planned comparisons.

**Key overlap and differences (without requiring a merge):**

* **Identifiers:** `fair_use_findings` includes `title` and `case_number`, which together serve the same reference role as the single `case` string in `fair_use_cases`. These fields are mainly used for labeling and lookup rather than as modeling features.
* **Year:** `year` is present in both tables and can be used directly without merging.
* **Court and venue information:** `fair_use_cases` includes `court` and `jurisdiction`, while `fair_use_findings` includes `court`. These are overlapping venue signals, and a single court field is sufficient for clustering and venue-aware comparisons.
* **Outcomes:** both tables include an `outcome` field describing the fair use result at a high level, supporting standardized labeling.
* **Text fields unique to `fair_use_findings`:** `key_facts`, `issue`, and `holding` provide the narrative summaries required for text-based unsupervised methods. In Milestone 2, the primary modeling text was often constructed from **`key_facts + issue`** to reduce decision-language leakage from holdings; this combined text averaged **about 145 words**, which is well suited to TF–IDF, topic modeling, and clustering workflows.

**Outcome labeling approach (used throughout M3):**

* Because `fair_use_findings` does not provide a boolean `fair_use_found`, we derive labels from `outcome` text and standardize into three groups: **fair use found**, **fair use not found**, and **indeterminate**.
* For venue comparisons, we compute outcome-related metrics using **determinate outcomes only**, avoiding mixed, preliminary, or remand-like resolutions.

Overall, while merging can add optional metadata, Milestone 2 showed that reliable integration requires record linkage plus validation and can still induce substantial case loss or matching risk. Since `fair_use_findings` already contains the critical unstructured fields used across all M3 modeling tracks, we proceed with a **one-table analysis** to maximize coverage, reduce merge-induced bias, and keep the pipeline reproducible.

## Data Importing

In [57]:
fair_use_findings = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-08-29/fair_use_findings.csv')

## Data Inspection

The `fair_use_findings` table contains complementary case-level text, including summaries of key facts, legal issues, holdings, and descriptive tags. Inspection centers on text completeness and variability, as these fields support later analysis of language patterns, similarity, and thematic structure across cases.

| variable    | class     | description                                                                            |
| ----------- | --------- | -------------------------------------------------------------------------------------- |
| title       | character | The title of the case.                                                                 |
| case_number | character | The case number or numbers of the case.                                                |
| year        | character | The year in which the finding was made (or findings were made).                        |
| court       | character | The court or courts involved.                                                          |
| key_facts   | character | The key facts of the case.                                                             |
| issue       | character | A brief description of the fair use issue.                                             |
| holding     | character | The decision of the court in paragraph form.                                           |
| tags        | character | Comma- or semicolon-separated tags for this case.                                      |
| outcome     | character | A brief description of the outcome of the case. These fields have not been normalized. |

In [58]:
print("Dataset Info:")
print(fair_use_findings.info())

print("\nFirst 5 rows:")
print(fair_use_findings.head())

print("\nMissing Values:")
print(fair_use_findings.isnull().sum())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        251 non-null    object
 1   case_number  251 non-null    object
 2   year         251 non-null    object
 3   court        251 non-null    object
 4   key_facts    251 non-null    object
 5   issue        251 non-null    object
 6   holding      251 non-null    object
 7   tags         251 non-null    object
 8   outcome      251 non-null    object
dtypes: object(9)
memory usage: 17.8+ KB
None

First 5 rows:
                                               title  \
0                              De Fontbrune v. Wofsy   
1                          Sedlik v. Von Drachenberg   
2  Sketchworks Indus. Strength Comedy, Inc. v. Ja...   
3  Am. Soc'y for Testing & Materials v. Public.Re...   
4                           Yang v. Mic Network Inc.   

                                 

## Preparing Data

### Outcome Flag Construction

The outcome column is converted into a simple label for analysis. The text is cleaned and then grouped into three outcomes: fair use found, fair use not found, and indeterminate (preliminary, mixed, remand, or unclear). A binary `fair_use_found` flag is created only for the final outcomes, and indeterminate cases are left out of binary rate calculations.

In [59]:
# Count outcome column from fair_use_findings and reset index
outcome_counts = fair_use_findings["outcome"].astype(str).str.lower().str.strip().value_counts().reset_index()
fair_use_findings["outcome"] = fair_use_findings["outcome"].astype(str).str.lower().str.strip()
outcome_counts.columns = ["outcome", "count"]

# Display the counts
print(outcome_counts)

                                              outcome  count
0                                  fair use not found    100
1                                      fair use found     98
2         preliminary ruling, mixed result, or remand     28
3             preliminary finding; fair use not found      4
4                                        mixed result      3
5              preliminary ruling, fair use not found      3
6              fair use not found, preliminary ruling      3
7              preliminary ruling; fair use not found      2
8              fair use not found; preliminary ruling      2
9                          preliminary ruling, remand      1
10                                fair use not found.      1
11                                    fair use found.      1
12  preliminary ruling, fair use not found, mixed ...      1
13                 preliminary ruling, fair use found      1
14  fair use found; second circuit affirmed on app...      1
15                      

Based on the grouped outcome counts, outcomes fall into three clear categories. Entries labeled “Fair use found” (including minor punctuation or appeal notes) are treated as fair use found, and entries labeled “Fair use not found” (including punctuation variants) are treated as fair use not found. All remaining outcomes, such as preliminary rulings, mixed results, remands, and irregular text entries, are treated as indeterminate. A binary fair_use_found flag is then defined only for the final outcomes, while indeterminate cases are excluded from binary rate calculations.

In [60]:
outcome_map = {
    # FINAL: fair use found
    "fair use found": "FAIR_USE_FOUND",
    "fair use found.": "FAIR_USE_FOUND",
    "fair use found; second circuit affirmed on appeal.": "FAIR_USE_FOUND",

    # FINAL: fair use not found
    "fair use not found": "FAIR_USE_NOT_FOUND",
    "fair use not found.": "FAIR_USE_NOT_FOUND",

    # INDETERMINATE
    "preliminary ruling, mixed result, or remand": "INDETERMINATE",
    "preliminary finding; fair use not found": "INDETERMINATE",
    "mixed result": "INDETERMINATE",
    "preliminary ruling, fair use not found": "INDETERMINATE",
    "fair use not found, preliminary ruling": "INDETERMINATE",
    "preliminary ruling; fair use not found": "INDETERMINATE",
    "fair use not found; preliminary ruling": "INDETERMINATE",
    "preliminary ruling, remand": "INDETERMINATE",
    "preliminary ruling, fair use not found, mixed result": "INDETERMINATE",
    "preliminary ruling, fair use found": "INDETERMINATE",
    "fair use found; mixed result": "INDETERMINATE",
    "plaintiff patrick cariou published yes rasta, a book of portraits and landscape photographs taken in jamaica. defendant richard prince was an appropriation artist who altered and incorporated several of plaintiff’s photographs into a series of paintings and collages called canal zone that was exhibited at a gallery and in the gallery’s exhibition catalog. plaintiff filed an infringement claim, and the district court ruled in his favor, stating that to qualify as fair use, a secondary work must “comment on, relate to the historical context of, or critically refer back to the original works.” defendant appealed.": "INDETERMINATE",
}

In [61]:
# Replace outcome column values with the mapping in outcome_map
fair_use_findings["outcome"] = fair_use_findings["outcome"].replace(outcome_map)
fair_use_findings["outcome"].value_counts().reset_index()

Unnamed: 0,outcome,count
0,FAIR_USE_NOT_FOUND,101
1,FAIR_USE_FOUND,100
2,INDETERMINATE,50


### Column Cleaning Steps

The `year` column is converted to a numeric integer format to ensure it can be used reliably in grouping, filtering, and any downstream modeling steps. Any non-numeric or missing values are handled safely during conversion.

In [62]:
# Turn the year column to integer
fair_use_findings["year"] = pd.to_numeric(fair_use_findings["year"], errors="coerce").astype("Int64")

In [63]:
fair_use_findings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        251 non-null    object
 1   case_number  251 non-null    object
 2   year         250 non-null    Int64 
 3   court        251 non-null    object
 4   key_facts    251 non-null    object
 5   issue        251 non-null    object
 6   holding      251 non-null    object
 7   tags         251 non-null    object
 8   outcome      251 non-null    object
dtypes: Int64(1), object(8)
memory usage: 18.0+ KB


In [64]:
fair_use_findings.head()

Unnamed: 0,title,case_number,year,court,key_facts,issue,holding,tags,outcome
0,De Fontbrune v. Wofsy,39 F.4th 1214 (9th Cir. 2022),2022,United States Court of Appeals for the Ninth C...,Plaintiffs own the rights to a catalogue compr...,Whether reproduction of photographs documentin...,"The panel held that the first factor, the purp...",Education/Scholarship/Research; Photograph,FAIR_USE_NOT_FOUND
1,Sedlik v. Von Drachenberg,"No. CV 21-1102, 2022 WL 2784818 (C.D. Cal. May...",2022,United States District Court for the Southern ...,Plaintiff Jeffrey Sedlik is a photographer who...,Whether use of a photograph as the reference i...,"Considering the first fair use factor, the pur...",Painting/Drawing/Graphic; Photograph,INDETERMINATE
2,"Sketchworks Indus. Strength Comedy, Inc. v. Ja...","No. 19-CV-7470-LTS-VF, 2022 U.S. Dist. LEXIS 8...",2022,United States District Court for the Southern ...,Plaintiff Sketchworks Industrial Strength Come...,"Whether the use of protected elements, includi...","The court found that the first factor, the pur...",Film/Audiovisual; Music; Parody/Satire; Review...,FAIR_USE_FOUND
3,Am. Soc'y for Testing & Materials v. Public.Re...,"No. 13-cv-1215 (TSC), 2022 U.S. Dist. LEXIS 60...",2022,United States District Court for the District ...,"Defendant Public.Resource.Org, Inc., a non-pro...",Whether it is fair use to make available onlin...,"As directed by the court of appeals, the distr...",Education/Scholarship/Research; Textual Work; ...,INDETERMINATE
4,Yang v. Mic Network Inc.,"Nos. 20-4097-cv(L), 20-4201-cv (XAP), 2022 U.S...",2022,United States Court of Appeals for the Second ...,Plaintiff Stephen Yang (“Yang”) licensed a pho...,"Whether using a screenshot from an article, in...","On appeal, the court decided that the first fa...",News Reporting; Photography,FAIR_USE_FOUND


# Integration Strategy and Synergy Effort

1. **Create a shared case-type backbone (topic-based case types).**

   * Build a consistent text representation for each case using **`key_facts + issue`**.
   * Fit a topic model and translate each case into **case types** (topic mixtures and/or clusters).
   * **Why this matters:** this backbone provides a reusable semantic structure that makes cases comparable across analyses and reduces duplicated work.

2. **Run Q4 first with full-case coverage.**

   * Use the full **`fair_use_findings`** table (about 250 cases).
   * **Why this matters:** Q4 focuses on boundary conditions and edge cases. Keeping all cases increases the likelihood of capturing true outliers rather than filtering them out prematurely.

3. **Execute Q1 to Q3 with minimal changes to the proven M2 approach.**

   * **Q1 (precedent finder):** reuse the case-type backbone as the semantic engine for similarity search and retrieval.
   * **Q2 (venue fairness):** preserve the existing method and use case types to control for case mix, then compare **actual vs. expected** outcomes by venue.
   * **Q3 (scenario bundles):** keep this workstream largely self-contained and focus on tag and category co-occurrence patterns.

The main integration benefit is that the topic-model case types are created once and reused across multiple questions. They strengthen **Q1** by improving retrieval and support **Q2** by enabling case-mix adjustments. **Q4** benefits most from full coverage because it is designed to surface boundary cases. **Q3** remains mostly independent, with overlap primarily in shared data preparation and labeling.

# Topic Modeling

The text input for topic modeling is constructed by combining `key_facts` and `issue`. These fields describe the dispute context and legal question, and they are used to define “case types” without leaking decision language from holding. Basic cleaning (lowercasing and whitespace normalization) is applied to reduce superficial variation before vectorization.

In [65]:
# Combine key facts + issue into one text field for topic modeling
fair_use_findings["text"] = (
    fair_use_findings["key_facts"].fillna("").astype(str) + " " +
    fair_use_findings["issue"].fillna("").astype(str)
)

# Basic cleaning: lowercase, collapse whitespace, trim
fair_use_findings["text"] = (
    fair_use_findings["text"]
    .str.lower()
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

Topic modeling is used to extract dispute themes from the combined `key_facts + issue` text. The primary approach uses **NMF on a TF–IDF–weighted document–term matrix**, which downweights common boilerplate terms and typically produces clearer and more distinct topic vocabularies, especially for short case summaries. In M3, we fit NMF with **K = 10** topics and use the resulting **topic-word summaries** and **per-case topic mixtures** as a stable, interpretable “case-type” representation for later steps in the notebook.

## Non-negative Matrix Factorization (NMF)

TF-IDF is used to downweight very common legal terms and emphasize terms that distinguish cases by fact patterns. NMF is then fit to the TF-IDF matrix to extract non-negative topic components, producing interpretable themes that can be treated as text-derived “case types.” These topic mixtures provide a compact numeric representation of each case summary for later analysis.

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Define the number of topics (K) to investigate for NMF.
# We use K=10 based on the selected configuration.
K_try = [10]

# Initialize TF-IDF vectorizer for text feature extraction.
tfidf_vec = TfidfVectorizer(
    stop_words="english",
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2)
)

# Transform text data into TF-IDF features.
X_tfidf = tfidf_vec.fit_transform(fair_use_findings["text"])
# Store the vocabulary terms for interpretation.
terms_tfidf = tfidf_vec.get_feature_names_out()

# Prepare to store NMF results for each K value.
nmf_results = {}

# Loop through each specified number of topics (K) to run NMF.
for K in K_try:
    # Set up the NMF model for the current K.
    nmf = NMF(n_components=K, random_state=42, init="nndsvda", max_iter=400)
    # Apply NMF to extract topic distributions per document.
    nmf_doc_topic = nmf.fit_transform(X_tfidf)

    # Record the model's output and key metrics for the current K.
    nmf_results[K] = {
        "model": nmf,
        "doc_topic": nmf_doc_topic,                 # (n_cases, K) topic weights per case
        "dominant_topic": nmf_doc_topic.argmax(1),  # length n_cases
        "reconstruction_err": nmf.reconstruction_err_,
        "terms": terms_tfidf
    }

    # Assign the dominant topic for the current K to the DataFrame.
    fair_use_findings[f"nmf_topic_k{K}"] = nmf_results[K]["dominant_topic"]

    # Report the reconstruction error for the NMF model at current K.
    print(f"NMF K={K} | recon_err={nmf_results[K]['reconstruction_err']:.4f}")

NMF K=10 | recon_err=14.7173


In [67]:
# Define the number of top words to display for each topic.
top_n = 10

# Iterate through each NMF model trained with different numbers of topics (K).
for K in K_try:
    # Retrieve the NMF model and the corresponding vocabulary terms for the current K.
    nmf = nmf_results[K]["model"]
    terms = nmf_results[K]["terms"]

    # Print a header for the current K to delineate the output.
    print(f"\nTop words per topic (NMF, K={K})")
    # For each topic generated by the NMF model, extract and display its most representative words.
    for topic_id, topic_weights in enumerate(nmf.components_):
        # Get the indices of the top 'top_n' words based on their weights in the current topic.
        top_idx = topic_weights.argsort()[::-1][:top_n]
        # Print the topic ID and the comma-separated list of its top words.
        print(f"Topic {topic_id}: " + ", ".join(terms[top_idx]))


Top words per topic (NMF, K=10)
Topic 0: defendants, plaintiff, advertisement, parody, campaign, sculpture, hustler, defendants unauthorized, video, unauthorized
Topic 1: photograph, post, article, instagram, copyright, use photograph, new, posted, blog, photographer
Topic 2: defendant, plaintiff, book, court, published, ruling, novel, district, appealed district, court ruling
Topic 3: news, footage, lans, angeles, los, los angeles, service, video, news service, defendant
Topic 4: plaintiffs, works, university, factor, factors, court, students, district, district court, publishers
Topic 5: photographs, images, photos, gossip, celebrity, website, defendant, use photographs, celebrity gossip, plaintiff photographs
Topic 6: film, documentary, films, clips, star, scenes, rights, jewish, film clips, defendant
Topic 7: game, video, code, software, computer, program, sony, games, video game, fox
Topic 8: series, television series, television, character, superman, book, superhero, characters,

Across the values evaluated in Milestone 2, the dominant-topic counts showed a consistent pattern: smaller K values produced a few very broad buckets that mixed multiple dispute themes, while larger K values progressively split those buckets into narrower themes. This comparison is useful for validating whether the model is producing usable “case types,” since overly broad topics reduce interpretability, while very small topics can become unstable or too specific to interpret reliably.

Based on those Milestone 2 checks, **K = 10** was selected as a practical middle ground. At K = 10, topics remain distinct enough to label and interpret, and the topic-size distribution avoids both the heavy collapse seen at lower K and the proliferation of tiny niche topics that emerges at higher K. Combined with the reconstruction-error knee-point diagnostics conducted in M2, this supports using **K = 10** as a stable configuration for downstream case-type analysis.

### Viewing The Results

In [68]:
K = 10
W = nmf_results[K]["doc_topic"]  # shape: (n_cases, K)

topic_cols = [f"topic_{i}" for i in range(K)]

nmf_topic_weights_k10 = pd.DataFrame(W, columns=topic_cols)
nmf_topic_weights_k10.insert(0, "case_number", fair_use_findings["case_number"].values)

# Useful summary columns (kept inside this table, not added back to fair_use_findings)
nmf_topic_weights_k10["dominant_topic"] = W.argmax(axis=1)
nmf_topic_weights_k10["dominant_weight"] = W.max(axis=1)

nmf_topic_weights_k10.head(10)

Unnamed: 0,case_number,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,dominant_topic,dominant_weight
0,39 F.4th 1214 (9th Cir. 2022),0.041529,0.0,0.0,0.0,0.225198,0.178476,0.0,0.0,0.028705,0.043319,4,0.225198
1,"No. CV 21-1102, 2022 WL 2784818 (C.D. Cal. May...",0.021433,0.135854,0.0,0.0,0.0,0.000851,0.0,0.065274,0.0,0.013951,1,0.135854
2,"No. 19-CV-7470-LTS-VF, 2022 U.S. Dist. LEXIS 8...",0.0623,0.0,0.0,0.0,0.0,0.0,0.167525,0.0,0.025972,0.167756,9,0.167756
3,"No. 13-cv-1215 (TSC), 2022 U.S. Dist. LEXIS 60...",0.0,0.012402,0.023487,0.0,0.277498,0.026679,0.0,0.039535,0.011493,0.0113,4,0.277498
4,"Nos. 20-4097-cv(L), 20-4201-cv (XAP), 2022 U.S...",0.0,0.4358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014149,1,0.4358
5,"Civ. Action No H-21-2612, 2022 U.S. Dist. LEXI...",0.0,0.009973,0.0,0.004021,0.041133,0.004524,0.021205,0.005756,0.270322,0.0,8,0.270322
6,"19 Civ. 9617 (KPF), 2022 U.S. Dist. LEXIS 5023...",0.0,0.299204,0.0,0.0,0.0,0.118378,0.0,0.0,0.0,0.0,1,0.299204
7,28 F.4th 314 (1st Cir. 2022),0.0,0.183423,0.020887,0.0,0.048423,0.0,0.0,0.051223,0.0,0.007696,1,0.183423
8,27 F.4th 313 (5th Cir. 2022),0.0,0.059604,0.085241,0.0,0.0,0.009576,2.2e-05,0.017786,0.016768,0.022853,2,0.085241
9,"No. 19 CIV. 9769 (AT), 2021 WL 4443259 (S.D.N....",0.0,0.422517,0.0,0.01196,0.0,0.0,0.0,0.0,0.0,0.0,1,0.422517


In [69]:
terms = nmf_results[10]["terms"]
H = nmf_results[10]["model"].components_  # shape: (K, n_terms)

top_n = 20
topic_labels_k10 = {}
for t in range(K):
    top_idx = np.argsort(H[t])[-top_n:][::-1]
    topic_labels_k10[t] = ", ".join(terms[top_idx])

pd.DataFrame({
    "topic": list(topic_labels_k10.keys()),
    "top_words": list(topic_labels_k10.values())
})

Unnamed: 0,topic,top_words
0,0,"defendants, plaintiff, advertisement, parody, ..."
1,1,"photograph, post, article, instagram, copyrigh..."
2,2,"defendant, plaintiff, book, court, published, ..."
3,3,"news, footage, lans, angeles, los, los angeles..."
4,4,"plaintiffs, works, university, factor, factors..."
5,5,"photographs, images, photos, gossip, celebrity..."
6,6,"film, documentary, films, clips, star, scenes,..."
7,7,"game, video, code, software, computer, program..."
8,8,"series, television series, television, charact..."
9,9,"song, music, musical, lyrics, defendants, albu..."


# Q1: Fact-Based Precedent Finder

Introduct updated method

Our group initially used a TF-IDF approach for precedent retrieval. However, a key limitation of TF-IDF is that it relies heavily on exact word overlap, so it can miss semantically similar cases when the same idea is expressed with different wording.

Word2Vec, which we learned in class, can help address this issue by capturing semantic similarity. That said, because our dataset is relatively small, using Word2Vec alone can make many documents appear overly similar and reduce discrimination between cases. To balance these trade-offs, we use a hybrid approach: TF-IDF is used to retrieve a shortlist of candidate cases and filter out clearly unrelated documents, and then Word2Vec is used to rerank those candidates based on semantic similarity.


Preare

Before training and modeling, we use NLTK to tokenize the text. We first combine the key narrative fields—key_facts, issue, and holding—into a single text column, then generate a tokens column. These tokens will be used to build document embeddings by averaging pretrained GloVe word vectors loaded via gensim.api.load.

In [70]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")
import nltk
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [71]:
text_cols = ["key_facts", "issue", "holding"]
out_text = "text_all"

fair_use_findings[out_text] = (
    fair_use_findings[text_cols].fillna("").astype(str).agg(" ".join, axis=1)
)

out_tokens = "tokens"
fair_use_findings[out_tokens] = fair_use_findings[out_text].apply(lambda x: word_tokenize(str(x).lower()))

fair_use_findings["n_tokens"] = fair_use_findings[out_tokens].map(len)
print("Docs:", len(fair_use_findings))
print("Avg tokens:", fair_use_findings["n_tokens"].mean())
print("Docs with <5 tokens:", (fair_use_findings["n_tokens"] < 5).sum())
print("Sample tokens:", fair_use_findings.loc[fair_use_findings.index[0], out_tokens][:30])


Docs: 251
Avg tokens: 403.8964143426295
Docs with <5 tokens: 0
Sample tokens: ['plaintiffs', 'own', 'the', 'rights', 'to', 'a', 'catalogue', 'comprised', 'of', '16,000', 'photographs', 'of', 'pablo', 'picasso', '’', 's', 'work', ',', 'which', 'was', 'originally', 'compiled', 'by', 'picasso', '’', 's', 'friend', 'in', '1932', '(']


This time, we do not use TF-IDF as the final ranker. Instead, we use TF-IDF for retrieval: we represent each case as a TF-IDF vector and compute cosine similarity to retrieve the top-N most similar cases as our candidate set.

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf = TfidfVectorizer(stop_words="english", max_features=20000)
X_tfidf = tfidf.fit_transform(fair_use_findings[out_text])

def tfidf_candidates_by_index(case_idx: int, top_n: int = 50):
    sims = cosine_similarity(X_tfidf[case_idx], X_tfidf).ravel()
    sims[case_idx] = -1  # exclude itself
    cand_idx = np.argsort(sims)[-top_n:][::-1]
    return cand_idx, sims[cand_idx]

def tfidf_candidates_by_text(query_text: str, top_n: int = 50):
    q_vec = tfidf.transform([str(query_text)])
    sims = cosine_similarity(q_vec, X_tfidf).ravel()
    cand_idx = np.argsort(sims)[-top_n:][::-1]
    return cand_idx, sims[cand_idx]

We then use Gensim to generate embeddings by loading pretrained GloVe vectors via gensim.api.load and computing document embeddings

In [74]:
!pip install -q gensim
import gensim.downloader as api

wv = api.load("glove-wiki-gigaword-50")  # 50-dim GloVe
EMB_DIM = 50

def mean_vector(tokens, wv, dim=50):
    vecs = [wv[w] for w in tokens if w in wv]
    if not vecs:
        return np.zeros(dim, dtype="float32")
    return np.mean(vecs, axis=0)

doc_vecs = np.vstack([mean_vector(toks, wv, dim=EMB_DIM) for toks in fair_use_findings[out_tokens]])
print("doc_vecs shape:", doc_vecs.shape)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
doc_vecs shape: (251, 50)


then we will use the tf-idf to find the retriver and then use the glove ton to rerank and then get our top-k

In [76]:
def hybrid_by_index(case_idx: int, cand_top_n: int = 50, final_top_k: int = 10, alpha: float = 0.6):
    """ alpha closer to 1 => rely more on TF-IDF alpha closer to 0 => rely more on GloVe semantic similarity """

    cand_idx, tf_sims = tfidf_candidates_by_index(case_idx, top_n=cand_top_n)

    q_vec = doc_vecs[case_idx].reshape(1, -1)
    glove_sims = cosine_similarity(q_vec, doc_vecs[cand_idx]).ravel()

    score = alpha * tf_sims + (1 - alpha) * glove_sims
    order = np.argsort(score)[-final_top_k:][::-1]
    top_idx = cand_idx[order]

    out = fair_use_findings.iloc[top_idx][["title"]].copy()
    out["tfidf_sim"] = tf_sims[order]
    out["glove_sim"] = glove_sims[order]
    out["hybrid_score"] = score[order]
    return out


def hybrid_by_text(query_text: str, cand_top_n: int = 50, final_top_k: int = 10, alpha: float = 0.6):
    cand_idx, tf_sims = tfidf_candidates_by_text(query_text, top_n=cand_top_n)

    q_tokens = word_tokenize(str(query_text).lower())
    q_vec = mean_vector(q_tokens, wv, dim=EMB_DIM).reshape(1, -1)

    glove_sims = cosine_similarity(q_vec, doc_vecs[cand_idx]).ravel()
    score = alpha * tf_sims + (1 - alpha) * glove_sims

    order = np.argsort(score)[-final_top_k:][::-1]
    top_idx = cand_idx[order]

    out = fair_use_findings.iloc[top_idx][["title"]].copy()
    out["tfidf_sim"] = tf_sims[order]
    out["glove_sim"] = glove_sims[order]
    out["hybrid_score"] = score[order]
    return out

in order to see how our model doing, so we take the first case as query and use tf-idf to pick 50 and use the glove rerank and get the most similar top 5 result

To evaluate our model’s performance, we use the first case as the query. We first retrieve the top 50 candidates using TF-IDF cosine similarity, then rerank these candidates with GloVe-based semantic similarity, and finally return the top 5 most similar cases.

In [77]:
from IPython.display import display
display(hybrid_by_index(0, cand_top_n=50, final_top_k=5, alpha=0.6))

Unnamed: 0,title,tfidf_sim,glove_sim,hybrid_score
64,Estate of Anthony Barré and Angel Barré v. Car...,0.178372,0.994988,0.505018
60,"Penguin Random House LLC, et al. v. Frederik C...",0.173438,0.994237,0.501757
86,"Cambridge Univ. Press v. Patton,",0.174908,0.990392,0.501102
49,Ferdman v. CBS Interactive Inc.,0.171026,0.995483,0.500809
74,Cambridge University Press v. Mark P. Becker,0.165812,0.991958,0.49627


In the output, TF-IDF similarity scores are moderate (around ~0.17), while GloVe similarities are generally very high (around ~0.99). This is because averaging word vectors over long, domain-consistent legal texts tends to make different documents appear broadly similar in embedding space. As a result, most of the ranking differences are driven by the TF-IDF signal, while GloVe mainly provides semantic refinement within the candidate set.