# **Integrated Analysis & Synergy For Fair Use Cases Data**

**Project Milestone 4**

**Team:** B1 Team 13

**Team Members:** Rita Feng

# Introduction

Fair use decisions are highly fact-specific, so outcomes can vary across cases and courts. In this project, we apply computational text analysis to fair use opinions to uncover common patterns, compare outcomes across venues, support precedent retrieval, and flag unusual boundary cases.

Previously, instead of merging the fair_use_findings and fair_use_cases datasets, our team used only fair_use_findings for downstream processing. We will continue using fair_use_findings as the primary dataset for model training.

For Question 1 (Precedent Finder), we plan to move beyond a TF-IDF–only approach by using a hybrid retrieval method that combines TF-IDF with gensim-based embeddings.

# Executive Summary

Q1: We use a hybrid precedent-retrieval approach that combines TF-IDF for exact term overlap with pretrained gensim embeddings (loaded via api.load) for semantic similarity. This pairing is designed to improve matching accuracy by capturing both shared legal language and underlying meaning.


# Setting Up Environment

In [61]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Core
import os
import re
import math
import json
import time
import string
import random
import warnings
from pathlib import Path
from collections import Counter, defaultdict

# Data
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# NLP / Text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, PCA, TruncatedSVD
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity

# Modeling / Anomaly detection
from sklearn.ensemble import IsolationForest

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Association rules
from mlxtend.frequent_patterns import apriori, association_rules

# Dimensionality reduction for viz
from sklearn.manifold import TSNE

import nltk
from nltk.tokenize import word_tokenize

# Data Importing, Inspection and Preparation

## Choosing the Working Analysis Table

In Milestone 2, the team used **both approaches**: some analyses worked entirely within a single table (most often `fair_use_findings`), while others attempted to merge `fair_use_cases` and `fair_use_findings` to enrich metadata. We explicitly tested whether the two tables could be integrated into a single “master” dataset. A straightforward join exposed structural mismatch in identifiers, producing only a **12% match rate**. Follow-up preprocessing improved linkage by normalizing formatting differences (e.g., standardizing “vs” to “v” and stripping punctuation), recovering **70.5% of records (177 cases)** in one improved approach. Other join strategies achieved results in the **low 30% range**, but still left a large portion unmatched. Even when match rates increase, the remaining unmatched share is large enough that (1) we risk losing too many cases if we require a strict join, and (2) we risk introducing false matches if we loosen the rules.

For Milestone 3, we prioritize a reproducible pipeline with minimal merge risk. As an integration choice, we therefore use **`fair_use_findings` as the single working analysis table** and treat `fair_use_cases` as optional metadata rather than a required dependency. This aligns with the milestone’s core modeling needs because our unsupervised methods depend most heavily on the narrative fields in `fair_use_findings`, while the structured fields in that table are sufficient for the planned comparisons.

**Key overlap and differences (without requiring a merge):**

* **Identifiers:** `fair_use_findings` includes `title` and `case_number`, which together serve the same reference role as the single `case` string in `fair_use_cases`. These fields are mainly used for labeling and lookup rather than as modeling features.
* **Year:** `year` is present in both tables and can be used directly without merging.
* **Court and venue information:** `fair_use_cases` includes `court` and `jurisdiction`, while `fair_use_findings` includes `court`. These are overlapping venue signals, and a single court field is sufficient for clustering and venue-aware comparisons.
* **Outcomes:** both tables include an `outcome` field describing the fair use result at a high level, supporting standardized labeling.
* **Text fields unique to `fair_use_findings`:** `key_facts`, `issue`, and `holding` provide the narrative summaries required for text-based unsupervised methods. In Milestone 2, the primary modeling text was often constructed from **`key_facts + issue`** to reduce decision-language leakage from holdings; this combined text averaged **about 145 words**, which is well suited to TF–IDF, topic modeling, and clustering workflows.

**Outcome labeling approach (used throughout M3):**

* Because `fair_use_findings` does not provide a boolean `fair_use_found`, we derive labels from `outcome` text and standardize into three groups: **fair use found**, **fair use not found**, and **indeterminate**.
* For venue comparisons, we compute outcome-related metrics using **determinate outcomes only**, avoiding mixed, preliminary, or remand-like resolutions.

Overall, while merging can add optional metadata, Milestone 2 showed that reliable integration requires record linkage plus validation and can still induce substantial case loss or matching risk. Since `fair_use_findings` already contains the critical unstructured fields used across all M3 modeling tracks, we proceed with a **one-table analysis** to maximize coverage, reduce merge-induced bias, and keep the pipeline reproducible.

## Data Importing

In [62]:
fair_use_findings = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-08-29/fair_use_findings.csv')

## Data Inspection

The `fair_use_findings` table contains complementary case-level text, including summaries of key facts, legal issues, holdings, and descriptive tags. Inspection centers on text completeness and variability, as these fields support later analysis of language patterns, similarity, and thematic structure across cases.

| variable    | class     | description                                                                            |
| ----------- | --------- | -------------------------------------------------------------------------------------- |
| title       | character | The title of the case.                                                                 |
| case_number | character | The case number or numbers of the case.                                                |
| year        | character | The year in which the finding was made (or findings were made).                        |
| court       | character | The court or courts involved.                                                          |
| key_facts   | character | The key facts of the case.                                                             |
| issue       | character | A brief description of the fair use issue.                                             |
| holding     | character | The decision of the court in paragraph form.                                           |
| tags        | character | Comma- or semicolon-separated tags for this case.                                      |
| outcome     | character | A brief description of the outcome of the case. These fields have not been normalized. |

In [63]:
print("Dataset Info:")
print(fair_use_findings.info())

print("\nFirst 5 rows:")
print(fair_use_findings.head())

print("\nMissing Values:")
print(fair_use_findings.isnull().sum())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        251 non-null    object
 1   case_number  251 non-null    object
 2   year         251 non-null    object
 3   court        251 non-null    object
 4   key_facts    251 non-null    object
 5   issue        251 non-null    object
 6   holding      251 non-null    object
 7   tags         251 non-null    object
 8   outcome      251 non-null    object
dtypes: object(9)
memory usage: 17.8+ KB
None

First 5 rows:
                                               title  \
0                              De Fontbrune v. Wofsy   
1                          Sedlik v. Von Drachenberg   
2  Sketchworks Indus. Strength Comedy, Inc. v. Ja...   
3  Am. Soc'y for Testing & Materials v. Public.Re...   
4                           Yang v. Mic Network Inc.   

                                 

## Preparing Data

### Outcome Flag Construction

The outcome column is converted into a simple label for analysis. The text is cleaned and then grouped into three outcomes: fair use found, fair use not found, and indeterminate (preliminary, mixed, remand, or unclear). A binary `fair_use_found` flag is created only for the final outcomes, and indeterminate cases are left out of binary rate calculations.

In [64]:
# Count outcome column from fair_use_findings and reset index
outcome_counts = fair_use_findings["outcome"].astype(str).str.lower().str.strip().value_counts().reset_index()
fair_use_findings["outcome"] = fair_use_findings["outcome"].astype(str).str.lower().str.strip()
outcome_counts.columns = ["outcome", "count"]

# Display the counts
print(outcome_counts)

                                              outcome  count
0                                  fair use not found    100
1                                      fair use found     98
2         preliminary ruling, mixed result, or remand     28
3             preliminary finding; fair use not found      4
4                                        mixed result      3
5              preliminary ruling, fair use not found      3
6              fair use not found, preliminary ruling      3
7              preliminary ruling; fair use not found      2
8              fair use not found; preliminary ruling      2
9                          preliminary ruling, remand      1
10                                fair use not found.      1
11                                    fair use found.      1
12  preliminary ruling, fair use not found, mixed ...      1
13                 preliminary ruling, fair use found      1
14  fair use found; second circuit affirmed on app...      1
15                      

Based on the grouped outcome counts, outcomes fall into three clear categories. Entries labeled “Fair use found” (including minor punctuation or appeal notes) are treated as fair use found, and entries labeled “Fair use not found” (including punctuation variants) are treated as fair use not found. All remaining outcomes, such as preliminary rulings, mixed results, remands, and irregular text entries, are treated as indeterminate. A binary fair_use_found flag is then defined only for the final outcomes, while indeterminate cases are excluded from binary rate calculations.

In [65]:
outcome_map = {
    # FINAL: fair use found
    "fair use found": "FAIR_USE_FOUND",
    "fair use found.": "FAIR_USE_FOUND",
    "fair use found; second circuit affirmed on appeal.": "FAIR_USE_FOUND",

    # FINAL: fair use not found
    "fair use not found": "FAIR_USE_NOT_FOUND",
    "fair use not found.": "FAIR_USE_NOT_FOUND",

    # INDETERMINATE
    "preliminary ruling, mixed result, or remand": "INDETERMINATE",
    "preliminary finding; fair use not found": "INDETERMINATE",
    "mixed result": "INDETERMINATE",
    "preliminary ruling, fair use not found": "INDETERMINATE",
    "fair use not found, preliminary ruling": "INDETERMINATE",
    "preliminary ruling; fair use not found": "INDETERMINATE",
    "fair use not found; preliminary ruling": "INDETERMINATE",
    "preliminary ruling, remand": "INDETERMINATE",
    "preliminary ruling, fair use not found, mixed result": "INDETERMINATE",
    "preliminary ruling, fair use found": "INDETERMINATE",
    "fair use found; mixed result": "INDETERMINATE",
    "plaintiff patrick cariou published yes rasta, a book of portraits and landscape photographs taken in jamaica. defendant richard prince was an appropriation artist who altered and incorporated several of plaintiff’s photographs into a series of paintings and collages called canal zone that was exhibited at a gallery and in the gallery’s exhibition catalog. plaintiff filed an infringement claim, and the district court ruled in his favor, stating that to qualify as fair use, a secondary work must “comment on, relate to the historical context of, or critically refer back to the original works.” defendant appealed.": "INDETERMINATE",
}

In [66]:
# Replace outcome column values with the mapping in outcome_map
fair_use_findings["outcome"] = fair_use_findings["outcome"].replace(outcome_map)
fair_use_findings["outcome"].value_counts().reset_index()

Unnamed: 0,outcome,count
0,FAIR_USE_NOT_FOUND,101
1,FAIR_USE_FOUND,100
2,INDETERMINATE,50


### Column Cleaning Steps

The `year` column is converted to a numeric integer format to ensure it can be used reliably in grouping, filtering, and any downstream modeling steps. Any non-numeric or missing values are handled safely during conversion.

In [67]:
# Turn the year column to integer
fair_use_findings["year"] = pd.to_numeric(fair_use_findings["year"], errors="coerce").astype("Int64")

In [68]:
fair_use_findings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        251 non-null    object
 1   case_number  251 non-null    object
 2   year         250 non-null    Int64 
 3   court        251 non-null    object
 4   key_facts    251 non-null    object
 5   issue        251 non-null    object
 6   holding      251 non-null    object
 7   tags         251 non-null    object
 8   outcome      251 non-null    object
dtypes: Int64(1), object(8)
memory usage: 18.0+ KB


In [69]:
fair_use_findings.head()

Unnamed: 0,title,case_number,year,court,key_facts,issue,holding,tags,outcome
0,De Fontbrune v. Wofsy,39 F.4th 1214 (9th Cir. 2022),2022,United States Court of Appeals for the Ninth C...,Plaintiffs own the rights to a catalogue compr...,Whether reproduction of photographs documentin...,"The panel held that the first factor, the purp...",Education/Scholarship/Research; Photograph,FAIR_USE_NOT_FOUND
1,Sedlik v. Von Drachenberg,"No. CV 21-1102, 2022 WL 2784818 (C.D. Cal. May...",2022,United States District Court for the Southern ...,Plaintiff Jeffrey Sedlik is a photographer who...,Whether use of a photograph as the reference i...,"Considering the first fair use factor, the pur...",Painting/Drawing/Graphic; Photograph,INDETERMINATE
2,"Sketchworks Indus. Strength Comedy, Inc. v. Ja...","No. 19-CV-7470-LTS-VF, 2022 U.S. Dist. LEXIS 8...",2022,United States District Court for the Southern ...,Plaintiff Sketchworks Industrial Strength Come...,"Whether the use of protected elements, includi...","The court found that the first factor, the pur...",Film/Audiovisual; Music; Parody/Satire; Review...,FAIR_USE_FOUND
3,Am. Soc'y for Testing & Materials v. Public.Re...,"No. 13-cv-1215 (TSC), 2022 U.S. Dist. LEXIS 60...",2022,United States District Court for the District ...,"Defendant Public.Resource.Org, Inc., a non-pro...",Whether it is fair use to make available onlin...,"As directed by the court of appeals, the distr...",Education/Scholarship/Research; Textual Work; ...,INDETERMINATE
4,Yang v. Mic Network Inc.,"Nos. 20-4097-cv(L), 20-4201-cv (XAP), 2022 U.S...",2022,United States Court of Appeals for the Second ...,Plaintiff Stephen Yang (“Yang”) licensed a pho...,"Whether using a screenshot from an article, in...","On appeal, the court decided that the first fa...",News Reporting; Photography,FAIR_USE_FOUND


## Fact-Based Precedent Finder

### Past model

Before training, we create a new text_old column by combining key_facts and issue into a single paragraph for downstream modeling.

In [70]:
fair_use_findings["text_old"] = (
    fair_use_findings["key_facts"].fillna("").astype(str) + " " +
    fair_use_findings["issue"].fillna("").astype(str)
)

fair_use_findings["text_old"] = (
    fair_use_findings["text_old"]
    .str.lower()
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

We use TF-IDF to convert the text into the X_tfidf_old feature matrix.

In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec_old = TfidfVectorizer(
    stop_words="english",
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2)
)

X_tfidf_old = tfidf_vec_old.fit_transform(fair_use_findings["text_old"])
terms_tfidf_old = tfidf_vec_old.get_feature_names_out()

print("X_tfidf_old shape:", X_tfidf_old.shape)

X_tfidf_old shape: (251, 3411)


We apply NMF to the TF-IDF matrix to perform topic modeling.

In [72]:
from sklearn.decomposition import NMF

K_old = 10
nmf_old = NMF(n_components=K_old, random_state=42, init="nndsvda", max_iter=400)
W_old = nmf_old.fit_transform(X_tfidf_old)   # doc-topic matrix

fair_use_findings["nmf_topic_old"] = W_old.argmax(1)
print("NMF recon_err:", nmf_old.reconstruction_err_)

NMF recon_err: 14.717318170618299


We use the topic-weight matrix W_old as input to K-Means and assign each case to a case-type cluster.

In [73]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

X_cluster_old = StandardScaler().fit_transform(W_old)

k_final_old = 10
kmeans_old = KMeans(n_clusters=k_final_old, random_state=42, n_init=50)
fair_use_findings["case_type_cluster_old"] = kmeans_old.fit_predict(X_cluster_old)

print("Cluster sizes:")
print(fair_use_findings["case_type_cluster_old"].value_counts().sort_index())

Cluster sizes:
case_type_cluster_old
0    12
1    23
2    68
3    63
4    21
5     5
6    21
7    11
8     7
9    20
Name: count, dtype: int64


In [74]:
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_precedents_old(query_text, top_n=10, restrict_to_same_case_type=True, exclude_idx=None):
    q = str(query_text).lower().strip()
    q_vec = tfidf_vec_old.transform([q])
    sims = cosine_similarity(q_vec, X_tfidf_old).ravel()

    if exclude_idx is not None:
        sims[exclude_idx] = -1

    if restrict_to_same_case_type:
        qW = nmf_old.transform(q_vec)  # (1, K_old)

        centroids = pd.DataFrame(W_old).groupby(fair_use_findings["case_type_cluster_old"]).mean().values

        q_type = int(np.argmax(cosine_similarity(qW, centroids).ravel()))
        mask = (fair_use_findings["case_type_cluster_old"] == q_type).to_numpy()
        sims = np.where(mask, sims, -1)

    top_idx = np.argsort(sims)[::-1][:top_n]

    out = fair_use_findings.iloc[top_idx][["title","case_number","year","court","outcome","tags"]].copy()
    out.insert(0, "case_idx", top_idx)
    out["similarity_old"] = sims[top_idx]
    out["case_type_cluster_old"] = fair_use_findings.iloc[top_idx]["case_type_cluster_old"].to_numpy()
    return out

In [75]:
from IPython.display import display

case_idx = 0
query_text_old = fair_use_findings.loc[fair_use_findings.index[case_idx], "text_old"]

print("=== OLD MODEL DEMO (TF-IDF cosine, no restriction) ===")
display(retrieve_precedents_old(
    query_text_old,
    top_n=10,
    restrict_to_same_case_type=False,
    exclude_idx=case_idx
))

=== OLD MODEL DEMO (TF-IDF cosine, no restriction) ===


Unnamed: 0,case_idx,title,case_number,year,court,outcome,tags,similarity_old,case_type_cluster_old
240,240,"Meeropol v. Nizer,",560 F.2d 1061 (2d Cir. 1977),1977,United States Court of Appeals for the Second ...,INDETERMINATE,Second Circuit; Education/Scholarship/Research...,0.187706,6
57,57,"Barcroft Media, Ltd. V. Coed Media Group, LLC","No. 16-CV-7634 (JMF) (S.D.N.Y. Nov. 2, 2017)",2017,United States District Court for the Southern ...,FAIR_USE_NOT_FOUND,"Second Circuit, Internet/Digitization, Photogr...",0.165811,4
147,147,"Elvis Presley Enters., Inc. v. Passport Video,","349 F.3d 622 (9th Cir. 2003), overruled on oth...",2003,United States Court of Appeals for the Ninth C...,FAIR_USE_NOT_FOUND,Ninth Circuit; Education/Scholarship/Research;...,0.149477,3
245,245,"Berlin v. E.C. Publ’ns, Inc.,",329 F.2d 541 (2d Cir. 1964),1964,United States Court of Appeals for the Second ...,FAIR_USE_FOUND,Second Circuit; Music; Parody/Satire,0.148766,9
44,44,"VHT, Inc. v. Zillow Group","Nos. 17-35587, 17-35588 (9th Cir. Mar. 15, 2019)",2019,United States Court of Appeals for the Ninth C...,FAIR_USE_NOT_FOUND,Ninth Circuit; Photograph,0.147875,4
200,200,"Cable/Home Commc’n Corp. v. Network Prods., Inc.,",902 F.2d 829 (11th Cir. 1990),1990,United States Court of Appeals for the Elevent...,FAIR_USE_NOT_FOUND,Eleventh Circuit; Computer program,0.147402,3
3,3,Am. Soc'y for Testing & Materials v. Public.Re...,"No. 13-cv-1215 (TSC), 2022 U.S. Dist. LEXIS 60...",2022,United States District Court for the District ...,INDETERMINATE,Education/Scholarship/Research; Textual Work; ...,0.147122,6
86,86,"Cambridge Univ. Press v. Patton,",769 F.3d 1232 (11th Cir. 2014),2014,United States Court of Appeals for the Elevent...,INDETERMINATE,Eleventh Circuit; Education/Scholarship/Resear...,0.143386,6
228,228,"Jartech, Inc. v. Clancy,",666 F.2d 403 (9th Cir. 1982),1982,United States Court of Appeals for the Ninth C...,FAIR_USE_FOUND,Ninth Circuit; Film/Audiovisual; Used in gover...,0.138605,3
97,97,"Cariou v. Prince,",714 F.3d 694 (2d Cir. 2013) cert. denied 134 S...,2013,United States Court of Appeals for the Second ...,INDETERMINATE,"Preliminary ruling, mixed result, or remand",0.138164,4


We use case_idx = 0 as the query and apply TF-IDF cosine similarity to retrieve the top 10 most similar cases. As shown in the table, the top result is Meeropol v. Nizer with similarity_old = 0.1877, and the second result is Barcroft Media, Ltd. v. Coed Media Group, LLC with similarity_old = 0.1658, slightly lower than the first. Although these cases are textually similar, their outcomes can differ because our objective is to retrieve comparable precedents rather than predict case outcomes.


### Introduction and Updated Method

Our group initially used a TF-IDF approach for precedent retrieval. However, a key limitation of TF-IDF is that it relies heavily on exact word overlap, so it can miss semantically similar cases when the same idea is expressed with different wording.

To address this, we use Word2Vec embeddings trained on our case corpus, which can capture semantic relationships beyond exact keyword matches. That said, because our dataset is relatively small and we represent each document by averaging word vectors, the embedding-based similarities can become uniformly high and less discriminative. To balance these trade-offs, we use a hybrid approach: TF-IDF is used to retrieve a shortlist of candidate cases and filter out clearly unrelated documents, and then Word2Vec-based semantic similarity is used to rerank those candidates.

### Prepare

Before training and modeling, we use NLTK to tokenize the text. We first combine the key narrative fields—key_facts, issue, and holding—into a single text column, then generate a tokens column. These tokens are then used to train a Word2Vec model on our corpus and to build document embeddings by averaging the learned word vectors for each case.

In [76]:
import nltk
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)

import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [77]:
text_cols = ["key_facts", "issue", "holding"]
out_text = "text_all"
out_tokens = "tokens"

fair_use_findings[out_text] = (
    fair_use_findings[text_cols]
    .fillna("").astype(str)
    .agg(" ".join, axis=1)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

stop = set(stopwords.words("english"))
fair_use_findings[out_tokens] = fair_use_findings[out_text].fillna("").astype(str).apply(
    lambda x: [t for t in word_tokenize(x.lower()) if t.isalpha() and t not in stop and len(t) > 2]
)

fair_use_findings["n_tokens"] = fair_use_findings[out_tokens].map(len)
print("Docs:", len(fair_use_findings))
print("Avg tokens:", fair_use_findings["n_tokens"].mean())
print("Docs with <5 tokens:", (fair_use_findings["n_tokens"] < 5).sum())
print("Sample tokens:", fair_use_findings.loc[fair_use_findings.index[0], out_tokens][:30])

Docs: 251
Avg tokens: 197.4581673306773
Docs with <5 tokens: 0
Sample tokens: ['plaintiffs', 'rights', 'catalogue', 'comprised', 'photographs', 'pablo', 'picasso', 'work', 'originally', 'compiled', 'picasso', 'friend', 'zervos', 'catalogue', 'obtaining', 'permission', 'picasso', 'estate', 'publish', 'work', 'illustrating', 'describing', 'works', 'picasso', 'defendants', 'alan', 'wofsy', 'company', 'alan', 'wofsy']


### TF-IDF

This time, we do not use TF-IDF as the final ranker. Instead, we use TF-IDF for retrieval: we represent each case as a TF-IDF vector and compute cosine similarity to retrieve the top-N most similar cases as our candidate set.

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf = TfidfVectorizer(
    stop_words="english",
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2)
)
X_tfidf = tfidf.fit_transform(fair_use_findings[out_text])

def tfidf_candidates_by_index(case_idx: int, top_n: int = 50):
    sims = cosine_similarity(X_tfidf[case_idx], X_tfidf).ravel()
    sims[case_idx] = -1
    cand_idx = np.argsort(sims)[-top_n:][::-1]
    return cand_idx, sims[cand_idx]

### Word2Vec

We then use Gensim to generate embeddings by training a Word2Vec model on our tokenized corpus and computing document embeddings by averaging the learned word vectors for each case.


In [79]:
!pip -q install gensim

import numpy as np
from gensim.models import Word2Vec

out_tokens = "tokens"
sentences = fair_use_findings[out_tokens].tolist()


w2v = Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=5,
    min_count=2,
    workers=1,
    sg=1,
    epochs=20
)

EMB_DIM = w2v.wv.vector_size

def mean_vector(tokens):
    vecs = [w2v.wv[w] for w in tokens if w in w2v.wv]
    if not vecs:
        return np.zeros(EMB_DIM, dtype="float32")
    return np.mean(np.asarray(vecs), axis=0).astype("float32")

doc_vecs = np.vstack([mean_vector(toks) for toks in sentences])
print("doc_vecs shape:", doc_vecs.shape, "| vocab size:", len(w2v.wv))

doc_vecs shape: (251, 100) | vocab size: 3568


### Hybrid

We use TF-IDF to retrieve a top-N candidate set, then rerank those candidates using Word2Vec-based semantic similarity, and finally return the top-K most similar cases.


In [80]:
def hybrid_by_text(query_text: str, cand_top_n: int = 50, final_top_k: int = 10, alpha: float = 0.6):
    # TF-IDF candidates from raw text
    q_vec_tfidf = tfidf.transform([str(query_text)])
    sims = cosine_similarity(q_vec_tfidf, X_tfidf).ravel()
    cand_idx = np.argsort(sims)[-cand_top_n:][::-1]
    tf_sims = sims[cand_idx]

    # Word2Vec vector for query text (same token filter as corpus)
    q_tokens = [t for t in word_tokenize(str(query_text).lower())
                if t.isalpha() and t not in stop and len(t) > 2]
    q_vec = mean_vector_w2v(q_tokens).reshape(1, -1)

    # Rerank within candidates using Word2Vec cosine
    w2v_sims = cosine_similarity(q_vec, doc_vecs[cand_idx]).ravel()
    score = alpha * tf_sims + (1 - alpha) * w2v_sims

    order = np.argsort(score)[-final_top_k:][::-1]
    top_idx = cand_idx[order]

    out = fair_use_findings.iloc[top_idx][["title"]].copy()
    out.insert(0, "case_idx", top_idx)
    out["tfidf_sim"] = tf_sims[order]
    out["w2v_sim"] = w2v_sims[order]
    out["hybrid_score"] = score[order]
    return out

### Test

To evaluate our model’s performance, we use the first case as the query. We first retrieve the top 50 candidates using TF-IDF cosine similarity, then rerank these candidates with GloVe-based semantic similarity, and finally return the top 5 most similar cases.

beside it we also currious about how the alpha effect the result and

In [81]:
from IPython.display import display
display(hybrid_by_index(0, cand_top_n=50, final_top_k=5, alpha=0.6))

Unnamed: 0,case_idx,title,tfidf_sim,w2v_sim,hybrid_score
60,60,"Penguin Random House LLC, et al. v. Frederik C...",0.224331,0.980322,0.526727
22,22,"Dr. Seuss Enters., L.P. v. ComicMix LLC",0.219006,0.980442,0.52358
64,64,Estate of Anthony Barré and Angel Barré v. Car...,0.215036,0.956246,0.51152
51,51,"Lucasfilm Ltd. LLC, et al. v. Ren Ventures Ltd...",0.187668,0.979085,0.504235
68,68,Reiner v. Nishimori,0.184045,0.96964,0.498283


We use `case_idx = 0` as the query and run our hybrid retrieval pipeline (TF-IDF candidate retrieval with `cand_top_n = 50`, followed by Word2Vec reranking with `alpha = 0.6`) to return the top 5 most similar precedents. The results show that **Penguin Random House LLC v. Frederik C…** ranks first with `tfidf_sim = 0.2243`, `w2v_sim = 0.9803`, and a `hybrid_score = 0.5267`, followed by **Dr. Seuss Enters., L.P. v. ComicMix LLC** (`hybrid_score = 0.5236`) and **Estate of Anthony Barré and Angel Barré v. …** (`hybrid_score = 0.5115`). Overall, the TF-IDF similarities are moderate (0.18-0.22) while Word2Vec similarites are consistently high (0.96-0.98), so the final ranking is mainly driven by small differences in TF-IDF with Word2Vec providing semantic refinement within the candidate set.


### Parameter

#### Cand_top

In [82]:
from IPython.display import display
display(hybrid_by_index(0, cand_top_n=10, final_top_k=5, alpha=0.6))

Unnamed: 0,case_idx,title,tfidf_sim,w2v_sim,hybrid_score
60,60,"Penguin Random House LLC, et al. v. Frederik C...",0.224331,0.980322,0.526727
22,22,"Dr. Seuss Enters., L.P. v. ComicMix LLC",0.219006,0.980442,0.52358
64,64,Estate of Anthony Barré and Angel Barré v. Car...,0.215036,0.956246,0.51152
51,51,"Lucasfilm Ltd. LLC, et al. v. Ren Ventures Ltd...",0.187668,0.979085,0.504235
68,68,Reiner v. Nishimori,0.184045,0.96964,0.498283


When I decrease cand_top_n from 50 to 10, the top-5 results stayed exactly the same for this query. This suggests that with alpha=0.6, the model isn’t very sensitive to the candidate pool size here.

#### Alpha

In [83]:
from IPython.display import display
display(hybrid_by_index(0, cand_top_n=50, final_top_k=5, alpha=0.3))

Unnamed: 0,case_idx,title,tfidf_sim,w2v_sim,hybrid_score
60,60,"Penguin Random House LLC, et al. v. Frederik C...",0.224331,0.980322,0.753525
22,22,"Dr. Seuss Enters., L.P. v. ComicMix LLC",0.219006,0.980442,0.752011
51,51,"Lucasfilm Ltd. LLC, et al. v. Ren Ventures Ltd...",0.187668,0.979085,0.74166
68,68,Reiner v. Nishimori,0.184045,0.96964,0.733961
64,64,Estate of Anthony Barré and Angel Barré v. Car...,0.215036,0.956246,0.733883


With alpha=0.3, Word2Vec contributes 70% of the score. Since w2v_sim values are very high (~0.95–0.98), the weighted score becomes much larger (around ~0.73–0.75). With alpha=0.6, Word2Vec contributes only 40%, so scores fall around ~0.49–0.53.

In [84]:
from IPython.display import display

case_idx = 0


query_text_old = fair_use_findings.loc[fair_use_findings.index[case_idx], "text_old"]

print("=== OLD MODEL (TF-IDF cosine, no case-type restriction) ===")
display(
    retrieve_precedents_old(
        query_text_old,
        top_n=10,
        restrict_to_same_case_type=False,
        exclude_idx=case_idx
    )
)

print("=== NEW MODEL (TF-IDF retrieve + Word2Vec rerank) ===")
display(
    hybrid_by_index(case_idx, cand_top_n=50, final_top_k=10, alpha=0.6)
)

=== OLD MODEL (TF-IDF cosine, no case-type restriction) ===


Unnamed: 0,case_idx,title,case_number,year,court,outcome,tags,similarity_old,case_type_cluster_old
240,240,"Meeropol v. Nizer,",560 F.2d 1061 (2d Cir. 1977),1977,United States Court of Appeals for the Second ...,INDETERMINATE,Second Circuit; Education/Scholarship/Research...,0.187706,6
57,57,"Barcroft Media, Ltd. V. Coed Media Group, LLC","No. 16-CV-7634 (JMF) (S.D.N.Y. Nov. 2, 2017)",2017,United States District Court for the Southern ...,FAIR_USE_NOT_FOUND,"Second Circuit, Internet/Digitization, Photogr...",0.165811,4
147,147,"Elvis Presley Enters., Inc. v. Passport Video,","349 F.3d 622 (9th Cir. 2003), overruled on oth...",2003,United States Court of Appeals for the Ninth C...,FAIR_USE_NOT_FOUND,Ninth Circuit; Education/Scholarship/Research;...,0.149477,3
245,245,"Berlin v. E.C. Publ’ns, Inc.,",329 F.2d 541 (2d Cir. 1964),1964,United States Court of Appeals for the Second ...,FAIR_USE_FOUND,Second Circuit; Music; Parody/Satire,0.148766,9
44,44,"VHT, Inc. v. Zillow Group","Nos. 17-35587, 17-35588 (9th Cir. Mar. 15, 2019)",2019,United States Court of Appeals for the Ninth C...,FAIR_USE_NOT_FOUND,Ninth Circuit; Photograph,0.147875,4
200,200,"Cable/Home Commc’n Corp. v. Network Prods., Inc.,",902 F.2d 829 (11th Cir. 1990),1990,United States Court of Appeals for the Elevent...,FAIR_USE_NOT_FOUND,Eleventh Circuit; Computer program,0.147402,3
3,3,Am. Soc'y for Testing & Materials v. Public.Re...,"No. 13-cv-1215 (TSC), 2022 U.S. Dist. LEXIS 60...",2022,United States District Court for the District ...,INDETERMINATE,Education/Scholarship/Research; Textual Work; ...,0.147122,6
86,86,"Cambridge Univ. Press v. Patton,",769 F.3d 1232 (11th Cir. 2014),2014,United States Court of Appeals for the Elevent...,INDETERMINATE,Eleventh Circuit; Education/Scholarship/Resear...,0.143386,6
228,228,"Jartech, Inc. v. Clancy,",666 F.2d 403 (9th Cir. 1982),1982,United States Court of Appeals for the Ninth C...,FAIR_USE_FOUND,Ninth Circuit; Film/Audiovisual; Used in gover...,0.138605,3
97,97,"Cariou v. Prince,",714 F.3d 694 (2d Cir. 2013) cert. denied 134 S...,2013,United States Court of Appeals for the Second ...,INDETERMINATE,"Preliminary ruling, mixed result, or remand",0.138164,4


=== NEW MODEL (TF-IDF retrieve + Word2Vec rerank) ===


Unnamed: 0,case_idx,title,tfidf_sim,w2v_sim,hybrid_score
60,60,"Penguin Random House LLC, et al. v. Frederik C...",0.224331,0.980322,0.526727
22,22,"Dr. Seuss Enters., L.P. v. ComicMix LLC",0.219006,0.980442,0.52358
64,64,Estate of Anthony Barré and Angel Barré v. Car...,0.215036,0.956246,0.51152
51,51,"Lucasfilm Ltd. LLC, et al. v. Ren Ventures Ltd...",0.187668,0.979085,0.504235
68,68,Reiner v. Nishimori,0.184045,0.96964,0.498283
57,57,"Barcroft Media, Ltd. V. Coed Media Group, LLC",0.198083,0.948455,0.498231
49,49,Ferdman v. CBS Interactive Inc.,0.197718,0.945457,0.496814
44,44,"VHT, Inc. v. Zillow Group",0.181025,0.962474,0.493605
95,95,"Neri v. Monroe,",0.172941,0.956914,0.48653
56,56,Philpot v. Media Research Center Inc.,0.178065,0.94622,0.485327


### Evaluation with Tag Overlap

Because a single demo example is not enough to compare the TF-IDF baseline and our hybrid method, we introduce a pseudo ground truth to evaluate both models more systematically. We use case tags to define relevance: for each query case, we compare its tags with the tags of each retrieved result, and if there is any overlap, we mark that retrieved case as relevant.

We then report Precision@5, Precision@10, and Recall@10_hit. Precision@5 is the percentage of relevant cases among the top 5 retrieved results, and Precision@10 is the percentage of relevant cases among the top 10 results. Recall@10_hit is a simple hit-rate metric: it equals 1 if at least one relevant case appears in the top 10, and 0 if none of the top 10 results are relevant.


In [85]:
TAG_COL = "tags"
SEP = ";"

def parse_tags(s):
    if pd.isna(s):
        return set()
    return {t.strip() for t in str(s).split(SEP) if t.strip()}

def relevance_by_tags(q_idx, r_idx):
    qt = parse_tags(fair_use_findings.loc[fair_use_findings.index[q_idx], TAG_COL])
    rt = parse_tags(fair_use_findings.loc[fair_use_findings.index[r_idx], TAG_COL])
    return len(qt & rt) > 0

def precision_at_k(rels, k):
    return sum(rels[:k]) / k

def recall10_hit(rels):
    return 1.0 if any(rels[:10]) else 0.0

def eval_new_one(case_idx, cand_top_n=50, final_top_k=10, alpha=0.6):
    res = hybrid_by_index(case_idx, cand_top_n=cand_top_n, final_top_k=final_top_k, alpha=alpha)
    rels = [relevance_by_tags(case_idx, ridx) for ridx in res["case_idx"].tolist()]
    return {
        "P@5": precision_at_k(rels, 5),
        "P@10": precision_at_k(rels, 10),
        "R@10_hit": recall10_hit(rels)
    }, res
def eval_old_one(case_idx, top_n=10, restrict=True):
    q_text = fair_use_findings.loc[fair_use_findings.index[case_idx], "text_old"]
    res = retrieve_precedents_old(q_text, top_n=top_n, restrict_to_same_case_type=restrict, exclude_idx=case_idx)
    rels = [relevance_by_tags(case_idx, ridx) for ridx in res["case_idx"].tolist()]
    return {"P@5": precision_at_k(rels, 5), "P@10": precision_at_k(rels, 10), "R@10_hit": recall10_hit(rels)}, res

In [86]:
from IPython.display import display

case_idx = 0

new_m, new_res = eval_new_one(case_idx, cand_top_n=50, final_top_k=10, alpha=0.6)
old_m, old_res = eval_old_one(case_idx, top_n=10, restrict=False)

print("NEW metrics:", new_m)
display(new_res)

print("OLD metrics:", old_m)
display(old_res)

NEW metrics: {'P@5': 0.4, 'P@10': 0.5, 'R@10_hit': 1.0}


Unnamed: 0,case_idx,title,tfidf_sim,w2v_sim,hybrid_score
60,60,"Penguin Random House LLC, et al. v. Frederik C...",0.224331,0.980322,0.526727
22,22,"Dr. Seuss Enters., L.P. v. ComicMix LLC",0.219006,0.980442,0.52358
64,64,Estate of Anthony Barré and Angel Barré v. Car...,0.215036,0.956246,0.51152
51,51,"Lucasfilm Ltd. LLC, et al. v. Ren Ventures Ltd...",0.187668,0.979085,0.504235
68,68,Reiner v. Nishimori,0.184045,0.96964,0.498283
57,57,"Barcroft Media, Ltd. V. Coed Media Group, LLC",0.198083,0.948455,0.498231
49,49,Ferdman v. CBS Interactive Inc.,0.197718,0.945457,0.496814
44,44,"VHT, Inc. v. Zillow Group",0.181025,0.962474,0.493605
95,95,"Neri v. Monroe,",0.172941,0.956914,0.48653
56,56,Philpot v. Media Research Center Inc.,0.178065,0.94622,0.485327


OLD metrics: {'P@5': 0.6, 'P@10': 0.5, 'R@10_hit': 1.0}


Unnamed: 0,case_idx,title,case_number,year,court,outcome,tags,similarity_old,case_type_cluster_old
240,240,"Meeropol v. Nizer,",560 F.2d 1061 (2d Cir. 1977),1977,United States Court of Appeals for the Second ...,INDETERMINATE,Second Circuit; Education/Scholarship/Research...,0.187706,6
57,57,"Barcroft Media, Ltd. V. Coed Media Group, LLC","No. 16-CV-7634 (JMF) (S.D.N.Y. Nov. 2, 2017)",2017,United States District Court for the Southern ...,FAIR_USE_NOT_FOUND,"Second Circuit, Internet/Digitization, Photogr...",0.165811,4
147,147,"Elvis Presley Enters., Inc. v. Passport Video,","349 F.3d 622 (9th Cir. 2003), overruled on oth...",2003,United States Court of Appeals for the Ninth C...,FAIR_USE_NOT_FOUND,Ninth Circuit; Education/Scholarship/Research;...,0.149477,3
245,245,"Berlin v. E.C. Publ’ns, Inc.,",329 F.2d 541 (2d Cir. 1964),1964,United States Court of Appeals for the Second ...,FAIR_USE_FOUND,Second Circuit; Music; Parody/Satire,0.148766,9
44,44,"VHT, Inc. v. Zillow Group","Nos. 17-35587, 17-35588 (9th Cir. Mar. 15, 2019)",2019,United States Court of Appeals for the Ninth C...,FAIR_USE_NOT_FOUND,Ninth Circuit; Photograph,0.147875,4
200,200,"Cable/Home Commc’n Corp. v. Network Prods., Inc.,",902 F.2d 829 (11th Cir. 1990),1990,United States Court of Appeals for the Elevent...,FAIR_USE_NOT_FOUND,Eleventh Circuit; Computer program,0.147402,3
3,3,Am. Soc'y for Testing & Materials v. Public.Re...,"No. 13-cv-1215 (TSC), 2022 U.S. Dist. LEXIS 60...",2022,United States District Court for the District ...,INDETERMINATE,Education/Scholarship/Research; Textual Work; ...,0.147122,6
86,86,"Cambridge Univ. Press v. Patton,",769 F.3d 1232 (11th Cir. 2014),2014,United States Court of Appeals for the Elevent...,INDETERMINATE,Eleventh Circuit; Education/Scholarship/Resear...,0.143386,6
228,228,"Jartech, Inc. v. Clancy,",666 F.2d 403 (9th Cir. 1982),1982,United States Court of Appeals for the Ninth C...,FAIR_USE_FOUND,Ninth Circuit; Film/Audiovisual; Used in gover...,0.138605,3
97,97,"Cariou v. Prince,",714 F.3d 694 (2d Cir. 2013) cert. denied 134 S...,2013,United States Court of Appeals for the Second ...,INDETERMINATE,"Preliminary ruling, mixed result, or remand",0.138164,4


With `restrict=False`, both models search the full corpus, so the comparison is more direct. In this setting, the **new hybrid model** achieves **P@5 = 0.4** and **P@10 = 0.5**, meaning **2/5** of the top 5 and **5/10** of the top 10 retrieved cases share at least one tag with the query (our pseudo relevance definition). The **old TF-IDF model** achieves **P@5 = 0.6** and **P@10 = 0.5**, meaning it returns **3/5** relevant cases in the top 5 and **5/10** relevant cases in the top 10. Both models have **R@10_hit = 1.0**, so each method retrieves at least one relevant case within the top 10.

This pattern suggests that, for this query, the baseline TF-IDF model places tag-matching cases slightly higher in the very top ranks (higher P@5), while both methods reach a similar level of tag overlap by rank 10 (same P@10). A likely explanation is that TF-IDF directly prioritizes exact term overlap, which aligns well with the tag taxonomy in this dataset, whereas the hybrid model incorporates Word2Vec-based semantic similarity and may promote cases that are semantically similar but not necessarily labeled with the same tags. As a result, the hybrid method can diversify the top-ranked results beyond strict tag alignment, which may reduce P@5 under a tag-based proxy metric even when the retrieved cases are conceptually related.
