<a href="https://colab.research.google.com/github/mqanaq/BA820-B1-Team13/blob/main/Danish_Azmi/M4_Danish_Kamal_Azmi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Integrated Analysis & Synergy For Fair Use Cases Data**

**Project Milestone 4**

**Team:** B1 Team 13

**Team Members:** Danish Azmi

# **Introduction**

# **Executive Summary**

# Setting Up Environment

In [4]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Core
import os
import re
import math
import json
import time
import string
import random
import warnings
from pathlib import Path
from collections import Counter, defaultdict

# Data
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# NLP / Text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, PCA, TruncatedSVD
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity

# Modeling / Anomaly detection
from sklearn.ensemble import IsolationForest

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Association rules
from mlxtend.frequent_patterns import apriori, association_rules

# Dimensionality reduction for viz
from sklearn.manifold import TSNE

## Data Importing

In [5]:
fair_use_findings = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2023/2023-08-29/fair_use_findings.csv')

## Data Inspection

In [6]:
print("Dataset Info:")
print(fair_use_findings.info())

print("\nFirst 5 rows:")
print(fair_use_findings.head())

print("\nMissing Values:")
print(fair_use_findings.isnull().sum())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        251 non-null    object
 1   case_number  251 non-null    object
 2   year         251 non-null    object
 3   court        251 non-null    object
 4   key_facts    251 non-null    object
 5   issue        251 non-null    object
 6   holding      251 non-null    object
 7   tags         251 non-null    object
 8   outcome      251 non-null    object
dtypes: object(9)
memory usage: 17.8+ KB
None

First 5 rows:
                                               title  \
0                              De Fontbrune v. Wofsy   
1                          Sedlik v. Von Drachenberg   
2  Sketchworks Indus. Strength Comedy, Inc. v. Ja...   
3  Am. Soc'y for Testing & Materials v. Public.Re...   
4                           Yang v. Mic Network Inc.   

                                 

## Preparing Data

### Outcome Construction

The outcome column is converted into a simple label for analysis. The text is cleaned and then grouped into three outcomes: fair use found, fair use not found, and indeterminate (preliminary, mixed, remand, or unclear). A binary fair_use_found flag is created only for the final outcomes, and indeterminate cases are left out of binary rate calculations.

In [7]:
# Count outcome column from fair_use_findings and reset index
outcome_counts = fair_use_findings["outcome"].astype(str).str.lower().str.strip().value_counts().reset_index()
fair_use_findings["outcome"] = fair_use_findings["outcome"].astype(str).str.lower().str.strip()
outcome_counts.columns = ["outcome", "count"]

# Display the counts
print(outcome_counts)

                                              outcome  count
0                                  fair use not found    100
1                                      fair use found     98
2         preliminary ruling, mixed result, or remand     28
3             preliminary finding; fair use not found      4
4                                        mixed result      3
5              preliminary ruling, fair use not found      3
6              fair use not found, preliminary ruling      3
7              preliminary ruling; fair use not found      2
8              fair use not found; preliminary ruling      2
9                          preliminary ruling, remand      1
10                                fair use not found.      1
11                                    fair use found.      1
12  preliminary ruling, fair use not found, mixed ...      1
13                 preliminary ruling, fair use found      1
14  fair use found; second circuit affirmed on app...      1
15                      

Based on the grouped outcome counts, outcomes fall into three clear categories. Entries labeled “Fair use found” (including minor punctuation or appeal notes) are treated as fair use found, and entries labeled “Fair use not found” (including punctuation variants) are treated as fair use not found. All remaining outcomes, such as preliminary rulings, mixed results, remands, and irregular text entries, are treated as indeterminate. A binary fair_use_found flag is then defined only for the final outcomes, while indeterminate cases are excluded from binary rate calculations.

In [8]:
outcome_map = {
    # FINAL: fair use found
    "fair use found": "FAIR_USE_FOUND",
    "fair use found.": "FAIR_USE_FOUND",
    "fair use found; second circuit affirmed on appeal.": "FAIR_USE_FOUND",

    # FINAL: fair use not found
    "fair use not found": "FAIR_USE_NOT_FOUND",
    "fair use not found.": "FAIR_USE_NOT_FOUND",

    # INDETERMINATE
    "preliminary ruling, mixed result, or remand": "INDETERMINATE",
    "preliminary finding; fair use not found": "INDETERMINATE",
    "mixed result": "INDETERMINATE",
    "preliminary ruling, fair use not found": "INDETERMINATE",
    "fair use not found, preliminary ruling": "INDETERMINATE",
    "preliminary ruling; fair use not found": "INDETERMINATE",
    "fair use not found; preliminary ruling": "INDETERMINATE",
    "preliminary ruling, remand": "INDETERMINATE",
    "preliminary ruling, fair use not found, mixed result": "INDETERMINATE",
    "preliminary ruling, fair use found": "INDETERMINATE",
    "fair use found; mixed result": "INDETERMINATE",
    "plaintiff patrick cariou published yes rasta, a book of portraits and landscape photographs taken in jamaica. defendant richard prince was an appropriation artist who altered and incorporated several of plaintiff’s photographs into a series of paintings and collages called canal zone that was exhibited at a gallery and in the gallery’s exhibition catalog. plaintiff filed an infringement claim, and the district court ruled in his favor, stating that to qualify as fair use, a secondary work must “comment on, relate to the historical context of, or critically refer back to the original works.” defendant appealed.": "INDETERMINATE",
}

In [9]:
# Replace outcome column values with the mapping in outcome_map
fair_use_findings["outcome"] = fair_use_findings["outcome"].replace(outcome_map)
fair_use_findings["outcome"].value_counts().reset_index()

Unnamed: 0,outcome,count
0,FAIR_USE_NOT_FOUND,101
1,FAIR_USE_FOUND,100
2,INDETERMINATE,50


### Column Cleaning

The year column is converted to a numeric integer format to ensure it can be used reliably in grouping, filtering, and any downstream modeling steps. Any non-numeric or missing values are handled safely during conversion.

In [10]:
# Turn the year column to integer
fair_use_findings["year"] = pd.to_numeric(fair_use_findings["year"], errors="coerce").astype("Int64")

In [11]:
fair_use_findings.head()

Unnamed: 0,title,case_number,year,court,key_facts,issue,holding,tags,outcome
0,De Fontbrune v. Wofsy,39 F.4th 1214 (9th Cir. 2022),2022,United States Court of Appeals for the Ninth C...,Plaintiffs own the rights to a catalogue compr...,Whether reproduction of photographs documentin...,"The panel held that the first factor, the purp...",Education/Scholarship/Research; Photograph,FAIR_USE_NOT_FOUND
1,Sedlik v. Von Drachenberg,"No. CV 21-1102, 2022 WL 2784818 (C.D. Cal. May...",2022,United States District Court for the Southern ...,Plaintiff Jeffrey Sedlik is a photographer who...,Whether use of a photograph as the reference i...,"Considering the first fair use factor, the pur...",Painting/Drawing/Graphic; Photograph,INDETERMINATE
2,"Sketchworks Indus. Strength Comedy, Inc. v. Ja...","No. 19-CV-7470-LTS-VF, 2022 U.S. Dist. LEXIS 8...",2022,United States District Court for the Southern ...,Plaintiff Sketchworks Industrial Strength Come...,"Whether the use of protected elements, includi...","The court found that the first factor, the pur...",Film/Audiovisual; Music; Parody/Satire; Review...,FAIR_USE_FOUND
3,Am. Soc'y for Testing & Materials v. Public.Re...,"No. 13-cv-1215 (TSC), 2022 U.S. Dist. LEXIS 60...",2022,United States District Court for the District ...,"Defendant Public.Resource.Org, Inc., a non-pro...",Whether it is fair use to make available onlin...,"As directed by the court of appeals, the distr...",Education/Scholarship/Research; Textual Work; ...,INDETERMINATE
4,Yang v. Mic Network Inc.,"Nos. 20-4097-cv(L), 20-4201-cv (XAP), 2022 U.S...",2022,United States Court of Appeals for the Second ...,Plaintiff Stephen Yang (“Yang”) licensed a pho...,"Whether using a screenshot from an article, in...","On appeal, the court decided that the first fa...",News Reporting; Photography,FAIR_USE_FOUND


## Question 3: Categories Association Rules and Clustering

Q3 Scenario bundles that define most disputes Question: What tag/category combinations repeatedly appear as common “scenario bundles,” and do these bundles correspond to distinct dispute themes in the summaries (e.g., licensing breakdowns, ownership ambiguity, takedowns) or mainly reflect metadata like venue?

This section works well in isolation while we use similar TF-IDF methodology for clustering the categories otherwise there is little synergy in-terms of utilizing methods and outcomes from other questions

In [12]:
fair_use_findings['tags_list'] = fair_use_findings['tags'].str.split(';')
df_exploded = fair_use_findings.explode('tags_list')

df_exploded['tags_list'] = df_exploded['tags_list'].str.strip()

df_unique = df_exploded.drop_duplicates(subset=['case_number', 'tags_list'])
df_matrix = pd.crosstab(df_unique['case_number'], df_unique['tags_list'])
print("\nFinal Matrix Shape:", df_matrix.shape)
print(df_matrix.head())


Final Matrix Shape: (251, 59)
tags_list                                           Computer Program  \
case_number                                                            
108 F.3d 1119 (9th Cir. 1997), cert. denied 522...                 0   
109 F.3d 1394 (9th Cir. 1997)                                      0   
11 F. Supp. 2d 1179 (C.D. Cal. 1998)                               0   
126 F.3d 70 (2d Cir. 1997)                                         0   
132 F. Supp. 2d 229 (S.D.N.Y. 2001)                                0   

tags_list                                           Computer program  \
case_number                                                            
108 F.3d 1119 (9th Cir. 1997), cert. denied 522...                 0   
109 F.3d 1394 (9th Cir. 1997)                                      0   
11 F. Supp. 2d 1179 (C.D. Cal. 1998)                               0   
126 F.3d 70 (2d Cir. 1997)                                         0   
132 F. Supp. 2d 229 (S.D.N.Y. 20

### Association Rules

Convert into an encoded format and apply Association Rules using confidence, support was found to not be as helpful in M2

In [13]:
frequent_itemsets = apriori(df_matrix, min_support=0.005, use_colnames=True)
frequent_itemsets.sort_values(by="support")

Unnamed: 0,support,itemsets
19,0.007968,"(Ninth Circuit, Photograph, Review/Commentary)"
28,0.007968,(Tenth Circuit)
272,0.007968,"(Second Circuit, Painting/Drawing/Graphic, Tex..."
275,0.007968,"(Second Circuit, Review/Commentary, Parody/Sat..."
277,0.007968,"(Photograph, Sculpture, Review/Commentary)"
...,...,...
22,0.223108,(Photograph)
3,0.227092,(Education/Scholarship/Research)
9,0.239044,(Film/Audiovisual)
25,0.286853,(Second Circuit)


In [14]:
rules = association_rules(frequent_itemsets,
                          num_itemsets=frequent_itemsets.shape[0],
                          metric="confidence", min_threshold=0.4) #, metric="support", min_threshold=0.05
rules.sort_values(by=["support", "confidence"])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
42,"(Education/Scholarship/Research, Eleventh Circ...",(Internet/Digitization),0.019920,0.159363,0.007968,0.400000,2.510000,1.0,0.004794,1.401062,0.613821,0.046512,0.286256,0.225000
44,"(Education/Scholarship/Research, Eleventh Circ...",(Review/Commentary),0.019920,0.155378,0.007968,0.400000,2.574359,1.0,0.004873,1.407703,0.623984,0.047619,0.289623,0.225641
62,"(Music, Ninth Circuit)",(Education/Scholarship/Research),0.019920,0.227092,0.007968,0.400000,1.761404,1.0,0.003444,1.288181,0.441057,0.033333,0.223711,0.217544
99,"(Eleventh Circuit, Textual work)",(Internet/Digitization),0.019920,0.159363,0.007968,0.400000,2.510000,1.0,0.004794,1.401062,0.613821,0.046512,0.286256,0.225000
101,"(Eleventh Circuit, Textual work)",(Parody/Satire),0.019920,0.135458,0.007968,0.400000,2.952941,1.0,0.005270,1.440903,0.674797,0.054054,0.305991,0.229412
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26,(Internet/Digitization),(Photograph),0.159363,0.223108,0.063745,0.400000,1.792857,1.0,0.028190,1.294821,0.526066,0.200000,0.227692,0.342857
30,(Review/Commentary),(Second Circuit),0.155378,0.286853,0.075697,0.487179,1.698362,1.0,0.031126,1.390637,0.486842,0.206522,0.280905,0.375534
32,(Second Circuit),(Textual work),0.286853,0.330677,0.115538,0.402778,1.218039,1.0,0.020682,1.120726,0.251011,0.230159,0.107722,0.376088
8,(Textual work),(Education/Scholarship/Research),0.330677,0.227092,0.143426,0.433735,1.909956,1.0,0.068332,1.364923,0.711806,0.346154,0.267358,0.532657


In [None]:
rules_filtered = rules[(rules['confidence'] > 0.5) & (rules['lift'] >= 1.5)]
rules_filtered.sort_values(by=["confidence", "lift"], ascending=False)

The Association Rules method was super relevant as the question directly relates to different categories associated with a case. The dataset contains rich but fragmented categorical data (tags, jurisdictions, outcome labels). Standard correlation analysis fails with non-numeric data, making Apriori the ideal choice to detect "if-then" patterns. It would help uncover hidden relationships between categorical tags, courts, and legal outcomes.

Association Rule Mining (Apriori) revealed that geography is a predictive factor for specific dispute types. We found a near-perfect correlation (Lift > 100.0) between the 9th Circuit (California) and "Format Shifting" disputes. This identifies the West Coast as the primary battleground for technical cases. Conversely, the 2nd Circuit (New York) showed a strong statistical dependency (Lift > 2.2) for Parody and Film disputes. This confirms that media companies face a "New York Risk" (artistic interpretation) that is fundamentally different from the "California Risk" (technical utility)

### Kmean clustering

In [None]:
fair_use_findings['full_text'] = (fair_use_findings['key_facts'].fillna('') + " " +
                                  fair_use_findings['issue'].fillna('') + " " +
                                  fair_use_findings['holding'].fillna(''))

my_stop_words = [
    'court', 'case', 'copyright', 'fair', 'use', 'plaintiff', 'defendant',
    'judge', 'district', 'appeal', 'circuit', 'infringement', 'holding',
    'fact', 'issue', 'summary', 'judgment', 'claimed', 'argued'
]

# 2. Vectorize with Custom Stop Words
#    We combine the standard English list with our custom legal list
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
all_stop_words = list(ENGLISH_STOP_WORDS) + my_stop_words

tfidf = TfidfVectorizer(max_features=3000,
                        stop_words=all_stop_words, # <--- The key change
                        ngram_range=(1,2))

text_matrix = tfidf.fit_transform(fair_use_findings['full_text'])

n_clusters_kmeans = 5

kmeans = KMeans(n_clusters=n_clusters_kmeans, random_state=42, n_init=10)
labels = kmeans.fit_predict(text_matrix)

fair_use_findings['cluster_labels'] = labels.astype(str)

print(f"--- Cases per Cluster (K={n_clusters_kmeans}) ---")
print(fair_use_findings['cluster_labels'].value_counts())

print("\n--- Top Terms per Cluster ---")
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = tfidf.get_feature_names_out()

for i in range(n_clusters_kmeans):
    top_words = [terms[ind] for ind in order_centroids[i, :15]]
    print(f"Cluster {i}: {', '.join(top_words)}")

The most critical finding from our text clustering is that disputes segregate into five mutually exclusive business archetypes. This confirms that a "one-size-fits-all" copyright policy is strategically flawed.

*   The "Commercial Advertising" Bundle (Cluster 0): Disputes here focus purely on marketing materials (posters, ad campaigns). These cases are structurally distinct because they are factually simple, often revolving around a single visual comparison.
*   The "High-Tech" Infrastructure Bundle (Cluster 1): This cluster isolates disputes involving software code, reverse engineering, and database management. It represents the "industrial" side of copyright, totally distinct from creative arts.

*   The "Content Sharing" Bundles (Clusters 2 & 3): We identified a split between Social Media/News (viral sharing risks) and Search/Indexing (infrastructure risks).
*   The "Legacy" Bundle (Cluster 4): The traditional domain of books, biographies, and parodies remains a distinct, isolated category.

