In [1]:
import os
import pandas as pd
import numpy as np
from IPython.display import Markdown, display
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

import eyecite

In [2]:
os.chdir(os.path.dirname(os.getcwd()))

In [3]:
from src.utils.gen_utils import count_tokens, hash_id
from src.utils.pydantic_utils import dataframe_to_documents
from src.parsing.parser import ParsingConfig, Parser
from src.citations import (
    get_citation_context_sents, 
    get_citation_context_words, 
    extract_citations_with_context_from_df,
)
from src.embedding_models.models import ColbertReranker

In [4]:
df = pd.read_parquet('./data/forward_citations.parquet')

In [5]:
df['tokens'] = df['Complete Text'].apply(lambda x: count_tokens(x))
df["id"] = df['Complete Text'].apply(lambda x: hash_id(x))

In [6]:
df.head(10)

Unnamed: 0,Cited Case ID,Cited Case Title,Citing Case ID,Citing Case Title,Relevant Excerpt,Excerpt Contains Negative Sentiment,Complete Text,tokens,id
0,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",3720681,"Recording Industry Ass'n of America, Inc. v. V...",available at *88 See Indirect Liability for ...,NO,\n 351 F.3d 1229 \n RECORDING INDUSTRY ASS...,22871,aa155094-1ce9-abce-a85d-f865944d493e
1,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9054205,"Recording Industry Ass'n of America, Inc. v. V...",available at *1232 See Indirect Liability fo...,The given excerpt does not provide enough info...,\n RECORDING INDUSTRY ASSOCIATION OF AMERICA...,22906,2dc86e39-82d9-6a1f-24ff-8cf9c743e8fd
2,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9291681,"UMG Recordings, Inc. v. Sinnott, 300 F. Supp. ...",239 F.3d at 1020 Cable/Home Communication Corp...,No,"\n UMG RECORDINGS, INC.; Arista Records, Inc...",25853,160761cf-e82c-beb6-04c4-dbc48770d9df
3,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9291681,"UMG Recordings, Inc. v. Sinnott, 300 F. Supp. ...",259 F.Supp.2d 1029 Grok-ster Grokster In Gr...,No,"\n UMG RECORDINGS, INC.; Arista Records, Inc...",25853,160761cf-e82c-beb6-04c4-dbc48770d9df
4,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9214758,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster ...","From the advent of the player piano, every new...",No,"\n METRO-GOLDWYN-MAYER STUDIOS, INC.; Columb...",29264,1ad2f161-d162-fd1a-6dfc-244676f7a67b
5,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9214758,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster ...","Thus, in order to analyze the required element...",No,"\n METRO-GOLDWYN-MAYER STUDIOS, INC.; Columb...",29264,1ad2f161-d162-fd1a-6dfc-244676f7a67b
6,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9214758,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster ...","See Napster I, In this case, the district co...",No,"\n METRO-GOLDWYN-MAYER STUDIOS, INC.; Columb...",29264,1ad2f161-d162-fd1a-6dfc-244676f7a67b
7,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9214758,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster ...","Napster II, Sony-Betamax Having determined t...",NO,"\n METRO-GOLDWYN-MAYER STUDIOS, INC.; Columb...",29264,1ad2f161-d162-fd1a-6dfc-244676f7a67b
8,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9214758,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster ...","*1163 In the context of this case, the softwar...",NO,"\n METRO-GOLDWYN-MAYER STUDIOS, INC.; Columb...",29264,1ad2f161-d162-fd1a-6dfc-244676f7a67b
9,9117679,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster,...",9214758,"Metro-Goldwyn-Mayer Studios, Inc. v. Grokster ...","907 F.Supp. 1361 Fonovisa, Inc. v. Cherry Auct...",NO,"\n METRO-GOLDWYN-MAYER STUDIOS, INC.; Columb...",29264,1ad2f161-d162-fd1a-6dfc-244676f7a67b


In [7]:
df.shape

(29, 9)

In [8]:
Markdown(df.head(1)['Complete Text'].tolist()[0][:2000])


   351 F.3d 1229 
   RECORDING INDUSTRY ASSOCIATION OF AMERICA, INC., Appellee, v. VERIZON INTERNET SERVICES, INC., Appellant. 
   Nos. 03-7015 & 03-7053. 
   United States Court of Appeals, District of Columbia Circuit. 
   Argued Sept. 16, 2003. 
   Decided Dec. 19, 2003. 
   
     *86 Andrew G. McBride argued the cause for appellant. With him on the briefs were John Thorne, Bruce G. Joseph, Dineen P. Wasylik, and Kathryn L. Comer-ford. 
   Megan E. Gray, Lawrence S. Robbins, Alan Untereiner, Christopher A. Hansen, Arthur B. Spitzer, and Cindy Cohn were on the brief for amici curiae Alliance for Public Technology, et al., in support of appellant. 
   Donald B. Verrilli, Jr. argued the cause for appellee Recording Industry Association of America, Inc. With him on the brief were Thomas J. Perrelli and Matthew J. Oppenheim. Deanne E. Maynard entered an appearance. 
   Scott R. McIntosh, Attorney, U.S. Department of Justice, argued the cause for intervenor-appellee United States. With him on the brief were Roscoe C. Howard, Jr., U.S. Attorney, and Douglas N. Letter, Attorney, U.S. Department of Justice. 
   Paul B. Gaffney, Thomas G. Hentoff, Eric H. Smith, Patricia Polach, Ann Chai *87 tovitz, Allan R. Adler, Joseph J. DiMona, Robert S. Giolito, and Chun T. Wright were on the brief for amici curiae Motion Picture Association of America, et al., in support of appellee Recording Industry Association of America. David E. Kendall entered an appearance. 
   Paul Alan Levy, Alan B. Morrison, and Allison M. Zieve were on the brief for amicus curiae Public Citizen. 
   Before: GINSBURG, Chief Judge,and ROBERTS, Circuit Judge, and WILLIAMS, Senior Circuit Judge. 
   
     Opinion for the Court filed by Chief Judge GINSBURG. 
     GINSBURG, Chief Judge: 
     This case concerns the Recording Industry Association of America’s use of the subpoena provision of the Digital Millennium Copyright Act,  17 U.S.C. § 512 (h), to identify internet users the RIAA believes are infringing 

In [9]:
text = df.head(1)['Complete Text'].tolist()[0]

In [10]:
citations = eyecite.get_citations(text, remove_ambiguous=True)
len(citations)

490

In [11]:
context_words = get_citation_context_words(text, '239 F.3d 1004', 400, 200)
highlighted_contexts = context_words.replace("239 F.3d 1004", "<mark>239 F.3d 1004</mark>")
Markdown(highlighted_contexts)

amici curiae Motion Picture Association of America, et al., in support of appellee Recording Industry Association of America. David E. Kendall entered an appearance. Paul Alan Levy, Alan B. Morrison, and Allison M. Zieve were on the brief for amicus curiae Public Citizen. Before: GINSBURG, Chief Judge,and ROBERTS, Circuit Judge, and WILLIAMS, Senior Circuit Judge. Opinion for the Court filed by Chief Judge GINSBURG. GINSBURG, Chief Judge: This case concerns the Recording Industry Association of America’s use of the subpoena provision of the Digital Millennium Copyright Act, 17 U.S.C. § 512 (h), to identify internet users the RIAA believes are infringing the copyrights of its members. The RIAA served two subpoenas upon Verizon Internet Services in order to discover the names of two Verizon subscribers who appeared to be trading large numbers of .mp3 files of copyrighted music via “peer-to-peer” (P2P) file sharing programs, such as KaZaA. Verizon refused to comply with the subpoenas on various legal grounds. The district court rejected Verizon’s statutory and constitutional challenges to § 512(h) and ordered the internet service provider (ISP) to disclose to the RIAA the names of the two subscribers. On appeal Verizon presents three alternative arguments for reversing the orders of the district court: (1) § 512(h) does not authorize the issuance of a subpoena to an ISP acting solely as a conduit for communications the content of which is determined by others; if the statute does authorize such a subpoena, then the statute is unconstitutional because (2) the district court lacked Article III jurisdiction to issue a subpoena with no underlying “case or controversy” pending before the court; and (3) § 512(h) violates the First Amendment because it lacks sufficient safeguards to protect an internet user’s ability to speak and to associate anonymously. Because we agree with Verizon’s interpretation of the statute, we reverse the orders of the district court enforcing the subpoenas and do not reach either of Verizon’s constitutional arguments. * I. Background Individuals with a personal computer and access to the internet began to offer digital copies of recordings for download by other users, an activity known as file sharing, in the late 1990’s using a program called Napster. Although recording companies and music publishers successfully obtained an injunction against Napster’s facilitating the sharing of files containing copyrighted recordings, see A&M Records, Inc. v. Napster, Inc., 284 F.3d 1091 (9th Cir.2002); A&M Records, Inc. v. Napster, Inc., <mark>239 F.3d 1004</mark> (9th Cir.2001), millions of people in the United States and around the world continue to share digital .mp3 files of copyrighted recordings using P2P computer programs such as KaZaA, Morpheus, Grokster, and eDonkey. See John Borland, File Swapping Shifts Up a Gear (May 27, 2003), available at http:// news.com.com/2100-1026-1009742.html, *88 (last visited December 2, 2003). Unlike Napster, which relied upon a centralized communication architecture to identify the .mp3 files available for download, the current generation of P2P file sharing programs allow an internet user to search directly the .mp3 file libraries of other users; no web site is involved. See Douglas Lichtman & William Landes, Indirect Liability for Copyright Infringement: An Economic Perspective, 16 Harv. J. Law & Tech. 395, 403, 408-09 (2003). To date, owners of copyrights have not been able to stop the use of these decentralized programs. See Metro-Goldwyn-Mayer Studios, Inc. v. Grokster, Ltd., 259 F.Supp.2d 1029 (C.D.Cal.2003) (holding Grokster not contributorily liable for copyright infringement by users of its P2P file sharing program). The RIAA now has begun to direct its anti-infringement efforts against individual users of P2P file sharing programs. In order to pursue apparent infringers the RIAA needs to be able to identify the

In [12]:
context_sents = get_citation_context_sents(text, "239 F.3d 1004", 6, 2)
highlighted_contexts = context_sents.replace("239 F.3d 1004", "**239 F.3d 1004**")
Markdown(highlighted_contexts)

This case concerns the Recording Industry Association of America’s use of the subpoena provision of the Digital Millennium Copyright Act,  17 U.S.C. § 512 (h), to identify internet users the RIAA believes are infringing the copyrights of its members. The RIAA served two subpoenas upon Verizon Internet Services in order to discover the names of two Verizon subscribers who appeared to be trading large numbers of .mp3 files of copyrighted music via “peer-to-peer” (P2P) file sharing programs, such as KaZaA. Verizon refused to comply with the subpoenas on various legal grounds. 
      The district court rejected Verizon’s statutory and constitutional challenges to § 512(h) and ordered the internet service provider (ISP) to disclose to the RIAA the names of the two subscribers. On appeal Verizon presents three alternative arguments for reversing the orders of the district court: (1) § 512(h) does not authorize the issuance of a subpoena to an ISP acting solely as a conduit for communications the content of which is determined by others; if the statute does authorize such a subpoena, then the statute is unconstitutional because (2) the district court lacked Article III jurisdiction to issue a subpoena with no underlying “case or controversy” pending before the court; and (3) § 512(h) violates the First Amendment because it lacks sufficient safeguards to protect an internet user’s ability to speak and to associate anonymously. Because we agree with Verizon’s interpretation of the statute, we reverse the orders of the district court enforcing the subpoenas and do not reach either of Verizon’s constitutional arguments. * 
     
     I. Background 
     Individuals with a personal computer and access to the internet began to offer digital copies of recordings for download by other users, an activity known as file sharing, in the late 1990’s using a program called Napster. Although recording companies and music publishers successfully obtained an injunction against Napster’s facilitating the sharing of files containing copyrighted recordings,  see A&M Records, Inc. v. Napster, Inc.,  284 F.3d 1091  (9th Cir.2002);  A&M Records, Inc. v. Napster, Inc.,  **239 F.3d 1004**  (9th Cir.2001), millions of people in the United States and around the world continue to share digital .mp3 files of copyrighted recordings using P2P computer programs such as KaZaA, Morpheus, Grokster, and eDonkey.   See  John Borland,  File Swapping Shifts Up a Gear  (May 27, 2003),  available at  http:// news.com.com/2100-1026-1009742.html,  *88 (last visited December 2, 2003). Unlike Napster, which relied upon a centralized communication architecture to identify the .mp3 files available for download, the current generation of P2P file sharing programs allow an internet user to search directly the .mp3 file libraries of other users; no web site is involved.  

In [13]:
citations_df = extract_citations_with_context_from_df(df, "Complete Text", "Citing Case ID", 4, 2)
print(f"df shape: {citations_df.shape}")
print(f"distinct citations: {citations_df.citation.nunique()}")
citations_df.head()

df shape: (8846, 5)
distinct citations: 575


Unnamed: 0,id,citation,context,start_char,end_char
0,3720681,351 F.3d 1229,\n 351 F.3d 1229 \n RECORDING INDUSTRY ASS...,0,160
1,3720681,17 U.S.C. § 512,"Paul Alan Levy, Alan B. Morrison, and Allison ...",1477,2366
2,3720681,284 F.3d 1091,The district court rejected Verizon’s statutor...,2366,4354
3,3720681,239 F.3d 1004,The district court rejected Verizon’s statutor...,2366,4354
4,3720681,259 F. Supp. 2d 1029,"Unlike Napster, which relied upon a centralize...",4355,5233


In [14]:
df2 = citations_df.drop_duplicates()
print(f"df shape: {df2.shape}")
print(f"distinct citations: {df2.citation.nunique()}")
df2.head()

df shape: (4104, 5)
distinct citations: 575


Unnamed: 0,id,citation,context,start_char,end_char
0,3720681,351 F.3d 1229,\n 351 F.3d 1229 \n RECORDING INDUSTRY ASS...,0,160
1,3720681,17 U.S.C. § 512,"Paul Alan Levy, Alan B. Morrison, and Allison ...",1477,2366
2,3720681,284 F.3d 1091,The district court rejected Verizon’s statutor...,2366,4354
3,3720681,239 F.3d 1004,The district court rejected Verizon’s statutor...,2366,4354
4,3720681,259 F. Supp. 2d 1029,"Unlike Napster, which relied upon a centralize...",4355,5233


In [15]:
grouped = df2.groupby('citation')
merged_data = []

for citation, group in grouped:
    # Sort contexts by their starting position
    sorted_group = group.sort_values(by=['start_char'])
    
    merged_contexts = []
    current_start, current_end, current_context = None, None, ""
    
    for idx, row in sorted_group.iterrows():
        if current_start is None:  # For the first iteration
            current_start, current_end, current_context = row['start_char'], row['end_char'], row['context']
        elif row['start_char'] <= current_end:  # Overlapping or adjacent contexts
            # Merge contexts by extending the end if necessary and combining the text
            current_end = max(current_end, row['end_char'])
            # Prevent duplicating text if the contexts are exactly the same
            if row['context'] not in current_context:
                current_context += " " + row['context']
        else:
            # No overlap; add the current context to merged_contexts and start a new one
            merged_contexts.append({"id": idx, "citation": citation, "context": current_context, "start_char": current_start, "end_char": current_end})
            current_start, current_end, current_context = row['start_char'], row['end_char'], row['context']
    
    # Don't forget to add the last context
    merged_contexts.append({"id": idx, "citation": citation, "context": current_context, "start_char": current_start, "end_char": current_end})
    
    merged_data.extend(merged_contexts)

# Convert merged data back into a DataFrame
merged_df = pd.DataFrame(merged_data)

In [16]:
print(f"df shape: {merged_df.shape}")
print(f"distinct citations: {merged_df.citation.nunique()}")
merged_df.head()

df shape: (2923, 5)
distinct citations: 575


Unnamed: 0,id,citation,context,start_char,end_char
0,2926,1 L. Ed. 2d 465,"In re Napster, Inc. Copyright Litigation, 191...",54510,55406
1,3098,1 L. Ed. 2d 465,"In re Napster, Inc. Copyright Litigation, 191...",118732,119628
2,3101,1 L. Ed. 2d 465,The issue focuses on when plaintiffs can bring...,186464,187384
3,3101,1 L. Ed. 2d 465,"402 , 86 L.Ed. 363 (1942) (“Equity may right...",187603,188343
4,2868,10 F.R.D. 534,See Oneida Indian Nation of New York v. City o...,18785,20019


In [17]:
merged_df["tokens"] = merged_df["context"].apply(lambda x: count_tokens(x))
merged_df["tokens"].describe()

count    2923.000000
mean      326.897708
std       273.851818
min        64.000000
25%       190.000000
50%       244.000000
75%       365.000000
max      4699.000000
Name: tokens, dtype: float64

In [18]:
Markdown(merged_df.head(2)['context'].tolist()[1])

In re Napster, Inc. Copyright Litigation,  191 F.Supp.2d 1087 , 1108 (N.D.Cal.2002)(“The doctrine does not prevent plaintiffs from ultimately recovering for acts of infringement that occur during the period of misuse. The issue focuses on when plaintiffs can bring or pursue an action for infringement, not for which acts of infringement they can recover.”). 
      Moreover, Defendants base their contention that “[w]hether a plaintiff may retroactively sue for infringement of its copyright during a period of misuse of the copyright is an unsettled issue” is based on dicta from a patent misuse case,  U.S. Gypsum Co. v. National Gypsum Co.,  352 U.S. 457 ,  77 S.Ct. 490 ,  1 L.Ed.2d 465  (1957), in which the Supreme Court reversed the lower court’s unsupported finding that the misuse remained “unpurged” as of the time of suit, precluding enforcement of the patent.   Id.  at 463 ,  77 S.Ct.

In [19]:
docs = dataframe_to_documents(df, "Complete Text", metadata=["Citing Case ID", "Citing Case Title"], )

In [20]:
len(docs)

29

In [21]:
parser = Parser(
    ParsingConfig(
        chunk_size=500,
        n_neighbor_ids=2
    )
)

In [22]:
chunks = parser.split_para_sentence(docs)
len(chunks)

2995

In [23]:
Markdown(chunks[1000].content)

And of the Courts of Appeals that have considered the matter, only one has proposed interpreting  Sony  more strictly than I would do — in a case where the product might have failed under  any  standard. In re Aimster Copyright Litigation,  334 F. 3d 643 , 653 (CA7 2003) (defendant “failed to show that its service is  ever  used for any purpose other than to infringe” copyrights (emphasis added)); see  Matthew Bender & Co. v. West Pub. Co.,  158  *956 F. 3d 693 , 706-707 (CA2 1998) (court did not  require  that noninfringing uses be “predominant,” it merely found that they  were  predominant, and therefore provided no analysis of  Sony’s  boundaries); but see  ante,  at 944, n. 1 (Ginsburg, J., concurring); see also  A&M Records, Inc. v. Napster, Inc.,  239 F. 3d 1004 , 1020 (CA9 2001) (discussing  Sony); Cable/Home Communication Corp. v. Network Productions, Inc.,  902 F. 2d 829 , 842-847 (CA11 1990) (same);  Vault Corp. v. Quaid Software, Ltd.,  847 F. 2d 255 , 262 (CA5 1988) (same); cf. Dynacore Holdings Corp. v. U S. Philips Corp.,  363 F. 3d 1263 , 1275 (CA Fed. 2004) (same); see also  Doe  v. GTE Corp.,  347 F. 3d 655 , 661 (CA7 2003) (“A person may be liable as a contributory infringer if the product or service it sells has no (or only slight) legal use”). Instead, the real question is whether we should modify the  Sony  standard, as MGM requests, or interpret  Sony  more strictly, as I believe Justice Ginsburg’s approach would do in practice.

In [24]:
chunks[3].metadata

DynamicMetaData(source='context', is_chunk=True, id='3e03d7ea-fe4d-c746-a611-6879bb8d5fc7', window_ids=['2ad8e729-a1be-6cbb-bb90-d4dd84fd94b9', '8605964f-ba70-81df-a37f-3c67bca750dc', '3e03d7ea-fe4d-c746-a611-6879bb8d5fc7', '728d7a22-51ac-e8b6-9f46-40d03ee67c8e', '4b3cbeef-348a-596e-b497-e95f33b988cc'], Citing Case ID=3720681, Citing Case Title="Recording Industry Ass'n of America, Inc. v. Verizon Internet Services, Inc., 359 U.S. App. D.C. 85 (D.C. Cir. 2003)")

In [6]:
from src.db.lancedb import LanceDBConfig, LanceDB
from src.embedding_models.models import OpenAIEmbeddingsConfig

In [7]:
reddit = pd.read_parquet("./data/reddit_legal_cluster_test_results.parquet")

In [8]:
reddit.rename(columns={"embeddings": "vector"}, inplace=True)
reddit.head(2)

Unnamed: 0,index,created_utc,full_link,id,body,title,text_label,flair_label,vector,token_count,llm_title,State,kmeans_label,topic_title
1078,1078,1575952538,https://www.reddit.com/r/legaladvice/comments/...,e8lsen,I applied for a job and after two interviews I...,"Failed a drug test due to amphetamines, I have...",employment,5,"[9.475638042064453e-05, 0.0005111666301983955,...",493,"""Validity of Schedule II Drug Prescription in ...",PR,8,Employment Legal Concerns and Issues
2098,2098,1577442453,https://www.reddit.com/r/legaladvice/comments/...,eg9ll2,"Hi everyone, thanks in advance for any guidanc...","Speeding ticket in Tennessee, Georgia Driver's...",driving,4,"[-0.006706413111028856, 0.020911016696181495, ...",252,"""Speeding ticket consequences for out-of-state...",KY,10,Legal Topics in Traffic Violations


In [10]:
db_dir = ".lancedb/data"
ldb_cfg = LanceDBConfig(
    collection_name="reddit-legal",
    replace_collection=False,
    storage_path=db_dir,
    flatten=False,
    embedding=OpenAIEmbeddingsConfig()
)

In [11]:
db = LanceDB(
    ldb_cfg
)

In [13]:
db.create_collection("reddit-legal")

In [17]:
db.list_collections()

['lance-citations', 'reddit-legal']

In [16]:
db.add_dataframe(reddit, content="body", metadata=['text_label', 'token_count', 'State'])

In [19]:
res = db.similar_texts_with_scores(text="I got fired for smoking medical marijuana during lunch. Can they even do that?", k=10)

In [22]:
table = db.client.open_table("reddit-legal")

In [26]:
query="I got fired for smoking medical marijuana during lunch. Can they even do that?"
query_vector = db.embedding_fn([query])[0]

In [40]:
table.count_rows()

5000

In [74]:
query_fts_results = table.search().where('token_count > 300', prefilter=True).limit(10).with_row_id(True).to_pandas()
query_fts_results['token_count'].describe()

count     10.000000
mean     503.300000
std      198.024157
min      308.000000
25%      369.250000
50%      422.500000
75%      643.000000
max      911.000000
Name: token_count, dtype: float64

In [27]:
query_vector_results = table.search(
    query_vector, query_type="vector").limit(20).with_row_id(True).to_arrow()

query_fts_results = table.search(
    query, query_type="fts").limit(20).with_row_id(True).to_arrow()

In [28]:
reranker = ColbertReranker()

In [29]:
ranked_res = reranker.rerank_hybrid(query, query_vector_results, query_fts_results)

In [30]:
query

'I got fired for smoking medical marijuana during lunch. Can they even do that?'

In [31]:
Markdown(ranked_res.to_pandas().head(1)['content'].tolist()[0])

I currently work for a staffing agency, when I picked up my check today I noticed I had less hours than usual. When I asked my employer why my check was off they stated that the business I contracted with was now requiring them to deduct lunch breaks. I am not able to take a 30 min lunch break off the floor/out of building due to being the only employee in building certified to hold the medication cart keys. Is this legal? Everything I've found online says they cannot deduct a lunch break unless I am free to leave premesis, which I am not. Any information on this issue and advice on course of action would be appreciated

In [32]:
Markdown(query_fts_results.to_pandas().head(1)['content'].tolist()[0])

I have my medical marijuana card because mj lets me eat, not feel like my body is coming apart at it's joints, feel like life is not so hopeless, and sleep soundly, but I've been looking for new jobs and I'm unsure about drug testing. 

I've looked it up and know its a state-to-state thing, but I'm still a little iffy about Pennsylvania and how weirdly 50/50 it seems to be with medical marijuana. I don't smoke before or during work and never intend to do so. However I'm unsure about the legality of workers compensation and drug testing during a hire situation if I happen to fail the test. 

Will I get fired or fined from work and would the employer looking to hire me have grounds to dismiss me?

Also, this is my first time on this sub so I hope this isn't an annoying question or something.

In [33]:
Markdown(query_vector_results.to_pandas().head(1)['content'].tolist()[0])

I'm a New York State medical marijuana patient. I also work in healthcare. I applied to a new job at a new hospital, and they are discriminating against me for being a medical marijuana patient. I was offered the job and accepted, but when I went to get my pre-employment physical conducted, I gave them my medical marijuana card and informed them that I am a patient. They are now refusing to hire me. Is this legal? I already contacted the division of human rights at the labor department and they said I may or may not have a case.

In [34]:
query_fts_results.to_pandas().head()

Unnamed: 0,index,created_utc,full_link,id,content,title,text_label,flair_label,vector,token_count,llm_title,State,kmeans_label,topic_title,score,_rowid
0,7381,1590042919,https://www.reddit.com/r/legaladvice/comments/...,gnrqby,I have my medical marijuana card because mj le...,If I have a medical marijuana card can I be fi...,employment,5,"[-0.0059105414, 8.900394e-05, 0.014213444, -0....",180,"""Medical Marijuana and Employment: Can failing...",DE,8,Employment Legal Concerns and Issues,21.860365,3185
1,1410,1588738046,https://www.reddit.com/r/legaladvice/comments/...,geczaw,"i was driving without a license, I had a marij...",Got a license block as a minor,driving,4,"[0.0020290152, -0.0046247216, 0.0039017021, -0...",243,"""Challenges with probation, drug tests, and li...",SD,10,Legal Topics in Traffic Violations,20.85096,3577
2,7331,1576018396,https://www.reddit.com/r/legaladvice/comments/...,e8y7gr,"Hello,\n\nI was considering getting a medical ...",Considering getting a medical marijuana card- ...,employment,5,"[-0.018918596, -0.009167221, 0.00857643, -0.05...",276,"""Legal implications of company drug policies a...",PA,8,Employment Legal Concerns and Issues,20.696045,1764
3,3114,1476760869,https://www.reddit.com/r/legaladvice/comments/...,581p64,Asking this for a friend who lives in a mobile...,Medical Marijuana: own the Mobile home but its...,housing,7,"[0.014702618, 0.0030635837, 0.03439224, -0.035...",91,"""Legal rights of mobile home park residents in...",AZ,6,Compilation of Legal Topics,17.979671,1658
4,8075,1590524169,https://www.reddit.com/r/legaladvice/comments/...,gr4q8s,I'm a New York State medical marijuana patient...,Is it legal in New York to discriminate agains...,employment,5,"[-0.01495087, -0.0094863335, 0.009920432, -0.0...",114,"""Can a hospital refuse to hire a medical marij...",MI,8,Employment Legal Concerns and Issues,17.294579,2172


In [19]:
from typing import List, Literal
from pydantic import BaseModel, Field

class LegalCitation(BaseModel):
    """Information about a legal rule and its application."""
    
    citation: str = Field(
        default="None",
        description="The Citation specified by the user.",
    )
    sentiment: Literal['Agree', 'Neutral', 'Disagree'] = Field(
        default="Neutral",
        description="The Context's sentiment towards the Citation."
    )
    summary: str = Field(
        ...,
        description="A concise background summary of the Context in relation to the Citation.",
    )
    hypothetical_applications: List[str] = Field(
        default_factory=list,
        description="Hypothetical questions or arguments that the Citation would apply to.",
    )
    keywords: List[str] = Field(
        default_factory=list, description="Keywords about the Citation from the Context."
    )

In [20]:
from tenacity import AsyncRetrying, stop_after_attempt, wait_fixed

In [21]:
import time

class Timer:
    def __init__(self, name):
        self.name = name
        self.start = None
        self.end = None

    async def __aenter__(self):
        self.start = time.time()

    async def __aexit__(self, *args, **kwargs):
        self.end = time.time()
        print(f"{self.name} took {(self.end - self.start):.2f} seconds")

In [22]:
import openai
import instructor

client = instructor.patch(openai.AsyncOpenAI())

In [23]:
async def analyze_citations(citation, context):
    return await client.chat.completions.create(
        model="gpt-4-turbo-preview",
        response_model=LegalCitation,
        max_retries=AsyncRetrying(
        stop=stop_after_attempt(2),
        wait=wait_fixed(1),
    ),
        messages=[
            {
                "role": "system",
                "content": "Your role is to extract information about a legal citation using the context provided and without prior knowledge.",
            },
            {"role": "user", "content": f"Your task focuses on citation: {citation}"},
            {"role": "user", "content": f"Here is the context: {context}"}
        ]
    )

In [28]:
sample_df = merged_df.sample(10, random_state=42)
sample_df.reset_index(inplace=True, drop=True)

In [25]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

In [29]:
async with Timer("asyncio.gather"):
    tasks_ = [analyze_citations(row['citation'], row['context']) for _, row in sample_df.iterrows()]
    all_data = await asyncio.gather(*tasks_)

asyncio.gather took 13.66 seconds


In [30]:
data_dicts = [model.model_dump() for model in all_data]
df_res = pd.DataFrame(data_dicts)
df_res['keywords'] = df_res['keywords'].apply(lambda x: ', '.join(x))
df_res['hypothetical_applications'] = df_res['hypothetical_applications'].apply(lambda x: ', '.join(x))
df_res

Unnamed: 0,citation,sentiment,summary,hypothetical_applications,keywords
0,259 F. Supp. 2d at 1036,Neutral,"In the context provided, the citation concerns...",Would the court's decision apply in a case whe...,"Grokster I, peer-to-peer file-sharing, copyrig..."
1,676 F.3d at 31,Neutral,The context discusses the interpretation of sp...,Would an online service provider be excluded f...,"Section 512(c), safe harbor, service provider,..."
2,284 F.3d at 1098,Neutral,The context involves legal principles surround...,Would a court be able to modify an injunction ...,"injunction, modify, preliminary injunction, in..."
3,373 F.3d at 555,Neutral,"In the context provided, the DMCA's relevance ...",Would the DMCA's safe harbor provisions apply ...,"DMCA, safe harbor, copyright infringement, CoS..."
4,2007 WL 1246448,Neutral,The context revolves around the uncertainty an...,How would the post-eBay legal landscape affect...,"post-eBay, preliminary injunction, Ninth Circu..."
5,464 U.S. 417,Neutral,"In Sony Corp. v. Universal City Studios, Inc. ...",Would Sony’s Betamax case apply to modern stre...,"Sony, Betamax, videocassette recorder, copyrig..."
6,648 F.Supp. 1127,Neutral,The context discusses the case of Broderbund S...,Would a claim of damages based on increased op...,"copyright infringement, damages, speculative c..."
7,676 F.3d 19,Neutral,This case involves the application of establis...,Could the precedent set by this case be used t...,"intellectual property, new technologies, film ..."
8,373 F.3d at 555,Neutral,The context discusses a legal contention invol...,How would the DMCA apply in a case where a ser...,"DMCA, copyright infringement, safe harbor, pre..."
9,711 A.2d 951,Neutral,The citation 711 A.2d 951 refers to the case B...,Would the citation apply if a blog post contai...,"defamation claim, false and defamatory stateme..."


In [31]:
Markdown(df_res.to_markdown(index=False))

| citation                | sentiment   | summary                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | hypothetical_applications                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | keywords                                                                                                                                                                                  |
|:------------------------|:------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 259 F. Supp. 2d at 1036 | Neutral     | In the context provided, the citation concerns a judicial stance on peer-to-peer file-sharing technology in the case, Grokster I. The court's decision indicated that the activities involved were too incidental to direct copyright infringement to constitute a material contribution. It was emphasized that the defendants did not host any infringing files or lists of infringing files, nor did they regulate or provide access to such files. The technology in question, despite attempts to portray it as a method to circumvent previous copyright infringement rulings (Napster I and II), was recognized for its varied applications beyond facilitating copyright infringement. The court notably highlighted the technology's role in significantly reducing distribution costs for public domain and permissively shared content, and in lessening centralized control over distribution. In light of these considerations, the court declined to expand contributory copyright liability as the copyright owners requested. | Would the court's decision apply in a case where a company's technology is primarily used for copyright infringement but has potential legitimate uses?, Can a platform be held liable for contributory copyright infringement if it does not host infringing content but indirectly benefits from the traffic generated by such content?, Does the decision imply that technologies with both infringing and non-infringing uses can always avoid contributory copyright liability?                                                                            | Grokster I, peer-to-peer file-sharing, copyright infringement, material contribution, public domain, permissively shared content, contributory copyright liability, Napster I, Napster II |
| 676 F.3d at 31          | Neutral     | The context discusses the interpretation of specific provisions within Section 512(c) of a legal statute, focusing on the definition of 'service provider' and the scope of liability under the statute's safe harbor provisions for online service providers or network access operators. Specifically, it addresses the uncertainty in case law regarding whether exclusion from the § 512(c) safe harbor due to actual or 'red flag' knowledge of infringing activity limits liability merely to that infringing activity or more broadly. The citation, 676 F.3d at 31, is referenced in relation to this uncertainty, noting the limited case law interpreting the knowledge provisions of the § 512(c) safe harbor.                                                                                                                                                                                                                                                                                                                     | Would an online service provider be excluded from the § 512(c) safe harbor if it had 'red flag' knowledge of specific infringing content on its platform?, Does the liability limitation under the § 512(c) safe harbor apply only to the specific infringing activity known by the service provider, or does it extend to other activities as well?, How does the interpretation of 'financial benefit/right to control' provisions affect a service provider's eligibility for the § 512(c) safe harbor?                                                      | Section 512(c), safe harbor, service provider, online services, network access, case law, liability, infringing activity, red flag knowledge                                              |
| 284 F.3d at 1098        | Neutral     | The context involves legal principles surrounding the adaptability of court orders, specifically injunctions, in response to new circumstances or facts. The citation comes from a case mentioned in support of the idea that a court has inherent authority to modify a preliminary injunction due to new facts. This concept is underscored by referencing several cases that together establish the principle that court orders, including injunctions, are not static. They can and should be adapted as necessary to address                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Would a court be able to modify an injunction if significant new evidence comes to light after the injunction is issued?, In a situation where there's a substantial change in law affecting the basis of a previously issued injunction, does the court have the authority to revisit and possibly modify the injunction?, Can a plaintiff request an amendment to an injunction based on unexpected developments that significantly affect the effectiveness of that injunction?                                                                              | injunction, modify, preliminary injunction, inherent authority, new facts                                                                                                                 |
| 373 F.3d at 555         | Neutral     | In the context provided, the DMCA's relevance in determining what constitutes a prima facie case of copyright infringement is being discussed. Cox contends that the court's instruction on the DMCA was highly prejudicial. The cited case, CoStar Group, Inc., establishes that everyone agrees the DMCA is irrelevant in determining the criteria for a prima facie case of copyright infringement. The citation emphasizes that the effect of DMCA’s safe harbor provisions is to limit the remedies available against parties otherwise found liable for copyright infringement.                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Would the DMCA's safe harbor provisions apply to a social media platform accused of copyright infringement by not taking down copyrighted content upon notification?, Is the DMCA relevant in a lawsuit where a content creator sues a website for copyright infringement for not implementing effective notice-and-takedown procedures?                                                                                                                                                                                                                        | DMCA, safe harbor, copyright infringement, CoStar Group, prima facie case                                                                                                                 |
| 2007 WL 1246448         | Neutral     | The context revolves around the uncertainty and evolving nature of the law concerning the issuance of preliminary injunctions in the Ninth Circuit following the eBay Inc. v. MercExchange, L.L.C. decision. The reference to 2007 WL 1246448, along with other case citations, highlights the ongoing questions about the effects of the eBay decision on the standards and presumptions applicable to requests for such injunctions. Specifically, there's an implication that post-eBay, there may no longer be a presumption of irreparable harm in cases seeking permanent or preliminary injunctions, a notion that is considered in light of Amoco Production Co. v. Village of Gambell. This creates a legal environment where the applicability and strength of previous Ninth Circuit decisions on preliminary injunctions are in question.                                                                                                                                                                                         | How would the post-eBay legal landscape affect a startup's request for a preliminary injunction against a competitor accused of intellectual property infringement?, In the context of a patent dispute, would the absence of a presumption of irreparable harm influence the court's decision to grant a preliminary injunction?, Does the eBay decision impact the strategy of companies seeking to enforce non-compete agreements through preliminary injunctions in the Ninth Circuit?                                                                      | post-eBay, preliminary injunction, Ninth Circuit, irreparable harm, Amoco, patent dispute, intellectual property, non-compete agreements                                                  |
| 464 U.S. 417            | Neutral     | In Sony Corp. v. Universal City Studios, Inc. (Sony), the Supreme Court considered Sony’s liability for selling the Betamax videocassette recorder. The ruling centered around a full trial record, scrutinizing whether Sony could be held liable for potential copyright infringement by users of the Betamax. The context highlights how the decision in Sony along with copyright law principles have informed distinctions in patent law, particularly between active inducement liability and contributory liability for distributing products not suitable for substantial noninfringing use. The discussion also touches upon overlapping yet distinct categories of culpable behavior in copyright and patent law.                                                                                                                                                                                                                                                                                                                   | Would Sony’s Betamax case apply to modern streaming devices in terms of liability for copyright infringement?, How does the distinction between active inducement liability and contributory liability impact tech companies' distribution of multi-purpose devices?, In what ways might a company mitigate risk when a product can be used for both infringing and non-infringing purposes?                                                                                                                                                                    | Sony, Betamax, videocassette recorder, copyright infringement, liability, patent law, active inducement, contributory liability                                                           |
| 648 F.Supp. 1127        | Neutral     | The context discusses the case of Broderbund Software, Inc. v. Unison World, Inc., where the court found that the defendant may have profited from alleged copyright infringement. StreamCast made claims that damages were incurred due to Plaintiffs' refusal to assist, which made their filtering system more burdensome to update. However, such claims were deemed speculative and insufficient to prove discernible damages. The court suggested that StreamCast may have benefitted from the plaintiffs' refusal to cooperate, likely resulting in increased advertising revenue due to an ineffective filter facilitating direct infringement. The precedent set by this case was used to argue that StreamCast's business model profited from mass copyright infringement and that their claims of personal injury were unsubstantial. Furthermore, StreamCast's argument that the public would suffer harm if it were shut down was also mentioned.                                                                                | Would a claim of damages based on increased operational difficulties due to a third party's refusal to cooperate be considered sufficient for legal relief?, Can a defendant be considered to have profited from copyright infringement if there is no effective system to prevent such infringement?, Is a speculative claim of personal injury adequate to establish discernible damages in a copyright infringement case?                                                                                                                                    | copyright infringement, damages, speculative claims, benefited from infringement, advertising revenue, business model, mass infringement, personal injury                                 |
| 676 F.3d 19             | Neutral     | This case involves the application of established intellectual property concepts to new technologies. It is referenced in the context of a broader discussion about intellectual property law and its intersection with new media and technologies. Specifically, the case of 676 F.3d 19 is mentioned alongside other notable cases that have addressed similar issues within the realm of intellectual property law. In this context, various film studios alleged that the services and websites maintained by Gary Fung and his company, isoHunt Web Technologies, Inc., induced third parties to download infringing copies of the studios' copyrighted works.                                                                                                                                                                                                                                                                                                                                                                           | Could the precedent set by this case be used to argue that a new file-sharing platform is liable for copyright infringement?, How might this case be cited in arguing for or against the applicability of copyright law to emerging technologies such as blockchain or non-fungible tokens (NFTs)?, In what way does this case influence the legal strategy of companies owning intellectual property rights when dealing with platforms that potentially facilitate copyright infringement?                                                                    | intellectual property, new technologies, film studios, copyright infringement, Gary Fung, isoHunt Web Technologies                                                                        |
| 373 F.3d at 555         | Neutral     | The context discusses a legal contention involving the DMCA (Digital Millennium Copyright Act) and a court's instruction related to it. Cox argued that the instruction on the DMCA was highly prejudicial. The precedent established in CoStar Grp., Inc. v. LoopNet, Inc., as cited (373 F.3d at 555), holds that the DMCA is "irrelevant to determining what constitutes a prima facie case of copyright infringement." This establishes that the DMCA's significance lies in its ability to limit remedies available against a party found liable for copyright infringement, rather than affecting the establishment of copyright infringement itself.                                                                                                                                                                                                                                                                                                                                                                                   | How would the DMCA apply in a case where a service provider is accused of copyright infringement but claims safe harbor protection?, In what circumstances can a party argue that the DMCA's safe harbor provisions do not affect the determination of a prima facie case of copyright infringement?                                                                                                                                                                                                                                                            | DMCA, copyright infringement, safe harbor, prejudicial, CoStar Grp., Inc., prima facie case                                                                                               |
| 711 A.2d 951            | Neutral     | The citation 711 A.2d 951 refers to the case Beck v. Tribert, which discusses the requirements for stating a defamation claim under New Jersey law. According to this case, to establish a defamation claim, the Defendants must prove that the Plaintiffs made a false and defamatory statement of fact about them, that the Plaintiffs knew or should have known was false, and that this statement was communicated to third parties, causing damages.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Would the citation apply if a blog post contained false statements about a person that damaged their reputation, and the author should have known these statements were false?, Could this citation be relevant in a scenario where a public figure is accused of making untrue statements about a competitor on social media, causing them harm?, Does this citation provide a basis for action if a newsletter publishes false information about an individual's criminal record, knowing it could be false, and it leads to the individual losing their job? | defamation claim, false and defamatory statement, knew or should have known, communicated to third parties, causing damages                                                               |

In [32]:
from lancedb.pydantic import pydantic_to_schema