The original dataset was extracted from https://huggingface.co/datasets/lexlms/lex_files/tree/main and corresponds to the paper "Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. 2022. In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada."

The name of the file is eurlex.zip

The aim of this repository is to provide a simple way to download the dataset, process it, and use it in your own projects.

In [1]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Reading the data.
# Replace it with your own paths.

df_1 = pd.read_json(r"D:\Corporas\eurlex\train.jsonl", lines=True)

df_2 = pd.read_json(r"D:\Corporas\eurlex\test.jsonl", lines=True)

df_3 = pd.read_json(r"D:\Corporas\eurlex\validation.jsonl", lines=True)

In [3]:
# We will merge the tree of them, since we won't use it for any machine learning algorithm yet

merged_data = pd.concat([df_1, df_2, df_3], ignore_index=True)
merged_data


Unnamed: 0,id,sector,descriptor,year,text
0,32014D0265,3,D,2014,12.5.2014\nEN\nOfficial Journal of the Europea...
1,32014L0038,3,L,2014,11.3.2014\nEN\nOfficial Journal of the Europea...
2,62014CJ0195,6,CJ,2014,JUDGMENT OF THE COURT (Ninth Chamber)\n4 June ...
3,32014D0933,3,D,2014,19.12.2014\nEN\nOfficial Journal of the Europe...
4,32014R0749,3,R,2014,11.7.2014\nEN\nOfficial Journal of the Europea...
...,...,...,...,...,...
123492,32014D0219,3,D,2014,16.4.2014\nEN\nOfficial Journal of the Europea...
123493,32014R0201,3,R,2014,4.3.2014\nEN\nOfficial Journal of the European...
123494,32014R1016,3,R,2014,27.9.2014\nEN\nOfficial Journal of the Europea...
123495,62014CC0377,6,CC,2014,OPINION OF ADVOCATE GENERAL\nSHARPSTON\ndelive...


In [4]:
# We will keep only therows where 'descriptor' is 'CJ' or 'TJ', since our intention is to build a corpus only from European Courts decisions. 
# Regarding the descriptors meaning, check the following webpage: https://eur-lex.europa.eu/content/tools/TableOfSectors/types_of_documents_in_eurlex.html 

judgments = merged_data[merged_data['descriptor'].isin(['CJ', 'TJ'])]

# Let's do some data cleaning

judgments['text'] = judgments['text'].str.replace('\n', ' ').str.strip()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  judgments['text'] = judgments['text'].str.replace('\n', ' ').str.strip()


In [7]:
# Let's make a filter using keywords. The keywords were selected by me.
# Let's see how many CL decisions we have

keywords = ['Article 101', 'Article 102', 'Article 81', 'Article 82', 'anticompetitive', 
            'Antitrust', 'competition law', 'concerted practices', 
            'agreements between undertakings', 'restriction by object', 'restriction by effects', 
            'abuse of dominance' , 'abuse of a dominant', 'cartels', 'price fixing', 'market allocation']

# Convert the keywords to lowercase for case-insensitive matching
keywords_lower = [keyword.lower() for keyword in keywords]

# Create a new column for each keyword, indicating if it is present
judgments['keyword_present'] = judgments['text'].str.lower().str.contains('|'.join(keywords_lower))

judgments['keyword_present'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  judgments['keyword_present'] = judgments['text'].str.lower().str.contains('|'.join(keywords_lower))


keyword_present
False    14715
True      1949
Name: count, dtype: int64

In [10]:
# The European Competition Law dataset is done

cl_judgements = judgments[judgments['keyword_present']].reset_index(drop=True).drop(columns=['keyword_present'])
cl_judgements

Unnamed: 0,id,sector,descriptor,year,text
0,62014CJ0151,6,CJ,2014,JUDGMENT OF THE COURT (Seventh Chamber) 10 Sep...
1,62014TJ0076,6,TJ,2014,JUDGMENT OF THE GENERAL COURT (Eighth Chamber)...
2,62014CJ0023,6,CJ,2014,JUDGMENT OF THE COURT (Second Chamber) 6 Octob...
3,62014TJ0079,6,TJ,2014,JUDGMENT OF THE GENERAL COURT (First Chamber) ...
4,62014CJ0033,6,CJ,2014,JUDGMENT OF THE COURT (Third Chamber) 17 Septe...
...,...,...,...,...,...
1944,62014CJ0154,6,CJ,2014,JUDGMENT OF THE COURT (Fifth Chamber) 16 June ...
1945,62014TJ0671,6,TJ,2014,JUDGMENT OF THE GENERAL COURT (Fifth Chamber) ...
1946,62014CJ0542,6,CJ,2014,JUDGMENT OF THE COURT (Fourth Chamber) 21 July...
1947,62014CJ0187,6,CJ,2014,JUDGMENT OF THE COURT (Fifth Chamber) 25 June ...
