<a href="https://colab.research.google.com/github/masha-medvedeva/AI_and_Digital_Skills/blob/main/predicting_court_decisions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Predicting court decisions of the European Court of Human Rights with machine learning and Google Colab



##Load libraries
First we need to load all the necessary libraries.
We can more here when we want to add more functions as we go.

In [1]:
import pandas as pd #pandas library, a way to work with table full of data
from sklearn.feature_extraction.text import TfidfVectorizer #vectorizer converting documents into tf-idf representation
from sklearn.dummy import DummyClassifier #Dummy CLassifier, predicts something?
from sklearn.naive_bayes import MultinomialNB #Naive Bayes classifier
from sklearn.metrics import accuracy_score #accuracy score

##Using Google Colab

If you are using Colab you will need to mount your google drive in order to be able to use the data stored in a folder google drive.



In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


##Dataset

###Load data

Now let's open the file with the data. The file that we are going to be using is in .csv format. This is a table format. You can save and open .csv files with Excel. 

More about .csv on Wikipedia: https://en.wikipedia.org/wiki/Comma-separated_values

The data is scraped from https://hudoc.echr.coe.int/.

**AT HOME:** I recommend looking at the actual files on this ECtHR website to see what kind of data is available and in what format.

In [3]:
#import csv with data
dataset = pd.read_csv('/content/gdrive/MyDrive/AI and Digital Skills/data/Judgements_v_1.0.csv')

###Inspect data

First let's see what we have available in our dataset. To do so let's print the first five rows using .head()

Want to see 10 rows use .head(10), etc.

Note that the first row is actually 0, and not 1, this is the standard when programming in python. You will get used to it.


In [4]:
dataset.head(5)

Unnamed: 0,itemid,docname,appno,conclusion,importance,originatingbody,kpdate,extractedappno,doctypebranch,respondent,...,url,text,procedure,procedure_facts,facts,relevant_law,law,verdict,violation,is_communicated
0,001-57516,CASE OF LAWLESS v. IRELAND (No. 1),332/57,Preliminary objection rejected (incompatibilit...,2,9,1960-11-14,332/57,CHAMBER,IRL,...,https://hudoc.echr.coe.int/app/conversion/docx...,COURT (CHAMBER)\n \n \n \n \n \n \nCASE OF LAW...,On 13th April 1960 the Secretary of the Europe...,,The purpose of the Commission's Request - to w...,,"Whereas the Irish Government, in reply to the ...",Takes note of the withdrawal by the Irish Gove...,0,0
1,001-57517,CASE OF LAWLESS v. IRELAND (No. 2),332/57,Questions of procedure partially accepted,2,9,1961-04-07,332/57,CHAMBER,IRL,...,https://hudoc.echr.coe.int/app/conversion/docx...,COURT (CHAMBER)\n \n \n \n \n \n \nCASE OF LAW...,,,,,,,0,0
2,001-57518,CASE OF LAWLESS v. IRELAND (No. 3),332/57,Questions of procedure rejected;No violation o...,2,9,1961-07-01,332/57;250/57,CHAMBER,IRL,...,https://hudoc.echr.coe.int/app/conversion/docx...,COURT (CHAMBER)\n \n \n \n \n \n \nCASE OF LAW...,1. The present case was referred to the Court ...,,I\n1. The purpose of the Commission's request ...,,1. Whereas it has been established that G.R. L...,"Unanimously,\n(i) Dismisses the plea in bar de...",0,0
3,001-57433,CASE OF DE BECKER v. BELGIUM,214/56,Struck out of the list,2,9,1962-03-27,,CHAMBER,BEL,...,https://hudoc.echr.coe.int/app/conversion/docx...,COURT (CHAMBER)\n \n \n \n \n \n \nCASE OF DE ...,"1. On 29th April 1960, the European Commission...",,,,,By 6 votes to 1\n \nDecides to strike the case...,0,0
4,001-57524,"CASE ""RELATING TO CERTAIN ASPECTS OF THE LAWS ...",1474/62;1677/62;1691/62;1769/63;1994/63;2126/64,Preliminary objection rejected (incompatibility),2,15,1967-02-09,1474/62;1677/62;1691/62;1769/63;1994/63;2126/64,CHAMBER,BEL,...,https://hudoc.echr.coe.int/app/conversion/docx...,"COURT (PLENARY)\n \n \n \n \n \n \nCASE ""RELAT...","1. By a request dated 25th June 1965, the Eur...",,1. The object of the Commission’s request is ...,,5. The Commission having decided to join the ...,,0,0


The rows are two long, so we can't see the names of all the columns, so let's instead print all the information available for one of the cases. Let's use case in row number 2 for that (application number 001-57518, CASE OF LAWLESS v. IRELAND).

In [5]:
dataset.loc[2,]

itemid                                                     001-57518
docname                           CASE OF LAWLESS v. IRELAND (No. 3)
appno                                                         332/57
conclusion         Questions of procedure rejected;No violation o...
importance                                                         2
originatingbody                                                    9
kpdate                                                    1961-07-01
extractedappno                                         332/57;250/57
doctypebranch                                                CHAMBER
respondent                                                       IRL
article                              17;5;5-1-b;5-1-c;5-3;6;7;7-1;15
url                https://hudoc.echr.coe.int/app/conversion/docx...
text               COURT (CHAMBER)\n \n \n \n \n \n \nCASE OF LAW...
procedure          1. The present case was referred to the Court ...
procedure_facts                   

What types of variable do you see?

**AT HOME:** Do they all make sense to you?



Now we see the all the column names but not the full text within each cell. To inspect those we can specify the specific column we want to look at:

In [6]:
dataset.loc[2,'facts']

'I\n1. The purpose of the Commission\'s request - to which is appended the Report drawn up by the Commission in accordance with the provisions of Article 31 (art. 31) of the Convention - is to submit the case of G.R. Lawless to the Court so that it may decide whether or not the facts of the case disclose that the Irish Government has failed in its obligations under the Convention.\nAs appears from the Commission\'s request and from its Memorial, G.R. Lawless alleges in his Application that, in his case, the Convention has been violated by the authorities of the Republic of Ireland, inasmuch as, in pursuance of an Order made by the Minister of Justice under section 4 of Act No. 2 of 1940 amending the Offences against the State Act, 1939, he was detained without trial, between 13th July and 11th December 1957, in a military detention camp situated in the territory of the Republic of Ireland.\n2. The facts of the case, as they appear from the Report of the Commission, the memorials, evide

To look at the full url you can use: `dataset.loc[2,'url']`

For verdict: `dataset.loc[2,'verdict']` , etc.

###Select data

We do not need all the data in the table for predicting court decisions. So it is easier to navigate let's make a new table with only the fact of the cases, which we are going to use as input, let's also keep the date of the judgement and the verdict, which will be the label.

**AT HOME**: Whould you want to choose other columns, which ones, why?

Not all rows in the table have all information (i.e., fact, date, verdict), There are many ways to deal with this, today, we are just goin to drop the rows which don't have all of the variables.

In [7]:
facts_dataset = (pd.DataFrame()
                .assign(Date=dataset['kpdate'],           #keep the date column, so we can sort the data based on when the judgement was made
                        Facts=dataset['facts'],           #keep the 'facts' column - this will be our main input
                        Violation=dataset['violation'])   #keep the 'violaiton' column - this will be our label
                .dropna()                                 #drop every row where at least one element is missing
                .reset_index(drop=True))                  #resent the index of the table after removing the rows

Note that we re-named some the columns, they are now called `'Date'`, `'Facts'` and `'Violation'`

Let's see what the dataset with just selected columns looks like. How many cases in it now?

In [8]:
facts_dataset

Unnamed: 0,Date,Facts,Violation
0,1960-11-14,The purpose of the Commission's Request - to w...,0
1,1961-07-01,I\n1. The purpose of the Commission's request ...,0
2,1967-02-09,1. The object of the Commission’s request is ...,0
3,1968-06-27,1. The object of the Commission’s request is t...,0
4,1968-06-27,1. The object of the request of the Commissio...,1
...,...,...,...
14532,2020-11-26,4. The list of the applications with the rele...,1
14533,2020-11-26,3. The applicant’s details and information re...,1
14534,2020-11-26,3. The list of the applications with the rele...,1
14535,2020-11-26,3. The list of applicants and the relevant de...,1



Let's look at our data more carefully. 

**AT HOME**: Experiment with printing the facts of different cases, and their verdicts, could you guess the court's decision based on the facts?

In [9]:
#print some data
row_number = 33 #change this number to try other rows
print('Facts of the case:\n', facts_dataset.loc[row_number]['Facts'])
print('\n\nVerdict:', 'violation' if facts_dataset.loc[row_number]['Violation'] == 1 else 'no violation')

Facts of the case:
 1. Particular facts of the case
(a) The applicant’s convictions and appeals
8. Mr. Ettore Artico, an Italian citizen born in 1917, is an accountant by profession.
On 27 January 1965, Mr. Artico was sentenced by the Verona District Judge (pretore) to eighteen months’ imprisonment and a fine for simple fraud (truffa semplice). A further sentence of eleven months’ imprisonment and a fine for repeated fraud (truffa con recidiva), impersonation (sostituzione di persona) and uttering worthless cheques was imposed on him by that Judge on 6 October 1970. These various offences had been committed in May/June 1964. Appeals lodged by the applicant on 28 January 1965 and 11 December 1970 were rejected on 16 December 1969 and 17 April 1971, respectively, by the Verona Criminal Court; it dealt with both cases in Mr. Artico’s absence.
On 11 October and 13 November 1971, the pretore issued committal warrants to enforce the two prison sentences. The warrants were served on the appli

###Data info

In [10]:
#print some numbers
print('Number of cases with violation:',
      facts_dataset['Violation'].value_counts()[1])

print('Number of cases without violation:',
      facts_dataset['Violation'].value_counts()[0])

Number of cases with violation: 12450
Number of cases without violation: 2087


---

###Split the dataset into training and test data

We want to only predict future cases to immitate how we would predict future court decision. Plus we don't want to use newer cases to predict old cases since they may be related, and that would be cheating.


In [11]:
#split the data in train and test
train = facts_dataset[(facts_dataset['Date'] < '2017-01-01')]
test = facts_dataset[(facts_dataset['Date'] >= '2017-01-01')]

Print the amount of cases in training set that resulted in violation of human rights:

In [12]:
train['Violation'].value_counts()[0]

1795

AT HOME: try printing the amount of the cases that resulted in no violaiton.

In [13]:
#undersampling
train_v = train[train['Violation'] == 1]
train_nv = train[train['Violation'] == 0]
train_v = train_v.sample(n=len(train_nv), random_state=101)
train = pd.concat([train_nv, train_v])
train

Unnamed: 0,Date,Facts,Violation
0,1960-11-14,The purpose of the Commission's Request - to w...,0
1,1961-07-01,I\n1. The purpose of the Commission's request ...,0
2,1967-02-09,1. The object of the Commission’s request is ...,0
3,1968-06-27,1. The object of the Commission’s request is t...,0
6,1969-11-10,1. The Commission and the Government have refe...,0
...,...,...,...
3339,2006-06-08,A. Background to the case\n8. The applicant ...,1
5346,2008-09-18,4. The applicant was born in 1939 and lives i...,1
5196,2008-06-19,4. The applicant was born in 1958 and lives i...,1
4978,2008-03-18,"4. The applicants were born in 1955, 1935 and...",1


In [14]:
test_v = test[test['Violation'] == 1]
test_nv = test[test['Violation'] == 0]
test_v = test_v.sample(n=len(test_nv), random_state=101)
test = pd.concat([test_nv, test_v])
test

Unnamed: 0,Date,Facts,Violation
11656,2017-01-10,5. The applicant was born in 1971 and lives i...,0
11657,2017-01-10,A. The background to the case\n6. The applic...,0
11658,2017-01-10,6. The applicant was born in 1980 in Tetovo a...,0
11668,2017-01-12,A. Background\n5. The applicants are the wif...,0
11669,2017-01-12,"6. The applicant was born in 1970 in Rafah, G...",0
...,...,...,...
12834,2018-05-15,5. The applicant was born in 1984 and lives i...,1
13416,2019-02-05,6. The applicants are five Russian nationals....,1
12826,2018-05-15,5. The applicant was born in 1985 and lives i...,1
13675,2019-06-27,3. The list of applicants and the relevant de...,1


###Input and labels

Now let's create a list of all inputs (facts) for the training set `Xtrain`, and a list of labels `Ytrain`.

Let's do the same for the test set, input `Xtest` and labels `Ytest`



In [21]:
Xtrain_words = train['Facts']
Ytrain = train['Violation']
Xtest = test['Facts']
Ytest = test['Violation']

Xtest_text = list(test['Facts']) #AT HOME: what is this for?

In [22]:
print('Number of cases in the training set:', len(Xtrain_words))
print('Number of cases in the test set:', len(Xtest))

Number of cases in the training set: 3590
Number of cases in the test set: 584


###Create feature vectors

**AT HOME:** Try adding more parameters to TFidfvectorizer to see if it changes the results. You can find the parameters that you can change here:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

To change the parameters re-write the line that contains TfidfVectorizer(), for instance:

`vectorizer = TfidfVectorizer(ngram_range=(1,2))`

**Report what your best result is in class.**


In [23]:
#encode features into vectors: tf-idf vectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,4))
Xtrain = vectorizer.fit_transform(Xtrain_words)
print('Number of features:', len(vectorizer.get_feature_names_out()))
vectorizer.get_feature_names_out()

Number of features: 6913791


array(['00', '00 0010', '00 0010 03', ..., 'საფასური which',
       'საფასური which appeared', 'საფასური which appeared in'],
      dtype=object)

In [24]:
#transform test set into vectors
Xtest = vectorizer.transform(Xtest)

## Machine learning

**You can read about all the algorithms below in the Machine Leaning for Legal Text Classification reading. Read everything until evaluation section.**

###Dummy classifier

In [None]:
#Dummy classifier
dummy_clf = DummyClassifier(strategy="uniform")
dummy_clf.fit(Xtrain, Ytrain)

#### Making a prediction

In [None]:
#make a prediction
Ypredict = dummy_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

In [None]:
print('Prediction:', Ypredict[22], '\nGold label:', list(Ytest)[22], '\n\nFacts of the case:\n', (Xtest_text[22]))


Prediction: 0 
Gold label: 0 

Facts of the case:
 6.  The applicant is a lawyer who also writes articles for various Russian law journals and online legal information databases and networks.
7.  According to the applicant, his work usually requires extensive scientific research, including in the field of law enforcement in the Khabarovsk Region. He supported his assertion with copies of contracts with well-known Russian publishing houses and owners of a number of legal magazines, including one supervised by the Secretariat of the President of the Russian Federation. Under the contracts he undertook the task of writing articles on specific topics of legal and social interest.
8.  Having received an assignment to write an article on prostitution and the fight against it in the Khabarovsk Region, on 12 May 2009 the applicant wrote to the head of the Khabarovsk Region police department by registered letter, asking for statistical data for his research. The relevant parts read:
“[I am] int

### Naive Bayes

In [None]:
#Naive Bayes classifier
nb_clf = MultinomialNB()
nb_clf.fit(Xtrain, Ytrain)
Ypredict = nb_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

0.7294520547945206

**AT HOME**: Why does this Naive Bayes classifier produce such a high score? You don't need to know how the model works to answer this question.



___

**AT HOME:** Run the code below to see which machien algorithm works best for your data. Try changing parameters of the algorithms to get higher scores.

### LinearSVC

If you want to change parameters of the classifier you can find the ones for LinearSVC here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

To change parameter add them to the brackets in `LinearSVC()`

For instance, `LinearSVC(C=10)`.

In [25]:
#Fast version of Linear SVC
from sklearn.svm import LinearSVC
svm_clf = LinearSVC()
svm_clf.fit(Xtrain, Ytrain)
Ypredict = svm_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

0.7328767123287672

### LinearSVC

You can find parameters for KNN classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.kneighbors

**AT HOME:** Try changing the number of neighbors `KNeighborsClassifier()`

For instance, `KNeighborsClassifier(n_neighbors=10)`

In [None]:
#KNN classifier
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=10)
knn_clf.fit(Xtrain, Ytrain)
Ypredict = knn_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

In [None]:
#Logistic regression classifier
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression()
lr_clf.fit(Xtrain, Ytrain)
Ypredict = lr_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

We haven't discussed the algorithm below, but feel free to explore yourself:

In [None]:
#Support vector machines classifier (with multiple kernels)
from sklearn.svm import SVC
svm_clf = SVC(kernel='linear')
svm_clf.fit(Xtrain, Ytrain)
Ypredict = svm_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

0.7243150684931506

### Neural Networks - Multilayer Perceptron

Chack other parameters you can try with neural network (Multilayer perceptron): https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(solver='sgd', max_iter=1000, verbose=False)
mlp.fit(Xtrain, Ytrain) #this will run for a looong time (reduce max_iter to make it run faster, but it will probably impact the performance)
Ypredict = mlp.predict(Xtest)
accuracy_score(Ytest, Ypredict)