<a href="https://colab.research.google.com/github/masha-medvedeva/AI_and_Digital_Skills/blob/main/predicting_court_decisions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Predicting court decisions of the European Court of Human Rights with machine learning and Google Colab



##Load libraries
First we need to load all the necessary libraries.
We can more here when we want to add more functions as we go.

In [3]:
import pandas as pd #pandas library, a way to work with table full of data
from sklearn.feature_extraction.text import TfidfVectorizer #vectorizer converting documents into tf-idf representation
from sklearn.dummy import DummyClassifier #Dummy CLassifier, predicts something?
from sklearn.naive_bayes import MultinomialNB #Naive Bayes classifier
from sklearn.metrics import accuracy_score #accuracy score

##Using Google Colab

If you are using Colab you will need to mount your google drive in order to be able to use the data stored in a folder google drive.



In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

##Dataset

###Load data

Now let's open the file with the data. The file that we are going to be using is in .csv format. This is a table format. You can save and open .csv files with Excel. 

More about .csv on Wikipedia: https://en.wikipedia.org/wiki/Comma-separated_values

The data is scraped from https://hudoc.echr.coe.int/.

**AT HOME:** I recommend looking at the actual files on this ECtHR website to see what kind of data is available and in what format.

In [4]:
#import csv with data
dataset = pd.read_csv('/content/gdrive/MyDrive/AI and Digital Skills/data/Judgements_v_1.0.csv')

###Inspect data

First let's see what we have available in our dataset. To do so let's print the first five rows using .head()

Want to see 10 rows use .head(10), etc.

Note that the first row is actually 0, and not 1, this is the standard when programming in python. You will get used to it.


In [None]:
dataset.head()

The rows are two long, so we can't see the names of all the columns, so let's instead print all the information available for one of the cases. Let's use case in row number 2 for that (application number 001-57518, CASE OF LAWLESS v. IRELAND).

In [6]:
dataset.loc[2,]

itemid                                                     001-57518
docname                           CASE OF LAWLESS v. IRELAND (No. 3)
appno                                                         332/57
conclusion         Questions of procedure rejected;No violation o...
importance                                                         2
originatingbody                                                    9
kpdate                                                    1961-07-01
extractedappno                                         332/57;250/57
doctypebranch                                                CHAMBER
respondent                                                       IRL
article                              17;5;5-1-b;5-1-c;5-3;6;7;7-1;15
url                https://hudoc.echr.coe.int/app/conversion/docx...
text               COURT (CHAMBER)\n \n \n \n \n \n \nCASE OF LAW...
procedure          1. The present case was referred to the Court ...
procedure_facts                   

What types of variable do you see?

**AT HOME:** Do they all make sense to you?



Now we see the all the column names but not the full text within each cell. To inspect those we can specify the specific column we want to look at:

In [None]:
dataset.loc[2,'facts']

To look at the full url you can use: `dataset.loc[2,'url']`

For verdict: `dataset.loc[2,'verdict']` , etc.

###Select data

We do not need all the data in the table for predicting court decisions. So it is easier to navigate let's make a new table with only the fact of the cases, which we are going to use as input, let's also keep the date of the judgement and the verdict, which will be the label.

**AT HOME**: Whould you want to choose other columns, which ones, why?

Not all rows in the table have all information (i.e., fact, date, verdict), There are many ways to deal with this, today, we are just goin to drop the rows which don't have all of the variables.

In [None]:
facts_dataset = (pd.DataFrame()
                .assign(Date=dataset['kpdate'],           #keep the date column, so we can sort the data based on when the judgement was made
                        Facts=dataset['facts'],           #keep the 'facts' column - this will be our main input
                        Violation=dataset['violation'])   #keep the 'violaiton' column - this will be our label
                .dropna()                                 #drop every row where at least one element is missing
                .reset_index(drop=True))                  #resent the index of the table after removing the rows

Note that we re-named some the columns, they are now called `'Date'`, `'Facts'` and `'Violation'`

Let's see what the dataset with just selected columns looks like. How many cases in it now?

In [None]:
facts_dataset


Let's look at our data more carefully. 

**AT HOME**: Experiment with printing the facts of different cases, and their verdicts, could you guess the court's decision based on the facts?

In [None]:
#print some data
row_number = 33 #change this number to try other rows
print('Facts of the case:\n', facts_dataset.loc[row_number]['Facts'])
print('\n\nVerdict:', 'violation' if facts_dataset.loc[row_number]['Violation'] == 1 else 'no violation')

###Data info

In [None]:
#print some numbers
print('Number of cases with violation:',
      facts_dataset['Violation'].value_counts()[1])

print('Number of cases without violation:',
      facts_dataset['Violation'].value_counts()[0])

---

###Split the dataset into training and test data

We want to only predict future cases to immitate how we would predict future court decision. Plus we don't want to use newer cases to predict old cases since they may be related, and that would be cheating.


In [None]:
#split the data in train and test
train = facts_dataset[(facts_dataset['Date'] < '2017-01-01')]
test = facts_dataset[(facts_dataset['Date'] >= '2017-01-01')]

Print the amount of cases in training set that resulted in violation of human rights:

In [None]:
train['Violation'].value_counts()[1]

AT HOME: try printing the amount of the cases that resulted in no violaiton.

###Input and labels

Now let's create a list of all inputs (facts) for the training set `Xtrain`, and a list of labels `Ytrain`.

Let's do the same for the test set, input `Xtest` and labels `Ytest`



In [None]:
Xtrain = train['Facts']
Ytrain = train['Violation']
Xtest = test['Facts']
Ytest = test['Violation']

Xtest_text = list(test['Facts']) #AT HOME: what is this for?

In [None]:
print('Number of cases in the training set:', len(Xtrain))
print('Number of cases in the test set:', len(Xtest))

###Create feature vectors

In [None]:
#encode features into vectors: tf-idf vectorizer
vectorizer = TfidfVectorizer()
Xtrain = vectorizer.fit_transform(Xtrain)
print('Number of features:', len(vectorizer.get_feature_names_out()))
vectorizer.get_feature_names_out()

In [None]:
#transform test set into vectors
Xtest = vectorizer.transform(Xtest)

## Machine learning

###Dummy classifier

In [None]:
#Dummy classifier
dummy_clf = DummyClassifier(strategy="uniform")
dummy_clf.fit(Xtrain, Ytrain)

#### Making a prediction

In [None]:
#make a prediction
Ypredict = dummy_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

0.47901491501907734

In [None]:
print('Prediction:', Ypredict[22], '\nGold label:', list(Ytest)[22], '\n\nFacts of the case:\n', (Xtest_text[22]))


Prediction: 1 
Gold label: 1 

Facts of the case:
 3.  The list of applicants and the relevant details of the applications are set out in the appended table.
4.  The applicants complained of the excessive length of criminal proceedings and of the lack of any effective remedy in domestic law. Some applicants also raised other complaints under the provisions of the Convention.


### Naive Bayes

In [None]:
#Naive Bayes classifier
nb_clf = MultinomialNB()
nb_clf.fit(Xtrain, Ytrain)
Ypredict = nb_clf.predict(Xtest)
accuracy_score(Ytest, Ypredict)

**AT HOME**: Why does this Naive Bayes classifier produce such a high score? You don't need to know how the model works to answer this question.



___