# Assignment 2: Advanced Legal Analytics (LAW 3027)


### Note: This is an individual assignment and you should provide only your own answer to the questions.

####  Assignment Instructions:

##### 1. Write the python code to complete the following tasks. (You should use markdown cells to answer the questions which require text or interpretation of results. )

##### 2. You should import the relevant python libraries needed for various tasks.

##### 3. All the python libraries you may need have been covered during the course.  

##### 4. Comment your code as much as possible. Mention the question number in the markdown cell before you write the code in the code cell. 

##### 5. Feel free to use the same notebook and mention your answers/code below each question. 

##### 6.  NOTE ABOUT FIGURES : In order for all the images to be displayed in your local Jupyter notebook make sure that you also download the folder called `figs` and save it in the same folder where this notebook is saved.  The `figs` folder can be downloaded from here: https://github.com/maastrichtlawtech/law3027-advanced-legal-analytics/tree/main/assignments 

#### 7. You are not allowed to distribute or share the assignment with anyone. The assignment is available on the github page of the course only to ensure easy readability of the images in the last 2 questions. 


## Q1.  Classification of Semantic Norms from GDPR (3.5 points)
Robaldo et al. [1] developed a knowledge base which formalized semantic norms such as permissions, obligations and constitutive rules from the General Data Protection Regulation (GDPR) in LegalRuleML, an XML formalism designed to represent the logical content of legal documents. We have programatically extracted a subset of this knowledge base for the purpose of this exercise. 

We will just focus on two categories of legal norms: obligations and permissions. 

- Obligations: An obligation indicates what someone is obliged to do. For example, the following fragment of the Italian privacy law taken from [2]: "A controller intending to process personal data falling within the scope of application of this Act shall have to notify the "Garante" ... "

- Permissions: a permission indicates that someone is not prohibited from doing something. For example, "Member States may adopt or maintain additional pre-contractual information requirements for contracts to which this Article applies."

The extracted dataset is available here: https://raw.githubusercontent.com/maastrichtlawtech/law3027-advanced-legal-analytics/main/data/gdpr_provision_classification_assignment.csv 


The dataset contains some articles and their corresponding paragraphs from the GDPR. The `text` column contains the text of that particular paragraph. The `labels` refer to the type of the semantic norm present in the given `text`.


- **0 means that the text is a permission**
- **1 means text is an obligation**


In this exercise, you will train a machine learning classifier to automatically classify/annotate some of the GDPR provisions. Semantic annotation is the process of augmenting a text with labels expressing its semantic content (semantic norm in this case). Enriching legal texts with semantic tags can help in legal information retrieval.


[1] Robaldo, L., Bartolini, C., & Lenzini, G. (2020, May). The DAPRECO knowledge base: representing the GDPR in LegalRuleML. In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 5688-5697).

[2] Biagioli, C., Francesconi, E., Passerini, A., Montemagni, S., & Soria, C. (2005, June). Automatic semantics extraction in law documents. In Proceedings of the 10th international conference on Artificial intelligence and law (pp. 133-140).


Specific tasks you need to perform to complete this question:

- Read the dataset using pandas DataFrame into a dataframe called `df`. 


- Plot the number of class labels in the dataset: The count of the column `labels` to illustrate how many permissions and obligations are in the dataset.


- Clean the `text` column of the DataFrame: Remove the punctuation, lowercase the text, remove the newlines (if any) and remove all numbers.  After doing these operations print the `text` column by converting it to a list to see if the data cleaning has been perfomed correctly or not. Use `df['text'].values.tolist()` for this.


- Select the variables of interset (predictor and target variables) into a new dataframe called `df_selection`. 

    - Predictor Variable (or Features) - One or more variables that are used to determine(Predict) the 'Target Variable'.

    - Target Variable - A variable that needs to be predicted is a target variable.
    - The features in this case are the `text`. The target variable is `labels`.


- Convert the `text` to TF-IDF. Use NLTK's default stopwords for English as an input argument for the `TfidfVectorizer`


- Split the data into train and test. Set the percentage of training set as 90% and testing set as 10%.


- Train the K-nearest neighbour (KNN) (Choose K=3) Classifier and then evaluate it on the testing set. Print the evaluation metrics (confusion matrix, precision, recall, accuracy, F1-score and classification report.) 


- Record the accuracy of the KNN classifier for different values of K (ranging from 1 to 10). Make a line plot (value of K on X-axis and the accuracy on the Y-axis). What value of K gives the best accuracy ?


## Q2 Correlation & Regression Analysis on a Crime Dataset (3.5 points)

#### Dataset: 

We have collected a subset of the crime dataset from UCI Machine Learning Repository. For detailed information about the variables in the dataset you can refer to the link: http://archive.ics.uci.edu/ml/datasets/communities+and+crime+unnormalized 

The dataset is available here: https://raw.githubusercontent.com/maastrichtlawtech/law3027-advanced-legal-analytics/main/data/crime_dataset_assignment.csv

#### Target Variable

The target variable of interest is `ViolentCrimesPerPop`. It refers to the total number of violent crimes per 100K popuation (numeric - decimal).

#### Sub-Tasks:

Specific tasks you need to perform to complete this question:

- Load the crime dataset into a pandas DataFrame called `df_crime`


- Explore the `df_crime` dataset and find out which `state` has the maximum and the minimum `ViolentCrimesPerPop` ? (Hint: you can use either numerical or visual data representation to find the answer).


- Compute a correlation matrix and a heatmap for the `df_crime` dataframe.


- Programatically identify the top 5 most correlated variables (features) with `ViolentCrimesPerPop`.  The code should print the correlation values with between `ViolentCrimesPerPop` and the top 5 most correlated features.  Further,  compute the correlation matrix of `ViolentCrimesPerPop` with the 5 most correlated features.  You can always refer here to see the meaning of each variable:  http://archive.ics.uci.edu/ml/datasets/communities+and+crime+unnormalized  


- Consider the most correlated variable with `ViolentCrimesPerPop` as the independent variable and `ViolentCrimesPerPop` as the dependent/target variable. Perform a linear regression analysis to predict the `ViolentCrimesPerPop`  using the most correlated variable. 
   
   - Split the dataset as 90% training and 10% test set
   - Compute the Mean squared error and Coefficient of Determination on the test set
   - Compute the Slope & Intercept
   - Plot the predicted values for the test set using the Linear Regression Model. Also plot the acutal test data.





## Q3 Suitability of data for Correlation or Regression Analysis (1 point)

#### For each of the four scatter plots shown below, discuss whether the data in the figures is suitable for a correlation analysis or linear regression ? (No more than 50 words for each option, A, B, C & D). Therefore the answer including all 4 options should contain a maximum of 200 words

- # A![A](figs/fig_r1.png)
- # B![B](figs/fig_r2.png)
- # C![C](figs/fig_r3.png)
- # D![D](figs/fig_r4.png)



## Q4. BLIND JUSTICE (2 points)


"In jurisdictions across the United States, prosecutors make highly consequntial charging decisions using police incident reports or narratives that contain information about the race of the suspect. Recent studies have shown that there is reason for concern that the judgments made by the prosecutor may suffer from explicit or implict racial bias." quoted and taken from [3]

Recently, we have seen works [3] [4] where algorithms have been used to automatically mask race-related information in police incident reports or narratives. Some researchers have developed algorithms to readact explicit mentions of race and other race-related information in the police incident reports or narratives. The two images below present fictional examples from [3][4] of the original and redacted narratives (by the algorithm). You can read the papers for more details. However, the questions below do not necessarily require a reading of the papers:


- Q4.1: Let's say you are in-charge of a BLIND JUSTICE project (in a hypothetical jurisdiction - assuming processing of personal data is permitted). The goal of this project is to develop an algorithm to mask the mentions of race and race-related information in the police incident reports or narratives. You have recently finished studying the Advanced Legal Analytics course and you have some ideas on how to implement the system. Based on the various technologies you have learned during the course what technology will you use to implement such a system ? What are the advantages and disadvantages of using regular expressions over named entitity recognition systems to accomplish the above masking task? (Word limit: 150 words)




- Q4.2: Compare the outputs A & B in the images below. Which output you think does a better job in implementing fairness and why? (Word limit: 75 words)


[3] https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/final_reports/report038.pdf

[4] https://5harad.com/papers/blind-charging.pdf

- # A![A](figs/blind_justice_1.png)
- # B![B](figs/blind_justice_2.png)
