# Legal Analytics (LAW3025) - Tutorial 2: 'Data-driven research design'

*Version*: 2023/2024

### 2.1. Understanding data-driven research design

Find an empirical/data-driven legal research article and review the research design by:

1. summarizing the theory underlying the research,
2. describing the main research question(s) and hypothesis(-es),
3. explaining the main finding(s) of the research,
4. evaluating the reliability and validity of their measure(s) and measurement procedure(s), and
5. determining whether the authors disclose any issues of privacy or ethics for their research, and, if not, evaluate that choice.

Form groups of 3 and prepare a presentation covering the five points above in maximum 10 minutes.

A good place to start is to check these publications:

* The International Conference on Artificial Intelligence and Law (ICAIL)
* The International Conference on Legal Knowledge and Information Systems (JURIX)
* The International Workshop on Juris-informatics (JURISIN)
* Journal of Empirical Legal Studies

### 2.2. Python Fundamentals Jupyter Notebook

The following exercises will check your understanding of:

* What are variables in Python.
* How to perform basic calculations.
* How to write functions in Python.
* Read a pandas dataframe.
* How to select columns and rows (subsets) in pandas dataframe.


**Note that programming follows a different approach compared to other disciplines or courses. Programming isn't learned by reading books and expecting to know everything, but rather by putting it into practice. It is a highly iterative learning process. Even the most experienced programmers get stuck all the time and then have to consult the documentation of the used programming libraries (e.g., `pandas`), turn to online communities like `StackOverflow`, or ask `ChatGPT`. When you use such online resources, MAKE SURE THAT YOU UNDERSTAND WHY YOU DO WHAT YOU DO. For example, when using ChatGPT, then ask it to explain the outputs to you. If you don't do this, your learnings from this course will be very limited.**

## Variables and Strings

**1. What is the final value of `position` in the program below?** (Try to predict the value without running the python code, then check your prediction_)

```python
initial = 'Legal'
position = initial
initial = 'Analytics'
```

'Legal' because the variable gets set to the value of the initial variable at that time. It is a pass by value not reference.

**2. If you create and assign to the variable `IDH` the value `= 0.0005`, what happens if you try to get the second digit (e.g. `.`) of `IDH` via index `IDH[1]`?**

I get an error because IDH is a number, not a string and therefore is not indexable, unless you set it to "=0"

In [11]:
IDH = 0.0005
IDH[1]

TypeError: 'float' object is not subscriptable

In [12]:
IDH = "= 0.0005"
IDH[1]

' '

**3. What does the following code print?** Think about it before you write a new python code. Then add some explanation (< 50 words) in text cell.

```python
city_name = 'Arequipa'
print('city_name[1:3] is', city_name[1:3])
```

You will get "city_name[1:3] is re" because the first part is a string and the second one uses the actual variable. Since indices start at 0 we get from the second up to the third character

In [13]:
city_name = 'Arequipa'
print('city_name[1:3] is', city_name[1:3])

city_name[1:3] is re


**4. Can you concatenate strings in python?** Can you demonstrate how with an example in a python cell?

Yes, by simply summing them

In [14]:
first_name = 'Lucas'
middle_name = 'Giovanni'
first_last_name = 'Uberti-Bona'
second_last_name = 'Marin'
first_name + ' ' + middle_name + ' ' + first_last_name + ' ' + second_last_name

'Lucas Giovanni Uberti-Bona Marin'

## Functions

The function syntax should be always as following:

In [15]:
def f(x):
    print(x)

+ **def**   tells Python that you are writing a function _definition_. This line of code is also followed by a colon.

+ **f**     is the name of the function so that the user can call the function later on.

+ **x**     within the parenthesis are parameters passed into the function to be used.

In [16]:
## Suppose we have the following string
x = "I am X"
## We can call our function f on x variable
f(x)

I am X


- Begin the definition of a new function with def.
- Followed by the name of the function.
    - Must obey the same rules as variable names.
- Then parameters in parentheses.
    - Empty parentheses if the function doesn’t take any inputs.
- Then a colon.
- Then an indented block of code.

In [17]:
def print_greeting():
    print('Hello!')

- Defining a function does not run it, it is like assigning a value to a variable. You must call the function to execute the code it contains.

In [18]:
# call the previous function print_greeting
print_greeting()

Hello!


Function can have a return value:
The return value is something that may be retrieved or calculated within the function. For instance:

In [19]:
## Suppose we have a function that multiply 10 to a given integer
def multiplyten(x):
    x = x*10 # In this line, we are assigning a new value to x
    return x # Return is what the function will give back

In [20]:
# now lets call the function mutiplyten
multiplyten(5)

50

### EXERCISE

*Law firms in Netherlands have an hourly rate along with a **x**% office cost and **y**% VAT tax. Write a function to compute the total cost of a law firm for **n** number of hours. The rate per hour of a legal firm in the Hague is € 200. In addition to this, they charge 5% office cost and 21% VAT. Use this function to compute the total cost of hiring this law firm for 10 hours.*

In [31]:
def hourly_cost(n:int) -> float:
    cost = n*200*1.05*1.21
    return cost

In [32]:
hourly_cost(10)

2541.0

## Data Handling with Pandas

In [33]:
# Read the data into a dataframe using the url of the csv file
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/maastrichtlawtech/law3025-legal-analytics/main/data/state_crime.csv')
df

Unnamed: 0,State,Year,Data.Population,Data.Rates.Property.All,Data.Rates.Property.Burglary,Data.Rates.Property.Larceny,Data.Rates.Property.Motor,Data.Rates.Violent.All,Data.Rates.Violent.Assault,Data.Rates.Violent.Murder,...,Data.Rates.Violent.Robbery,Data.Totals.Property.All,Data.Totals.Property.Burglary,Data.Totals.Property.Larceny,Data.Totals.Property.Motor,Data.Totals.Violent.All,Data.Totals.Violent.Assault,Data.Totals.Violent.Murder,Data.Totals.Violent.Rape,Data.Totals.Violent.Robbery
0,Alabama,1960,3266740,1035.4,355.9,592.1,87.3,186.6,138.1,12.4,...,27.5,33823,11626,19344,2853,6097,4512,406,281,898
1,Alabama,1961,3302000,985.5,339.3,569.4,76.8,168.5,128.9,12.9,...,19.1,32541,11205,18801,2535,5564,4255,427,252,630
2,Alabama,1962,3358000,1067.0,349.1,634.5,83.4,157.3,119.0,9.4,...,22.5,35829,11722,21306,2801,5283,3995,316,218,754
3,Alabama,1963,3347000,1150.9,376.9,683.4,90.6,182.7,142.1,10.2,...,24.7,38521,12614,22874,3033,6115,4755,340,192,828
4,Alabama,1964,3407000,1358.7,466.6,784.1,108.0,213.1,163.0,9.3,...,29.1,46290,15898,26713,3679,7260,5555,316,397,992
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3110,Wyoming,2015,586107,1902.6,300.6,1500.9,101.0,222.1,179.8,2.7,...,10.1,11151,1762,8797,592,1302,1054,16,173,59
3111,Wyoming,2016,585501,1957.3,302.5,1518.2,136.6,244.2,195.7,3.4,...,10.1,11460,1771,8889,800,1430,1146,20,205,59
3112,Wyoming,2017,579315,1830.4,275.0,1421.0,134.5,237.5,176.4,2.6,...,13.1,10604,1593,8232,779,1376,1022,15,263,76
3113,Wyoming,2018,577737,1785.1,264.0,1375.9,145.2,212.2,150.6,2.3,...,17.3,10313,1525,7949,839,1226,870,13,243,100


We can get a slice of the data by using the `df.query()` function to select all the records relating to a specific year. For example, by specifying the corresponding column to filter (`Year`) and the desired value (`2007`).

In [34]:
df_2007 = df.query('Year == 2007')
df_2007

Unnamed: 0,State,Year,Data.Population,Data.Rates.Property.All,Data.Rates.Property.Burglary,Data.Rates.Property.Larceny,Data.Rates.Property.Motor,Data.Rates.Violent.All,Data.Rates.Violent.Assault,Data.Rates.Violent.Murder,...,Data.Rates.Violent.Robbery,Data.Totals.Property.All,Data.Totals.Property.Burglary,Data.Totals.Property.Larceny,Data.Totals.Property.Motor,Data.Totals.Violent.All,Data.Totals.Violent.Assault,Data.Totals.Violent.Murder,Data.Totals.Violent.Rape,Data.Totals.Violent.Robbery
47,Alabama,2007,4627851,3977.7,980.6,2689.5,307.7,448.9,246.7,8.9,...,159.9,184082,45379,124465,14238,20775,11417,412,1548,7398
107,Alaska,2007,683478,3379.2,546.3,2476.9,356.0,661.3,490.3,6.3,...,85.0,23096,3734,16929,2433,4520,3351,43,545,581
167,Arizona,2007,6338755,4532.6,946.4,2793.5,792.6,518.0,318.2,8.6,...,154.0,287308,59988,177076,50244,32835,20170,548,2353,9764
227,Arkansas,2007,2834797,3955.5,1130.1,2578.5,246.9,537.1,374.9,7.0,...,109.5,112130,32035,73096,6999,15226,10629,198,1294,3105
287,California,2007,36553215,3043.5,650.7,1790.6,602.2,524.1,299.7,6.2,...,193.4,1112510,237850,654526,220134,191561,109547,2262,9046,70706
347,Colorado,2007,4861515,2999.2,589.0,2067.6,342.6,351.8,234.7,3.2,...,71.2,145808,28633,100519,16656,17101,11410,155,2075,3461
407,Connecticut,2007,3502309,2470.6,446.6,1739.9,284.1,301.1,155.4,3.2,...,122.9,86528,15640,60937,9951,10547,5441,113,690,4303
467,Delaware,2007,864764,3378.5,742.4,2365.3,270.8,705.4,457.6,4.5,...,203.9,29216,6420,20454,2342,6100,3957,39,341,1763
527,District of Columbia,2007,588292,4916.3,667.4,2956.0,1292.9,1415.1,626.7,30.8,...,725.0,28922,3926,17390,7606,8325,3687,181,192,4265
587,Florida,2007,18251243,4088.8,996.3,2689.0,403.4,722.6,473.2,6.6,...,209.1,746249,181836,490783,73630,131878,86372,1202,6149,38155


Similarly, we can find various crime records for the state of Wisconsin for the year 2007 by doing the following operation.

In [35]:
df_2007_wisconsin = df_2007.query('State == "Wisconsin" ')
df_2007_wisconsin

Unnamed: 0,State,Year,Data.Population,Data.Rates.Property.All,Data.Rates.Property.Burglary,Data.Rates.Property.Larceny,Data.Rates.Property.Motor,Data.Rates.Violent.All,Data.Rates.Violent.Assault,Data.Rates.Violent.Murder,...,Data.Rates.Violent.Robbery,Data.Totals.Property.All,Data.Totals.Property.Burglary,Data.Totals.Property.Larceny,Data.Totals.Property.Motor,Data.Totals.Violent.All,Data.Totals.Violent.Assault,Data.Totals.Violent.Murder,Data.Totals.Violent.Rape,Data.Totals.Violent.Robbery
3042,Wisconsin,2007,5601640,2843.9,497.9,2105.8,240.3,291.5,168.5,3.3,...,97.8,159305,27890,117957,13458,16330,9438,185,1227,5480


Find all the unique States in the dataframe. Do you find something strange ? Select only the entries from the dataframe which are not states. 

Hint: United States and District of Columbia are not states. 

In [36]:
df['State'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'United States', 'Utah', 'Vermont', 'Virginia',
       'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
      dtype=object)

How would you count the number of unique states in the 'State' column ?

In [16]:
df['State'].nunique()

52

Let's select the 50 States using the query() method. We create a new dataframe, `df_50_States` for this.
Basically, using the & (and) operator we select all the rows in the dataframe which don't belong to 'United States' and 'District of Columbia'. 

In [17]:
df_50_States = df.query("State!='United States' & State!='District of Columbia'")
df_50_States

Unnamed: 0,State,Year,Data.Population,Data.Rates.Property.All,Data.Rates.Property.Burglary,Data.Rates.Property.Larceny,Data.Rates.Property.Motor,Data.Rates.Violent.All,Data.Rates.Violent.Assault,Data.Rates.Violent.Murder,Data.Rates.Violent.Rape,Data.Rates.Violent.Robbery,Data.Totals.Property.All,Data.Totals.Property.Burglary,Data.Totals.Property.Larceny,Data.Totals.Property.Motor,Data.Totals.Violent.All,Data.Totals.Violent.Assault,Data.Totals.Violent.Murder,Data.Totals.Violent.Rape,Data.Totals.Violent.Robbery
0,Alabama,1960,3266740,1035.4,355.9,592.1,87.3,186.6,138.1,12.4,8.6,27.5,33823,11626,19344,2853,6097,4512,406,281,898
1,Alabama,1961,3302000,985.5,339.3,569.4,76.8,168.5,128.9,12.9,7.6,19.1,32541,11205,18801,2535,5564,4255,427,252,630
2,Alabama,1962,3358000,1067.0,349.1,634.5,83.4,157.3,119.0,9.4,6.5,22.5,35829,11722,21306,2801,5283,3995,316,218,754
3,Alabama,1963,3347000,1150.9,376.9,683.4,90.6,182.7,142.1,10.2,5.7,24.7,38521,12614,22874,3033,6115,4755,340,192,828
4,Alabama,1964,3407000,1358.7,466.6,784.1,108.0,213.1,163.0,9.3,11.7,29.1,46290,15898,26713,3679,7260,5555,316,397,992
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3110,Wyoming,2015,586107,179.8,1902.6,300.6,1500.9,222.1,10.1,2.7,29.5,21.3,1054,11151,1762,8797,1302,59,16,173,125
3111,Wyoming,2016,585501,1957.3,302.5,1518.2,136.6,244.2,195.7,3.4,35.0,10.1,11460,1771,8889,800,1430,1146,20,205,59
3112,Wyoming,2017,579315,1830.4,275.0,1421.0,134.5,237.5,176.4,2.6,45.4,13.1,10604,1593,8232,779,1376,1022,15,263,76
3113,Wyoming,2018,577737,1785.1,264.0,1375.9,145.2,212.2,150.6,2.3,42.1,17.3,10313,1525,7949,839,1226,870,13,243,100


Now let's check how many unique States we have in the `'State'` column.

In [18]:
df_50_States['State']. nunique()

50

### EXERCISE

*Find the total number of robberies (hint: look at data.totals.violent.robbery column) which took place in Rhode Island, Ohio and South Dakota in the year 2015.*

In [44]:
df.query("(State =='Rhode Island' | State == 'Ohio'| State== 'South Dakota') & Year  == 2015")['Data.Totals.Violent.Robbery'].sum()

np.int64(13326)