# Advanced Legal Analytics
# LAW 3027 Tutorial 2: Introduction to Linear Regression


#### Intended Learning Outcomes:
This notebook provides a gentle introduction to Linear Regression.
By the end of this notebook you will know how to:
- Peform Regression Analysis to model the linear relationship between an independent and dependent variable
- Interpret the regression model and its parameters

#### Libraries to be used:
You can activate your previously used environment, though you will not use most packages from that environment. In this tutorial, we will use only the most commonly used python libraries such as: `pandas`, `numpy`, `matplotlib`, `scipy`, `seaborn` etc. 

We will use the Machine Learning library of Python, called Scikit Learn. You can use `pip` to install it. See the instructions here: https://scikit-learn.org/stable/install.html

### Linear Regression

#### 1.1
The model for a linear regression is described by the following formula:

$\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i $


which describes the relationship between two variables $x$ and $y$ (the subscript $i$ is a variable that represents the index of a given data point). Can you see from the formula why this model is a _linear_ regression?

In linear regression, one variable is called independent and another is dependent. 

Just looking at how the formula is written;  Which variable do you think is the dependent variable and which is the independent variable?

What are the other terms $\hat{\beta_0}$ and $\hat{\beta_1}$ ? 

##### 1.2 In the following picture, which regression line do you think best fits the data? How might you check?

In [None]:
from IPython import display
display.Image("figs/regression_line.png")

##### 1.3 Watch the video below and discuss the answer to 1.2 after getting insights from the video.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("jEEJNz0RK4Q", width=600)

#### 2 The Swedish Auto Insurance Problem

The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims. It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:

- X = Number of claims.
- Y = Total payment for all claims in thousands of Swedish Kronor.

#### 2.1 Load the dataset
The dataset is available at the following url : https://raw.githubusercontent.com/maastrichtlawtech/law3027-advanced-legal-analytics/main/data/insurance.csv 

Load the dataset using pandas. Create a dataframe, called `df_insurance` to load the data. 

#### 2.2 Inspect the data, do and exploratory data analysis and remove any anomalies (missing values, invalid values, etc.) 

#### 2.3 Peform a Visual Exploratory data analysis

Conduct a univariate visual exploratory data analysis on the dataset(you can plot the histograms, cumulative/probability distribution plots etc (feel free to recall the tutorial on Visual Exploratory Analysis from Legal Analytics Course).

What are your observations. Did you find some outliers? 

#### 2.4 Perform a Correlation Analysis between the two variables: Number of claims and Payment for all claims in Swedish Kronor

- Make a scattert plot between the two variables
- Compute the Pearson Correlation Coefficient between the two variables

- What are your observations? Is there a linear relationship between the two variables ?
- What is the magnitude of the correlation coefficient ?

#### 2.5 Preparing the data for Linear Regression
The variable names are as follows:

- X (Independent Variable) = Number of claims. 
- Y (Dependent Variable) = Total payment for all claims in thousands of Swedish Kronor.

In [None]:
import numpy as np
X = df_insurance['Number_of_claims'].values
y = df_insurance['Payment_for_all_claims_in_Swedish_Kronor'].values
print(X,X.shape, X.ndim)
#Notice that X is a 1-dimensional array with a shape of (63,) 

In [None]:
#scikit-learn expects a 2-D array
# we will use the array.reshape(shape) function to reshape the array. See here: https://www.geeksforgeeks.org/reshape-numpy-array/
#It takes a tuple as an argument. The tuple is the new shape to be formed
X = X.reshape((len(X), 1)) # the new data structure will have 63 arrays, each with 1 element
print(X, X.shape, X.ndim)

#### 2.6 Do the train-test split. Set the train_size = 0.8 (80% training and 20% testing set)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 0)
# see here for more details: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Note that we have split `X` into 2 parts: `X_train` and `X_test`. 
Similarly we have also split `y` into 2 parts: `y_train` and `y_test`

#### 2.7 Train the Linear Regression Model

In [None]:
reg = LinearRegression().fit(X_train, y_train)

#### 2.8 Compute the Slope and the Intercept

In [None]:
print("Slope:{}".format(reg.coef_)) #underscore denotes that a quantity has been derived from training data
print("Intercept:{}".format(reg.intercept_))

Interpretation of the Slope: For each additional claim, the payment increases by 3.34 thousands of Swedish Kronor. 

#### 2.9 Make the predictions on the test set

We will use the trained linear regression model, `reg` to make predictions on the `X` values in the test set. The `x` values in the test set are given by `X_test`. The `y` values predictions on the `X_test` are stored in `y_pred` as indicated below.

In [None]:
y_pred = reg.predict(X_test) #the predictions on the test set are stored in y_pred

#### 2.10 Predict the value of payment for an example value of the number of claims
What will be the payment for say, 56 claims ?

In [None]:
reg.predict([[56]])

#### 2.11 Plot the predicted values for the test set using the Linear Regression Model (You need to plot the least squares regression line). Also plot the acutal test data.

#### 2.12 Compute the mean squared error and the R2 score

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
print('Mean squared error: {}'.format(mean_squared_error(y_test, y_pred)))
print('Coefficient of determination or R2 Score: {}'.format(r2_score(y_test, y_pred)))

## 3. Correlation & Regression Analysis on a Crime Dataset

#### Dataset: 

We have collected a subset of the crime dataset from UCI Machine Learning Repository. For detailed information about the variables in the dataset you can refer to the link: https://archive.ics.uci.edu/dataset/183/communities+and+crime

The dataset is available here: https://raw.githubusercontent.com/maastrichtlawtech/law3027-advanced-legal-analytics/main/data/crime_dataset_assignment.csv

#### Target Variable

The target variable of interest is `ViolentCrimesPerPop`. It refers to the total number of violent crimes per 100K popuation (numeric - decimal).

#### Sub-Tasks:

Specific tasks you need to perform to complete this question:

- Load the crime dataset into a pandas DataFrame called `df_crime`


- Compute a correlation matrix for the `df_crime` dataframe.


- Programatically identify the top 5 most correlated variables (features) with `ViolentCrimesPerPop`.  The code should print the correlation values with between `ViolentCrimesPerPop` and the top 5 most correlated features.  Further, compute the correlation matrix of `ViolentCrimesPerPop` with the 5 most correlated features.  You can always refer here to see the meaning of each variable: https://archive.ics.uci.edu/dataset/183/communities+and+crime


- Consider the most correlated variable with `ViolentCrimesPerPop` as the independent variable and `ViolentCrimesPerPop` as the dependent/target variable. Perform a linear regression analysis to predict the `ViolentCrimesPerPop`  using the most correlated variable. 
   
   - Split the dataset as 90% training and 10% test set
   - Compute the Mean squared error and Coefficient of Determination on the test set
   - Compute the Slope & Intercept
   - Plot the predicted values for the test set using the Linear Regression Model. Also plot the acutal test data.
