In [3]:
import pandas as pd
import numpy as np
import os
import re
%matplotlib inline
import matplotlib.pyplot as plt


# Final Exam
**PSTAT 134/234 (Spring 2024)  



## Data description: Fraudulent transactions

A credit card company wants to know whether a set of variables $x_1, \ldots, x_p$ have an impact on the probability of a given transaction being fraudulent. To understand the relationship between these predictor variables and the probability of a transaction being fraudulent, the company can perform logistic regression where the response is defined as: 

$$ 
y = \begin{cases}
    1, & \text{if transaction is fraudulent.}  \\
    0, & \text{otherwise. } 
\end{cases}
$$

### Question 1: Read Data into Python
<!--
BEGIN QUESTION
name: q1a
manual: true
points: 10
-->

1. Unzip folder `DataFinalExam` in your working directory.
2. Use regular expressions to filter data sets that match the conditions:
    * Starting with pattern 'Data'
    * From years: 2020, 2021, 2022
    * From months: January, February, March and April
3. Concatenate all files in a single data frame
4. Set Y to be column 'Y' and X as the remaining columns of the merged data frame.

In [None]:
folder_path = '...'  # Replace with your folder path

# Define the regular expression pattern for the desired filenames
pattern = r'...'  ##Starting by 'Data', only months: Jan-April, Only Years 2020, 2021 and 2022


# Initialize an empty list to store the data frames
dfs = []

# Iterate over the files in the folder
for file_name in os.listdir(folder_path):
    # Check if the file matches the desired pattern
    if re.match(pattern, file_name):
        # Read the CSV file and append the data frame to the list
        file_path = os.path.join(folder_path, file_name)
        df = pd.read_csv(file_path)
        dfs.append(df)
        print(file_name)

# Concatenate the data frames into a single data frame
merged_df = ...

X = ...
Y = ...

### Question 2

In this problem, we use convex optimization to train a logistic regression model with regularization. We are given data $\left(x_i, y_i\right), i=1, \ldots, n$. The $x_i \in \mathbf{R}^p$ are feature vectors, while the $y_i \in\{0,1\}$ are associated boolean classes.

The goal is to construct a linear classifier $\hat{y}=\mathbb{1}\left[x^T \beta>0\right]$, which is 1 when $x^T \beta$ is positive and 0 otherwise. We model the posterior probabilities of the classes given the data linearly, with
$$
\log \frac{\operatorname{Pr}(Y=1 \mid X=x)}{\operatorname{Pr}(Y=0 \mid X=x)}=x^T \beta
$$
This implies that
$$
\operatorname{Pr}(Y=1 \mid X=x)=\frac{\exp \left(x^T \beta\right)}{1+\exp \left(x^T \beta\right)}, \quad \operatorname{Pr}(Y=0 \mid X=x)=\frac{1}{1+\exp \left(x^T \beta\right)} .
$$
We fit $\beta$ by maximizing the log-likelihood of the data:
$$
\ell(\beta)=\sum_{i=1}^n y_i x_i^T \beta-\log \left(1+\exp \left(x_i^T \beta\right)\right)
$$

Because $\ell$ is a concave function of $\beta$, this is a convex optimization problem.



#### Question 2a:
<!--
BEGIN QUESTION
name: q1a
manual: true
points: 50
-->
1. Use gradient descent to create function `Update_Beta` that uses the data as input, to obtain the optimal value of $\beta$.
     - At each iteration of your algorithm the function should keep track of the maximum update, $Max_{update} = \|\beta_{new} - \beta\|_{\infty} $ and the mean absolute error defined as, $error= \sum_{i=1}^n \frac{|y_i - \hat{y_i}|}{n}$
2. Use your function to estimate $\beta$.

*Hint*: Use the fact that maximizig the concave function $f(x)$ is equivalent to minimizing the convex function $-f(x)$.
   


In [1]:
# Fill-in ...
def Update_Beta(X, y, alpha=0.01, max_iteration=100):
   
    ... 
    return ...

#### Question 2b:
<!--
BEGIN QUESTION
name: q1a
manual: true
points: 20
-->

Use diagnostic plots to assess the convergence of your algorithm.

#### Question 2c (PSTAT 234)
<!--
BEGIN QUESTION
name: q1a
manual: true
points: 30
-->

Suppose we incorporate a regularization term $\lambda\|\beta\|_1$ with $\lambda>0$, so that the objective function to be maximized is:
$$
\ell(\beta)=\sum_{i=1}^n y_i x_i^T \beta-\log \left(1+\exp \left(x_i^T \beta\right)\right)-\lambda\|\beta\|_1
$$

With $\|\beta\|_1 = \sum_{j=1}^p |\beta_j|$.

1. By using the library `cvxpy`, create the function `Update_Beta_reg` that takes the data and the value of $\lambda$ as inputs to obtain the optimal value of $\beta$. 
2. Run a loop that iterates over different values of $\lambda$ (in between 0.01 and 1), and uses the function that you created (`Update_Beta_reg`) to obtain several solutions for $\beta$.
3. What value of $\lambda$ would you choose based on the average absolute error?
4. How these resultes compare to part 2a?

In [1]:
# Fill-in ...
def Update_Beta_reg(X, y, lambda_val):
    
   
    ... 
    return ...


## Submission Checklist
1. Save file to confirm all changes are on disk
2. Run *Kernel > Restart & Run All* to execute all code from top to bottom
3. Save file again to write any new output to disk
4. Select *File > Save and export Notebook as/ > HTML*.
5. Open in Google Chrome and print to PDF.
6. Submit to Gradescope

