# LOGISTIC REGRESSION

**Pre-processing Steps:**
-  **Feature Engineering**: Extract and create meaningful features from existing data.
-  **Feature Scaling**: Standardise or normalise features to ensure consistent scaling.
-  **Transform Categorical Variables**: Use one-hot encoding to convert categorical data into numerical format.
-  **Fix Class Imbalances**: Apply techniques like oversampling, undersampling, or use class weights.

**Model Building Steps:**
-  **Regularisation**: Use L1 or L2 regularisation to prevent overfitting.
-  **Build, Train, and Test Model**: Develop the logistic regression model and split data into training and testing sets.
-  **Evaluate Model Performance**: Use metrics like accuracy, precision, recall, F1-score, and AUC-ROC.
-  **Cross-Validation**: Perform cross-validation to ensure model generalisation.

## DOCUMENT PREAMBLE

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import itertools
from tqdm import tqdm
import os

# Configure tqdm and matplotlib
tqdm.pandas()
plt.style.use("classic")
#plt.rcParams["figure.dpi"] = 200
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.family"] = "serif"

In [2]:
# Load data from parquet files in the 'data_processed' folder
def load_data(version, data_type):
    if data_type not in ["train", "validation"]:
        raise ValueError("data_type must be either 'train' or 'validation'")

    # Define file paths with the new naming convention
    base_path = f"./data_processed/{version}_{data_type}_"
    behaviors_file = f"{base_path}behaviors_df_expanded.parquet"
    history_file = f"{base_path}history_df_expanded.parquet"
    articles_file = f"{base_path}articles_df_expanded.parquet"
    users_file = f"{base_path}users_df_expanded.parquet"

    # Read parquet files into DataFrames
    behaviors_df = pd.read_parquet(behaviors_file)
    history_df = pd.read_parquet(history_file)
    articles_df = pd.read_parquet(articles_file)
    users_df = pd.read_parquet(users_file)

    # Print DataFrame info
    for name, df in zip(
        ["behaviors_df", "history_df", "articles_df", "users_df"],
        [behaviors_df, history_df, articles_df, users_df],
    ):
        print(f"--- '{name}' ---\n")
        print(df.info(), "\n")

    return behaviors_df, history_df, articles_df, users_df

# Set parameters and load data
version = "small"
data_type = "train"
behaviors_df, history_df, articles_df, users_df = load_data(version, data_type)

--- 'behaviors_df' ---

<class 'pandas.core.frame.DataFrame'>
Index: 2254478 entries, 0 to 205360
Data columns (total 24 columns):
 #   Column                             Dtype         
---  ------                             -----         
 0   impression_id                      uint32        
 1   impression_article_id              float64       
 2   impression_time                    datetime64[us]
 3   impression_read_time               float32       
 4   impression_scroll_percentage       float32       
 5   impression_device_type             int8          
 6   impression_article_id_inview       int64         
 7   impression_article_id_clicked      int64         
 8   user_id                            uint32        
 9   user_is_sso                        bool          
 10  user_is_subscriber                 bool          
 11  impression_session_id              uint32        
 12  impression_next_read_time          float32       
 13  impression_next_scroll_percentage  floa