# Linear Classification with Gradient Descent

In [1]:
import numpy as np
import pandas as pd
from ucimlrepo import fetch_ucirepo
import data_handler as dh

## Fetching and Pre-Processing the data

In [2]:
bank_marketing = fetch_ucirepo(id=222)
occupancy_detection = fetch_ucirepo(id=357)

X_bank_marketing, y_bank_marketing = bank_marketing.data.features, bank_marketing.data.targets
X_occupancy, y_occupancy = occupancy_detection.data.features, occupancy_detection.data.targets

Pre-process the bank marketing data. Drop features without values for prediction: month, day_of_week. We have to treat NaN values differently than last week, since dropping them would drastically reduce the instances of the data, so we replace them with 0. We use One-Hot encoding for non-numeric (categorical) values. We use min-max normalization to avoid huge values for residuals.

In [3]:
X_bank_marketing = X_bank_marketing.drop(['month', 'day_of_week'], axis=1)  # drop features without value for prediction

X_bank_marketing = X_bank_marketing.fillna(0)

X_bank_marketing = pd.get_dummies(X_bank_marketing).astype(np.float64)

X_bank_marketing = (X_bank_marketing - X_bank_marketing.min()) / (
        X_bank_marketing.max() - X_bank_marketing.min())  # normalize the data

Pre-process the occupancy data. Drop the date feature. Since all features are type object, we have to remove the standing out ones and convert them to numeric values. We use min-max normalization to avoid huge values for residuals.

In [4]:
X_occupancy = X_occupancy.drop(['date'], axis=1)  # drop features without value for prediction

X_occupancy = X_occupancy.apply(pd.to_numeric, errors='coerce').dropna().astype(
    np.float64)  # Convert all columns to numeric, coerce errors to NaN, and drop rows with NaN values

X_occupancy = (X_occupancy - X_occupancy.min()) / (X_occupancy.max() - X_occupancy.min())  # normalize the data

Split the datasets in training data and test data. We use our method from `data_handler`.

In [5]:
X_occupancy_train, X_occupancy_test, y_occupancy_train, y_occupancy_test = dh.split_data(X_occupancy,
                                                                                         y_occupancy, 0.8)
X_bank_marketing_train, X_bank_marketing_test, y_bank_marketing_train, y_bank_marketing_test = dh.split_data(
    X_bank_marketing, y_bank_marketing, 0.8)

## Implementation of Linear Classification

In [6]:
class LinearClassifier:
    def __init__(self):
        raise NotImplementedError
    