# Ch 3a: Design Patterns 9 and 10

Design Pattern 9: Neutral Class

Design Pattern 10: Rebalancing

# Design Pattern 9: Neutral Class

Demostrates on a synthetic dataset that creating a separate Neutral class can be helpful. And then provides a real-world scenario

## On Synthetic data

Patients with a history of jaundice will be assumed to be at risk of liver damage and prescribed ibuprofen while patients with a history of stomach ulcers will be prescribed acetaminophen. The remaining patients will be arbitrarily assigned to either category.

In [6]:
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from pathlib import Path
import os

load_dotenv(dotenv_path=Path("../.env"))

True

In [8]:
%load_ext google.cloud.bigquery
from google.cloud import bigquery
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = f"../{os.environ.get('GCP_KEY_FILE')}"
bq = bigquery.Client()

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


In [None]:
# dataset = bq.create_dataset("mlpatterns")

In [4]:
def create_synthetic_dataset(N, shuffle):
  # random array
  prescription = np.full(N, fill_value='acetominophen', dtype='U20')
  prescription[:N//2] = 'ibuprofen'
  np.random.shuffle(prescription)
  
  # neutral class
  p_neutral = np.full(N, fill_value='Neutral', dtype='U20')

  # 10% is patients with history of liver disease
  jaundice = np.zeros(N, dtype=bool)
  jaundice[0:N//10] = True
  prescription[0:N//10] = 'ibuprofen'
  p_neutral[0:N//10] = 'ibuprofen'

  # 10% is patients with history of stomach ulcers
  ulcers = np.zeros(N, dtype=bool)
  ulcers[(9*N)//10:] = 'acetominophen'
  prescription[(9*N)//10:] = 'acetominophen'
  p_neutral[(9*N)//10:] = 'acetominophen'

  df = pd.DataFrame.from_dict({
    'jaundice': jaundice,
    'ulcers': ulcers,
    'prescription': prescription,
    'prescription_with_neutral': p_neutral
  })

  if shuffle:
    return df.sample(frac=1).reset_index(drop=True)
  else:
    return df

In [5]:
df = create_synthetic_dataset(1000, shuffle=True)
df.head()

Unnamed: 0,jaundice,ulcers,prescription,prescription_with_neutral
0,True,False,ibuprofen,ibuprofen
1,False,False,acetominophen,Neutral
2,False,False,ibuprofen,Neutral
3,False,False,acetominophen,Neutral
4,False,False,acetominophen,Neutral


In [6]:
from sklearn import linear_model
for label in ['prescription', 'prescription_with_neutral']:
  ntrain = 8*len(df)//10 # 80% used as training data
  lm = linear_model.LogisticRegression()
  lm = lm.fit(df.loc[:ntrain-1, ['jaundice', 'ulcers']], df[label][:ntrain])
  acc = lm.score(df.loc[ntrain:, ['jaundice', 'ulcers']], df[label][ntrain:])
  print(f'label={label} | accuracy={acc}')

label=prescription | accuracy=0.63
label=prescription_with_neutral | accuracy=1.0


## On Natality data

A baby with an Apgar score of 10 is healthy and one with an Apgar score of <= 7 requires some medical attention. What about babies with scores of 8-9? They are neither perfectly healthy, nor do they need serious medical intervention. Let's see how the model does with a 2-class model and with a 3-class model that includes a Neutral class.

Without the Neutral class

In [19]:
%%bigquery
CREATE OR REPLACE MODEL `mlpatterns.neutral_2classes`
OPTIONS(model_type='logistic_reg', input_label_cols=['health']) AS
SELECT
	IF(apgar_1min >= 9, 'Healthy', 'NeedsAttention') AS health,
	plurality,
	mother_age,
	gestation_weeks,
	ever_born
FROM `bigquery-public-data.samples.natality`
WHERE apgar_1min <= 10

HBox(children=(FloatProgress(value=0.0, description='Query is running', max=1.0, style=ProgressStyle(descripti…




In [20]:
%%bigquery
SELECT * FROM ML.EVALUATE(MODEL mlpatterns.neutral_2classes)

HBox(children=(FloatProgress(value=0.0, description='Query is running', max=1.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1.0, style=ProgressStyle(description_wi…




Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.565628,0.997893,0.565213,0.722007,0.690348,0.52722


Create a neutral class to hold "marginal" scores (8-9) These represents babies that are neither perfectly healthy or require serious medical intervention

In [9]:
%%bigquery
CREATE OR REPLACE MODEL mlpatterns.neutral_3classes
OPTIONS(model_type='logistic_reg', input_label_cols=['health'])
AS

SELECT
	IF (apgar_1min = 10, 'Healthy',
  IF (apgar_1min >= 8, 'Neutral', 'Needs Attention')
	) AS health,
	plurality,
	mother_age,
 	gestation_weeks,
  ever_born
FROM `bigquery-public-data.samples.natality`
WHERE apgar_1min <= 10

HBox(children=(FloatProgress(value=0.0, description='Query is running', max=1.0, style=ProgressStyle(descripti…




The second model which includes the neutral data class has an accurayc of 0.79 vs. the original which is around 0.56

# Design Pattern 10: Rebalancing

Handle datasets that are inherently imbalanced

dataset: https://www.kaggle.com/datasets/ealaxi/paysim1?resource=download