# Weights and biases have-a-go

We're going to learn how to use weights and biases with the aid of everyone's favourite, the [Kaggle Titanic dataset](https://www.kaggle.com/competitions/titanic/data).

In [None]:
from dotenv import load_dotenv
import logging
import pandas as pd
import os
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import wandb

import utils
from utils import PROJECT_DIR, load_data, SEED

load_dotenv()

os.chdir(PROJECT_DIR)

WANDB_PROJ = "Titanic"  # Change this to your project name!
WANDB_USER = os.environ.get("wandb_username")  # Change this to your username!
JOB = "predict survival"

## 1. Log a run in Weights and Biases

First things first, let's load the data:

In [None]:
X_train, X_test, y_train, y_test = load_data()

We are going to train a classifier to predict whether a passenger survived the Titanic disaster. We'll start simple with a logistic regression model.

The first step in using weights and biases is to initialise a **run**. Just as it sounds, a run captures a record of each time you *run* your model. You can choose what metrics, visuals etc you want to record in each run.

The way you start a run is like this:

In [None]:
run = wandb.init(
        project=WANDB_PROJ,
        job_type=JOB,
        save_code=True,
    )

Note:
- I log both `project` and `job_type` because within one project, you might have multiple ML jobs that you want to keep track of. For example, at Nesta, a real use case might be an imputation model to predict and fill in missing values, then another downstream model to make predictions based on the imputed data.
- `save_code`: definitely do this! It saves a snapshot of your code at the time of the run. This is mega useful because you might not want to make a Git commit every single time you run your code - you'll do lots of minor changes and tweaking.

The run now exists as an object called `run`.

If you go to your weights and biases workspace, find the Titanic project and navigate to **Runs**, you should now see that you have a run in progress! Exciting!

We will now define and fit a logistic regression model. First of all, we'll define a dict of hyperparameters. The wandb guidance notes that you should keep track of ALL hyperparameters, even if you're just using the defaults. I've been a bit lazy and haven't exactly done that here, but you should!

The hyperparameters are stored in a dict because that will make it more convenient when we move on to running sweeps.

In [None]:
log_reg_config = {'penalty': 'l2',
                  'C': 1.0,
                  'random_state': SEED,
                  'solver':'lbfgs',
                  'max_iter':100}
    
model = LogisticRegression(penalty=log_reg_config['penalty'],
                           C=log_reg_config['C'],
                           solver=log_reg_config['solver'],
                           max_iter=log_reg_config['max_iter'],
                           random_state=log_reg_config['random_state'])

model.fit(X_train, y_train)

Let's find out how the model did:

In [None]:
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)
logging.info(f"Accuracy: {accuracy}")

It got 81% accuracy which is not bad for a first go. We will definitely want to store this metric with the run so that we can compare accuracy across different runs. As a bonus, we'll also create and log a confusion matrix.

The way you log things to the run is extremely simple for single metrics, and not much more complicated for artefacts like tables and graphs.

In [None]:
wandb.run.summary["accuracy"] = accuracy

cm = confusion_matrix(y_test, preds)
cm = pd.DataFrame(cm)
logging.info(f"Confusion matrix:\n{cm}")

# Log confusion matrix
wb_confusion_matrix = wandb.Table(data=cm, columns=["0", "1"])
run.log({"confusion_matrix": wb_confusion_matrix})

Now we're done, so we'll finish the run! You can go to the **Runs** page and inspect :)

In [None]:
wandb.finish()

## 2. Sweep over hyperparameters

Ok, we have a model. It got only 81% accuracy, so we might want to try some different hyperparameters to see if we can improve it. Weights and biases sweeps are a really convenient way to do this.

**Bonus:** within a sweep you can not just try different hyperparameters, but you can also try out different models or data preprocessing steps, embeddings models etc. You can see an example of how to do this in the script `sweep_different_classifiers.py`. There is guidance on how to nest parameters in your sweep config [here](https://docs.wandb.ai/guides/sweeps/define-sweep-configuration).