# Categorical variables

<br><br><br>

## Recap of yesterday

I'll just leave this here so we can refer back to it.

* Quantitative data analysts distinguish between **measurements**, which are direct observations or outcomes of experiments, and **models**, which are mathematical machines that describe, predict, or explain the measurements in a quantitative way.
* Measurements can be expressed as points in an **N-dimensional space**. Since the number of measurements is finite, they can't completely fill the space.
  * Measurements can be represented in a 2-D data frame or 2-D array, in which the rows are repeated observations or experiments and the columns are observed attributes, one column/dimension per attribute.
  * Measurements can be visualized as a `scatter` plot.
  * Measurements say what _is_ true.
* Models, when questioned, provide a response for any point in the **N-dimensional space**, so a model completely fills the space.
  * Models can be represented in an N-dimensional array, as a value for each point in space, or as a function that returns a response for N arguments.
  * Models can be visualized by coloring a space with `imshow` or `contourf`, or with contour lines (like mountains on an elevation map).
  * The model-function's response may be
    * the probability that that combination of attributes exists, or
    * a prediction of some other attribute (or its probability), or
    * a category that we use to organize the data but isn't directly measurable, such as species (or its probability).
  * Models say what _would be_ true, under the given conditions, assuming that the model is accurate, etc.
* Models are algorithms involving numerical and categorical values: changing these values changes the model.
  * **Parameters** are values that we tune in an automated **fitting** procedure to find the best model for some measurements.
  * **Hyperparameters** are not part of the fitting procedure, but also impact the quality of the fitted model.
  * Models that don't accurately resemble their training data are **underfitted**.
  * Models that are too similar to their training data (take the individual points too literally—don't generalize well) are **overfitted**.
  * Both underfitting and overfitting are problematic.
* **Machine learning** is a fitting procedure, usually with very large datasets and very large numbers of parameters.
* A **neural network** is currently the most successful kind of machine learning model.
  * A neural network consists of layers of linear functions with many parameters sandwiched between non-linear functions.
  * Optimizing a neural network involves tuning the parameters of the linear functions so that the whole model fits the training data.
  * **Deep learning** is a neural network with many layers (which became feasible about 10 years ago).

<br><br><br>

## What we'll do today

Short discussion of categorical variables, using the penguins dataset.

A more detailed look at text-based data using the complete works of Shakespeare.

Build an autocomplete engine, learning a little about SQL and databases along the way.

Talk about the similarities and differences between our autocomplete engine and large language models like ChatGPT.

<br><br><br>

## Categorical variables among the penguins

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
penguins = pd.read_csv("data/penguins.csv")
penguins

<br><br><br>

In [None]:
penguins[["species", "island", "sex"]]

In [None]:
penguins["species"].unique()

In [None]:
penguins["island"].unique()

In [None]:
penguins["sex"].unique()

<br><br><br>

Many (not all!) machine learning models require inputs and outputs to be numerical. How can we do that?

<br><br><br>

### Method 1

Associate a number to each category. We've already done this.

In [None]:
pd.Categorical(penguins["species"]).codes

In [None]:
pd.Categorical(penguins["island"]).codes

In [None]:
pd.Categorical(penguins["sex"]).codes

<br><br><br>

Notice that this plot is using a numerical relationship among Adelie, Gentoo, and Chinstrap to give the horizontal axis an order (Adelie first, then Gentoo, then Chinstrap).

In [None]:
penguins["species"].value_counts().plot(kind="bar")

In [None]:
pd.crosstab(penguins["species"], penguins["island"])

In [None]:
fig, ax = plt.subplots()

matrix = ax.matshow(pd.crosstab(penguins["species"], penguins["island"]).values)
fig.colorbar(matrix, label="number of penguins")

ax.set_xticks([0, 1, 2], ["Biscoe", "Dream", "Torgersen"])
ax.set_yticks([0, 1, 2], ["Adelie", "Chinstrap", "Gentoo"])

None

<br><br><br>

The disadvantage of this method is that the order is not meaningful—it's something we made up—and a machine learning model might optimize for it.

It's an invitation to overfitting (which can be controlled, but still).

<br><br><br>

### Method 2

Create a dimension for each value of a categorical variable:

In [None]:
expanded_penguins = pd.get_dummies(penguins.dropna(), columns=["species", "island", "sex"])
expanded_penguins

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))

sex2D = expanded_penguins[["sex_female", "sex_male"]].values

# scatter a little, so we can see overlapping points
sex2D = sex2D.astype(np.float64) + np.random.normal(0, 0.05, (len(expanded_penguins), 2))

ax.scatter(sex2D[:, 0], sex2D[:, 1], marker=".")

ax.set_xlim(-0.3, 1.3)
ax.set_ylim(-0.3, 1.3)
ax.set_xlabel("sex_female")
ax.set_ylabel("sex_male")
ax.axhline(0, color="gray", ls=":")
ax.axvline(0, color="gray", ls=":")

None

In [None]:
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(projection="3d")

island3D = expanded_penguins[["island_Biscoe", "island_Dream", "island_Torgersen"]].values

# scatter a little, so we can see overlapping points
island3D = island3D.astype(np.float64) + np.random.normal(0, 0.05, (len(expanded_penguins), 3))

ax.scatter(island3D[:, 0], island3D[:, 1], island3D[:, 2], marker=".")

ax.set_xlabel("Biscoe")
ax.set_ylabel("Dream")
ax.set_zlabel("Torgersen")

None

<br><br><br>

The disadvantages of this method are that

* we quickly end up with a lot of dimensions, which uses more memory and computation time, and
* all the values between and beyond 0 and 1 are meaningless.

But if you can afford it, it's a robust way to make models!