# canica tutorial
We start by loading a dataset into a pandas DataFrame (the format expected by canica) that contains the basic information canica needs:
- A text column
- An embedding for every piece of text. In this case they were obtained using OpenAI's embeddings endpoint.

This dataset also includes extra information (the language and stars columns) that can be used by canica to color the data points.

In [None]:
import pandas as pd
from pathlib import Path

dataset_path = Path("data") / "amazon_reviews_multi_val_de_en_100.parquet"
df = pd.read_parquet(dataset_path)
df.head()

We plot the data using [TSNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) and the language of the review as the coloring variable. Observe that the languages are colored using a discrete, categorical color scale. This happens when the variable is a string.

Additionally, observe we are setting the option `"use_PCA": False`. This is an optimization tradeoff. Because the original embeddings have very high dimensionality, the tSNE algorithm benefits from reducing the dimensionality of the data using PCA. However, the performance gain is not noticeable when we have so few datapoints.


In [None]:
from canica.widget import CanicaTSNE

CanicaTSNE(
    df,
    embedding_col="embedding",
    text_col="text",
    hue_col="language",
    params={"use_PCA": False},
)

Now we do the same but using the [UMAP](https://umap-learn.readthedocs.io/en/latest/) algorithm and coloring by stars. Since stars are read as numbers, we can plot the data using a continuous color scale.

In [None]:
from canica.widget import CanicaUMAP

CanicaUMAP(df, embedding_col="embedding", text_col="text", hue_col="stars")