# ðŸŽ¨ Visualizing Multi-Year Vector Embeddings (2020-2024)

In this notebook, we visualize how the semantic space changed over five years of Warren Buffett's letters. We will:
1. **Extract** vectors and metadata (`year`) from LanceDB.
2. **Reduce** dimensions using **PCA**.
3. **Visualize** the clusters color-coded by year.

---

### 1. Install Dependencies
Ensure we have the visualization stack installed using `pip3`.

In [None]:
!pip3 install lancedb pandas scikit-learn plotly

### 2. Load Multi-Year Data
Connect to the `buffett_letters_multi` table.

In [None]:
import lancedb
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import plotly.express as px

db = lancedb.connect("./lancedb_data")
table = db.open_table("buffett_letters_multi")

df = table.to_pandas()
print(f"Loaded {len(df)} records across years: {df['year'].unique()}")

### 3. Dimensionality Reduction
Reducing the 384 dimensions to 2 for plotting.

In [None]:
vectors = np.stack(df['vector'].values)
pca = PCA(n_components=2)
reduced = pca.fit_transform(vectors)

df['x'] = reduced[:, 0]
df['y'] = reduced[:, 1]
print("PCA reduction complete.")

### 4. Interactive Visualization with Color Coding
We use the `year` column to color-code the points, allowing students to see if certain years have unique semantic clusters.

In [None]:
# Convert year to string so it's treated as a discrete category by Plotly
df['year_str'] = df['year'].astype(str)

fig = px.scatter(
    df, 
    x='x', 
    y='y', 
    color='year_str', # Color coding by year
    hover_data=['year', 'text'], 
    title='Evolution of Warren Buffett\'s Letters (2020-2024)',
    labels={'x': 'PC 1', 'y': 'PC 2', 'year_str': 'Year'}
)

fig.update_traces(marker=dict(size=7, opacity=0.6))
fig.update_layout(template="plotly_white")
fig.show()

### 5. Lab Observations
- **Overlapping Clusters**: Most themes (insurance, general management) are consistent, resulting in overlapping points.
- **Year-Specific Outliers**: Can you find clusters dominated by a single color (e.g., 2020)? These might be unique topics like pandemic response or specific acquisitions.
- **Semantic drift**: Does the overall "shape" of the data change over time? In this small dataset, we look for thematic consistency vs change.