# Fractal Embeddings Demo Notebook

This walks through some of the functionality contained in this repo with example results.

#### Main Point of This File
- To describe code functionality 
- To steal code snippets and shell commands
- A playground for development

# Creating Embeddings from Text Directories

The `embed_text_cohere.py` script takes a directory containing text files and generates embeddings using Cohere's API. The embeddings are saved as a DataFrame in a `.csv` file. This script supports recursive search for `.md` files in the input directory and can handle API rate limits.

Here's an example of how to run the script from the command line:

```bash
python src/embed_text_cohere.py -i ./data -e output_embeddings.csv -c config.ini
```

Review the script to understand the command line arguments. Briefly, the -i input dir is being stored in the -e embedding file and a -n npz file, with an api key found in -c config.ini

Now let's run the script using the provided data directories.

In [8]:
!python src/embed_text_cohere.py -i ./data/memory -e ./demo_data/memory_embeddings.csv -n -c config.ini
!python src/embed_text_cohere.py -i ./data/my-second-brain -e ./demo_data/my_second_brain_embeddings.csv -n -c config.ini

Embedding text: 100%|█████████████████████████████| 6/6 [00:11<00:00,  1.88s/it]
Embedding text: 100%|█████████████████████████████| 7/7 [00:14<00:00,  2.08s/it]


After running the above commands, you should have two `.csv` files containing embeddings for your data: `memory_embeddings.csv`and `my_second_brain_embeddings.csv`. You can load these files using pandas and visualize the embeddings or perform further analysis.

In [2]:
import pandas as pd

memory_embeddings_df = pd.read_csv('./demo_data/memory_embeddings.csv')
my_second_brain_embeddings_df = pd.read_csv('./demo_data/my_second_brain_embeddings.csv')

memory_embeddings_df.head()


Unnamed: 0,filename,index,chunk_text,embedding,links
0,Heretics of Dune,0,# Heretics of Dune ![rw-book-cover](https://im...,"[1.4345703, -0.37695312, 0.18273926, -0.636718...","['Frank Herbert', 'Books', 'Psychedelics', 'Sc..."
1,CSC 581,1,"--- **Status::** #🗺️ **Tags::** [[MOC]], [[Win...","[0.92285156, -0.5053711, 0.7392578, 0.30615234...","['MOC', 'Winter 2023']"
2,Tensorflow Mac M1,2,## AAAh [Good post on SO](https://stackoverflo...,"[1.8417969, 0.6347656, 0.027954102, 1.9462891,...",['Programming Notes']
3,Tensorflow Mac M1,2,optimizer=tf.keras.optimizers.legacy.Adam(lear...,"[3.5527344, 0.11090088, 0.33618164, 1.2988281,...",['Programming Notes']
4,Snowflake,3,## Snowflake Here's to you - You doubters of m...,"[0.80810547, 1.3378906, -0.18164062, 0.0483093...","['Poetry', 'My Writings']"


In [3]:
my_second_brain_embeddings_df.head()

Unnamed: 0,filename,index,chunk_text,embedding,links
0,Welcome in my mind 🧠,0,"## Who I am? I'm **Anthony**, a `Date.today.ye...","[0.8691406, -0.91748047, 1.21875, -1.6054688, ...",[]
1,Welcome in my mind 🧠,0,want. Here's a hint: you can just watch the su...,"[0.75634766, -0.33984375, 0.22790527, -1.01953...",[]
2,README,1,*You'll have a better browsing experience of t...,"[1.0322266, -1.8320312, 0.36694336, -0.9697265...",[]
3,README,1,deep into a wide variety of fields and engage ...,"[1.2236328, -0.5444336, -0.43579102, 0.4089355...",[]
4,Contact me 💌,2,## Want to get in touch? 😊 I'd love to hear fr...,"[-0.6777344, 0.38598633, 1.8916016, -2.1796875...",[]


# Dimensionality Reduction

We'll perform dimensionality reduction on the embeddings to visualize them in a lower-dimensional space. The `dimensionality_reduction.py` script can perform PCA, t-SNE, and UMAP dimensionality reduction methods. 

### Combining Embeddings

Before performing dimensionality reduction, let's combine the two sets of embeddings into a single file. We'll use numpy to load, concatenate, and save the embeddings.

In [None]:
import numpy as np

memory_embeddings = np.load('./demo_data/memory_embeddings.npz', allow_pickle=True)
my_second_brain_embeddings = np.load('./demo_data/my_second_brain_embeddings.npz', allow_pickle=True)

combined_embeddings = np.concatenate((memory_embeddings['embeddings'], my_second_brain_embeddings['embeddings']), axis=0)
combined_filenames = np.concatenate((memory_embeddings['filenames'], my_second_brain_embeddings['filenames']), axis=0)
np.savez('./demo_data/combined_embeddings.npz', filenames=combined_filenames, embeddings=combined_embeddings)

### Running Dimensionality Reduction
Now we'll run the dimensionality_reduction.py script on the combined embeddings. This script supports PCA, t-SNE, and UMAP reduction methods. You can specify which methods to use with the `-r` argument. By default, it will run all methods.

Rundown of the following command:

Reduce the -e embeddings npz, -o output the results to npz, and -p plot the results

In [30]:
!python src/dimensionality_reduction.py -e demo_data/combined_embeddings.npz -o demo_data/combined_reduced_embeddings.npz -p

Running PCA 5
Running t-SNE 2
Running UMAP 5
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Running UMAP 2
Figure(640x480)
Figure(640x480)
Figure(640x480)
Figure(640x480)


After running the above command, you should have an NPZ file `demo_data/combined_reduced_embeddings.npz` containing reduced embeddings for the combined dataset. You can load this file using numpy and visualize the embeddings or perform further analysis.

In [None]:
reduced_embeddings = np.load('demo_data/combined_reduced_embeddings.npz')

# Example of accessing the reduced embeddings
pca5_embeddings = reduced_embeddings['pca5']
tsne2_embeddings = reduced_embeddings['tsne2']
umap5_embeddings = reduced_embeddings['umap5']
umap2_embeddings = reduced_embeddings['umap2']

# Clustering (WIP)

In [None]:
import pickle
import numpy as np
import pandas as pd
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import fcluster

# TreeNode class definition
class TreeNode:
    def __init__(self, left=None, right=None, filenames=None, data=None):
        self.left = left
        self.right = right
        self.filenames = filenames
        self.data = data

# Load the tree
def load_tree(input_file):
    with open(input_file, 'rb') as f:
        return pickle.load(f)

tree = load_tree('../demo_data/combined_dendrogram.pkl')

# Load the data
embedding_npz = np.load('../demo_data/combined_reduced_embeddings.npz')
