In [1]:
from label_generation import *
import openai
import pickle
import json

### Get the API Key

In [2]:
config_path = "../config.ini"
openai.api_key = get_api_key(config_path)

### Generate or Load in Summaries

There must be a dictionary mapping each file to a summary of the contents of the file. Summaries are costly to generate and can be generated using the script in 'summarize.py'. Instead of summaries, the whole text can simply be passed in. <br>

{file name : summary}

#### Reading in summaries

In [3]:
with open("../Data/openai_summaries_2", "rb") as fp:
    summaries = pickle.load(fp)

### Generate or Load in Embedddings

Similar to the summaries, there must be a dictionary mapping each file to an embedding vector. <br>

{file name : embedding}

#### Reading in Embeddings

In [4]:
with open("../Helper Stuff/embeddings.json") as f:
    embeddings = json.load(f)

### Generating Labels

With these imputs a linkage matrix containing labels will be generated. Contents of each row in the linkage matrix are summarized below: <br> <br>
Index 0: Contains index of cluster 1 used to create new cluster <br>
Index 1: Contains index of cluster 2 used to create new cluster <br>
Index 2: Contains distance between cluster 1 and cluster 2 <br>
Index 3: Contains number of leaves contained within this cluster <br> 
Index 4: Contains the Label for this cluster <br>
Index 5: Contains the labels of this clusters children which were used to generate the label for this cluster <br>

In [5]:
links = generate_labels(embeddings, summaries)

100%|██████████| 593/593 [07:51<00:00,  1.26it/s]


### Output

In [6]:
links

[[399.0,
  594.0,
  0.42786448395325166,
  3.0,
  'Esoteric knowledge.',
  ['Book - Philosophy - Alchemy', 'Occult literature']],
 [204.0,
  595.0,
  0.15685175833221093,
  3.0,
  'Creative expression.',
  ['Category: Theater.', 'Literature/renowned/authors.']],
 [303.0,
  596.0,
  0.1652177944881287,
  4.0,
  'Literary interpretation.',
  ['Shakespearean play analysis.', 'Creative expression.']],
 [303.0,
  596.0,
  0.1652177944881287,
  4.0,
  'Literary interpretation.',
  ['Shakespearean play analysis.', 'Creative expression.']],
 [277.0,
  379.0,
  0.1695001531610153,
  2.0,
  'Russian literature.',
  ['Literature, Russian, Free.', 'Book, Fiction, Russian-Language.']],
 [288.0,
  599.0,
  0.4305627017428426,
  3.0,
  'Innovation.',
  ['Engineering book.', 'Technology.']],
 [216.0,
  407.0,
  0.19415253415551106,
  2.0,
  'Predictions.',
  ['Prophecy book.', 'Prophecies.']],
 [243.0,
  601.0,
  0.4949923818620001,
  3.0,
  'Knowledge.',
  ['Book - Bernard Werber - Encyclopedia', 'Sc

### Saving Output:

In [9]:
with open("../Data/linkage.pkl", "wb") as f:
    pickle.dump(links, f)

### Loading Output

In [10]:
with open("../Data/linkage.pkl", "rb") as f:
    links_read = pickle.load(f)
links_read

[[399.0,
  594.0,
  0.42786448395325166,
  3.0,
  'Esoteric knowledge.',
  ['Book - Philosophy - Alchemy', 'Occult literature']],
 [204.0,
  595.0,
  0.15685175833221093,
  3.0,
  'Creative expression.',
  ['Category: Theater.', 'Literature/renowned/authors.']],
 [303.0,
  596.0,
  0.1652177944881287,
  4.0,
  'Literary interpretation.',
  ['Shakespearean play analysis.', 'Creative expression.']],
 [303.0,
  596.0,
  0.1652177944881287,
  4.0,
  'Literary interpretation.',
  ['Shakespearean play analysis.', 'Creative expression.']],
 [277.0,
  379.0,
  0.1695001531610153,
  2.0,
  'Russian literature.',
  ['Literature, Russian, Free.', 'Book, Fiction, Russian-Language.']],
 [288.0,
  599.0,
  0.4305627017428426,
  3.0,
  'Innovation.',
  ['Engineering book.', 'Technology.']],
 [216.0,
  407.0,
  0.19415253415551106,
  2.0,
  'Predictions.',
  ['Prophecy book.', 'Prophecies.']],
 [243.0,
  601.0,
  0.4949923818620001,
  3.0,
  'Knowledge.',
  ['Book - Bernard Werber - Encyclopedia', 'Sc