# paragraphle / chunkle
- maybe there could be an optional hint where you see one of the most related paragraphs across all articles
- connections-style UI?
- genius.com?

- Maybe clustering will help?
- Maybe show users a list of words that are common between all the matching passages!
- Show cosine dist as a percentile of all the passages in the dataset?
- Or try randomly play out a series guesses and look at the distribution of scores?
- Have both modes. Match start or substring.


In [1]:
!pip install --upgrade openai



In [2]:
import sqlite3
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
import nltk
from nltk.corpus import stopwords
from openai import OpenAI
from dotenv import load_dotenv
import os 
import copy

load_dotenv()

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jackv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def execute_sql(query):
    conn = sqlite3.Connection('data/data.db')
    data = pd.read_sql(query, conn).to_dict(orient='records')
    conn.close()
    return data

def get_random_target():
    return execute_sql("""
        select article_id, vector, title, url, chunk
        from embeddings
        join (
            select article_id, title, url
            from articles 
            order by random() 
            limit 1
        ) as a
            using(article_id)
        join chunks 
            using (article_id)
    """)

def get_ai_target(article_id):
     return execute_sql(f"""
        select article_id, vector, title, url, chunk
        from embeddings e
        join articles a 
            using(article_id)
        join chunks c 
            using (article_id)
        where article_id == {article_id}
    """)

In [5]:
guesses = []

potential_targets = []
with open('data/ai_targets.txt', 'r') as file:
    for line in file:
        potential_targets.append(int(line.strip()))

target_article_id = np.random.choice(potential_targets)
target = get_ai_target(target_article_id)
target_embeddings = np.array([np.frombuffer(x['vector']) for x in target])
mean_target_embedding = target_embeddings.mean(axis=0)

In [6]:
from scipy.spatial import distance
import numpy as np

def get_matching_titles(substring):
    substring = substring.lower().strip()
    return execute_sql(f"""
        select distinct article_id, title
        from articles
        where clean_title like '%{substring}%'
    """)

def guess(substring):
    matches = get_matching_titles(substring)
    if len(matches) >= 5:
        return list(np.random.choice(matches, 5, replace=False))
    elif len(matches) > 0:
        return matches
    return []

def top_chunks(guess_id, mean_target_embedding, target_article_id):
    if guess_id == target_article_id:
        print('you got it!')
    
    guess_data = execute_sql(f"""
        select c.chunk_id, c.article_id, e.vector, c.chunk, title
        from (select * from embeddings where article_id == {guess_id}) as e
        join (select * from chunks where article_id == {guess_id}) c
            using (chunk_id)
        join (select * from articles where article_id == {guess_id}) a
            using (article_id)
    """)

    for i in range(len(guess_data)):
        guess_data[i]['vector'] = np.frombuffer(guess_data[i]['vector'])
        guess_data[i]['score'] = distance.cosine(guess_data[i]['vector'], mean_target_embedding)

    return sorted(guess_data, key=lambda x: x['score'])[: 2]

def print_text(text, line_size=100):
    start = 0
    end = line_size
    while end < len(text):
        print(text[start : end])
        start += line_size
        end += line_size
    print(text[start :])

def submit_guess(guess_id):
    top = top_chunks(guess_id, mean_target_embedding, target_article_id)  
    for row in top:
        guesses.append(row)
        print(f"Score: {row['score']}\n")
        print_text(row['chunk'])
        print()


In [121]:
g = guess('climate')
for i, x in enumerate(g):
    print(f"{i}) {x['title']}")

0) Oceanic climate
1) Department of Climate Change, Energy, the Environment and Water
2) Microclimate
3) Climate of Egypt
4) Temperate climate


In [122]:
i = 1
submit_guess(g[i]['article_id'])

you got it!
Score: 0.0

The Department of Climate Change, Energy, the Environment and Water (DCCEEW) is a department of the 
Australian Government. The department was established on 1 July 2022, superseding the water and envi
ronment functions from the Department of Agriculture, Water and the Environment and energy functions
 from the Department of Industry, Science, Energy and Resources. The current and inaugural head of t
he department is the Secretary, David Fredericks. References 2022 establishments in Australia Austra
lia, Climate Change, Energy,



In [116]:
top_guesses = sorted(list(set([(x['score'], x['chunk'], x['title']) for x in guesses])), key=lambda x: x[0])[: 10]
for score, text, title in top_guesses:
    print(f'\n{title} | Score: {score}\n')
    print_text(text)


University of Melbourne | Score: 0.5746451492093343

of Land and Environment was disestablished on 1January 2015. Its agriculture and food systems depart
ment moved alongside veterinary science to form the Faculty of Veterinary and Agricultural Sciences,
 while other areas of study, including horticulture, forestry, geography and resource management, mo
ved to the Faculty of Science in two new departments. In 2019, allegations of a toxic workplace cult
ure within the Faculty of Arts were aired, with a number of senior staff leaving their positions. At

University of Melbourne | Score: 0.5990288854272678

which led to the sacking of 500 administrative staff and some administrative responsibilities being 
transferred to academic staff. At the same time in the ten years to 2018 the university embarked on 
a large capital works program, spending $2 billion on new buildings across the university's campuses
. The Melbourne School of Land and Environment was disestablished on 1January 2015. 

In [99]:
guess_words = ' '.join([x[1] for x in top_guesses]).split()
guess_words = [x.strip().lower() for x in guess_words if x not in stop_words]
guess_map = {token: count / len(guess_words) for token, count in Counter(guess_words).items()}

target_words = [x.strip().lower() for x in ' '.join([x['chunk'] for x in target]).split()]
target_words = [x for x in target_words if x not in stop_words]
target_map = {token: count / len(target_words) for token, count in Counter(target_words).items()}

matches = {}
for guess_token, guess_score in guess_map.items():
    matches[guess_token] = guess_score * target_map.get(guess_token, 0)

for i, x in enumerate(list(sorted([(k, v) for k, v in matches.items()], key=lambda x: x[1], reverse=True))[: 10]):
    print(f"{i + 1}) {x[0]}")

1) department
2) environment
3) established
4) australian
5) functions
6) government.
7) industry,
8) australia,
9) land
10) disestablished


In [11]:
# t = target[0]
# print(t['title'])
# print(t['url'])