# Project: ADHD Strategy Finder
## Introduction
This project is a prototype for the "Semantic Detective" track of the BigQuery AI Hackathon. Its mission is to solve a real-world problem by helping the neurodivergent community find relevant, crowd-sourced strategies for managing ADHD.

To demonstrate the architecture and value of a BigQuery-powered semantic search application, this notebook builds a fully functional prototype using a public Kaggle dataset. The workflow directly simulates BigQuery AI's capabilities:

- The `sentence-transformers` library is used to generate text embeddings, serving as a functional equivalent to BigQuery's `ML.GENERATE_EMBEDDING`.
- The `scikit-learn` similarity search is a direct stand-in for BigQuery's native `VECTOR_SEARCH` function.

This prototype proves the effectiveness of the approach. The next step would be to scale this proven logic by migrating it to BigQuery's serverless infrastructure to handle millions of documents in real-time.

## Step 1: Loading and Unifying


In [1]:
# Imports
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import os

# --- Load All Data ---
print("Loading all four data files...")
file_paths = {}
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if filename.endswith('.csv'):
            key = os.path.splitext(filename)[0].lower()
            file_paths[key] = os.path.join(dirname, filename)

posts_adhd = pd.read_csv(file_paths['adhd'])
comments_adhd = pd.read_csv(file_paths['adhd-comment'])
posts_women = pd.read_csv(file_paths['adhdwomen'])
comments_women = pd.read_csv(file_paths['adhdwomen-comment'])


# --- Create a Unified Searchable Corpus ---
print("\nCombining posts and comments into a single corpus...")
all_posts_df = pd.concat([posts_adhd, posts_women], ignore_index=True)
all_posts_df['text'] = all_posts_df['title'] + '. ' + all_posts_df['selftext'].fillna('')
posts_corpus = all_posts_df[['text']].copy()
posts_corpus['source_type'] = 'Post'

all_comments_df = pd.concat([comments_adhd, comments_women], ignore_index=True)
all_comments_df.rename(columns={'body': 'text'}, inplace=True)
comments_corpus = all_comments_df[['text']].copy()
comments_corpus['source_type'] = 'Comment'

corpus_df = pd.concat([posts_corpus, comments_corpus], ignore_index=True)

2025-08-14 22:52:50.394773: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755211970.655934      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755211970.738634      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading all four data files...


  posts_adhd = pd.read_csv(file_paths['adhd'])



Combining posts and comments into a single corpus...


## Step 2: Cleaning the Unified Corpus

In [2]:
# --- Clean the Unified Corpus ---
print("\nCleaning the unified corpus...")
df_cleaned = corpus_df.dropna(subset=['text']).copy()
df_cleaned = df_cleaned[df_cleaned['text'].apply(type) == str]

# Robust filter that checks for substrings
df_cleaned = df_cleaned[~df_cleaned['text'].str.contains('\[deleted\]|\[removed\]', case=False, regex=True, na=False)]

df_cleaned = df_cleaned[df_cleaned['text'].str.len() > 50]
print(f"Total searchable documents after improved cleaning: {len(df_cleaned)}")


Cleaning the unified corpus...
Total searchable documents after improved cleaning: 3171924


In [3]:
# --- Export data for BigQuery AI ---
print("\nExporting corpus...")
df_cleaned.to_csv('/kaggle/working/cleaned_adhd_corpus.csv', index=False)


Exporting corpus...


## Step 3: Creating Embeddings for a Sample

In [4]:
# --- Phase 4. Create Embeddings for a Sample ---
print("\nPreparing to generate embeddings...")
sample_size = 20000
df_sample = df_cleaned.sample(n=sample_size, random_state=42).copy()

print(f"Working with a sample of {len(df_sample)} documents.")
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df_sample['text'].tolist(), show_progress_bar=True)
print(f"Embeddings created with shape: {embeddings.shape}")


Preparing to generate embeddings...
Working with a sample of 20000 documents.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/625 [00:00<?, ?it/s]

Embeddings created with shape: (20000, 384)


## Step 4: Building the Search Function and Testing

In [5]:
# --- Final Search Function ---
def find_similar_content(query, data_df, embeddings, top_n=5):
    print(f"\n--- Searching for top {top_n} results for query: '{query}' ---")
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    top_indices = np.argsort(similarities)[-top_n:][::-1]

    for i, pos_index in enumerate(top_indices):
        row = data_df.iloc[pos_index]
        score = similarities[pos_index]
        
        print(f"\n--- Result {i+1} | Source: {row['source_type']} ---")
        print(f"Text: {row['text'][:800]}...")
        print(f"(Similarity Score: {score:.4f})")
    print("\n--- End of results ---")


# --- Test the Final Search Function ---
# Search for 'procrastination and how to stop'
find_similar_content("procrastination and how to stop",
                     data_df=df_sample,
                     embeddings=embeddings)
# Search for 'adhd strategies'
find_similar_content("adhd strategies",
                     data_df=df_sample,
                     embeddings=embeddings)
# Search for 'help with executive function difficulties'
find_similar_content("help with executive function difficulties",
                     data_df=df_sample,
                     embeddings=embeddings)


--- Searching for top 5 results for query: 'procrastination and how to stop' ---


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Result 1 | Source: Comment ---
Text: I'm realizing today, for the first time in my life, that I probably have ADHD (which is how I ended up on this subreddit), and this describes my experience so perfectly. I keep trying to describe my "procrastination" to people, and no one seems to get it. Anyway, I can't answer your question or give you any tips, but I hope someone else can, because I need it too!...
(Similarity Score: 0.7785)

--- Result 2 | Source: Comment ---
Text: I literally procrastinate procrastinating by thinking how much I procrastinated....
(Similarity Score: 0.7731)

--- Result 3 | Source: Comment ---
Text: One of my favourite tricks of all time is Productive Procrastination.

Last semester we had something called Social Responsibility Project where we had to work at an NGO for 30 hours. Everything went fine but I had to submit a report, it was almost done except I had to type just one more page.

My ADHD for some reason didn't allow me to do it even though a 12 yo k

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Result 1 | Source: Comment ---
Text: Jessica McCabe has done some good videos on the subject. Check out her TED talk too. Look up “How to ADHD” on YouTube....
(Similarity Score: 0.7791)

--- Result 2 | Source: Comment ---
Text: Random question but how do you manage ADHD and a physical illness?...
(Similarity Score: 0.7659)

--- Result 3 | Source: Comment ---
Text: Take meds as prescribed. Take them daily for a reason, if you have true adhd you can’t just stop taking meds ...
(Similarity Score: 0.7427)

--- Result 4 | Source: Comment ---
Text: I speak for only me, but a part of my ADHD management includes changing my diet. That being said, diet ALONG with medications, exercise, meditation, and a host of other strategies is what I use to manage my ADHD....
(Similarity Score: 0.7260)

--- Result 5 | Source: Comment ---
Text: What really helps here is that adhd lets you see things, situations and people from many different angles. We get bored with the obvious, so we tend to look for 

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Result 1 | Source: Comment ---
Text: Do you know anyways to help with executive dysfunction? Person experience?...
(Similarity Score: 0.7283)

--- Result 2 | Source: Comment ---
Text: Thank you for your insight; did you also struggle with executive dysfunction when trying to start tasks? Or was your struggle only finishing them? I struggle also to start tasks and if you have any advice for how to start a task I would love to hear your thoughts on that as well....
(Similarity Score: 0.7142)

--- Result 3 | Source: Comment ---
Text: Has your psychologist suggested anything that might help you other than medication ? Executive function disorder looks similar to adhd, i think you should try to pinpoint the different things you struggle with and look for solutions this way rather than stick a label onto it, especially if you can't get medication anyways....
(Similarity Score: 0.7093)

--- Result 4 | Source: Comment ---
Text: Making my executive dysfunction into executive function feels