# DREAMLENS AI â€” EDA & Prototype Notebook
This notebook walks through loading the dataset, quick EDA, simple preprocessing, and an initial prototype function to map dream texts to dataset interpretations and labels. It follows the development outline described in the project plan.

In [None]:
# Setup and load dataset

This section loads the `project/cleaned_dream_interpretations.csv` dataset and prints a few quick summaries (shape, head, most frequent `Word` values, interpretation lengths).
</VSCode.Cell>

<VSCode.Cell id="#VSC-2" language="python">
import pandas as pd
from pathlib import Path

DATA_PATH = Path('project') / 'cleaned_dream_interpretations.csv'

print('Loading dataset from', DATA_PATH)

df = pd.read_csv(DATA_PATH)

df.head()
</VSCode.Cell>

<VSCode.Cell id="#VSC-3" language="python">
print('Shape:', df.shape)
print('Columns:', df.columns.tolist())
print('\nTop words:')
print(df['Word'].value_counts().head(20))

# Distribution of interpretation lengths
df['interp_len'] = df['Interpretation'].astype(str).apply(len)
print('\nInterpretation length stats:')
print(df['interp_len'].describe())
</VSCode.Cell>

<VSCode.Cell id="#VSC-4" language="markdown">
# Simple preprocessing

- Lowercase
- Drop exact duplicates
- Basic tokenization (using whitespace) for quick checks
</VSCode.Cell>

<VSCode.Cell id="#VSC-5" language="python">
# Preprocessing
import numpy as np

df['Word_clean'] = df['Word'].astype(str).str.lower().str.strip()
df = df.drop_duplicates(subset=['Word_clean','Interpretation']).reset_index(drop=True)

print('After cleaning shape:', df.shape)

# Quick token counts for a few examples
sample = df['Word_clean'].value_counts().head(30)

sample
</VSCode.Cell>

<VSCode.Cell id="#VSC-6" language="markdown">
# Prototype: Find matches using TF-IDF

Create a small function that uses TF-IDF to find the closest dataset interpretation to a free-form dream text.
</VSCode.Cell>

<VSCode.Cell id="#VSC-7" language="python">
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Word_clean'])


def find_best_match(text, top_k=3):
    v = vectorizer.transform([text.lower()])
    sims = cosine_similarity(v, X).flatten()
    top_idx = sims.argsort()[::-1][:top_k]
    return df.iloc[top_idx][['Word','Interpretation']].assign(score=sims[top_idx])

# Try an example
find_best_match('flying over the city and feeling free')
</VSCode.Cell>

<VSCode.Cell id="#VSC-8" language="markdown">
# Prototype: Zero-shot labeling + prompt for generation

Use Hugging Face zero-shot (BART-MNLI) to suggest short theme labels, then feed prompt into Flan-T5 to generate interpretation text (not executed here if models are not available locally).
</VSCode.Cell>

<VSCode.Cell id="#VSC-9" language="python">
try:
    from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
    zsc = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
    zsc('I was being chased', ['fear','stress','freedom'], multi_label=True)
except Exception as e:
    print('Models not available in this env:', e)

# You can run the model steps inside a colab or a machine with sufficient RAM/GPU.
</VSCode.Cell>

<VSCode.Cell id="#VSC-10" language="markdown">
# Visualizations

Plot top words and basic length distributions (histograms). Use plotly or seaborn for interactivity.
</VSCode.Cell>

<VSCode.Cell id="#VSC-11" language="python">
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
df['interp_len'].hist(bins=40)
plt.title('Interpretation length distribution')
plt.show()
</VSCode.Cell>

<VSCode.Cell id="#VSC-12" language="markdown">
# Next steps

- Add more data cleaning and canonicalization of words
- Create a small classification dataset by annotating 500-1k dreams with themes
- Prototype a lightweight inference API (already implemented in `DREAMLENS AI/app.py`)
- Add unit tests for the matching and label-generation steps
