# Sentence Transformers Tutorial

This notebook will teach you how to use sentence transformers with practical examples using the text files in our project.

## Table of Contents
1. [Setup and Installation](#setup)
2. [Loading Text Files](#loading)
3. [Basic Sentence Embeddings](#basic)
4. [Semantic Search](#search)


## 1. Setup and Installation {#setup}

First, let's install the required packages if needed and import all necessary libraries:


In [None]:
# Uncomment and run this cell if you need to install packages

# Install uv first (if needed)
# !curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies with uv
# !uv sync


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")


## 2. Loading Text Files {#loading}

Let's load our sample text files and prepare them for analysis:


In [None]:
def load_text_files(directory='sample_texts'):
    """
    Load all text files from a directory
    Returns a dictionary with filename as key and content as value
    """
    texts = {}
    
    if os.path.exists(directory):
        for filename in os.listdir(directory):
            if filename.endswith('.txt'):
                filepath = os.path.join(directory, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    content = file.read()
                    # Remove .txt extension from key
                    key = filename.replace('.txt', '')
                    texts[key] = content
    
    return texts

# Load our text files
documents = load_text_files()

print(f"Loaded {len(documents)} documents:")
for key in documents.keys():
    print(f"- {key}")
    print(f"  Preview: {documents[key][:100]}...\\n")


In [None]:
def split_into_sentences(text):
    """
    Simple sentence splitter
    """
    import re
    # Split on periods, exclamation marks, and question marks
    sentences = re.split(r'[.!?]+', text)
    # Clean up and filter out empty sentences
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

# Create sentence-level data
sentence_data = []
for topic, content in documents.items():
    sentences = split_into_sentences(content)
    for sentence in sentences:
        sentence_data.append({
            'topic': topic,
            'sentence': sentence
        })

# Convert to DataFrame for easier handling
df_sentences = pd.DataFrame(sentence_data)
print(f"Total sentences: {len(df_sentences)}")
print("\\nSample sentences by topic:")
for topic in df_sentences['topic'].unique():
    sample = df_sentences[df_sentences['topic'] == topic].iloc[0]['sentence']
    print(f"{topic}: {sample}")


## 3. Basic Sentence Embeddings {#basic}

Let's start with the basics - loading a sentence transformer model and creating embeddings:


In [None]:
# Load a pre-trained sentence transformer model
# We'll use 'all-MiniLM-L6-v2' - it's fast and effective for most tasks
print("Loading sentence transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded! Embedding dimension: {model.get_sentence_embedding_dimension()}")


In [None]:
# Create embeddings for our documents
print("Creating document embeddings...")
doc_texts = list(documents.values())
doc_names = list(documents.keys())

# Generate embeddings
doc_embeddings = model.encode(doc_texts)

print(f"Created embeddings for {len(doc_embeddings)} documents")
print(f"Each embedding has {doc_embeddings[0].shape[0]} dimensions")
print(f"Embedding shape: {doc_embeddings.shape}")

# Create sentence embeddings
print("\\nCreating sentence embeddings...")
sentences = df_sentences['sentence'].tolist()
sentence_embeddings = model.encode(sentences)

print(f"Created embeddings for {len(sentence_embeddings)} sentences")

# Add embeddings to our DataFrame
df_sentences['embedding'] = list(sentence_embeddings)


## 4. Semantic Search {#search}

Let's implement a semantic search function that finds the most relevant sentences:


In [None]:
def semantic_search(query, df_sentences, model, top_k=5):
    """
    Perform semantic search on sentences
    """
    # Encode the query
    query_embedding = model.encode([query])
    
    # Calculate similarities
    sentence_embeddings = np.vstack(df_sentences['embedding'].values)
    similarities = cosine_similarity(query_embedding, sentence_embeddings)[0]
    
    # Add similarities to dataframe
    df_results = df_sentences.copy()
    df_results['similarity'] = similarities
    
    # Sort by similarity and return top results
    top_results = df_results.nlargest(top_k, 'similarity')
    
    return top_results[['topic', 'sentence', 'similarity']]

# Test semantic search
search_queries = [
    "artificial intelligence and neural networks",
    "pasta and italian recipes", 
    "solo travel adventures",
    "startup and entrepreneurship",
    "climate change research"
]

for query in search_queries:
    print(f"\\n🔍 Searching for: '{query}'")
    print("=" * 60)
    
    results = semantic_search(query, df_sentences, model, top_k=3)
    
    for idx, row in results.iterrows():
        print(f"📄 [{row['topic']}] (Score: {row['similarity']:.3f})")
        print(f"   {row['sentence']}")
        print()


## Summary

In this tutorial, you've learned:

1. **Basic Usage**: How to load models and create embeddings
2. **Text Loading**: Working with multiple text files
3. **Semantic Search**: Finding relevant content based on meaning
4. **Practical Applications**: Real-world examples with different topics

### Next Steps:
- Experiment with different models for your specific use case
- Try fine-tuning models on your domain-specific data
- Explore clustering and visualization techniques
- Build more sophisticated applications using these fundamentals

### Key Takeaways:
- Sentence transformers convert text to meaningful vector representations
- Cosine similarity measures semantic closeness between texts
- Different models have different strengths and computational requirements
- Embeddings enable powerful applications like search, clustering, and QA systems

To run this notebook:
1. Install dependencies: `uv sync`
2. Start Jupyter: `jupyter notebook` or `jupyter lab`
3. Open `sentence_transformers_tutorial.ipynb`
4. Run cells sequentially from top to bottom
