# Testing Text Preprocessing and Embedding Generation

## Overview
This notebook tests our text preprocessing pipeline and prepares for embedding generation using the DeepSeek model. We'll verify that our preprocessing handles both Portuguese and English text correctly before moving on to generating embeddings.

## Understanding the Imports

### System and Path Management
```python
import sys
from pathlib import Path
```
- `sys`: Provides access to Python interpreter variables and functions
- `pathlib.Path`: Modern way to handle file paths in Python, making it easier to work with directories and files

### Project-Specific Imports
```python
from vectorshop.data.preprocessing import TextPreprocessor
from vectorshop.config import DATA_FILES, RAW_DATA_DIR
```
These imports come from our own project structure:
- `TextPreprocessor`: Our custom class for cleaning and standardizing text
- `DATA_FILES` and `RAW_DATA_DIR`: Configuration settings for data file locations

### Data Analysis Libraries
```python
import pandas as pd
```
- `pandas`: Library for data manipulation and analysis

## Module Organization
Our project is organized like a Python package:
```
vectorshop/
├── vectorshop/
│   ├── data/
│   │   ├── preprocessing.py  # Contains TextPreprocessor
│   ├── config.py            # Contains settings and paths
```

## Preprocessing Steps
1. Text Cleaning
   - Convert to lowercase
   - Remove accents (important for Portuguese text)
   - Remove special characters
   - Remove extra whitespace

2. Batch Processing
   - Process multiple texts efficiently
   - Maintain consistent cleaning across all products

3. Combined Text Creation
   - Merge product name, description, and category
   - Create standardized format for embedding

## Next Steps
After verifying preprocessing works correctly, we'll:
1. Set up the DeepSeek model
2. Generate embeddings for sample products
3. Test embedding quality
4. Implement vector storage

In [1]:
# First cell - Path setup and imports
import sys
from pathlib import Path

# Get the correct project root (going up TWO levels from current notebook)
# We're in notebooks/notebooks, so we need to go up twice
project_root = Path.cwd().parent.parent  # Changed this line
print(f"Project root: {project_root}")

# Add to Python path if not already there
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))
    print(f"Added {project_root} to Python path")

# Let's add some debugging information
print(f"Current working directory: {Path.cwd()}")
print(f"Python path after update: {sys.path[0]}")

# Try importing our modules with better error handling
try:
    from vectorshop.data.preprocessing import TextPreprocessor
    print("Successfully imported TextPreprocessor")
except ImportError as e:
    print(f"Error importing TextPreprocessor: {e}")

try:
    from vectorshop.config import DATA_FILES, RAW_DATA_DIR
    print("Successfully imported config variables")
except ImportError as e:
    print(f"Error importing config: {e}")

import pandas as pd
print("All imports completed")

# Additional verification
print("\nVerifying directory structure:")
print(f"Does vectorshop directory exist? {(project_root / 'vectorshop').exists()}")
print(f"Does preprocessing.py exist? {(project_root / 'vectorshop' / 'data' / 'preprocessing.py').exists()}")
print(f"Does config.py exist? {(project_root / 'vectorshop' / 'config.py').exists()}")

Project root: c:\Users\User\Desktop\vectorshop
Added c:\Users\User\Desktop\vectorshop to Python path
Current working directory: c:\Users\User\Desktop\vectorshop\notebooks\notebooks
Python path after update: c:\Users\User\Desktop\vectorshop
Successfully imported TextPreprocessor
Successfully imported config variables
All imports completed

Verifying directory structure:
Does vectorshop directory exist? True
Does preprocessing.py exist? True
Does config.py exist? True


In [2]:
# Second cell - Test data loading
try:
    # Import the loader
    from vectorshop.data.load import OlistDataLoader
    
    # Create loader instance
    loader = OlistDataLoader(RAW_DATA_DIR)
    
    # Get a sample of product data
    sample_products = loader.get_sample_product_data(n_samples=5)  # Changed method name to match
    
    print("\nSample Product Data:")
    print("="*50)
    print(f"\nShape of sample data: {sample_products.shape}")
    print(f"\nColumns in dataset:")
    for col in sample_products.columns:
        print(f"- {col}")
    
    print("\nFirst product details:")
    print("-"*30)
    first_product = sample_products.iloc[0]
    for col, value in first_product.items():
        print(f"{col}: {value}")
        
except Exception as e:
    print(f"An error occurred: {str(e)}")
    import traceback
    traceback.print_exc()

Successfully loaded olist_products_dataset.csv with 32951 rows
Successfully loaded product_category_name_translation.csv with 71 rows

Sample Product Data:

Shape of sample data: (5, 10)

Columns in dataset:
- product_id
- product_category_name
- product_name_lenght
- product_description_lenght
- product_photos_qty
- product_weight_g
- product_length_cm
- product_height_cm
- product_width_cm
- product_category_name_english

First product details:
------------------------------
product_id: f819f0c84a64f02d3a5606ca95edd272
product_category_name: relogios_presentes
product_name_lenght: 59.0
product_description_lenght: 452.0
product_photos_qty: 1.0
product_weight_g: 710.0
product_length_cm: 19.0
product_height_cm: 13.0
product_width_cm: 14.0
product_category_name_english: watches_gifts


In [3]:
# Third cell - Test full product data loading
try:
    # Get a sample of complete product data
    full_sample = loader.get_full_product_data(n_samples=3)
    
    print("\nComplete Product Data Sample:")
    print("="*50)
    print(f"\nShape of full data: {full_sample.shape}")
    print(f"\nColumns in complete dataset:")
    for col in full_sample.columns:
        print(f"- {col}")
    
    print("\nDetailed view of first product:")
    print("-"*30)
    first_product = full_sample.iloc[0]
    
    # Print basic info
    print("\nBasic Information:")
    print(f"Product ID: {first_product['product_id']}")
    print(f"Category (PT): {first_product['product_category_name']}")
    print(f"Category (EN): {first_product['product_category_name_english']}")
    
    # Print review information if available
    if pd.notna(first_product.get('review_comment_message')):
        print("\nCustomer Reviews:")
        print(f"Average Score: {first_product['review_score']:.2f}")
        print("Review Comments:")
        print(first_product['review_comment_message'])
    
except Exception as e:
    print(f"An error occurred: {str(e)}")
    import traceback
    traceback.print_exc()

Successfully loaded olist_products_dataset.csv with 32951 rows
Successfully loaded product_category_name_translation.csv with 71 rows
Successfully loaded olist_order_items_dataset.csv with 112650 rows
Successfully loaded olist_order_reviews_dataset.csv with 99224 rows

Complete Product Data Sample:

Shape of full data: (3, 13)

Columns in complete dataset:
- product_id
- product_category_name
- product_name_lenght
- product_description_lenght
- product_photos_qty
- product_weight_g
- product_length_cm
- product_height_cm
- product_width_cm
- product_category_name_english
- review_comment_title
- review_comment_message
- review_score

Detailed view of first product:
------------------------------

Basic Information:
Product ID: f819f0c84a64f02d3a5606ca95edd272
Category (PT): relogios_presentes
Category (EN): watches_gifts

Customer Reviews:
Average Score: 4.09
Review Comments:
correta  | não usei o produto não mas parece bom | Eu comprei um relógio e me mandaram outro  | Melhor do Brasi

## Analyze text content distribution

In [4]:
# Fourth cell - Analyze text content distribution
try:
    # Get a larger sample for analysis
    analysis_sample = loader.get_full_product_data(n_samples=1000)
    
    print("Text Content Analysis:")
    print("="*50)
    
    # Analyze review presence
    reviews_present = analysis_sample['review_comment_message'].notna().sum()
    print(f"\nProducts with reviews: {reviews_present} out of {len(analysis_sample)} ({reviews_present/len(analysis_sample)*100:.1f}%)")
    
    # Analyze language distribution
    print("\nCategory Language Distribution:")
    category_counts = analysis_sample.groupby('product_category_name_english').size().sort_values(ascending=False).head()
    for category, count in category_counts.items():
        print(f"- {category}: {count} products")
    
    # Look at review lengths
    analysis_sample['review_length'] = analysis_sample['review_comment_message'].str.len()
    print("\nReview Length Statistics:")
    print(f"Average review length: {analysis_sample['review_length'].mean():.0f} characters")
    print(f"Max review length: {analysis_sample['review_length'].max():.0f} characters")
    print(f"Min review length: {analysis_sample['review_length'].min():.0f} characters")
    
except Exception as e:
    print(f"An error occurred: {str(e)}")

Successfully loaded olist_products_dataset.csv with 32951 rows
Successfully loaded product_category_name_translation.csv with 71 rows
Successfully loaded olist_order_items_dataset.csv with 112650 rows
Successfully loaded olist_order_reviews_dataset.csv with 99224 rows
Text Content Analysis:

Products with reviews: 992 out of 1000 (99.2%)

Category Language Distribution:
- bed_bath_table: 118 products
- furniture_decor: 108 products
- sports_leisure: 81 products
- housewares: 79 products
- auto: 57 products

Review Length Statistics:
Average review length: 116 characters
Max review length: 6445 characters
Min review length: 0 characters


## Examine review content quality

In [5]:
# Fifth cell - Examine review content quality
try:
    # Get another sample for detailed text analysis
    text_sample = loader.get_full_product_data(n_samples=1000)
    
    # Add analysis of review text characteristics
    text_sample['review_word_count'] = text_sample['review_comment_message'].str.split().str.len()
    text_sample['has_portuguese_chars'] = text_sample['review_comment_message'].str.contains('[áéíóúãõâêîôûç]', regex=True)
    
    print("Detailed Text Analysis:")
    print("="*50)
    
    # Word count distribution
    print("\nWord Count Statistics:")
    print(f"Average words per review: {text_sample['review_word_count'].mean():.1f}")
    print(f"Median words per review: {text_sample['review_word_count'].median():.1f}")
    
    # Language detection
    portuguese_reviews = text_sample['has_portuguese_chars'].sum()
    print(f"\nReviews with Portuguese characters: {portuguese_reviews} ({portuguese_reviews/len(text_sample)*100:.1f}%)")
    
    # Look at some examples of different lengths
    print("\nSample Reviews by Length:")
    print("\nShort Review Example:")
    short_review = text_sample[text_sample['review_word_count'] < 5].iloc[0]['review_comment_message']
    print(short_review)
    
    print("\nMedium Review Example:")
    medium_review = text_sample[
        (text_sample['review_word_count'] > 10) & 
        (text_sample['review_word_count'] < 20)
    ].iloc[0]['review_comment_message']
    print(medium_review)
    
    print("\nLong Review Example:")
    long_review = text_sample[text_sample['review_word_count'] > 50].iloc[0]['review_comment_message']
    print(long_review[:200] + "..." if len(long_review) > 200 else long_review)
    
except Exception as e:
    print(f"An error occurred: {str(e)}")

Successfully loaded olist_products_dataset.csv with 32951 rows
Successfully loaded product_category_name_translation.csv with 71 rows
Successfully loaded olist_order_items_dataset.csv with 112650 rows
Successfully loaded olist_order_reviews_dataset.csv with 99224 rows
Detailed Text Analysis:

Word Count Statistics:
Average words per review: 20.2
Median words per review: 4.5

Reviews with Portuguese characters: 362 (36.2%)

Sample Reviews by Length:

Short Review Example:
Voltarei a comprar!

Medium Review Example:
Adorei os livros e os cartazes, são excelentes para usarmos com nossos alunos

Long Review Example:
correta  | não usei o produto não mas parece bom | Eu comprei um relógio e me mandaram outro  | Melhor do Brasil | vou comprar de novo , gostei e é excelente os produtos obrigado! | na verdade o produ...


## Language handling analysis

In [6]:
# Sixth cell - Language handling analysis
from typing import List, Tuple
import re

class TextAnalyzer:
    """Analyzes and processes bilingual review text."""
    
    def __init__(self):
        # Common Portuguese words that indicate the text is in Portuguese
        self.portuguese_indicators = {
            'produto', 'muito', 'bom', 'ótimo', 'excelente',
            'recomendo', 'chegou', 'recebi', 'não', 'para'
        }
    
    def analyze_review(self, text: str) -> dict:
        """
        Analyze a single review for language characteristics.
        """
        if not isinstance(text, str):
            return {
                'length': 0,
                'has_accents': False,
                'likely_portuguese': False,
                'word_count': 0
            }
            
        # Basic text cleanup
        text = text.lower().strip()
        
        # Analyze characteristics
        has_accents = bool(re.search('[áéíóúãõâêîôûç]', text))
        words = text.split()
        portuguese_words = sum(1 for word in words 
                             if word in self.portuguese_indicators)
        
        return {
            'length': len(text),
            'has_accents': has_accents,
            'likely_portuguese': portuguese_words > 0,
            'word_count': len(words)
        }

try:
    # Create analyzer
    analyzer = TextAnalyzer()
    
    # Get fresh sample
    text_sample = loader.get_full_product_data(n_samples=1000)
    
    # Analyze reviews
    analyses = []
    for review in text_sample['review_comment_message'].dropna():
        # Split combined reviews
        individual_reviews = review.split(' | ')
        for individual_review in individual_reviews:
            analyses.append(analyzer.analyze_review(individual_review))
    
    # Compile statistics
    total_reviews = len(analyses)
    accented = sum(1 for a in analyses if a['has_accents'])
    likely_portuguese = sum(1 for a in analyses if a['likely_portuguese'])
    
    print("Detailed Language Analysis:")
    print("="*50)
    print(f"\nTotal individual reviews analyzed: {total_reviews}")
    print(f"Reviews with accents: {accented} ({accented/total_reviews*100:.1f}%)")
    print(f"Reviews with Portuguese indicators: {likely_portuguese} ({likely_portuguese/total_reviews*100:.1f}%)")
    
    # Length distribution
    lengths = [a['length'] for a in analyses]
    print(f"\nLength Distribution:")
    print(f"Average length: {sum(lengths)/len(lengths):.1f} characters")
    print(f"Shortest review: {min(lengths)} characters")
    print(f"Longest review: {max(lengths)} characters")
    
except Exception as e:
    print(f"An error occurred: {str(e)}")

Successfully loaded olist_products_dataset.csv with 32951 rows
Successfully loaded product_category_name_translation.csv with 71 rows
Successfully loaded olist_order_items_dataset.csv with 112650 rows
Successfully loaded olist_order_reviews_dataset.csv with 99224 rows
Detailed Language Analysis:

Total individual reviews analyzed: 1928
Reviews with accents: 842 (43.7%)
Reviews with Portuguese indicators: 1128 (58.5%)

Length Distribution:
Average length: 57.8 characters
Shortest review: 0 characters
Longest review: 203 characters


In [7]:
# 7th cell - Test the enhanced preprocessing
from vectorshop.data.preprocessing import TextPreprocessor, MultilingualTextPreprocessor, ProcessedText

# Test the enhanced preprocessing
try:
    # Create both preprocessors
    basic_processor = TextPreprocessor(remove_accents=True)
    multilingual_processor = MultilingualTextPreprocessor(remove_accents=False)
    
    # Get a sample product with review
    sample = text_sample.iloc[0]
    
    # Process with both processors
    print("Basic Processing Result:")
    print("=" * 50)
    basic_result = basic_processor.clean_text(sample['review_comment_message'])
    print(f"Cleaned text: {basic_result[:100]}...")
    
    print("\nMultilingual Processing Result:")
    print("=" * 50)
    ml_result = multilingual_processor.process_with_metadata(sample['review_comment_message'])
    print(f"Original text: {ml_result.original[:100]}...")
    print(f"Cleaned text: {ml_result.cleaned[:100]}...")
    print(f"Detected language: {ml_result.language}")
    print(f"Has accents: {ml_result.has_accents}")
    print(f"Word count: {ml_result.word_count}")
    
except Exception as e:
    print(f"An error occurred: {str(e)}")
    import traceback
    traceback.print_exc()

Basic Processing Result:
Cleaned text: correta nao usei produto nao mas parece bom comprei relogio mandaram outro melhor brasil vou comprar...

Multilingual Processing Result:
Original text: correta  | não usei o produto não mas parece bom | Eu comprei um relógio e me mandaram outro  | Melh...
Cleaned text: correta usei produto mas parece bom comprei rel gio mandaram outro melhor brasil vou comprar novo go...
Detected language: pt
Has accents: True
Word count: 143


## Analyze a batch of reviews with our multilingual processor

In [8]:
# 8th cell - Analyze a batch of reviews with our multilingual processor
try:
    multilingual_processor = MultilingualTextPreprocessor(remove_accents=False)
    
    # Get a sample of reviews
    reviews_sample = text_sample['review_comment_message'].head(50)
    
    # Process each review and collect statistics
    stats = {
        'pt_count': 0,
        'en_count': 0,
        'accented_count': 0,
        'avg_word_count': 0,
        'samples': {
            'pt_sample': None,
            'en_sample': None
        }
    }
    
    total_words = 0
    for review in reviews_sample:
        if isinstance(review, str):
            result = multilingual_processor.process_with_metadata(review)
            
            # Update statistics
            if result.language == 'pt':
                stats['pt_count'] += 1
                if not stats['samples']['pt_sample']:
                    stats['samples']['pt_sample'] = result
            else:
                stats['en_count'] += 1
                if not stats['samples']['en_sample']:
                    stats['samples']['en_sample'] = result
                    
            if result.has_accents:
                stats['accented_count'] += 1
                
            total_words += result.word_count
    
    stats['avg_word_count'] = total_words / len(reviews_sample) if len(reviews_sample) > 0 else 0
    
    # Print analysis results
    print("Review Language Analysis:")
    print("=" * 50)
    print(f"Total reviews analyzed: {len(reviews_sample)}")
    print(f"Portuguese reviews: {stats['pt_count']} ({stats['pt_count']/len(reviews_sample)*100:.1f}%)")
    print(f"English reviews: {stats['en_count']} ({stats['en_count']/len(reviews_sample)*100:.1f}%)")
    print(f"Reviews with accents: {stats['accented_count']} ({stats['accented_count']/len(reviews_sample)*100:.1f}%)")
    print(f"Average words per review: {stats['avg_word_count']:.1f}")
    
    # Show examples
    if stats['samples']['pt_sample']:
        print("\nPortuguese Review Example:")
        print("-" * 30)
        print(f"Original: {stats['samples']['pt_sample'].original[:100]}...")
        print(f"Cleaned: {stats['samples']['pt_sample'].cleaned[:100]}...")
    
    if stats['samples']['en_sample']:
        print("\nEnglish Review Example:")
        print("-" * 30)
        print(f"Original: {stats['samples']['en_sample'].original[:100]}...")
        print(f"Cleaned: {stats['samples']['en_sample'].cleaned[:100]}...")
    
except Exception as e:
    print(f"An error occurred: {str(e)}")
    import traceback
    traceback.print_exc()

Review Language Analysis:
Total reviews analyzed: 50
Portuguese reviews: 20 (40.0%)
English reviews: 30 (60.0%)
Reviews with accents: 20 (40.0%)
Average words per review: 33.4

Portuguese Review Example:
------------------------------
Original: correta  | não usei o produto não mas parece bom | Eu comprei um relógio e me mandaram outro  | Melh...
Cleaned: correta usei produto mas parece bom comprei rel gio mandaram outro melhor brasil vou comprar novo go...

English Review Example:
------------------------------
Original: Voltarei a comprar!...
Cleaned: voltarei comprar...
