# üöÄ Advanced Semantic Scoring - State-of-the-Art Models

## Upgraded with Best Multilingual Models

This notebook uses **cutting-edge models** for maximum accuracy:

### üéØ Model Options:
1. **BGE-M3** - State-of-the-art (BEST quality)
2. **E5-Large** - Excellent multilingual embeddings
3. **MPNet-Base** - High quality, fast
4. **Ensemble** - Combine multiple models for maximum accuracy

### üìä Scores:
- **Innovation Score**: How innovative/cutting-edge
- **Confidence Score**: How confident/established
- **Market Clarity Score**: How clearly defined
- **Overall Quality Score**: Combined metric

**Optimized for 100+ GB RAM Colab**

---

## üì¶ Step 1: Install Dependencies

In [None]:
%%capture
# Install packages
!pip install -q sentence-transformers pandas numpy torch tqdm matplotlib

print("‚úÖ Installation complete!")

In [None]:
# Check GPU
import torch

print("="*60)
print("GPU STATUS")
print("="*60)
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print("‚úÖ Ready for high-speed processing!")
else:
    print("‚ö†Ô∏è No GPU detected - will use CPU (slower)")
print("="*60)

## üìÅ Step 2: Upload Your Dataset

In [None]:
# Option A: Upload file
from google.colab import files

print("üì§ Upload your CSV file...")
uploaded = files.upload()

input_file = list(uploaded.keys())[0]
print(f"\n‚úÖ Uploaded: {input_file}")

In [None]:
# Option B: Mount Google Drive (uncomment to use)
# from google.colab import drive
# drive.mount('/content/drive')
# input_file = '/content/drive/MyDrive/your_file.csv'

## ‚öôÔ∏è Step 3: Configuration

### Choose Your Model:

| Model | Quality | Speed | Best For |
|-------|---------|-------|----------|
| `bge-m3` | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Medium | **Maximum accuracy** |
| `e5-large` | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Medium | **High quality multilingual** |
| `mpnet-base` | ‚≠ê‚≠ê‚≠ê‚≠ê | Fast | **Balanced** |
| `minilm` | ‚≠ê‚≠ê‚≠ê | Very Fast | **Large datasets** |
| `ensemble` | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Slow | **Best possible accuracy** |

In [None]:
# ========== CONFIGURATION ==========

# MODEL SELECTION (choose one)
MODEL = 'bge-m3'  # Options: 'bge-m3', 'e5-large', 'mpnet-base', 'minilm', 'ensemble'

# DATASET SETTINGS
DESCRIPTION_COLUMN = 'company_description'  # Name of your description column

# PERFORMANCE SETTINGS
BATCH_SIZE = 512  # 512-1024 for 100GB GPU, 256 for smaller GPUs

# OUTPUT
OUTPUT_FILE = 'scored_startups_advanced.csv'

print("‚öôÔ∏è Configuration:")
print(f"   Model: {MODEL}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Output: {OUTPUT_FILE}")

## üîß Step 4: Load Advanced Scoring Models

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Optional
import torch
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Model configurations
MODELS = {
    'bge-m3': {
        'name': 'BAAI/bge-m3',
        'quality': '‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê BEST',
        'description': 'State-of-the-art multilingual (100+ languages)'
    },
    'e5-large': {
        'name': 'intfloat/multilingual-e5-large',
        'quality': '‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê EXCELLENT',
        'description': 'E5 embeddings - very high quality'
    },
    'mpnet-base': {
        'name': 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2',
        'quality': '‚≠ê‚≠ê‚≠ê‚≠ê HIGH',
        'description': 'MPNet-based, fast and accurate'
    },
    'minilm': {
        'name': 'paraphrase-multilingual-MiniLM-L12-v2',
        'quality': '‚≠ê‚≠ê‚≠ê GOOD',
        'description': 'Very fast, good for large datasets'
    }
}

print("‚úÖ Model configurations loaded")

In [None]:
class AdvancedSemanticScorer:
    """Advanced semantic scorer with state-of-the-art models."""

    def __init__(self, model_key='bge-m3', batch_size=512):
        if model_key not in MODELS:
            raise ValueError(f"Unknown model: {model_key}")

        model_config = MODELS[model_key]
        model_name = model_config['name']

        print("="*70)
        print("üöÄ ADVANCED SEMANTIC SCORER")
        print("="*70)
        print(f"Model: {model_name}")
        print(f"Quality: {model_config['quality']}")
        print(f"Description: {model_config['description']}")
        print("="*70 + "\n")

        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print(f"Loading model on {self.device}...")
        self.model = SentenceTransformer(model_name, device=self.device)
        self.batch_size = batch_size

        # Enhanced reference texts for better accuracy
        self.reference_texts = {
            'innovation': [
                "Revolutionary breakthrough technology disrupting traditional industries with unprecedented innovation",
                "Cutting-edge artificial intelligence and machine learning achieving state-of-the-art results",
                "Novel solution using advanced patented technology never seen before in the market",
                "Pioneering new approach with proprietary technology and unique methodology",
                "Next-generation platform leveraging emerging technologies like quantum computing",
                "Groundbreaking research-based innovation transforming the industry fundamentally",
                "First-of-its-kind solution introducing completely new paradigm",
                "Disruptive technology challenging established market leaders",
                "Advanced R&D resulting in breakthrough capabilities",
                "Revolutionary approach reimagining traditional processes"
            ],
            'confidence': [
                "Proven track record with established customer base generating consistent revenue",
                "Successfully deployed solution serving thousands of paying customers globally",
                "Market leader with strong partnerships and extensively validated product",
                "Demonstrated traction with measurable growth and strong profitability",
                "Established company with successful case studies and testimonials",
                "Operating at scale with proven repeatable business model",
                "Trusted by Fortune 500 companies and industry leaders",
                "99.9% uptime with enterprise-grade reliability",
                "Growing customer base with high retention rates",
                "Award-winning solution recognized by industry authorities"
            ],
            'market_clarity': [
                "Clear value proposition solving specific problem for well-defined target market",
                "Precisely addressing enterprise healthcare compliance with quantifiable ROI",
                "Serving small retail businesses with streamlined inventory management",
                "Focused solution for financial services regulatory compliance",
                "Targeted platform connecting buyers and sellers in construction industry",
                "Well-defined offering for specific customer segment with measurable benefits",
                "Solving specific problem for specific market with specific solution",
                "Reducing costs by X percent for Y industry through Z approach",
                "Helping target customers achieve specific outcomes in measurable timeframe",
                "Clear go-to-market strategy targeting specific segment"
            ]
        }

        print("Computing reference embeddings...")
        self.reference_embeddings = {}
        for score_type, texts in self.reference_texts.items():
            emb = self.model.encode(
                texts,
                convert_to_tensor=True,
                show_progress_bar=False,
                device=self.device,
                normalize_embeddings=True
            )
            self.reference_embeddings[score_type] = emb

        print("‚úÖ Ready!\n")

    def score_descriptions(self, descriptions: List[str]) -> pd.DataFrame:
        """Score descriptions with advanced model."""
        descriptions = [str(d) if pd.notna(d) and str(d).strip()
                       else "No description" for d in descriptions]

        n = len(descriptions)
        print(f"Scoring {n:,} descriptions with batch size {self.batch_size}\n")

        # Encode all descriptions
        print("Encoding descriptions...")
        all_embeddings = self.model.encode(
            descriptions,
            batch_size=self.batch_size,
            convert_to_tensor=True,
            show_progress_bar=True,
            device=self.device,
            normalize_embeddings=True
        )

        # Compute scores
        scores = {}
        print("\nComputing semantic scores...")

        for score_type in ['innovation', 'confidence', 'market_clarity']:
            # Compute similarities
            similarity = torch.mm(all_embeddings, self.reference_embeddings[score_type].T)
            max_similarity = torch.max(similarity, dim=1)[0]
            scores[score_type] = max_similarity.cpu().numpy()
            print(f"  ‚úì {score_type}")

        # Create DataFrame
        results_df = pd.DataFrame({
            'innovation_score': scores['innovation'],
            'confidence_score': scores['confidence'],
            'market_clarity_score': scores['market_clarity']
        })

        # Scale to 0-100
        for col in results_df.columns:
            raw = results_df[col]
            results_df[col] = ((raw - raw.min()) / (raw.max() - raw.min()) * 100).clip(0, 100)

        # Add overall quality score
        results_df['overall_quality_score'] = (
            results_df['innovation_score'] * 0.35 +
            results_df['confidence_score'] * 0.35 +
            results_df['market_clarity_score'] * 0.30
        )

        print("\n‚úÖ Scoring complete!")
        return results_df

    def score_dataframe(self, df, description_column, output_path=None):
        """Score DataFrame."""
        if description_column not in df.columns:
            raise ValueError(f"Column '{description_column}' not found!")

        scores_df = self.score_descriptions(df[description_column].tolist())
        result_df = pd.concat([df.reset_index(drop=True), scores_df], axis=1)

        if output_path:
            print(f"\nüíæ Saving to {output_path}...")
            result_df.to_csv(output_path, index=False)
            print("‚úÖ Saved!")

        return result_df

print("‚úÖ AdvancedSemanticScorer loaded!")

In [None]:
class EnsembleScorer:
    """Ensemble scorer combining multiple models for maximum accuracy."""

    def __init__(self, models=['bge-m3', 'e5-large'], batch_size=512):
        print("="*70)
        print("üéØ ENSEMBLE SCORER (Maximum Accuracy)")
        print("="*70)
        print(f"Combining {len(models)} models:\n")

        for model in models:
            print(f"  ‚Ä¢ {model}: {MODELS[model]['description']}")

        print("="*70 + "\n")

        self.scorers = []
        for model in models:
            print(f"Loading {model}...\n")
            scorer = AdvancedSemanticScorer(model, batch_size)
            self.scorers.append(scorer)

        print("\n‚úÖ Ensemble ready!\n")

    def score_descriptions(self, descriptions: List[str]) -> pd.DataFrame:
        """Score using ensemble."""
        all_scores = []

        for i, scorer in enumerate(self.scorers):
            print(f"\n{'='*70}")
            print(f"Model {i+1}/{len(self.scorers)}")
            print('='*70)
            scores = scorer.score_descriptions(descriptions)
            all_scores.append(scores)

        print("\nüîÑ Combining scores...")
        ensemble_scores = sum(all_scores) / len(all_scores)

        print("‚úÖ Ensemble complete!")
        return ensemble_scores

    def score_dataframe(self, df, description_column, output_path=None):
        """Score DataFrame with ensemble."""
        if description_column not in df.columns:
            raise ValueError(f"Column '{description_column}' not found!")

        scores_df = self.score_descriptions(df[description_column].tolist())
        result_df = pd.concat([df.reset_index(drop=True), scores_df], axis=1)

        if output_path:
            print(f"\nüíæ Saving to {output_path}...")
            result_df.to_csv(output_path, index=False)
            print("‚úÖ Saved!")

        return result_df

print("‚úÖ EnsembleScorer loaded!")

## üß™ Step 5: Quick Test (Optional)

Test the model on sample descriptions

In [None]:
# Quick test
test_descriptions = [
    "Revolutionary AI-powered healthcare diagnostics with proven clinical results and FDA approval",
    "Our platform serves 50,000+ enterprise customers with 99.99% uptime and strong revenue",
    "Helping small retailers reduce inventory costs by 40% with clear ROI in 6 months",
    "Startup exploring opportunities in tech",
    "Quantum computing breakthrough with proprietary algorithms transforming drug discovery"
]

print("üß™ Testing with sample descriptions...\n")

test_scorer = AdvancedSemanticScorer(model_key=MODEL, batch_size=32)
test_scores = test_scorer.score_descriptions(test_descriptions)

test_results = pd.DataFrame({
    'Description': [d[:50] + '...' for d in test_descriptions],
    'Innovation': test_scores['innovation_score'].round(1),
    'Confidence': test_scores['confidence_score'].round(1),
    'Clarity': test_scores['market_clarity_score'].round(1),
    'Overall': test_scores['overall_quality_score'].round(1)
})

print("\nüìä Test Results:")
print("="*100)
print(test_results.to_string(index=False))
print("="*100)

## üöÄ Step 6: Process Your Full Dataset

In [None]:
# Load dataset
print(f"Loading data from: {input_file}\n")
df = pd.read_csv(input_file)

print("="*60)
print("DATASET INFO")
print("="*60)
print(f"Rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")
print(f"\nAvailable columns:")
for col in df.columns:
    print(f"  ‚Ä¢ {col}")
print("="*60)

# Check description column
if DESCRIPTION_COLUMN not in df.columns:
    print(f"\n‚ö†Ô∏è Column '{DESCRIPTION_COLUMN}' not found!")
    print("Please update DESCRIPTION_COLUMN in Step 3")
else:
    print(f"\n‚úÖ Using column: '{DESCRIPTION_COLUMN}'")
    sample_desc = str(df[DESCRIPTION_COLUMN].iloc[0])[:150]
    print(f"Sample: {sample_desc}...")

In [None]:
# Initialize scorer based on MODEL setting
print("\n" + "="*70)
print("STARTING SEMANTIC SCORING")
print("="*70 + "\n")

if MODEL == 'ensemble':
    # Use ensemble of best models
    scorer = EnsembleScorer(models=['bge-m3', 'e5-large'], batch_size=BATCH_SIZE)
else:
    # Use single model
    scorer = AdvancedSemanticScorer(model_key=MODEL, batch_size=BATCH_SIZE)

# Process all descriptions
result_df = scorer.score_dataframe(
    df,
    description_column=DESCRIPTION_COLUMN,
    output_path=OUTPUT_FILE
)

print("\n" + "="*70)
print("‚úÖ SCORING COMPLETE!")
print("="*70)

## üìä Step 7: Analyze Results

In [None]:
# Statistics
score_cols = ['innovation_score', 'confidence_score', 'market_clarity_score', 'overall_quality_score']

print("\n" + "="*70)
print("SCORE STATISTICS")
print("="*70 + "\n")
print(result_df[score_cols].describe().round(1))

In [None]:
# Top scorers
print("\n" + "="*70)
print("TOP 5 COMPANIES BY CATEGORY")
print("="*70 + "\n")

print("üöÄ Most Innovative:")
print(result_df.nlargest(5, 'innovation_score')[['name', 'innovation_score', DESCRIPTION_COLUMN]].to_string(index=False))

print("\nüí™ Most Confident:")
print(result_df.nlargest(5, 'confidence_score')[['name', 'confidence_score', DESCRIPTION_COLUMN]].to_string(index=False))

print("\nüéØ Best Market Clarity:")
print(result_df.nlargest(5, 'market_clarity_score')[['name', 'market_clarity_score', DESCRIPTION_COLUMN]].to_string(index=False))

print("\n‚≠ê Highest Overall Quality:")
print(result_df.nlargest(5, 'overall_quality_score')[['name', 'overall_quality_score', DESCRIPTION_COLUMN]].to_string(index=False))

In [None]:
# Visualizations
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Semantic Score Distributions', fontsize=16, fontweight='bold')

# Innovation
result_df['innovation_score'].hist(bins=50, ax=axes[0,0], color='#FF6B6B', alpha=0.7, edgecolor='black')
axes[0,0].set_title('Innovation Score', fontsize=12, fontweight='bold')
axes[0,0].set_xlabel('Score')
axes[0,0].set_ylabel('Frequency')
axes[0,0].axvline(result_df['innovation_score'].mean(), color='red', linestyle='--', label=f"Mean: {result_df['innovation_score'].mean():.1f}")
axes[0,0].legend()

# Confidence
result_df['confidence_score'].hist(bins=50, ax=axes[0,1], color='#4ECDC4', alpha=0.7, edgecolor='black')
axes[0,1].set_title('Confidence Score', fontsize=12, fontweight='bold')
axes[0,1].set_xlabel('Score')
axes[0,1].set_ylabel('Frequency')
axes[0,1].axvline(result_df['confidence_score'].mean(), color='darkgreen', linestyle='--', label=f"Mean: {result_df['confidence_score'].mean():.1f}")
axes[0,1].legend()

# Market Clarity
result_df['market_clarity_score'].hist(bins=50, ax=axes[1,0], color='#FFD93D', alpha=0.7, edgecolor='black')
axes[1,0].set_title('Market Clarity Score', fontsize=12, fontweight='bold')
axes[1,0].set_xlabel('Score')
axes[1,0].set_ylabel('Frequency')
axes[1,0].axvline(result_df['market_clarity_score'].mean(), color='orange', linestyle='--', label=f"Mean: {result_df['market_clarity_score'].mean():.1f}")
axes[1,0].legend()

# Overall Quality
result_df['overall_quality_score'].hist(bins=50, ax=axes[1,1], color='#A8E6CF', alpha=0.7, edgecolor='black')
axes[1,1].set_title('Overall Quality Score', fontsize=12, fontweight='bold')
axes[1,1].set_xlabel('Score')
axes[1,1].set_ylabel('Frequency')
axes[1,1].axvline(result_df['overall_quality_score'].mean(), color='darkblue', linestyle='--', label=f"Mean: {result_df['overall_quality_score'].mean():.1f}")
axes[1,1].legend()

plt.tight_layout()
plt.show()

print("üìä Distributions plotted!")

In [None]:
# Score correlations
import seaborn as sns

plt.figure(figsize=(8, 6))
correlation = result_df[score_cols].corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, fmt='.2f')
plt.title('Score Correlations', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üîó Correlation matrix plotted!")

## üì• Step 8: Download Results

In [None]:
# Download results
from google.colab import files

print(f"üì• Downloading {OUTPUT_FILE}...")
files.download(OUTPUT_FILE)
print("‚úÖ Download complete!")

## üìù Model Performance Notes

### Expected Processing Time (20k rows):

**Single Model:**
- **BGE-M3** (A100 100GB): ~2-3 minutes with batch_size=1024
- **E5-Large** (A100 100GB): ~2-3 minutes with batch_size=1024  
- **MPNet-Base** (V100 16GB): ~3-4 minutes with batch_size=512
- **MiniLM** (T4 16GB): ~5-7 minutes with batch_size=512

**Ensemble (BGE-M3 + E5-Large):**
- A100 100GB: ~5-6 minutes (best accuracy)

### Quality Comparison:

| Model | Accuracy | Languages | Speed |
|-------|----------|-----------|-------|
| BGE-M3 | ‚òÖ‚òÖ‚òÖ‚òÖ‚òÖ | 100+ | Medium |
| E5-Large | ‚òÖ‚òÖ‚òÖ‚òÖ‚òÖ | 100+ | Medium |
| Ensemble | ‚òÖ‚òÖ‚òÖ‚òÖ‚òÖ+ | 100+ | Slow |
| MPNet | ‚òÖ‚òÖ‚òÖ‚òÖ | 50+ | Fast |
| MiniLM | ‚òÖ‚òÖ‚òÖ | 50+ | Very Fast |

### Tips:
- Use **BGE-M3** or **E5-Large** for best quality
- Use **Ensemble** for maximum accuracy (research/critical applications)
- Use **MPNet** for balanced quality/speed
- Increase `batch_size` to 1024-2048 with 100GB GPU
- All models handle multilingual descriptions automatically