{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "89aad950-d833-496f-8d65-acf997376724",
   "metadata": {},
   "source": [
    "# Cabecera\n",
    "\n",
    "**Nombre completo del estudiante**: [TU NOMBRE AQUI]\n",
    "\n",
    "**Grupo**: [TU GRUPO]  \n",
    "\n",
    "**Carrera**: Ingeniería en Inteligencia Artificial  \n",
    "\n",
    "**Fecha de última modificación**: 04/05/2025\n",
    "\n",
    "---\n",
    "\n",
    "## Descripción detallada del programa\n",
    "\n",
    "Este programa implementa diferentes algoritmos de extracción de frases clave y resumen automático de textos aplicados a las tres primeras cartas del libro \"Frankenstein\" de Mary Shelley. Los algoritmos utilizados son:\n",
    "\n",
    "1. TF-IDF (Term Frequency-Inverse Document Frequency)\n",
    "2. Frecuencia de palabras normalizada\n",
    "3. RAKE (Rapid Automatic Keyword Extraction)\n",
    "4. TextRank\n",
    "5. BERT (Bidirectional Encoder Representations from Transformers)\n",
    "6. LSA (Latent Semantic Analysis)\n",
    "\n",
    "### Datos de entrada:\n",
    "- **Texto de entrada**: Las tres primeras cartas del libro \"Frankenstein\" descargado de Project Gutenberg.\n",
    "- **Algoritmos de procesamiento**: Módulos de NLTK, SpaCy, Transformers y otras bibliotecas especializadas.\n",
    "\n",
    "### Salidas esperadas:\n",
    "- Resúmenes extractivos de cada carta usando cada algoritmo\n",
    "- Métricas de tiempo de ejecución\n",
    "- Análisis comparativo de resultados\n",
    "\n",
    "---\n",
    "\n",
    "## PRÁCTICA 3: IDENTIFICACIÓN DE FRASES CLAVE Y RESUMEN AUTOMÁTICO DE TEXTO\n",
    "\n",
    "### 1. Importación de bibliotecas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1eac2da-4402-4f69-bf7d-c530b66e56c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bibliotecas principales\n",
    "import requests\n",
    "import re\n",
    "import time\n",
    "import nltk\n",
    "from nltk.tokenize import sent_tokenize, word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import PorterStemmer, WordNetLemmatizer\n",
    "import pandas as pd\n",
    "\n",
    "# Algoritmos específicos\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "import spacy\n",
    "from rake_nltk import Rake\n",
    "import networkx as nx\n",
    "from transformers import pipeline\n",
    "from sklearn.decomposition import TruncatedSVD\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "# Descarga de recursos NLTK necesarios\n",
    "nltk.download('punkt')\n",
    "nltk.download('stopwords')\n",
    "nltk.download('wordnet')\n",
    "nltk.download('averaged_perceptron_tagger')\n",
    "\n",
    "# Cargar SpaCy\n",
    "nlp = spacy.load('en_core_web_sm')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d9ea830-5df5-4b78-9342-58aed1857961",
   "metadata": {},
   "source": [
    "### 2. Generación del corpus: Descarga y extracción de las tres primeras cartas de Frankenstein"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51708a64-201e-440a-ad52-7bad5941e5ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Descargar el libro Frankenstein\n",
    "url = \"https://www.gutenberg.org/files/84/84-0.txt\"\n",
    "response = requests.get(url)\n",
    "text = response.text\n",
    "\n",
    "# Encontrar el inicio y fin de las cartas\n",
    "carta_start = text.find(\"LETTER I\")\n",
    "carta_end = text.find(\"CHAPTER 1\")\n",
    "\n",
    "# Extraer la sección de cartas\n",
    "cartas_text = text[carta_start:carta_end]\n",
    "\n",
    "# Separar las tres primeras cartas\n",
    "cartas = []\n",
    "carta_indices = [\"LETTER I\", \"LETTER II\", \"LETTER III\"]\n",
    "\n",
    "for i in range(len(carta_indices)):\n",
    "    start_idx = cartas_text.find(carta_indices[i])\n",
    "    if i + 1 < len(carta_indices):\n",
    "        end_idx = cartas_text.find(carta_indices[i + 1])\n",
    "    else:\n",
    "        end_idx = cartas_text.find(\"LETTER IV\", start_idx)\n",
    "    \n",
    "    carta = cartas_text[start_idx:end_idx]\n",
    "    cartas.append(carta.strip())\n",
    "\n",
    "print(f\"Cartas extraídas: {len(cartas)}\")\n",
    "print(f\"Longitud carta 1: {len(cartas[0])} caracteres\")\n",
    "print(f\"Longitud carta 2: {len(cartas[1])} caracteres\")\n",
    "print(f\"Longitud carta 3: {len(cartas[2])} caracteres\")\n",
    "\n",
    "# Mostrar el inicio de cada carta para verificación\n",
    "for i, carta in enumerate(cartas):\n",
    "    print(f\"\\n=== CARTA {i+1} - Primeras 200 caracteres ===\\n\")\n",
    "    print(carta[:200])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53a821e2-e2ec-4491-bc29-4f9492a351ac",
   "metadata": {},
   "source": [
    "### 3. Normalización de textos\n",
    "\n",
    "Para cada documento, aplicaremos las siguientes técnicas de normalización:\n",
    "1. Eliminación de encabezados y formato de Project Gutenberg\n",
    "2. Limpieza de caracteres especiales\n",
    "3. Tokenización\n",
    "4. Eliminación de palabras vacías\n",
    "5. Lematización/Stemming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0bf73af-3a78-4790-b378-675ca408e298",
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_text(text):\n",
    "    \"\"\"Normaliza el texto aplicando varias técnicas de preprocesamiento\"\"\"\n",
    "    \n",
    "    # 1. Eliminar encabezados y metadatos de Project Gutenberg\n",
    "    text = re.sub(r'^.*?LETTER [I]+', '', text)\n",
    "    text = re.sub(r'_.*?_', ' ', text)  # eliminar texto entre guiones bajos\n",
    "    \n",
    "    # 2. Limpiar caracteres especiales y múltiples espacios\n",
    "    text = re.sub(r'[^a-zA-Z0-9\\s.,!?]', ' ', text)\n",
    "    text = re.sub(r'\\s+', ' ', text)\n",
    "    \n",
    "    # 3. Convertir a minúsculas\n",
    "    text = text.lower()\n",
    "    \n",
    "    # 4. Tokenización por oraciones\n",
    "    sentences = sent_tokenize(text)\n",
    "    \n",
    "    # 5. Tokenización por palabras y eliminación de palabras vacías\n",
    "    stop_words = set(stopwords.words('english'))\n",
    "    lemmatizer = WordNetLemmatizer()\n",
    "    \n",
    "    processed_sentences = []\n",
    "    for sentence in sentences:\n",
    "        words = word_tokenize(sentence)\n",
    "        words = [lemmatizer.lemmatize(word) for word in words \n",
    "                if word.isalpha() and word not in stop_words]\n",
    "        processed_sentences.append(' '.join(words))\n",
    "    \n",
    "    return ' '.join(processed_sentences)\n",
    "\n",
    "# Normalizar las cartas\n",
    "cartas_normalizadas = []\n",
    "cartas_originals = []  # Guardamos las originales para el resumen\n",
    "\n",
    "for i, carta in enumerate(cartas):\n",
    "    # Guardar versión original para resúmenes\n",
    "    carta_clean = re.sub(r'^.*?LETTER [I]+', '', carta)\n",
    "    carta_clean = re.sub(r'_.*?_', ' ', carta_clean)\n",
    "    cartas_originals.append(carta_clean.strip())\n",
    "    \n",
    "    # Versión normalizada para procesamiento\n",
    "    normalized = normalize_text(carta)\n",
    "    cartas_normalizadas.append(normalized)\n",
    "    \n",
    "    print(f\"\\n=== CARTA {i+1} - Texto normalizado (primeras 200 caracteres) ===\")\n",
    "    print(normalized[:200])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d45f066a-d960-479a-949b-a348e63920e7",
   "metadata": {},
   "source": [
    "### 4. Resumen automático extractivo de texto\n",
    "\n",
    "#### 4.1 TF-IDF con NLTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "575be191-1338-4fad-b196-2118606fbc08",
   "metadata": {},
   "outputs": [],
   "source": [
    "def tfidf_summarize(text, num_sentences=4):\n",
    "    \"\"\"Genera resumen usando TF-IDF\"\"\"\n",
    "    \n",
    "    sentences = sent_tokenize(text)\n",
    "    \n",
    "    # Vectorizar usando TF-IDF\n",
    "    tfidf_vectorizer = TfidfVectorizer()\n",
    "    tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)\n",
    "    \n",
    "    # Calcular la suma de TF-IDF para cada oración\n",
    "    sentence_scores = tfidf_matrix.sum(axis=1).A1\n",
    "    \n",
    "    # Obtener índices de las oraciones mejor puntuadas\n",
    "    top_indices = sentence_scores.argsort()[-num_sentences:][::-1]\n",
    "    \n",
    "    # Ordenar por posición original\n",
    "    top_indices = sorted(top_indices)\n",
    "    \n",
    "    summary = [sentences[i] for i in top_indices]\n",
    "    return summary\n",
    "\n",
    "# Medir tiempo de ejecución y generar resúmenes\n",
    "tfidf_times = []\n",
    "tfidf_summaries = []\n",
    "\n",
    "print(\"=== RESUMEN USANDO TF-IDF ===\")\n",
    "for i, carta in enumerate(cartas_originals):\n",
    "    start_time = time.time()\n",
    "    summary = tfidf_summarize(carta)\n",
    "    execution_time = time.time() - start_time\n",
    "    \n",
    "    tfidf_times.append(execution_time)\n",
    "    tfidf_summaries.append(summary)\n",
    "    \n",
    "    print(f\"\\nCarta {i+1} - Tiempo de ejecución: {execution_time:.3f} segundos\")\n",
    "    print(\"Resumen:\")\n",
    "    for j, sentence in enumerate(summary):\n",
    "        print(f\"{j+1}. {sentence[:100]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93526356-aef6-4353-866e-28656afbac85",
   "metadata": {},
   "source": [
    "#### 4.2 Frecuencia de palabras normalizada"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5c42cab-b5d0-40b1-ab2e-629ef13a53eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "def word_frequency_summarize(text, num_sentences=4):\n",
    "    \"\"\"Genera resumen usando frecuencia de palabras normalizada\"\"\"\n",
    "    \n",
    "    # Tokenización\n",
    "    sentences = sent_tokenize(text)\n",
    "    words = word_tokenize(text.lower())\n",
    "    \n",
    "    # Eliminar palabras vacías\n",
    "    stop_words = set(stopwords.words('english'))\n",
    "    words = [word for word in words if word.isalpha() and word not in stop_words]\n",
    "    \n",
    "    # Calcular frecuencia de palabras\n",
    "    word_freq = {}\n",
    "    for word in words:\n",
    "        if word in word_freq:\n",
    "            word_freq[word] += 1\n",
    "        else:\n",
    "            word_freq[word] = 1\n",
    "    \n",
    "    # Normalizar frecuencias\n",
    "    max_freq = max(word_freq.values())\n",
    "    for word in word_freq:\n",
    "        word_freq[word] = word_freq[word] / max_freq\n",
    "    \n",
    "    # Puntuar oraciones\n",
    "    sentence_scores = []\n",
    "    for sentence in sentences:\n",
    "        sentence_words = word_tokenize(sentence.lower())\n",
    "        score = 0\n",
    "        word_count = 0\n",
    "        \n",
    "        for word in sentence_words:\n",
    "            if word in word_freq:\n",
    "                score += word_freq[word]\n",
    "                word_count += 1\n",
    "        \n",
    "        if word_count > 0:\n",
    "            sentence_scores.append(score / word_count)\n",
    "        else:\n",
    "            sentence_scores.append(0)\n",
    "    \n",
    "    # Obtener índices de las oraciones mejor puntuadas\n",
    "    top_indices = sorted(range(len(sentence_scores)), \n",
    "                        key=lambda i: sentence_scores[i], \n",
    "                        reverse=True)[:num_sentences]\n",
    "    \n",
    "    # Ordenar por posición original\n",
    "    top_indices = sorted(top_indices)\n",
    "    \n",
    "    summary = [sentences[i] for i in top_indices]\n",
    "    return summary\n",
    "\n",
    "# Medir tiempo de ejecución y generar resúmenes\n",
    "freq_times = []\n",
    "freq_summaries = []\n",
    "\n",
    "print(\"=== RESUMEN USANDO FRECUENCIA DE PALABRAS NORMALIZADA ===\")\n",
    "for i, carta in enumerate(cartas_originals):\n",
    "    start_time = time.time()\n",
    "    summary = word_frequency_summarize(carta)\n",
    "    execution_time = time.time() - start_time\n",
    "    \n",
    "    freq_times.append(execution_time)\n",
    "    freq_summaries.append(summary)\n",
    "    \n",
    "    print(f\"\\nCarta {i+1} - Tiempo de ejecución: {execution_time:.3f} segundos\")\n",
    "    print(\"Resumen:\")\n",
    "    for j, sentence in enumerate(summary):\n",
    "        print(f\"{j+1}. {sentence[:100]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80d8a7f4-f551-4e73-9033-e473b1881341",
   "metadata": {},
   "source": [
    "#### 4.3 RAKE (Rapid Automatic Keyword Extraction)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dcb4dc2-7d51-4cf8-8b26-ebc62c32c316",
   "metadata": {},
   "outputs": [],
   "source": [
    "def rake_summarize(text, num_sentences=4):\n",
    "    \"\"\"Genera resumen usando RAKE para extraer frases clave\"\"\"\n",
    "    \n",
    "    # Inicializar RAKE\n",
    "    r = Rake()\n",
    "    r.extract_keywords_from_text(text)\n",
    "    \n",
    "    # Obtener frases clave con puntuación\n",
    "    phrases = r.get_ranked_phrases_with_scores()\n",
    "    \n",
    "    # Obtener oraciones del texto\n",
    "    sentences = sent_tokenize(text)\n",
    "    \n",
    "    # Puntuar oraciones basándose en la presencia de frases clave\n",
    "    sentence_scores = []\n",
    "    for sentence in sentences:\n",
    "        score = 0\n",
    "        for phrase_score, phrase in phrases:\n",
    "            if phrase.lower() in sentence.lower():\n",
    "                score += phrase_score\n",
    "        sentence_scores.append(score)\n",
    "    \n",
    "    # Obtener índices de las oraciones mejor puntuadas\n",
    "    top_indices = sorted(range(len(sentence_scores)), \n",
    "                        key=lambda i: sentence_scores[i], \n",
    "                        reverse=True)[:num_sentences]\n",
    "    \n",
    "    # Ordenar por posición original\n",
    "    top_indices = sorted(top_indices)\n",
    "    \n",
    "    summary = [sentences[i] for i in top_indices]\n",
    "    return summary\n",
    "\n",
    "# Medir tiempo de ejecución y generar resúmenes\n",
    "rake_times = []\n",
    "rake_summaries = []\n",
    "\n",
    "print(\"=== RESUMEN USANDO RAKE ===\")\n",
    "for i, carta in enumerate(cartas_originals):\n",
    "    start_time = time.time()\n",
    "    summary = rake_summarize(carta)\n",
    "    execution_time = time.time() - start_time\n",
    "    \n",
    "    rake_times.append(execution_time)\n",
    "    rake_summaries.append(summary)\n",
    "    \n",
    "    print(f\"\\nCarta {i+1} - Tiempo de ejecución: {execution_time:.3f} segundos\")\n",
    "    print(\"Resumen:\")\n",
    "    for j, sentence in enumerate(summary):\n",
    "        print(f\"{j+1}. {sentence[:100]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5dfb249-78f3-4e5c-aae1-4f416dc5c71e",
   "metadata": {},
   "source": [
    "#### 4.4 TextRank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46c25d35-c7bc-4a83-9008-8539ca48d4d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def textrank_summarize(text, num_sentences=4):\n",
    "    \"\"\"Genera resumen usando TextRank\"\"\"\n",
    "    \n",
    "    # Tokenizar oraciones\n",
    "    sentences = sent_tokenize(text)\n",
    "    \n",
    "    # Crear matriz de similitud\n",
    "    vectorizer = TfidfVectorizer()\n",
    "    tfidf_matrix = vectorizer.fit_transform(sentences)\n",
    "    \n",
    "    # Calcular similitud usando multiplicación de matrices\n",
    "    similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()\n",
    "    \n",
    "    # Crear grafo usando NetworkX\n",
    "    graph = nx.from_numpy_array(similarity_matrix)\n",
    "    \n",
    "    # Aplicar SVD (Singular Value Decomposition)\n",
    "    svd = TruncatedSVD(n_components=4)  # Reducir a 4 dimensiones\n",
    "    document_matrix_reduced = svd.fit_transform(document_matrix)\n",
    "    \n",
    "    # Calcular puntuación para cada oración basada en los componentes principales\n",
    "    sentence_scores = []\n",
    "    for i in range(len(sentences)):\n",
    "        # Sumar el valor absoluto de cada componente\n",
    "        score = sum(abs(document_matrix_reduced[i]))\n",
    "        sentence_scores.append(score)\n",
    "    \n",
    "    # Obtener índices de las oraciones mejor puntuadas\n",
    "    top_indices = sorted(range(len(sentence_scores)), \n",
    "                        key=lambda i: sentence_scores[i], \n",
    "                        reverse=True)[:num_sentences]\n",
    "    \n",
    "    # Ordenar por posición original\n",
    "    top_indices = sorted(top_indices)\n",
    "    \n",
    "    summary = [sentences[i] for i in top_indices]\n",
    "    return summary\n",
    "\n",
    "# Medir tiempo de ejecución y generar resúmenes\n",
    "lsa_times = []\n",
    "lsa_summaries = []\n",
    "\n",
    "print(\"=== RESUMEN USANDO LSA ===\")\n",
    "for i, carta in enumerate(cartas_originals):\n",
    "    start_time = time.time()\n",
    "    summary = lsa_summarize(carta)\n",
    "    execution_time = time.time() - start_time\n",
    "    \n",
    "    lsa_times.append(execution_time)\n",
    "    lsa_summaries.append(summary)\n",
    "    \n",
    "    print(f\"\\nCarta {i+1} - Tiempo de ejecución: {execution_time:.3f} segundos\")\n",
    "    print(\"Resumen:\")\n",
    "    for j, sentence in enumerate(summary):\n",
    "        print(f\"{j+1}. {sentence[:100]}...\") PageRank (TextRank)\n",
    "    scores = nx.pagerank(graph)\n",
    "    \n",
    "    # Obtener índices de las oraciones mejor puntuadas\n",
    "    top_indices = sorted(scores, key=scores.get, reverse=True)[:num_sentences]\n",
    "    \n",
    "    # Ordenar por posición original\n",
    "    top_indices = sorted(top_indices)\n",
    "    \n",
    "    summary = [sentences[i] for i in top_indices]\n",
    "    return summary\n",
    "\n",
    "# Medir tiempo de ejecución y generar resúmenes\n",
    "textrank_times = []\n",
    "textrank_summaries = []\n",
    "\n",
    "print(\"=== RESUMEN USANDO TEXTRANK ===\")\n",
    "for i, carta in enumerate(cartas_originals):\n",
    "    start_time = time.time()\n",
    "    summary = textrank_summarize(carta)\n",
    "    execution_time = time.time() - start_time\n",
    "    \n",
    "    textrank_times.append(execution_time)\n",
    "    textrank_summaries.append(summary)\n",
    "    \n",
    "    print(f\"\\nCarta {i+1} - Tiempo de ejecución: {execution_time:.3f} segundos\")\n",
    "    print(\"Resumen:\")\n",
    "    for j, sentence in enumerate(summary):\n",
    "        print(f\"{j+1}. {sentence[:100]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08dd7914-4f33-4e38-b41e-10b2653e249c",
   "metadata": {},
   "source": [
    "#### 4.5 BERT con Transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aed33907-4d58-4bbd-96aa-dac964670ad7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def bert_summarize(text, num_sentences=4):\n",
    "    \"\"\"Genera resumen usando BERT para embeddings de oraciones\"\"\"\n",
    "    \n",
    "    # Tokenizar oraciones\n",
    "    sentences = sent_tokenize(text)\n",
    "    \n",
    "    # Usar BERT para generar embeddings\n",
    "    # Nota: En ambiente real, usaríamos un modelo específico para embeddings\n",
    "    # Aquí usaremos un approach más simple con BERT pre-entrenado\n",
    "    \n",
    "    # Vectorizar usando SpaCy (como aproximación a BERT)\n",
    "    embeddings = []\n",
    "    for sentence in sentences:\n",
    "        doc = nlp(sentence)\n",
    "        embeddings.append(doc.vector)\n",
    "    \n",
    "    # Calcular similitud con todo el documento\n",
    "    doc_vector = nlp(text).vector\n",
    "    \n",
    "    similarities = []\n",
    "    for embedding in embeddings:\n",
    "        # Calcular similitud coseno\n",
    "        similarity = sum(embedding * doc_vector) / (sum(embedding * embedding) ** 0.5 * sum(doc_vector * doc_vector) ** 0.5)\n",
    "        similarities.append(similarity)\n",
    "    \n",
    "    # Obtener índices de las oraciones más similares\n",
    "    top_indices = sorted(range(len(similarities)), \n",
    "                        key=lambda i: similarities[i], \n",
    "                        reverse=True)[:num_sentences]\n",
    "    \n",
    "    # Ordenar por posición original\n",
    "    top_indices = sorted(top_indices)\n",
    "    \n",
    "    summary = [sentences[i] for i in top_indices]\n",
    "    return summary\n",
    "\n",
    "# Medir tiempo de ejecución y generar resúmenes\n",
    "bert_times = []\n",
    "bert_summaries = []\n",
    "\n",
    "print(\"=== RESUMEN USANDO BERT (aproximación con SpaCy) ===\")\n",
    "for i, carta in enumerate(cartas_originals):\n",
    "    start_time = time.time()\n",
    "    summary = bert_summarize(carta)\n",
    "    execution_time = time.time() - start_time\n",
    "    \n",
    "    bert_times.append(execution_time)\n",
    "    bert_summaries.append(summary)\n",
    "    \n",
    "    print(f\"\\nCarta {i+1} - Tiempo de ejecución: {execution_time:.3f} segundos\")\n",
    "    print(\"Resumen:\")\n",
    "    for j, sentence in enumerate(summary):\n",
    "        print(f\"{j+1}. {sentence[:100]}...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8f2eb19-8326-49fd-8d2f-636b139e56be",
   "metadata": {},
   "source": [
    "#### 4.6 LSA (Latent Semantic Analysis)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28a5d9eb-4ff8-438e-9b49-87fd2e1f4a91",
   "metadata": {},
   "outputs": [],
   "source": [
    "def lsa_summarize(text, num_sentences=4):\n",
    "    \"\"\"Genera resumen usando LSA\"\"\"\n",
    "    \n",
    "    # Tokenizar oraciones\n",
    "    sentences = sent_tokenize(text)\n",
    "    \n",
    "    # Crear matriz de documentos-términos\n",
    "    vectorizer = CountVectorizer(stop_words='english')\n",
    "    document_matrix = vectorizer.fit_transform(sentences)\n",
    "    \n",
    "    # Aplicar

SyntaxError: incomplete input (1855278807.py, line 609)