# Ejercicio 5: Modelo Probabilístico

## Objetivo de la práctica
- Aplicar paso a paso técnicas de preprocesamiento, evaluando el impacto de cada etapa en el número de tokens y en el vocabulario final.

## Parte 0: Carga del Corpus


In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

## Parte 1: Tokenización

### Actividad
1. Tokeniza los documentos.

In [5]:
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/235.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m235.5/235.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.8/235.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.4.0


In [21]:
# Tokenizar > diseccionar y comprender la estructura y el significado del texto.
import nltk
from nltk.tokenize import word_tokenize #tokenización por palabras

nltk.download('punkt')
nltk.download('punkt_tab')
corpus_tokenizado = [word_tokenize(doc) for doc in newsgroupsdocs]

corpus_tokenizado

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[['I',
  'am',
  'sure',
  'some',
  'bashers',
  'of',
  'Pens',
  'fans',
  'are',
  'pretty',
  'confused',
  'about',
  'the',
  'lack',
  'of',
  'any',
  'kind',
  'of',
  'posts',
  'about',
  'the',
  'recent',
  'Pens',
  'massacre',
  'of',
  'the',
  'Devils',
  '.',
  'Actually',
  ',',
  'I',
  'am',
  'bit',
  'puzzled',
  'too',
  'and',
  'a',
  'bit',
  'relieved',
  '.',
  'However',
  ',',
  'I',
  'am',
  'going',
  'to',
  'put',
  'an',
  'end',
  'to',
  'non-PIttsburghers',
  "'",
  'relief',
  'with',
  'a',
  'bit',
  'of',
  'praise',
  'for',
  'the',
  'Pens',
  '.',
  'Man',
  ',',
  'they',
  'are',
  'killing',
  'those',
  'Devils',
  'worse',
  'than',
  'I',
  'thought',
  '.',
  'Jagr',
  'just',
  'showed',
  'you',
  'why',
  'he',
  'is',
  'much',
  'better',
  'than',
  'his',
  'regular',
  'season',
  'stats',
  '.',
  'He',
  'is',
  'also',
  'a',
  'lot',
  'fo',
  'fun',
  'to',
  'watch',
  'in',
  'the',
  'playoffs',
  '.',
  'Bowman',


In [12]:
# Tokenización con spacy
import spacy
nlp = spacy.load("en_core_web_sm")

corpus_tokenizado_spacy = [nlp(doc) for doc in newsgroupsdocs]

In [20]:
corpus_tokenizado_spacy[9]

 if a christian means someone who believes in the divinity of jesus it is safe to say that jesus was a christian  on the first day after christmas my truelove served to me  leftover turkey on the second day after christmas my truelove served to me  turkey casserole     that she made from leftover turkey days  deleted   flaming turkey wings      pizza hut commercial and mtluagic bait

## Parte 2: Normalización

### Actividad
1. Convierte todos los tokens a minúsculas.
2. Elimina puntuación y símbolos no alfabéticos.

In [22]:
corpus_tokenizado[0]

['I',
 'am',
 'sure',
 'some',
 'bashers',
 'of',
 'Pens',
 'fans',
 'are',
 'pretty',
 'confused',
 'about',
 'the',
 'lack',
 'of',
 'any',
 'kind',
 'of',
 'posts',
 'about',
 'the',
 'recent',
 'Pens',
 'massacre',
 'of',
 'the',
 'Devils',
 '.',
 'Actually',
 ',',
 'I',
 'am',
 'bit',
 'puzzled',
 'too',
 'and',
 'a',
 'bit',
 'relieved',
 '.',
 'However',
 ',',
 'I',
 'am',
 'going',
 'to',
 'put',
 'an',
 'end',
 'to',
 'non-PIttsburghers',
 "'",
 'relief',
 'with',
 'a',
 'bit',
 'of',
 'praise',
 'for',
 'the',
 'Pens',
 '.',
 'Man',
 ',',
 'they',
 'are',
 'killing',
 'those',
 'Devils',
 'worse',
 'than',
 'I',
 'thought',
 '.',
 'Jagr',
 'just',
 'showed',
 'you',
 'why',
 'he',
 'is',
 'much',
 'better',
 'than',
 'his',
 'regular',
 'season',
 'stats',
 '.',
 'He',
 'is',
 'also',
 'a',
 'lot',
 'fo',
 'fun',
 'to',
 'watch',
 'in',
 'the',
 'playoffs',
 '.',
 'Bowman',
 'should',
 'let',
 'JAgr',
 'have',
 'a',
 'lot',
 'of',
 'fun',
 'in',
 'the',
 'next',
 'couple',
 '

In [28]:
import re
from unidecode import unidecode

def procesar_token(token):
  token = token.lower() #lower
  token = unidecode(token)

  if not (bool(re.fullmatch(r'[^a-z\s]', token))): # filtrado de signos de puntuación y carácteres especiales
    return token
  else:
    return None

def limpiar_tokens(corpus_tokenizado):
  res = []
  for doc in corpus_tokenizado:
    tokens = list(filter(None, map(procesar_token, doc)))
    res.append(tokens)
  return res

limpiar_tokens(corpus_tokenizado[0:1])

[['i',
  'am',
  'sure',
  'some',
  'bashers',
  'of',
  'pens',
  'fans',
  'are',
  'pretty',
  'confused',
  'about',
  'the',
  'lack',
  'of',
  'any',
  'kind',
  'of',
  'posts',
  'about',
  'the',
  'recent',
  'pens',
  'massacre',
  'of',
  'the',
  'devils',
  'actually',
  'i',
  'am',
  'bit',
  'puzzled',
  'too',
  'and',
  'a',
  'bit',
  'relieved',
  'however',
  'i',
  'am',
  'going',
  'to',
  'put',
  'an',
  'end',
  'to',
  'non-pittsburghers',
  'relief',
  'with',
  'a',
  'bit',
  'of',
  'praise',
  'for',
  'the',
  'pens',
  'man',
  'they',
  'are',
  'killing',
  'those',
  'devils',
  'worse',
  'than',
  'i',
  'thought',
  'jagr',
  'just',
  'showed',
  'you',
  'why',
  'he',
  'is',
  'much',
  'better',
  'than',
  'his',
  'regular',
  'season',
  'stats',
  'he',
  'is',
  'also',
  'a',
  'lot',
  'fo',
  'fun',
  'to',
  'watch',
  'in',
  'the',
  'playoffs',
  'bowman',
  'should',
  'let',
  'jagr',
  'have',
  'a',
  'lot',
  'of',
  'fu

In [29]:
# Procesamiento en el corpus
corpus_normalizado = limpiar_tokens(corpus_tokenizado)

## Parte 3: Eliminación de Stopwords

### Actividad
1. Elimina las palabras vacías usando una lista estándar.

In [41]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words= set(stopwords.words('english'))


def borrar_stopwords(doc):
  return [t for t in doc if not t in stop_words]

corpus_sin_stopwords = list(map(borrar_stopwords, corpus_normalizado)) #uso de list para eager de map


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
corpus_sin_stopwords[0]

['sure',
 'bashers',
 'pens',
 'fans',
 'pretty',
 'confused',
 'lack',
 'kind',
 'posts',
 'recent',
 'pens',
 'massacre',
 'devils',
 'actually',
 'bit',
 'puzzled',
 'bit',
 'relieved',
 'however',
 'going',
 'put',
 'end',
 'non-pittsburghers',
 'relief',
 'bit',
 'praise',
 'pens',
 'man',
 'killing',
 'devils',
 'worse',
 'thought',
 'jagr',
 'showed',
 'much',
 'better',
 'regular',
 'season',
 'stats',
 'also',
 'lot',
 'fo',
 'fun',
 'watch',
 'playoffs',
 'bowman',
 'let',
 'jagr',
 'lot',
 'fun',
 'next',
 'couple',
 'games',
 'since',
 'pens',
 'going',
 'beat',
 'pulp',
 'jersey',
 'anyway',
 'disappointed',
 'see',
 'islanders',
 'lose',
 'final',
 'regular',
 'season',
 'game',
 'pens',
 'rule']

## Parte 4: Stemming o Lematización

### Actividad
1. Aplica stemming.
2. Aplica lematización.
3. Compara ambas técnicas.

In [43]:
#1.
# Stemming
import nltk
from nltk.stem import PorterStemmer

def steamming_doc (doc):
  stemmer = PorterStemmer()
  return [stemmer.stem(token) for token in doc]

corpus_stemmed = list(map(steamming_doc, corpus_sin_stopwords))
corpus_stemmed[0]

['sure',
 'basher',
 'pen',
 'fan',
 'pretti',
 'confus',
 'lack',
 'kind',
 'post',
 'recent',
 'pen',
 'massacr',
 'devil',
 'actual',
 'bit',
 'puzzl',
 'bit',
 'reliev',
 'howev',
 'go',
 'put',
 'end',
 'non-pittsburgh',
 'relief',
 'bit',
 'prais',
 'pen',
 'man',
 'kill',
 'devil',
 'wors',
 'thought',
 'jagr',
 'show',
 'much',
 'better',
 'regular',
 'season',
 'stat',
 'also',
 'lot',
 'fo',
 'fun',
 'watch',
 'playoff',
 'bowman',
 'let',
 'jagr',
 'lot',
 'fun',
 'next',
 'coupl',
 'game',
 'sinc',
 'pen',
 'go',
 'beat',
 'pulp',
 'jersey',
 'anyway',
 'disappoint',
 'see',
 'island',
 'lose',
 'final',
 'regular',
 'season',
 'game',
 'pen',
 'rule']

In [48]:
#2.
# Lematización
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tag import pos_tag

nltk.download('wordnet')
nltk.download('punkt')
nltk.download("averaged_perceptron_tagger")
nltk.download('averaged_perceptron_tagger_eng')

lemmatizer = WordNetLemmatizer()

# Función para mapear etiquetas POS de nltk a WordNet
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # por defecto

def lematizar_doc (doc):
  tag = pos_tag(doc)
  return [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in tag]

corpus_lematizado = list(map(lematizar_doc, corpus_sin_stopwords))

corpus_lematizado[0]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


['sure',
 'bashers',
 'pen',
 'fan',
 'pretty',
 'confused',
 'lack',
 'kind',
 'post',
 'recent',
 'pen',
 'massacre',
 'devil',
 'actually',
 'bit',
 'puzzled',
 'bit',
 'relieve',
 'however',
 'go',
 'put',
 'end',
 'non-pittsburghers',
 'relief',
 'bit',
 'praise',
 'pen',
 'man',
 'kill',
 'devil',
 'bad',
 'think',
 'jagr',
 'show',
 'much',
 'good',
 'regular',
 'season',
 'stats',
 'also',
 'lot',
 'fo',
 'fun',
 'watch',
 'playoff',
 'bowman',
 'let',
 'jagr',
 'lot',
 'fun',
 'next',
 'couple',
 'game',
 'since',
 'pen',
 'go',
 'beat',
 'pulp',
 'jersey',
 'anyway',
 'disappointed',
 'see',
 'islander',
 'lose',
 'final',
 'regular',
 'season',
 'game',
 'pen',
 'rule']

In [53]:
#3.
#Comparación
from tabulate import tabulate
doc = [
    [newsgroupsdocs[0]]
]


data = [
  corpus_sin_stopwords[0],
   corpus_stemmed[0],
   corpus_lematizado[0]
]
data_T = list(zip(*data))

print(tabulate(doc, headers=["Documento"], tablefmt='grid'))
print(tabulate(data_T, headers=["Normal", "Steamming", "Lematización"], tablefmt="grid"))


+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Documento                                                                                                                                                       |
| I am sure some bashers of Pens fans are pretty confused about the lack                                                                                          |
| of any kind of posts about the recent Pens massacre of the Devils. Actually,                                                                                    |
| I am  bit puzzled too and a bit relieved. However, I am going to put an end                                                                                     |
| to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they                                                                                       |
| are killing th

<ul>
<li>Steamming: Recorta la palabra a su forma raíz, recorta sus términaciónes</li>
<li>Lematización: Reduce la palabra a su forma base (lemma), considerando el contexto gramatical</li>
</ul>