# Procedimientos de Limpieza de Comentarios de Facebook

## Cargar el archivo
Se extrajeron 1,569 comentarios de septiembre y octubre año 2020 en los que se menciona a CitiBanamex

Instalamos la librería mlxtend que nos permite implementar reglas de asociación

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import nltk
import re
import string

In [None]:
# conda install -c conda-forge spacy
# python -m spacy download es

In [None]:
df = pd.read_csv('../data/BanamexFace.csv')
df.head()

## Limpieza de Texto
#### Eliminamos los signos de Puntuación

In [None]:
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

df['com_p'] = df['com'].apply(lambda x: remove_punct(x))
df.head()

#### Creamos una columna con los comentarios tokenizados

In [None]:
def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens
df['com_t'] = df['com_p'].apply(lambda x: tokenize(x.lower()))
df.head()

## POS Tagging
#### Invocamos al POS Tagger de Stanford (https://nlp.stanford.edu/software/spanish-faq.shtml#tagset)

In [None]:
from nltk.internals import find_jars_within_path
from nltk.tag import StanfordPOSTagger
%env JAVAHOME=C:\Program Files\Java\jre1.8.0_321
tagger="C:\\Users\\lc250058\\notebooks\\Texto\\src\\models\\spanish-ud.tagger"
jar="C:\\Users\\lc250058\\notebooks\\Texto\\src\\stanford-postagger.jar"
etiquetador=StanfordPOSTagger(tagger,jar)

In [None]:
etiquetas=etiquetador.tag(df.iloc[0,9])
for etiqueta in etiquetas:
    print(etiqueta)

In [None]:
df_pos = pd.DataFrame(columns = ['comm_id', 'term', 'tag'])
for i in range(20):
    etiquetas=etiquetador.tag(df.iloc[i,9])
    ori = [etiqueta[0] for etiqueta in etiquetas]
    lis = [etiqueta[1] for etiqueta in etiquetas]
    for j in range(len(lis)):
        try:
            cid = i
            txt = ori[j]
            new = lis[j]
            df_pos = df_pos.append({'comm_id' : cid, 'term' : txt, 'tag' : new}, ignore_index = True)
        except Exception:
            pass
df_pos.head(10)

In [None]:
df_pos.shape

## Stemming
#### Stemming en Español con SnowBall

In [None]:
from nltk.stem import SnowballStemmer
stm = SnowballStemmer('spanish') # Hay que indicarle el idioma
[stm.stem(word) for word in df.iloc[0,9]]

In [None]:
df_stem = pd.DataFrame(columns = ['comm_id', 'stem'])
for i in range(len(df)):
    lis = [stm.stem(word) for word in df.iloc[i,9]]
    for tok in lis:
        try:
            cid = i
            new = tok
            df_stem = df_stem.append({'comm_id' : cid, 'stem' : new}, ignore_index = True)
        except Exception:
            pass
df_stem.head(10)

In [None]:
df_stem.shape

## Lemmatization
#### Lemmatization con WordNet (Sólo inglés)

In [None]:
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
[lemm.lemmatize(word) for word in ['cars','dogs','families','parties','elements']]

#### Lemmatization con spacy (Utilizando un Corpus en Español)
Es necesario instalar spacy y el modelo de lenguaje en español:
```
python -m spacy download es
```

In [None]:
import spacy
nlp = spacy.load('es_core_news_sm')
doc = nlp(df.iloc[0,8])
[tok.lemma_ for tok in doc]

In [None]:
df_lem = pd.DataFrame(columns = ['comm_id', 'lemma'])
for i in range(len(df)):
    doc = nlp(df.iloc[i,8])
    lis = [term.lemma_ for term in doc]
    for tok in lis:
        try:
            cid = i
            new = tok
            df_lem = df_lem.append({'comm_id' : cid, 'lemma' : new}, ignore_index = True)
        except Exception:
            pass
df_lem.head(10)

In [None]:
df_lem.shape

## N-Grams

In [None]:
from nltk import ngrams
words = df.iloc[0,9]
num_elementos = 3
n_grams = ngrams(words, num_elementos)
for grams in n_grams:
    print (grams)

## Corrección Ortográfica
Utilizamos el código basado en diccionario, creado por Peter Norvig: http://www.norvig.com/spell-correct.html

In [None]:
%run C:\Users\lc250058\notebooks\Texto\src\spell.py

In [None]:
correction('natufaleza')

In [None]:
[correction(term) for term in df.iloc[1,9]]

## Reconocimiento de Entidades (NER)
Utilizamos el mismo corpus de spacy

In [None]:
from spacy import displacy

In [None]:
doc = nlp(df.iloc[0,8])
for sent in doc.sents:
    displacy.render(nlp(sent.text),style='ent',jupyter=True)

In [None]:
for i in range(10):
    doc = nlp(df.iloc[i,8])
    for sent in doc.sents:
        displacy.render(nlp(sent.text),style='ent',jupyter=True)

Elaborado por Luis Cajachahua bajo licencia MIT (2022)