# Cap. 3 Construindo Seu Vocabulário de PNL

### Neste capítulo, cobriremos os seguintes tópicos:

    -Lexicons
    -Fonemas, grafemas e morfemas
    -Tokenização
    -Compreendendo a normalização de palavras

**Lexicons** podem ser definidos como o vocabulário de uma pessoa, idioma ou ramo do conhecimento.(lexemas)
Por exemplo, os termos usados por médicos podem ser considerados um léxico para sua profissão. Por exemplo, ao tentar construir um algoritmo para converter uma prescrição física fornecida por médicos em um formulário eletrônico, os léxicos iriam ser composto principalmente de termos médicos. Lexicons são usados para uma ampla variedade de tarefas de PNL, onde são fornecidos como uma lista de palavras ou vocabulário. As conversas no campo em questão são orientadas por seus respectivos vocabulários. Neste capítulo, veremos as etapas e processos envolvidos na construção de um vocabulário de linguagem natural.

**Fonemas, grafemas e morfemas**

Antes de começarmos a examinar as etapas para a construção de vocabulário, precisamos entender:

**fonemas** podem ser considerados os sons da fala, produzidos pela boca ou unidade de som, que podem diferenciar uma palavra de outra em um idioma.

**Grafemas** são grupos de letras de um ou mais tamanhos que podem representar esses sons ou fonemas individuais. A palavra **spoon** consiste em cinco letras que, na verdade, representam quatro fonemas, identificados pelos grafemas s, p, oo e n.

**morfema** é a menor unidade significativa em um idioma. A palavra inquebrável é composta por três morfemas:

**un** — um morfema vinculado que significa **não**

**break** - o morfema da raiz

**able** - um de morfema livre  que significa **can be done** 


Agora, vamos nos aprofundar em alguns aspectos práticos que formam a base de todo sistema baseado em PNL.

# Todo o pré-processamento básico em um só lugar

#### Vamos aplicar todos os métodos de pré-processamento que discutimos até agora em nosso conjunto de dados Zomato e ver como funciona tudo junto

@author: Aman Kedia

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [11]:
import os

In [12]:
os.getcwd()

'/content'

In [20]:
os.listdir("drive/MyDrive/PERSONAL/PYTHON/NLP/notebooks/Chapter03")

['All the basic preprocessing in one place.ipynb',
 '.ipynb_checkpoints',
 '__init__.py',
 'zomato_reviews.csv',
 'Understanding Tokenization.ipynb',
 'Stemming, Lemmatization, Stopword Removal, Case-Folding, N-grams and HTML tags.ipynb']

In [21]:
df = pd.read_csv("drive/MyDrive/PERSONAL/PYTHON/NLP/notebooks/Chapter03/zomato_reviews.csv")
df.head(3)

Unnamed: 0,Review,sentiment
0,Virat Kohli did a great thing to open his rest...,positive
1,This place have some really heathy options to ...,positive
2,Aerocity is the most finest place in Delhi for...,positive


In [23]:
df.dtypes

Review       object
sentiment    object
dtype: object

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
corpus = pd.Series(df.Review.tolist()).astype(str)

In [24]:
corpus

0       Virat Kohli did a great thing to open his rest...
1       This place have some really heathy options to ...
2       Aerocity is the most finest place in Delhi for...
3       Yesterday evening there was small team lunch ,...
4       I find aerocity to be the best place in delhi ...
                              ...                        
1591    || DESI LANE || So we were at alipore's most h...
1592    "Desi Lane" is one of the most trending place ...
1593    One of the cool and pocket pinch restaurant at...
1594    "DESI LANE" one of the best places in town and...
1595    Looking for good place for lunch but dont wann...
Length: 1596, dtype: object

### Limpeza de texto (remoção de caracteres especiais / pontuações e letras maiúsculas no inico e meio de frases)


In [25]:
def text_clean(corpus, keep_list):
    '''
    Purpose : Function to keep only alphabets, digits and certain words (punctuations, qmarks, tabs etc. removed)
    
    Input : Takes a text corpus, 'corpus' to be cleaned along with a list of words, 'keep_list', which have to be retained
            even after the cleaning process
    
    Output : Returns the cleaned text corpus
    
    '''
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            if word not in keep_list:
                p1 = re.sub(pattern='[^a-zA-Z0-9]',repl=' ',string=word)
                p1 = p1.lower()
                qs.append(p1)
            else : qs.append(word)
        cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
    return cleaned_corpus

### Stopwords Removal

In [26]:
def stopwords_removal(corpus):
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

### Lemmatization

In [27]:
def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

### Stemming

In [28]:
def stem(corpus, stem_type = None):
    if stem_type == 'snowball':
        stemmer = SnowballStemmer(language = 'english')
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    else :
        stemmer = PorterStemmer()
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    return corpus

### Função de Funções 

In [29]:
def preprocess(corpus, keep_list, cleaning = True, stemming = False, stem_type = None, lemmatization = False, remove_stopwords = True):
    '''
    Purpose : Function to perform all pre-processing tasks (cleaning, stemming, lemmatization, stopwords removal etc.)
    
    Input : 
    'corpus' - Text corpus on which pre-processing tasks will be performed
    'keep_list' - List of words to be retained during cleaning process
    'cleaning', 'stemming', 'lemmatization', 'remove_stopwords' - Boolean variables indicating whether a particular task should 
                                                                  be performed or not
    'stem_type' - Choose between Porter stemmer or Snowball(Porter2) stemmer. Default is "None", which corresponds to Porter
                  Stemmer. 'snowball' corresponds to Snowball Stemmer https://pypi.org/project/snowballstemmer/
    
    Note : Either stemming or lemmatization should be used. There's no benefit of using both of them together
    
    Output : Returns the processed text corpus
    
    '''
    
    if cleaning == True:
        corpus = text_clean(corpus, keep_list)
    
    if remove_stopwords == True:
        corpus = stopwords_removal(corpus)
    else :
        corpus = [[x for x in x.split()] for x in corpus]
    
    if lemmatization == True:
        corpus = lemmatize(corpus)
        
        
    if stemming == True:
        corpus = stem(corpus, stem_type)
    
    corpus = [' '.join(x) for x in corpus]        

    return corpus

In [30]:
#Keep List 
common_dot_words = ['U.S.A', 'Mr.', 'Mrs.', 'D.C.']

In [31]:
# Preprocessing with Lemmatization here
corpus_with_lemmatization = preprocess(corpus, keep_list = common_dot_words, stemming = False, stem_type = None, lemmatization = True, remove_stopwords = True)

  # This is added back by InteractiveShellApp.init_path()


In [32]:
# Preprocessing with Stemming here here
corpus_with_stemming = preprocess(corpus, keep_list = common_dot_words, stemming = True, stem_type = "snowball", lemmatization = False, remove_stopwords = True)

  # This is added back by InteractiveShellApp.init_path()


# Let's see the results on applying

### 1. Lemmatization
### 2. Stemming

Note: Stopwords removal and text cleaning have been applied on both the occassions.

In [33]:
print("Original string: ", corpus[0])

Original string:  Virat Kohli did a great thing to open his restaurant in an exquisite place of Delhi. Wide range of food with lots and lots of options on drinks. Courteous staff with a quick response on anything.


In [36]:
print("String after lemmatization: ", corpus_with_lemmatization[0])

String after lemmatization:  virat kohli great thing open restaurant exquisite place delhi wide range food lot lot options drink courteous staff quick response anything


In [35]:
print("String after stemming: ", corpus_with_stemming[0])

String after stemming:  virat koh great thing open restaur exquisit place delhi wide rang food lot lot option drink courteous staff quick respons anyth
