# **Projeto NLP - Classificador de Textos**

## **1 - Tarefa e Dados**

##### **Base de Dados**

TweetSentBR é um corpus de Tweets em Português do Brasil. Ele foi rotulado por vários anotadores seguindo etapas estabelecidas na literatura para melhorar a confiabilidade na tarefa de Análise de Sentimento. Cada tweet foi anotado em uma das três classes a seguir:

*   **Positivo** - tweets em que o usuário quis dizer uma reação 
positiva ou avaliação sobre o tópico principal da postagem;
*   **Negativo** - tweets em que o usuário quis dizer uma reação negativa ou avaliação sobre o tópico principal da postagem;
*   **Neutro** - tweets que não pertencem a nenhuma das últimas classes, geralmente não fazem ponto, fora do tópico, irrelevantes, confusos ou contendo apenas dados objetivos.

**Link:** https://bitbucket.org/HBrum/tweetsentbr/src/master/ 


In [144]:
pip install pyspellchecker



In [145]:
import pandas as pd
import spacy
import nltk
import re
from nltk.corpus import wordnet as wn
# from nltk.tokenize import TweetTokenizer
from spellchecker import SpellChecker
from google.colab import drive

In [146]:
spacy.cli.download('pt_core_news_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('pt_core_news_sm')


In [147]:
nltk.download('wordnet')
nltk.download('omw')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

In [148]:
drive.mount('/content/drive')
path = 'drive/My Drive/AS/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##### **Expansão da lista de palavras com polaridade**

In [149]:
def search_synonyms(word):
  synonyms = []
  #antonyms = []
  for synonym in wn.synsets(word, lang=('por')):
    for lemma in synonym.lemmas('por'):
      synonyms.append(lemma.name())
      #if lemma.antonyms():
        #if lemma.antonyms()[0].name() != word:
          #antonyms.append(lemma.antonyms()[0].name())
  return synonyms

In [150]:
def insert_word(positive, negative, word, polarity, new_words):
  if((word not in positive) and (word not in negative)):
    if(polarity == 1):
      positive.append(word)
    else:
      negative.append(word)
    new_words.append(word)
  return positive, negative, new_words

In [151]:
def read_seeds(filename):
  df = pd.read_csv(path + filename, delimiter=';')  # leitura do arquivo
  lines = df.loc[:].values
  negative = []
  positive = []
  new_words = []
  for line in lines:
    word = line[0]
    polarity = line[1]
    if(polarity == 1):
      positive.append(word)
    else:
      negative.append(word)
    synonyms = search_synonyms(word)
    for synonym in synonyms:
      positive, negative, new_words = insert_word(positive, negative, synonym,
                                                  polarity, new_words)
  return positive, negative, new_words

In [152]:
def expand_seeds(positive, negative, words):
  new_words = []
  for word in words:
    if word in positive:
      polarity = 1
    else:
      polarity = -1
    synonyms = search_synonyms(word)
    for synonym in synonyms:
        positive, negative, new_words = insert_word(positive, negative, synonym,
                                                    polarity, new_words)
  while(len(new_words) > 0):
    print(len(new_words))
    print(new_words)
    positive, negative, new_words = expand_seeds(positive, negative, new_words)
  return positive, negative, new_words


In [153]:
def read_verbs(filename, positive, negative):
  df = pd.read_csv(path + filename, delimiter=';')  # leitura do arquivo
  lines = df.loc[:].values
  for line in lines:
    word = line[0]
    polarity = line[1]
    if(polarity == 1):
      positive.append(word)
    else:
      negative.append(word)
  return positive, negative

##### **Leitura do corpus**

In [154]:
from sklearn.model_selection import train_test_split

In [155]:
def read_corpus(filename):
  corpus = pd.read_csv(path + filename, delimiter='\t')
  #document = corpus.iloc[:,3]
  #polarity = corpus.iloc[:,1]
  document = corpus.iloc[:,0]
  polarity = corpus.iloc[:,1]
  document, document_, polarity, polarity_ = train_test_split(document, polarity, test_size=0.15, random_state=0)
  return corpus, document_, polarity_

##### **Separação do documento em frases**

In [156]:
def search_phrases(document):
  str = document
  x = re.search("[?][\s]|[!][\s]|[.][\s]", str)
  if(x):
    phrases = re.split("[?][\s]|[!][\s]|[.][\s]", str)
  else:
    phrases = [document]
  return phrases

##### **Deleta entidades nomeadas (REN) da frase**

In [157]:
def delete_entities(phrase):
  nlp = spacy.load("pt_core_news_sm")
  doc = nlp(phrase)
  for entity in doc.ents:
    phrase = phrase.replace(entity.text, '')
  return phrase

##### **Tokenização e POS Tagging**

In [158]:
def pos_tagging(phrase):
  token_list = []
  token_with_emoji = []
  pos_list = []
  classes = ['ADV', 'ADJ', 'VERB', 'NOUN', 'SCONJ', 'ADP', 'PRON', 'PROPN', 'SYM', 'X']
  nlp = spacy.load("pt_core_news_sm")
  doc = nlp(phrase)
  for token in doc:
    if token.pos_ in classes or token.text == 'nem':
      token_list.append(token.text)
      pos_list.append(token.pos_)
    token_with_emoji.append(token.text)  
  return token_list, pos_list, token_with_emoji

## **2 - Visualização dos dados**


##### **Correção ortográfica**

In [159]:
def spell_checker(token_list):
  spell = SpellChecker(language='pt')
  misspelled = spell.unknown(token_list)
  for word in misspelled:
    try:
      index = token_list.index(word)
      del(token_list[index])
      token_list.insert(index, spell.correction(word))
    except:
      print("Não encontrado na lista!")
    #spell.candidates(word) # lista de opções possíveis de correção
  return token_list

##### **Conversão para caracteres minúsculos**

In [160]:
def to_lower(token_list):
  new_token_list = []
  for token in token_list:
    new_token_list.append(token.lower())
  return new_token_list

##### **Lematização**

In [161]:
def lemmatization(phrase):
  lemma_list = []
  nlp = spacy.load("pt_core_news_sm")
  doc = nlp(phrase)
  for token in doc:
    lemma_list.append(token.lemma_)
  return lemma_list

## **3 - Classificadores**

##### **Risada**

In [162]:
def check_laugh(token_list, positive):
  for token in token_list:
    str = token
    x = re.search("([hH][aeiouAEIOU])([hH][aeiouAEIOU])+|[k][k]+|[K][K]+|[e][e]+|[z][z]+|[Z][Z]+|[uU][hH][uU][lL]+", str)
    if(x):
      positive += 1
  return positive

In [163]:
print(check_laugh(['cláudia', 'raia', 'hehe', 'aha', 'kaleu', 'kj'],0))

1


##### **Busca de advérbio de negação**

In [164]:
def search_negative_word(token_list, pos_list):
  negative_words = ['não', 'nunca', 'jamais', 'sem','ninguém','sqn']
  for negative_word in negative_words:
    if negative_word in token_list:
      index = token_list.index(negative_word)
      try:
        if ((pos_list[index+1] == 'VERB' or pos_list[index+1] == 'ADJ' 
          or pos_list[index+1] == 'NOUN') and (token_list.count(
              negative_word) == 1)):
          return "found", index
      except:
        print("Fora do índice")
  return "not found", -1000

##### **Busca de sentimento na frase**

In [165]:
def search_sentiment(token_list, pos_list, positive, negative):
  positive_word = 0
  negative_word = 0
  result, index = search_negative_word(token_list, pos_list)
  for token in token_list:
    if((result == "not found") or ((token_list.index(token) != index+1))):
      if token in positive:
        positive_word += 1
      elif token in negative:
        negative_word += 1
    elif((result == "found") and ((token_list.index(token) == index+1))):
      if token in positive:
        negative_word += 1
      elif token in negative:
        positive_word += 1
  return positive_word, negative_word

##### **Busca de emojis na frase**

In [166]:
def search_emojis(token_with_emoji):
  positive = 0
  negative = 0
  positive_words = ['homão','mulherão','lacradora','crush','crushs','+','ícone',
                    'bff', 'mito', 'memes','ótimo','obrigado',
                    'xuxu','migo','fofo','deusa','deuso','top']
  negative_words = ['pqp','sdds','poxa','embuste', 'aff', 'bad', 'sad','sozinho']
  positive_emoji = ['😀', '😬', '😁', '😂', '😃', '😄', '🤣', '😆', '😇',
                    '😉', '😊', '🙂', '🌟','🐓', '🍝', '🌈','🦄 ', '😋',
                    '😌', '😍', '😘', '😗', '😙', '😚', '🤪', '😜', '😝',
                    '😛', '🤑', '😎', '🤜', '🍜', '🤘','😻', '🤓', '🧐',
                    '🤠', '🤗', '🤲', '🙌', '👏', '🙏', '🤝', '👍', '✌',
                    '👊', '♥', '❤','💔',':)','<3']
  negative_emoji = ['🤡', '😏', '😶', '😐', '😑', '😒', '🙄', '🤨', '🤔',
                    '🤫', '🤭', '🤥', '😳', '😞', '😟', '😠', '😡', '🤬',
                    '😔', '😕', '🙁', '😣', '😖', '😫', '😩', '😤', '😅',
                   '😮', '😱', '😨', '😰', '😯', '😦', '😧', '😢', '😥',
                    '😪', '🤤', '😓', '😭', ':(','o.O',':O' ]
  for token in token_with_emoji:
    if token in positive_emoji:
      positive += 1
    elif token in negative_emoji:
      negative += 1
  token_with_emoji = to_lower(token_with_emoji)
  for token in token_with_emoji:
    if token in positive_words:
      positive += 1
    elif token in negative_words:
      negative += 1
  return positive, negative

##### **Definição da análise de sentimento**

In [167]:
def sentiment_analysis(positive_emoji, negative_emoji, 
                       positive_word, negative_word):
  if(positive_word == 0 and negative_word == 0 and
     positive_emoji == 0 and negative_emoji == 0):
    result = 0 # neutral
    #result = "NE"
  elif(positive_word == negative_word and positive_emoji == negative_emoji):
    result = -1 # positive
    #result = "PO"
  else:
    positive = positive_word + positive_emoji
    negative = negative_word + negative_emoji
    if(positive <= negative):
      if(positive_emoji == negative_word and (
          positive_word == negative_emoji and positive_emoji > 0)):
        result = 1 #positive
        #result = "PO"
      else:
        result = -1 # negative
        #result = "NG"
    else:
      result = 1 # positive
      #result = "PO"
  return result

## **4 - Resultados**

In [168]:
positive, negative, new_words = read_seeds('sementes.csv')

In [169]:
positive = lemmatization(' '.join(map(str, positive)))
negative = lemmatization(' '.join(map(str, negative)))
positive, negative = read_verbs('verbos.csv', positive, negative)

In [170]:
print(len(positive))
print(len(negative))

520
538


In [171]:
def delete_repeated(words):
    lemmas = []
    [ lemmas.append(item) for item in words if not lemmas.count(item) ]
    return lemmas

In [172]:
corpus, document, polarity = read_corpus('TweetSentBR.txt')

In [None]:
results = []
positive_word = 0
negative_word = 0
positive_emoji = 0
negative_emoji = 0
count = 0
for doc in document:
  phrase = doc
  token, pos, token_with_emoji = pos_tagging(phrase)
  positive_word = check_laugh(token, positive_word)
  token = to_lower(spell_checker(token))
  lemma = lemmatization(' '.join(map(str, token)))
  i, j = search_sentiment(lemma, pos, positive, negative)
  positive_word += i
  negative_word += j
  i, j = search_emojis(token_with_emoji)
  positive_emoji += i
  negative_emoji += j
  result = sentiment_analysis(positive_emoji, negative_emoji,
                                positive_word, negative_word)
  results.append(result)
  token.clear()
  pos.clear()
  positive_word = 0
  negative_word = 0
  positive_emoji = 0
  negative_emoji = 0
  lemma.clear()
  count += 1
  if(count%10==0 or count == 1):
    print(count)

In [None]:
final = pd.DataFrame(columns=['doc','pol','result'])
final['doc'] = document
final['pol'] = polarity
final['result'] = results


In [175]:
final

Unnamed: 0,doc,pol,result
1670,esse homem é uma perdição 😍 😈 USERNAME USERNAME,1,
13379,que fome da porra aí eu invento de ir assistir,-1,
10234,boa noite ! 📺,0,
4719,A moça rindo 😱 😱 😱 😂 😂 😂 😂,1,
7003,USERNAME agora que o bicho vai pegar 😂 😂 😂 😂 😂...,0,
...,...,...,...
3943,o filho do serginho ❤ ️ ❤ ️ ❤ ️ to apaixonadaa...,1,
3183,ok depois manda ver com FLOR DA PELE bora com ...,1,
11514,se eu fosse mãe dessa criatura mandava calar a...,-1,
14232,de luto pela saída do USERNAME do,-1,


In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, f1_score, recall_score
f1 = f1_score(polarity, results, average = 'weighted')
recall = recall_score(polarity, results, average = 'weighted')
accuracy = accuracy_score(polarity, results)
precision = precision_score(polarity, results, average = 'weighted')
matrix = confusion_matrix(polarity, results)

In [None]:
print(f1)
print(recall)
print(accuracy)
print(precision)
print(matrix)

## **5 - Conclusão**

**Experimentos**
*   Experimento A
 - Normalização, tokenização, remoção de
stopwords e POS Tagging
*   Experimento B
 - Acréscimo de lematização
*   Experimento C
 - Acréscimo de correção ortográfica


**Resultados**

| Métricas | Experimento A | Experimento B | Experimento C |
|---------|----------------|---------------|---------------|
| Medida F | 0,544 | 0,580 | 0,578 |
| Revocação | 0,539 | 0,584 | 0,585 |
| Acurácia | 0,539 | 0,584 | 0,585 |
| Precisão | 0,581 | 0,578 | 0,576 |




**Referências**
*   Brum, H. B. and Nunes, M. d. G. V. (2017). Building a sentiment
corpus of tweets in brazilian portuguese.CoRR, abs/1712.08917.
*   Gonçalves, P., Araújo, M., Benevenuto, F., and Cha, M. (2013).
Comparing and combining sentiment analysis methods.
InProceedings of the First ACM Conference on Online Social
Networks, COSN ’13, pages 27–38, New York, NY, USA. ACM.
*   Zhao, J., Liu, K., and Xu, L. (2016). Book review: Sentiment
analysis: Mining opinions, sentiments, and emotions by bing
Liu.Computational Linguistics, 42(3):595–598