<a href="https://colab.research.google.com/github/punkmic/Topic-Modeling-Reclame-Aqui/blob/master/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pre-Processing** 

### **Setup**

In [None]:
import pandas as pd # for data manipulation
import os # for interacting with the operating system
import nltk # for natural language processing
import string # for string manipulation 
import re # for for regular expressions
import matplotlib.pyplot as plt # for visualization
import spacy # for lemmatize portuguese text
import pickle
try:
  from spellchecker import SpellChecker # for spell check
except:
  !pip install pyspellchecker
  os.kill(os.getpid(), 9)



In [None]:
%%capture
# Install spacy pt_core_news_sm for portuguese text
!python -m spacy download pt_core_news_sm

In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Download dataset with stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Donwload datasets for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
# Donwload dependency need to stem portuguese text
nltk.download('rslp')

[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Unzipping stemmers/rslp.zip.


True

### **Load data from [Github](https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git)**

In [None]:
!git clone https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git

Cloning into 'Topic-Modeling-Reclame-Aqui'...
remote: Enumerating objects: 16196, done.[K
remote: Counting objects: 100% (2015/2015), done.[K
remote: Compressing objects: 100% (1763/1763), done.[K
remote: Total 16196 (delta 257), reused 1916 (delta 222), pack-reused 14181[K
Receiving objects: 100% (16196/16196), 117.72 MiB | 21.07 MiB/s, done.
Resolving deltas: 100% (1405/1405), done.
Checking out files: 100% (14806/14806), done.


In [69]:
%pwd

'/content/Topic-Modeling-Reclame-Aqui'

In [70]:
# Change directory
%cd /content/Topic-Modeling-Reclame-Aqui 

# Update files from remote repository
!git pull 

# Return to work directory
%cd ..

# Check current directory
!pwd

/content/Topic-Modeling-Reclame-Aqui
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 3 (delta 2), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), done.
From https://github.com/punkmic/Topic-Modeling-Reclame-Aqui
   3c52906..83eaeea  master     -> origin/master
There is no tracking information for the current branch.
Please specify which branch you want to merge with.
See git-pull(1) for details.

    git pull <remote> <branch>

If you wish to set tracking information for this branch you can do so with:

    git branch --set-upstream-to=origin/<branch> master

/content
/content


## **Prepare data**

In [None]:
def read_data(path_csv, drop_duplicates = True, lower=True):
 
  # Use the read_csv method to read csv file
  df = pd.read_csv(path_csv)
  
  if drop_duplicates:
    # Read and return the CSV file using the read_csv method
    print(f"Shape before remove duplicates: {df.shape}")

    # Use the drop_duplicated method to drop duplicates rows
    df = df.drop_duplicates(subset="text")

    print(f"Shape after remove duplicates: {df.shape}")

    if lower:
      # apply the str.lower() method to each element in the dataframe
      df = df.applymap(str.lower)
  return df

In [None]:
# Set the path to data
path_csv = "/content/Topic-Modeling-Reclame-Aqui/corpus.csv"

df = read_data(path_csv)

# Print the first 5 rows of the DataFrame
df.head(5)

Shape before remove duplicates: (12760, 2)
Shape after remove duplicates: (10510, 2)


Unnamed: 0,title,text
0,pedido cancelado sem justificativa após uma se...,eu estava pesquisando bastante uma nova tv par...
1,pedido cancelado,eu sinceramente estou decepcionada com o amazo...
2,cobrança indevida,cancelei meu plano antes de terminar o período...
3,pedido reincidente,"olha fiz compra veio errada, e veio errado nov..."
4,assinatura para vender na amazon brasil,eu me inscrevi na amazon para realizar vendas ...


This dataset contains just two columns called title and text 

In [None]:
print(df.shape)

(10510, 2)


There are 10510 unique rows in this dataset.

In [None]:
# join columns
df["documents"] = df["title"] + " " + df["text"]

# Use the replace() method to replace the string with an empty string
df = df.replace(re.compile('\[editado pelo reclame aqui\]|editado pelo reclame aqui|Editado pelo Reclame Aqui'), '')

# Drop the old index column
df.reset_index(inplace = True, drop = True)

df.head()

Unnamed: 0,title,text,documents
0,pedido cancelado sem justificativa após uma se...,eu estava pesquisando bastante uma nova tv par...,pedido cancelado sem justificativa após uma se...
1,pedido cancelado,eu sinceramente estou decepcionada com o amazo...,pedido cancelado eu sinceramente estou decepc...
2,cobrança indevida,cancelei meu plano antes de terminar o período...,cobrança indevida cancelei meu plano antes de ...
3,pedido reincidente,"olha fiz compra veio errada, e veio errado nov...",pedido reincidente olha fiz compra veio errada...
4,assinatura para vender na amazon brasil,eu me inscrevi na amazon para realizar vendas ...,assinatura para vender na amazon brasil eu me ...


#### **Save table as image**

In [None]:
# Set the path to save 
path = '/content/Topic-Modeling-Reclame-Aqui/results/joined_table/'

# Use makedirs() to create a new directory if it does not exists
if not os.path.exists(path):
  os.makedirs(path)

# Save the DataFrame to a CSV file
df.head(10).to_csv(path + 'joined_table.csv')

### **Preprocessing**

#### **Tokenization**

Tokenization aims to breaking text down into its component parts

In [None]:
WORD_TOKENIZER = nltk.tokenize.word_tokenize
def tokenize(text, lowercase=True):
  if lowercase:
    text = text.lower()
  return WORD_TOKENIZER(text, language="portuguese")

#### **Stem** 

Stem the tokens. This step aims to remove morphological affixes and normalize to standardized stem forms

In [None]:
STEMMER = nltk.stem.RSLPStemmer()
def stem(tokens):
  return [STEMMER.stem(token) for token in tokens]

#### **Lemmatize**

Lemmatize the tokens. Retains more natural forms than stemming, but assumes all tokens nons unless tokens are passed as (word, pos) tuples. Note: nltk lemmatize does not suport portugues language

In [None]:
LEMMATIZER = nltk.WordNetLemmatizer()

def lemmatize(tokens):
  lemmas = []
  for token in tokens:
      if isinstance(token, str):
          # treats token like a noun
          lemmas.append(LEMMATIZER.lemmatize(token)) 
      else: 
          # assume a tuple of (word, pos)
          lemmas.append(LEMMATIZER.lemmatize(*token))
  return lemmas

**Lemmatize option for portuguese text**

In [None]:
def lemmatize_pt(tokens, nlp):
  # Create a spaCy Doc object and apply the lemmatization
  doc = nlp(' '.join(tokens))

  # Return lemmatize
  return [token.lemma_ for token in doc]

#### **Remove stopwords**

Stop words are things like articles and conjunctions that usually do not offer a lot of value in an analysis.

In [None]:
def remove_stopwords(tokens, stopwords=None):

  # Use the default stop words if none is passed
  if stopwords is None:
    stopwords = nltk.corpus.stopwords.words('portuguese')
  
  # Filter the list of tokens to exclude the stop word tokens
  return [token for token in tokens if token not in stopwords]

#### **Remove hyperlinks**

Removes http/s links from the tokens.

In [None]:
def remove_links(tokens):
  # Filter tokens that starts with "http://" or "https://"
  return [token for token in tokens 
          if not token.startswith("http://")
          and not token.startswith("https://")]

#### **Remove numbers**

In [None]:
def remove_numbers(tokens):
  # Filter out number tokens using a list comprehension and the isnumeric method
  return [token for token in tokens if not token.isnumeric()]


#### **Remove date**

In [None]:
def remove_date(tokens):
  # Compile a regular expression to match dates in the format dd/mm or dd/mm/yyyy
  date_regex = re.compile(r'\d{2}/\d{2}(/\d{4})?')

  # Use the regex to find all the tokens that match the date pattern
  dates = [token for token in tokens if date_regex.fullmatch(token)]

  # Filter the list of tokens to exclude the date tokens
  filtered_tokens = [token for token in tokens if token not in dates]

  # Return the filtered tokens
  return filtered_tokens

#### **Remove punctuation**

In [None]:
def remove_punctuation(tokens,
                       strip_mentions=False,
                       strip_hashtags=False,
                       strict=False):

    # Use a regular expression to match and remove repeated punctuation characters
    tokens = [re.sub(r"([!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~])\1+", "", token) for token in tokens]

    # Filter punctuation tokens
    tokens = [token for token in tokens if token not in string.punctuation]

    # Remove @ symbol from left side of tokens
    if strip_mentions:
        tokens = [t.lstrip('@') for t in tokens]

    # Remove # symbol from left side of tokens
    if strip_hashtags:
        tokens = [t.lstrip('#') for t in tokens]

    return tokens

#### **Remove short tokens**

In [None]:
def remove_short_tokens(tokens):
  # Filter the list of tokens to exclude tokens that are shorter than four letters
  filtered_tokens = [token for token in tokens if len(token) >= 4]

  # Return the filtered tokens
  return filtered_tokens

#### **Correction of spelling errors**

In [None]:
def check_spell_errors(text, spell):

  # Correct the spelling errors in the text
  corrected_text = spell.correction(text)

  # If no correction is present user the original text
  if corrected_text == None:
     corrected_text =  text
  
  # Return the corrected text
  return corrected_text

In [None]:
def preprocessing(documents, nlp = None, spell = None):
  corpus = []

  # process each document and append to corpus list
  for i, text in enumerate(documents):
    if i % 1000 == 0:
      print(f"{i} documents of {len(documents)}\n")
    if spell is not None:
      text = check_spell_errors(text, spell)
    tokens = tokenize(text)
    tokens = remove_links(tokens)
    tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
    tokens = remove_numbers(tokens)
    tokens = remove_date(tokens)
    tokens = remove_short_tokens(tokens)
    tokens = remove_stopwords(tokens)
    if nlp is not None: 
      tokens = lemmatize_pt(tokens, nlp)
    #tokens = stem(tokens) 
    corpus.append(' '.join(tokens))
  return corpus

In [None]:
# Create a SpellChecker object
spell = SpellChecker(language='pt')

nlp = spacy.load('pt_core_news_sm')

corpus = preprocessing(df.documents, nlp, spell)

0 documents of 10510

1000 documents of 10510

2000 documents of 10510

3000 documents of 10510

4000 documents of 10510

5000 documents of 10510

6000 documents of 10510

7000 documents of 10510

8000 documents of 10510

9000 documents of 10510

10000 documents of 10510



In [None]:
# Print the first document before and after pre-processing it
print(df.documents[0])
print()
print(corpus[0])

pedido cancelado sem justificativa após uma semana da compra eu estava pesquisando bastante uma nova tv para comprar e resolvi aguardar a semana do cliente, porque como ocorreu tiveram vários descontos e promoções nessa semana, sendo assim recebi descontos, cashback e cupons de varias plataformas e assim decidi efetuar a compra aquela que eu entendesse ser o melhor custo beneficio. sendo assim no dia 12 de setembro de 2022 recebi uma oferta de produto da amazon que entendi estar com um preço muito bom além de ter cachback e efetuei a compra que já aguardei até a semana do consumidor para efetuar com toda a expectativa. como estavamos anciosos e mesmo estando dentro do prazo de entrega hoje no dia 19/09 resolvi enviar uma mensagem perguntando quando o pedido seria enviado, ja que após a confirmação da compra e pagamento, passado 7 dias não tive nenhum retorno, já que complei inclusive na amazon, um site em que confio e tambem sou assinante. para a minha surpresa após o pedido de informa

### **Save corpus**

In [237]:
import pickle
import os

def save_corpus(corpus, file_name="corpus.pkl"):
  file_path = f'./corpus/{file_name}'

  if not os.path.exists('./corpus'):
    os.makedirs('./corpus')
  with open(file_path, 'wb') as f:
    # Save the DataFrame to the file
    pickle.dump(corpus, f)


In [233]:
import pickle
def load_corpus(file_name = "corpus.pkl"):
  file_path = f'./corpus/{file_name}'
  return pickle.load(open(file_path, 'rb'))

In [None]:
load_corpus("corpus.pkl")[:1]

#### **Update github**

In [None]:
from getpass import getpass

In [217]:
%cd /content/Topic-Modeling-Reclame-Aqui/

/content/Topic-Modeling-Reclame-Aqui


In [None]:
!git init

Reinitialized existing Git repository in /content/Topic-Modeling-Reclame-Aqui/.git/


In [238]:
!git add .

In [243]:
!git status

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mnew file:   corpus/corpus.pkl[m
	[32mdeleted:    corpus/preprocessed/corpus.p[m
	[32mmodified:   load_corpus.py[m
	[32mmodified:   save_corpus.py[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31m__pycache__/click_list_param.cpython-38.pyc[m
	[31m__pycache__/load_corpus.cpython-38.pyc[m
	[31m__pycache__/save_corpus.cpython-38.pyc[m



In [None]:
username = getpass("Username: ")

Username: ··········


In [None]:
email = getpass("Email: ")

Email: ··········


In [None]:
password = getpass("Password: ")

Password: ··········


In [None]:
!git config --global user.email username
!git config --global user.name email
!git config --global user.password password

In [250]:
message = "Update save_corpus.py and load_corpus.py"
assert message != None

In [251]:
!git commit -m message

[master 7f6b06b] message


In [None]:
token = getpass("Token: ")

Token: ··········


In [246]:
!git remote rm origin
!git remote add origin https://$token@github.com/$username/Topic-Modeling-Reclame-Aqui.git

In [252]:
!git push origin master

Counting objects: 8, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (8/8), 1.25 MiB | 2.67 MiB/s, done.
Total 8 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 1 local object.[K
To https://github.com/punkmic/Topic-Modeling-Reclame-Aqui.git
   83eaeea..7f6b06b  master -> master


In [None]:
%cd ..