# Fórum do SIGAA - PLN

###### Holanda, Song Jong, Lourenço, Oliveira

### Data Collect

Our aim was to obtain messages from "Fórum do SIGAA". Because SIGAA does not provide an API to collect that data, it was needed apply web scrapping techniques. It is important to notice that the messages are limited to the course BTI and do not represent all messages in the web plataform.

#### Detailed Code Functionality: SIGAA Forum Scrapping

The [Node.js script for SIGAA Forum Scrapping](https://github.com/isaacmsl/scrapping-forum-sigaa) utilizes Puppeteer for browser automation and data collection from the SIGAA forums of UFRN. Below, each part of the code is detailed along with its functionality:

#### 📁 Directory and File Structure
The Scrapping project includes the following files and directories:

- **.env.template**: Template for the environment variables required for execution.
- **.gitignore**: Specifies which files and directories to ignore in version control.
- **readme.md**: Documentation of the project, installation, and usage instructions.
- **index.js**: The main script for scrapping execution.
- **package.json**: Project metadata, such as dependencies and available scripts.
- **pnpm-lock.yaml**: Lock file to ensure the consistency of dependencies installed with PNPM.

#### 🛠 Environment Setup

##### Prerequisites
- **Node.js** version 20.6.1 or higher.
- **Package Manager**: PNPM, NPM, or Yarn.

##### Configuration
1. Clone the repository using:
   ```bash
   git clone https://github.com/isaacmsl/scrapping-forum-sigaa
   ```
2. Navigate to the cloned project directory.
3. Copy the `.env.template` to a new file named `.env`.
4. Fill in the variables in the `.env` file with your access information and desired settings:
   ```plaintext
   SIGAA_USERNAME="your_username"
   SIGAA_PASSWORD="your_password"
   INITIAL_FORUM_PAGE=1
   TERMINAL_FORUM_PAGE=2
   ```
5. Execute `pnpm install` to install the dependencies.

#### 🚀 Script Execution
To start the scrapping process, execute the following command in the terminal:
```bash
node .
```
The script will:
- Initiate a browser session.
- Log into SIGAA using the provided credentials.
- Navigate to the forum page.
- Collect messages from all forum pages between `INITIAL_FORUM_PAGE` and `TERMINAL_FORUM_PAGE`.
- Save the messages to the file specified by the `OUTPUT_FILE_PATH` variable.

##### .gitignore
The `.gitignore` file includes:
- `node_modules/`: Node.js modules directory.
- `.env`: Environment variables file with sensitive information.
- `data/`: Directory for storing extracted data.

#### 🔐 Security
- Do not commit the `.env` file to version control to protect access credentials.
- Ensure secure connections when making requests to prevent data interception.

#### 📚 Dependencies
The script utilizes the following main libraries:
- **puppeteer**: For browser automation and data collection.
- **dotenv**: To load environment variables from the `.env` file.

##### 📦 Module Import and Configuration

```javascript
const puppeteer = require('puppeteer');
const fs = require('fs');
require('dotenv').config();
```

- **puppeteer**: Module that enables programmatic control of a Chrome or Chromium browser.
- **fs**: Node.js module for file and directory manipulation.
- **dotenv**: Module to load environment variables from a `.env` file.

##### 🌍 Environment Variables

```javascript
const {
    INITIAL_FORUM_PAGE,
    TERMINAL_FORUM_PAGE,
    SIGAA_USERNAME,
    SIGAA_PASSWORD
} = process.env;
```

- **INITIAL_FORUM_PAGE** and **TERMINAL_FORUM_PAGE**: Define the range of forum pages to be accessed.
- **SIGAA_USERNAME** and **SIGAA_PASSWORD**: User credentials for logging into SIGAA.

##### 🗂 Output File Path

```javascript
const OUTPUT_FILE_PATH = `data/forum-pages-${INITIAL_FORUM_PAGE}-${TERMINAL_FORUM_PAGE}.txt`;
```

- Defines the path and name of the file where the extracted data will be stored.

##### 📄 Function `getPageMessages`

```javascript
async function getPageMessages(page) {
    return await page.evaluate(() => {
        // Extract relevant information from the page's DOM
        const forumTitle = document.querySelector('h2').innerText.split("Portal do Discente > Discussão sobre ")[1];
        const msgs = [];
        const tbody = document.querySelector('tbody');
        const headers = tbody.querySelectorAll('tr.bg-claro');
        const contents = tbody.querySelectorAll('td[style="background-color: #FCFCFC;"');
        
        for (let i = 0; i < Math.min(contents.length, headers.length); ++i) {
            const newMsg = `[${forumTitle}]\n{${headers[i].innerText}}\n*-start-content-*\n${contents[i].innerText}\n*-end-content-*`;
            msgs.push(newMsg);
        }
        return msgs;
    });
}
```

- This function navigates through the current forum page's DOM and collects messages, which are formatted in a specific style and returned as an array.

##### 🔄 Main Script

```javascript
(async () => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();

    await page.goto('https://sigaa.ufrn.br/sigaa/portais/discente/discente.jsf');
    await page.type("#username", SIGAA_USERNAME);
    await page.type("#password", SIGAA_PASSWORD);
    await Promise.all([
        page.click("button"),
        page.waitForNavigation({ waitUntil: 'networkidle2' })
    ]);

    const allMsgs = [];
    for (let pageIndex = Number(INITIAL_FORUM_PAGE); pageIndex <= Number(TERMINAL_FORUM_PAGE); ++pageIndex) {
        console.log(`Getting page ${pageIndex}`);
        const selectPageForum = await page.$("select");
        await Promise.all([
            selectPageForum.select(`${pageIndex-1}`),
            page.waitForNavigation({ waitUntil: 'networkidle2' })
        ]);
        
        // Iterates over each forum topic, clicking on it and extracting messages
        let topics = await page.$$("[id*='listagem:mostrar']");
        for (let i = 0; i < topics.length; i++) {
            const topic = topics[i];
            await Promise.all([
                topic.click(),
                page.waitForNavigation({ waitUntil: 'networkidle2' })
            ]);
            const msgs = await getPageMessages(page);
            fs.appendFileSync(OUTPUT_FILE_PATH, msgs.join('\n*-delimit-msg-*\n'));

            // Navigation between pages within a topic, if there are multiple pages
            let nextPageButton = await page.$("a[id*='j_id_jsp_454333105_172:pageNext']");
            while (nextPageButton) {
                await Promise.all([
                    nextPageButton.click(),
                    page.waitForNavigation({ waitUntil: 'networkidle2' })
                ]);
                const msgsNextPage = await getPageMessages(page);
                fs.appendFileSync(OUTPUT_FILE_PATH, msgsNextPage.join('\n*-delimit-msg-*\n'));
                msgs.push(...msgsNextPage);
                nextPageButton = await page.$("a[id*='j_id_jsp_454333105_172:pageNext']");
            }
            allMsgs.push(...msgs);
            console.log(`Pushed ${msgs.length} messages to allMsgs`);
        }
    }

    await browser.close();


})();
```

##### 🔍 Detailed Functioning
- **Starts the browser**: Non-headless for visual confirmation of actions.
- **Accesses the SIGAA login page**: Fills in user and password fields and logs in.
- **Forum page navigation**: Sequentially accesses topic pages between `INITIAL_FORUM_PAGE` and `TERMINAL_FORUM_PAGE`.
- **Message extraction from each topic**: Iterates over the topics on the page, collecting messages from each one, including navigating through subsequent pages within a topic.
- **Data storage**: Saves the collected messages to a text file.
- This process is carried out for each configured page and topic, comprising a comprehensive and systematic data collection from the forum.

### Exploration

It is known (so far) that the forum consists of topics that are created by students and professors related to BTI. The topic is represented by title, author, how many messages (do not include the first message from author) and last message sent datetime. Besides, topic answers are represented by sent datetime, author with registration code (matrícula), content.

From date of collect, it was noticed that there was 3070 registered topics in the forum. They are presented by pages of 30 topics, so there are in total 103 pages. But that does not express the depth of messages registered. In total, there was 11920 messages in almost 11 years (2013 to 2024).

Let's have a look at our dataset:

In [1]:
# Importing the pandas library
# Pandas is commonly used for data manipulation and analysis.
import pandas as pd

# Loading the dataset
# Here, we're reading data from a JSON file named 'forum.json' located within a 'dataset' folder.
# The read_json function from pandas is used to directly convert the JSON file into a pandas DataFrame.
df = pd.read_json('dataset/forum.json')

# Displaying a random sample of the dataset
# The 'sample' method is used to randomly select 10 entries from the DataFrame.
# 'random_state' is set to 314 to ensure the reproducibility of this sample.
# This means whenever you run this code, it will always return the same random sample.
df.sample(10, random_state=314)


Unnamed: 0,titulo,data,hora,nome,conteudo
4274,Pesquisa para alunos que já pagaram VGA (ajude...,20/09/2021,16:33:27,RAFAELA HORACINA SILVA ROCHA SOARES,"Olá, pessoal.\n\nUma das alunas egressas do BT..."
8692,Provas de proficiencia de inglês e matematica,27/01/2016,15:56:25,JOSÉ HERICK MELO DA SILVA,"Não é obrigatório, contudo se for aprovado ser..."
9478,Treinamento para o UCC 2015,25/03/2015,18:09:34,MARIA CLARA SOUZA DE FONTES PEREIRA,"Professora, eu gostaria de participar do trein..."
10032,Novo Modelo Camisa BTI,03/09/2014,15:51:25,JOSE VICTOR GAMA BEZERRA,"Roberto, como posso entrar em contato com você..."
5524,Sugestões de minicursos para o DAAL promover,29/10/2019,20:10:39,ANDRECIO COSTA BEZERRA,Docker\n
5522,Sugestões de minicursos para o DAAL promover,29/10/2019,10:41:48,VICTOR HUGO FREIRE RAMALHO,docker\n
1171,Cadê o professor?,18/08/2023,21:08:15,RANNA BEATRIZ DE LIMA LISBOA,A turma do sábado também está sem professor. (...
9028,VOTAÇÃO - ambientes de estudos,11/08/2015,18:29:46,WILSON SILVA DE FARIAS,"Ótima iniciativa, Cephas! Parabéns!\n\nRespond..."
7329,"[URGENTE] Demanda de Grafos, Cálculo numérico,...",17/05/2017,15:06:54,CARLOS ANTÔNIO DE OLIVEIRA NETO,"Grafos é uma das últimas que faltam para mim, ..."
2724,Disciplinas Eletivas,28/08/2022,19:20:52,VICTOR EDUARDO NASCIMENTO,Mandei essa mesma pergunta no início da matríc...


An initial glance at the dataset shows us two areas of interest, the titles and the contents. The texts so far also seem surprisingly clean, but let's check further.

We can start by isolating the collumns of interest. 

We could leave titles attached to their respective content, but notice how titles repeat themselves in that format. Any analysis or model built on that would be heavily biased towards the words in the titles, which would not be necessarily representative of the forum as a whole.

In [2]:
# Extracting specific columns from the DataFrame
# We are creating two new DataFrames, df_titles and df_content, from the 'titulo' and 'conteudo' columns respectively.
# This allows us to focus on the titles and content of the forum messages separately for further analysis.

df_titles = df['titulo']
df_content = df['conteudo']

# Printing the shape of the DataFrames
# The 'shape' attribute of a DataFrame provides the dimensions of the DataFrame in the form of (rows, columns).
# Here, we print the shapes to understand the amount of data we're dealing with in each DataFrame.

print(df_titles.shape, df_content.shape)


(11922,) (11922,)


We can then easily get rid of repeated values in the titles with the pandas df.drop_duplicates method

In [3]:
# Removing duplicate entries from the df_titles DataFrame
# The 'drop_duplicates' method identifies and removes any duplicate values in the DataFrame.
# The 'inplace=True' argument modifies the original DataFrame directly, avoiding the need to create a new DataFrame to store the results.

df_titles.drop_duplicates(inplace=True)

# Displaying the new shape of the df_titles DataFrame after duplicates have been removed
# After removing duplicates, it's useful to check the shape of the DataFrame again to see how many unique titles remain.
# This information can help assess the diversity or repetitiveness of topics discussed in the forum.

df_titles.shape


(2711,)

Analysing content and titles separately may be worthwhile, but it may also be interesting to see what we can get using both at once.

In [4]:
# Combining the df_titles and df_content into a single DataFrame
# We use the 'concat' function from pandas to concatenate the df_titles and df_content series.
# The 'axis=0' argument specifies that the concatenation should occur along the index (row-wise), effectively stacking one series on top of the other.
# This results in a single series, df_text, which holds both titles and contents of the forum messages.

df_text = pd.concat([df_titles, df_content], axis=0)

# Printing the shape of the combined DataFrame
# The 'shape' attribute is used again to determine the total number of entries in the df_text after concatenation.
# This provides a count of all textual data available for analysis, combining both titles and content.

df_text.shape


(14633,)

Now that we have our datasets, let's check them again. First, messages contents:

In [5]:
# Displaying a random sample of content from the df_content DataFrame
# This loop iterates over a randomly selected sample of 10 entries from df_content.
# The 'sample' method is used with 'random_state=200' to ensure reproducibility, meaning the same 10 entries are selected each time this code is run.
# Each content entry in the sample is printed to the console.

for content in df_content.sample(10, random_state=200):
    print(content)


Tem alguma previsão para o início das aulas?

Me interesso em DIM0346 - GERENCIAMENTO E SEGURANCA EM REDES DE COMPUTADORES.
Turnos: vespertino ou noturno.

up

kkkkkkkkkkkkkkkkkkkkkkkkkk desculpe Anrafell

IMD0902 é "INGLES TECNICO I" ou "Introdução à Internet das Coisas"?

Boa noite galera de TI!

NOSSOS CANECOS FINALMENTE PRONTINHOS PARA DISTRIBUIÇÃO

E sabe qual a melhor parte?

Você que comprou o Projeto ReX (festa de integração) vai ter em primeira mão!!

A primeira distribuição dos canecos será durante o PROJETO REX

Para aqueles que não compraram o caneco, temos uma boa notícia, ESTAREMOS VENDENDO NO PROJETO REX!!

CANECO + TIRANTE APENAS R$ 40,00

OBS: Aqueles que não pegarem durante o integra, ainda divulgaremos próxima semana como será o esquema para todo mundo pegar o caneco

Confere o vídeo no insta no link abaixo para ver o resultado iradooooo dos nossos canecos

https://www.instagram.com/reel/Cq9GBVjrg4Y/?igshid=YmMyMTA2M2Y=

Tenho interesse em participar, quais os pré re

Second, titles:

In [6]:
# Displaying a random sample of titles from the df_titles DataFrame
# This loop iterates over a randomly selected sample of 10 titles from df_titles.
# The 'sample' method selects these titles randomly and 'random_state=200' ensures that the selection is reproducible,
# meaning the same 10 titles will always be selected when this code is executed with the same random state.
# Each title is printed on a new line to provide a clear and separated view of each entry.

for title in df_titles.sample(10, random_state=200):
    print(title, '\n')


[Sugestão] Salas de estudos divididas 

VTEX no IMD - UFRN | 29/08 

Novas turmas criadas HOJE 

Processos de aproveitamento de estudos 

GRUPO PARA OS FORMANDOS 

Solicitação de Componente Curricular 

PROJETO DE EXTENSÃO - Oficina de Expressão Oral em Língua Inglesa 

Oportunidade de Bolsa 

Ferramentas para a disciplina IMD1001 - MATEMÁTICA ELEMENTAR 

Novo Coordenador e Sala da Coordenação 



In [7]:
# Displaying a random sample of text entries from the df_text DataFrame
# This loop iterates over a randomly selected sample of 10 text entries from df_text, which includes both titles and content.
# The 'sample' method is used with 'random_state=300' to ensure reproducibility. This setting ensures that the same
# 10 entries are selected each time the code is run with this random state.
# Each text entry from the sample is printed on a new line, followed by a blank line for better readability.

for txt in df_text.sample(10, random_state=300):
    print(txt, '\n')


Nova Disciplina 

Fui assaltado também na minha primeira semana no IMD, isso em 2020, enquanto aguardava na parada próximo às residencias universitarias. Muito perigoso.
 

Existe algum meio no SiGAA para conferir se o vínculo foi confirmado?
 

Jefferson,

pode ser que para alguns alunos aconteça algum problema. Peço que tente novamente em outro PC e veja se a unidade está montada.

Caso tenha algum problema, procure o pessoal da TI na sala B115.
 

Obrigado.
 

Obrigado Rubem.
 

Pessoal, eu fiz prova na sala A306 e esqueci minha garrafa d'água na sala. É uma garrafa de alumínio prateada... Se alguém encontrar por favor deixem na secretaria. Agradeço desde ja
 

Cade as disciplinas na sexta 6N1234 ??? Quem quer pagar no mínimo 5 matérias não consegue....
 

@jessielylvr
 

Cine Empreender com a SoftUrbano 



## Cleaning

As far as texts with no standard format (like the ones in an online forum) go, these are surprisingly clean so far. Most of what could be considered thrash for our purposes are links, emails and citations. Removing special characters later on would result in a lot of gibberish from these cases, but they are also not particularly easy to isolate. Selecting the most frequent words for our vocabulary might be enough to get rid of these extra data. But let's do some tests. 

In [8]:
# Importing the regular expressions library
import re

# Defining functions to remove specific types of text from strings using regular expressions

def remove_emails_com(text):
    # Removes any string that ends with '.com'. It targets continuous non-whitespace characters ending with '.com'.
    return re.sub(r'[^\s]+\.com', '', text).strip()

def remove_emails_br(text):
    # Removes any string that ends with '.br'. It targets continuous non-whitespace characters ending with '.br'.
    return re.sub(r'[^\s]+\.br', '', text).strip()

def remove_links(text):
    # Removes any string that starts with 'http' followed by any non-whitespace characters, effectively removing URLs.
    return re.sub(r'http[^\s]+', '', text).strip()

def remove_citations(text):
    # Removes any string that starts with '@', targeting mentions or citations which are common in emails and social media.
    return re.sub(r'@[^\s]+', '', text).strip()

# Example string containing different types of data to be cleaned
test_str = "mande para o email sample_email@ufrn.edu.br ou sample_email@gmail.com, ou se inscreva no site https://url.com.br/inscricao - @user"

# Applying the cleaning functions in sequence
result = remove_links(test_str)     # First remove any URLs
result = remove_emails_br(result)   # Then remove any '.br' emails
result = remove_emails_com(result)  # Then remove any '.com' emails
result = remove_citations(result)   # Finally, remove any user citations starting with '@'

# Printing the cleaned string
print(result)


mande para o email  ou , ou se inscreva no site  -


These seems to work well enough. Other than that, we could check for spelling errors. However, the texts seem clean enough and spelling erros can be eliminated by frequency, so for now let's skip that step. Let's just add a function to get rid of random formatting like lines and tabs. We can also get rid of repeated symbols and possible punctuation errors.

In [9]:
# Importing the regular expressions library
import re

def remove_repeated_symbols(text):
    # This function removes consecutive, repeated non-word characters (symbols) in the text.
    # It uses a regular expression to find all non-word characters (\W) that appear consecutively (\1+)
    # and replaces them with a single occurrence of that character.
    # Example: "Hello!!! How are you???" becomes "Hello! How are you?"
    return re.sub(r'(\W)\1+', r'\1', text).strip()

def remove_excessive_spaces(text):
    # This function consolidates multiple consecutive spaces into a single space.
    # It uses a regular expression to find all instances of one or more spaces (\s+)
    # and replaces them with a single space, thereby standardizing the spacing in the text.
    # Example: "This    is  a  test" becomes "This is a test".
    return re.sub(r'\s+', ' ', text).strip()

def fix_isolated_commas(text):
    # This function corrects punctuation spacing issues by removing any space that precedes punctuation marks.
    # It targets common punctuation characters like commas, periods, colons, semicolons, and exclamation marks,
    # ensuring they directly follow the preceding word without a space.
    # Example: "Wait , what did you do ?" becomes "Wait, what did you do?"
    # The regular expression here looks for a space followed by any of the specified punctuation marks
    # and replaces it with just the punctuation mark.
    text = re.sub(r' ([.,:;!?])', r'\1', text)
    return text.strip()


Now we just need to build our pipeline.

In [10]:
# Importing necessary classes from scikit-learn
from sklearn.pipeline import Pipeline  # Pipeline applies a list of transforms and can also include an estimator at the end.
from sklearn.preprocessing import FunctionTransformer  # FunctionTransformer allows the application of any arbitrary function to the data.

# Defining the text cleaning pipeline
# This pipeline is designed to sequentially apply several text cleaning functions using FunctionTransformers.
# Each FunctionTransformer wraps a predefined function for integration into the pipeline structure.

pipeline_clean_text = Pipeline([
    # Removing hyperlinks from the text
    ('remove_links', FunctionTransformer(remove_links)),
    # Removing email addresses that end with .br
    ('remove_emails_br', FunctionTransformer(remove_emails_br)),
    # Removing email addresses that end with .com
    ('remove_emails_com', FunctionTransformer(remove_emails_com)),
    # Removing user mentions or citations that start with '@'
    ('remove_citations', FunctionTransformer(remove_citations)),
    # Consolidating multiple spaces into a single space
    ('remove_excessive_spaces', FunctionTransformer(remove_excessive_spaces)),
    # Removing consecutive repeated symbols
    ('remove_repeated_symbols', FunctionTransformer(remove_repeated_symbols)),
    # Correcting isolated commas and other punctuations
    ('fix_isolated_commas', FunctionTransformer(fix_isolated_commas)),
])

# Example string for testing
test_str = "mande para o email sample_email@ufrn.edu.br ou sample_email@gmail.com, ou se inscreva no site https://url.com.br/inscricao - @user"

# Applying the pipeline to the example string
# The transform method of the pipeline is used to apply all transformations to the test_str.
# It processes the string through each step in the defined sequence, outputting the fully cleaned text.
cleaned_text = pipeline_clean_text.transform([test_str])

# Printing the cleaned text
print(cleaned_text)


TypeError: expected string or bytes-like object

Now that we have setted our pipeline, let's clean our datasets:

In [None]:
# Applying the text cleaning pipeline to various DataFrame columns
# This code demonstrates how to use the previously defined pipeline to clean different segments of textual data within the DataFrame.

# Cleaning the combined text column (titles and contents)
# The 'apply' method is used on the 'df_text' DataFrame series to apply the 'pipeline_clean_text.transform' function,
# which processes each text entry through the entire sequence of cleaning operations defined in the pipeline.
text_clean = df_text.apply(pipeline_clean_text.transform)

# Cleaning the titles column
# Similar to the combined text, the titles are processed through the pipeline,
# ensuring that all textual transformations (like removing links, emails, and excess spaces) are consistently applied.
titles_clean = df_titles.apply(pipeline_clean_text.transform)

# Cleaning the content column
# The content of the forum messages is also cleaned using the same pipeline,
# which standardizes the text formatting and removes unwanted textual elements.
content_clean = df_content.apply(pipeline_clean_text.transform)

# These cleaned data series now contain the processed text ready for further analysis or use in machine learning models,
# improving the accuracy of text-based insights and predictions by ensuring data consistency and cleanliness.


In [None]:
# Displaying a random sample of cleaned text from the content_clean DataFrame
# This loop iterates over a randomly selected sample of 10 entries from the content_clean series.
# The 'sample' method is used with 'random_state=200' to ensure reproducibility, meaning the same 10 entries are selected each time this code is run.
# Each cleaned content entry in the sample is printed on a new line, followed by a blank line for better readability.

for txt in content_clean.sample(10, random_state=200):
    print(txt, '\n')


Tem alguma previsão para o início das aulas? 

Me interesso em DIM0346 - GERENCIAMENTO E SEGURANCA EM REDES DE COMPUTADORES. Turnos: vespertino ou noturno. 

up 

kkkkkkkkkkkkkkkkkkkkkkkkkk desculpe Anrafell 

IMD0902 é "INGLES TECNICO I" ou "Introdução à Internet das Coisas"? 

Boa noite galera de TI! NOSSOS CANECOS FINALMENTE PRONTINHOS PARA DISTRIBUIÇÃO E sabe qual a melhor parte? Você prou o Projeto ReX (festa de integração) vai ter em primeira mão! A primeira distribuição dos canecos será durante o PROJETO REX Para aqueles que praram o caneco, temos uma boa notícia, ESTAREMOS VENDENDO NO PROJETO REX! CANECO + TIRANTE APENAS R$ 40,00 OBS: Aqueles que não pegarem durante o integra, ainda divulgaremos próxima o será o esquema para todo mundo pegar o caneco Confere o vídeo no insta no link abaixo para ver o resultado iradooooo dos nossos canecos 

Tenho interesse em participar, quais os pré requisitos? 

Seria possível disponibilizar Desenvolvimento WEB 1? 

Olá Any, a página do curso

In [None]:
# Displaying a random sample of cleaned titles from the titles_clean DataFrame
# This loop iterates over a randomly selected sample of 10 entries from the titles_clean series.
# The 'sample' method is used with 'random_state=200' to ensure reproducibility. This setting ensures that the same 10 titles will always be selected when this code is executed with the same random state.
# Each cleaned title in the sample is printed on a new line, followed by a blank line for better readability and separation between entries.

for txt in titles_clean.sample(10, random_state=200):
    print(txt, '\n')


[Sugestão] Salas de estudos divididas 

VTEX no IMD - UFRN | 29/08 

Novas turmas criadas HOJE 

Processos de aproveitamento de estudos 

GRUPO PARA OS FORMANDOS 

Solicitação de Componente Curricular 

PROJETO DE EXTENSÃO - Oficina de Expressão Oral em Língua Inglesa 

Oportunidade de Bolsa 

Ferramentas para a disciplina IMD1001 - MATEMÁTICA ELEMENTAR 

Novo Coordenador e Sala da Coordenação 



In [None]:
# Displaying a random sample of cleaned texts from the text_clean DataFrame
# This loop iterates over a randomly selected sample of 10 entries from the text_clean series.
# The 'sample' method is used with 'random_state=300' to ensure reproducibility, meaning the same 10 texts will always be selected each time this code is executed with the same random state.
# Each cleaned text entry in the sample is printed on a new line, followed by a blank line for better readability and separation between entries.

for txt in text_clean.sample(10, random_state=300):
    print(txt, '\n')


Nova Disciplina 

Fui assaltado também na minha primeira semana no IMD, isso em 2020, enquanto aguardava na parada próximo às residencias universitarias. Muito perigoso. 

Existe algum meio no SiGAA para conferir se o vínculo foi confirmado? 

Jefferson, pode ser que para alguns alunos aconteça algum problema. Peço que tente novamente em outro PC e veja se a unidade está montada. Caso tenha algum problema, procure o pessoal da TI na sala B115. 

Obrigado. 

Obrigado Rubem. 

Pessoal, eu fiz prova na sala A306 e esqueci minha garrafa d'água na sala. É uma garrafa de alumínio prateada. Se alguém encontrar por favor deixem na secretaria. Agradeço desde ja 

Cade as disciplinas na sexta 6N1234? Quem quer pagar no mínimo 5 matérias não consegue. 

 

Cine a SoftUrbano 



## Analysis

Let's do some a analysis of our clean datasets.

In [None]:
# Calculating word counts for each entry in the cleaned content, titles, and combined text dataframes
# This is performed by splitting each string entry into words (using space as the delimiter) and then applying the 'len' function to count the words.

# Calculating word counts for the content
word_counts_content = content_clean.str.split().apply(len)

# Calculating word counts for the titles
word_counts_titles = titles_clean.str.split().apply(len)

# Calculating word counts for the combined text of titles and content
word_counts_text = text_clean.str.split().apply(len)

# Generating a statistical summary of the word counts in the content
# The 'describe' method provides a descriptive statistical summary including count, mean, standard deviation, minimum, and percentile values.
# This summary helps to understand the distribution of word counts across the cleaned content data,
# which is important for further analyses, such as identifying average post length or detecting outliers.

word_counts_content.describe()


count    11922.000000
mean        42.207012
std         71.041758
min          0.000000
25%          6.000000
50%         18.000000
75%         47.000000
max       1143.000000
Name: conteudo, dtype: float64

In [None]:
# Generating a statistical summary of the word counts in the cleaned titles
# The 'describe' method is used here to summarize the statistics of word counts in the titles dataset,
# which has previously been calculated to count the number of words in each title entry.
# This summary includes key statistics such as the count of entries, mean (average) word count,
# standard deviation (a measure of variability), minimum word count, and the 25th, 50th (median), and 75th percentiles,
# providing insights into the typical length and distribution of words in titles.

word_counts_titles.describe()


count    2711.000000
mean        6.901143
std         3.428527
min         1.000000
25%         4.000000
50%         6.000000
75%         9.000000
max        20.000000
Name: titulo, dtype: float64

In [None]:
# Generating a statistical summary of the word counts in the combined text of titles and content
# The 'describe' method provides a descriptive statistical summary of the word counts for the entire cleaned text,
# which includes both titles and content. This summary offers insights into the central tendency, dispersion,
# and shape of the distribution of word counts across all text entries.
# It includes the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.
# This analysis helps understand the overall verbosity and text density in the cleaned dataset.

word_counts_text.describe()


count    14633.000000
mean        35.666029
std         65.591024
min          0.000000
25%          5.000000
50%         13.000000
75%         37.000000
max       1143.000000
dtype: float64

You may have noticed that we have some empty content. Let's get rid of it.

In [None]:
# Removing entries with zero words from the cleaned content and combined text dataframes
# This operation is essential for cleaning data further to ensure that no empty strings are left in the dataset, which can skew analysis.

# Initializing an empty list to hold the indices of entries that are empty
indexDrop = []

# Iterating over the content_clean series to find and record the index of each empty entry
for i in range(len(content_clean)):
    if len(content_clean[i]) == 0:
        indexDrop.append(i)  # Appending the index of the empty entry to the list

# Dropping the recorded indices from the content_clean series
# 'axis=0' indicates that rows are to be dropped
# 'inplace=True' modifies the original DataFrame directly
content_clean.drop(indexDrop, axis=0, inplace=True)

# Similarly, dropping the same indices from the text_clean series to maintain consistency across data
text_clean.drop(indexDrop, axis=0, inplace=True)

# Recalculating word counts after removing empty entries
# This step recalculates the word counts for each entry in content_clean by splitting the text into words and counting them
word_counts_content = content_clean.str.split().apply(len)

# Describing the new word counts to see the updated statistics after cleaning
# The describe method provides a statistical summary including count, mean, std (standard deviation), min, and various percentiles
word_counts_content.describe()


count    11877.000000
mean        42.366928
std         71.128611
min          1.000000
25%          6.000000
50%         19.000000
75%         47.000000
max       1143.000000
Name: conteudo, dtype: float64

In [None]:
# Generating a statistical summary of the word counts in the cleaned and filtered combined text
# After removing entries with zero words and ensuring data consistency, we re-evaluate the word counts in the combined text data.
# The 'describe' method is used to provide a statistical summary of the word counts for the text_clean series,
# which includes both titles and contents. This summary gives insights into the central tendency, dispersion,
# and the distribution shape of the word counts across the dataset.
# It includes metrics such as the count of non-empty entries, mean (average) word count,
# standard deviation (measure of word count variability), minimum word count, and the 25th, 50th (median), and 75th percentiles.

word_counts_text.describe()


count    14582.000000
mean        35.787889
std         65.673049
min          1.000000
25%          5.000000
50%         13.000000
75%         37.000000
max       1143.000000
dtype: float64

Luckily there wasn't much to get rid of. Now let's continue:

In [None]:
# Importing the Plotly library for creating interactive graphs
import plotly.graph_objects as go

def word_count_distr_graph(word_counts, title):
    """
    This function creates a histogram to visualize the distribution of word counts.
    
    Parameters:
    word_counts (Series): A pandas Series containing the word counts for each text entry.
    title (str): The title of the histogram, which describes what the histogram represents.
    
    The function utilizes Plotly's Graph Objects to create the histogram, allowing for interactive exploration of the data.
    'go.Figure()' initializes a new figure for plotting, and 'go.Histogram()' specifies that we want to plot a histogram.
    The 'x' parameter in 'go.Histogram()' is set to the word_counts, which plots these counts along the x-axis.
    'update_layout()' is used to set the title of the graph, where 'title_text' specifies the actual title string,
    and 'title_x=0.5' centers the title over the histogram.
    'fig.show()' renders the figure in the output cell of the notebook or the web interface being used.
    """

    # Create a new figure for the histogram
    fig = go.Figure(data=[go.Histogram(x=word_counts)])

    # Update the layout to add a title and center it
    fig.update_layout(title_text=title, title_x=0.5)

    # Display the figure
    fig.show()

# Example usage of the function to plot the distribution of word counts in the combined text data
word_count_distr_graph(word_counts_text, 'Word count distribution in forum')


In [None]:
# Visualizing the distribution of word counts in the titles of the forum
# This call to the word_count_distr_graph function uses the word_counts_titles data,
# which contains the word count for each cleaned title in the dataset.

# The title parameter specifies what the histogram will represent: "Word count distribution in titles".
# This visualization helps to understand the typical length of titles used in the forum,
# providing insights into how succinctly information is typically presented in title form.

# Using the function defined earlier, we create and display a histogram that shows the frequency distribution of word counts in the titles.
word_count_distr_graph(word_counts_titles, 'Word count distribution in titles')


In [None]:
# Visualizing the distribution of word counts in the content of messages from the forum
# This invocation of the word_count_distr_graph function uses the word_counts_content data,
# which contains the word count for each cleaned content entry in the dataset.

# The title 'Word count distribution in messages' clearly describes the purpose of the histogram,
# indicating that the visualization aims to show how word counts are distributed across the messages in the forum.
# This helps in understanding not only the general length of the messages but also the variability and density of information within them.

# By using the previously defined function, we generate and display a histogram to visualize the frequency distribution of word counts in the messages.
word_count_distr_graph(word_counts_content, 'Word count distribution in messages')


It seems there are some words that appear very frequently in this dataset. Let's get an idea of what they are with Counter.

In [None]:
# Import necessary libraries
from collections import Counter
import plotly.graph_objects as go

def plot_histogram_word(text_list, n_most_common=30, title="text"):
    """
    This function creates a histogram of the most frequent words from a list of text entries.
    
    Parameters:
    - text_list: A list or series of text entries from which words will be counted.
    - n_most_common: The number of most frequent words to display in the histogram.
    - title: A descriptive title for the histogram to indicate the source of the text data.
    
    The function processes the list of texts by splitting each entry into words, counting the occurrences of each word,
    and then displaying the most frequent words in a histogram for easy visualization.
    """

    # Create a list of words by iterating over each text entry and splitting it into words
    words = []
    for txt in text_list:
        words += txt.split()

    # Count the frequency of each word in the list using Counter from the collections module
    word_counts = Counter(words)

    # Select the top 'n_most_common' words for display
    word_counts = dict(word_counts.most_common(n_most_common))

    # Create a bar chart using Plotly, where the x-axis represents the words and the y-axis their counts
    fig = go.Figure([go.Bar(x=list(word_counts.keys()), y=list(word_counts.values()))])
    fig.update_layout(title_text=f'Top {n_most_common} most frequent words in the {title}', title_x=0.5)
    fig.show()

# Example usages of the function:

# Plot the most frequent words in the cleaned combined text (titles and contents) of the forum
plot_histogram_word(text_clean, 30, 'forum')

# Plot the most frequent words in the cleaned titles of the forum
plot_histogram_word(titles_clean, 30, 'titles')

# Plot the most frequent words in the cleaned content of the messages in the forum
plot_histogram_word(content_clean, 30, 'messages')


There sure are a lot stop words we need to get rid of.  We'll use both spacy's and nltk's list for this:

In [None]:
# Importing the spaCy library
import spacy

# Loading the Portuguese language model provided by spaCy
# 'pt_core_news_sm' is a small model for Portuguese that includes various NLP capabilities.
nlp = spacy.load('pt_core_news_sm')

# Accessing the default stop words list from the loaded Portuguese language model
# Stop words are commonly used words (such as "and", "the", etc.) that are usually ignored in text processing tasks.
stopwords_spacy = nlp.Defaults.stop_words

# Displaying the first 10 stop words from the list to get an idea of what types of words are considered stop words in Portuguese
# This helps in understanding which words will be excluded during text processing tasks like tokenization or text normalization.
list(stopwords_spacy)[:10]


OSError: [E050] Can't find model 'pt_core_news_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

In [None]:
# Importing the Natural Language Toolkit (NLTK) library
import nltk

# Importing the stopwords resource from the NLTK corpus module
from nltk.corpus import stopwords

# Downloading the stopwords dataset from NLTK
# This dataset includes stopwords for multiple languages, including Portuguese.
nltk.download('stopwords')

# Retrieving the list of Portuguese stopwords from the NLTK corpus
# Stopwords are frequently occurring words (like conjunctions and prepositions) that are usually removed in the processing of text data
# because they contribute little to the overall meaning of text and can skew analyses based on word frequencies.
stopwords_nltk = stopwords.words('portuguese')

# Displaying the first 10 stopwords from the NLTK list to understand what types of words are included
# This preview helps in assessing the typical stopwords used in Portuguese texts.
list(stopwords_nltk)[:10]


[nltk_data] Downloading package stopwords to /home/isaac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['a',
 'à',
 'ao',
 'aos',
 'aquela',
 'aquelas',
 'aquele',
 'aqueles',
 'aquilo',
 'as']

In [None]:
# Combining stop words lists from NLTK and spaCy libraries
# This step involves creating a unified set of stop words by merging the stop words from both the NLTK and spaCy libraries.
# The purpose of merging these lists is to create a comprehensive collection of stop words for the Portuguese language.

# Create a set from the NLTK Portuguese stop words
stopwords_nltk = set(stopwords.words('portuguese'))

# Create a set from the spaCy Portuguese stop words
stopwords_spacy = set(nlp.Defaults.stop_words)

# Union of both stop words sets
# The '|' operator is used to perform a union operation between the two sets, ensuring that all unique stop words from both
# NLTK and spaCy are included in the resulting set.
both_stopwords = stopwords_nltk | stopwords_spacy

# The resulting 'both_stopwords' set contains a comprehensive list of Portuguese stop words that can be used in text preprocessing
# to remove less meaningful words from text data.


NameError: name 'stopwords_spacy' is not defined

In [None]:
# Defining a function to remove stopwords from a text
def remove_stop_word(text):
    """
    Removes stopwords from a given text string.
    
    Parameters:
    - text (str): The text from which stopwords are to be removed.

    Steps:
    1. Remove all non-alphanumeric characters (punctuation, special symbols) from the text to simplify word separation.
       This is done using a regular expression that matches any character not a word character (\w) or whitespace (\s),
       and replaces them with an empty string.
    2. Split the cleaned text into individual words (tokens) based on whitespace.
    3. Filter out any words that are in the predefined set of stopwords ('both_stopwords').
       This is achieved using a filter function where only words not in the 'both_stopwords' set are retained.
    4. Join the filtered words back into a single string, separated by spaces, to form the cleaned text without stopwords.

    Returns:
    - The cleaned text as a single string without any stopwords.
    """

    # Remove punctuation and special characters from the text
    text = re.sub(r'[^\w\s]', '', text)

    # Split the text into words
    tokens = text.split()

    # Filter out stopwords from the tokens
    tokens = filter(lambda token: token not in both_stopwords, tokens)

    # Join the remaining words back into a single string
    return ' '.join(tokens)


We took advantage of this process to also remove non-alphanumeric characters from or token that are not going to be relevant. Let's see if we improved somewhat.

In [None]:
# Applying the remove_stop_word function to different segments of the dataset
# This step is essential for refining the text data further by eliminating stopwords,
# which are words typically filtered out in pre-processing due to their minimal lexical content.

# Applying the function to the 'text_clean' DataFrame
# This operation will process the entire text (including both titles and content) to remove all stopwords,
# resulting in a version of the text that consists only of meaningful words.
text_clean_no_stopwords = text_clean.apply(remove_stop_word)

# Applying the function to the 'titles_clean' DataFrame
# Similarly, this applies the stopword removal to just the titles, enhancing the focus on key terms
# and potentially improving the visibility and relevance of titles for tasks like search and categorization.
titles_clean_no_stopwords = titles_clean.apply(remove_stop_word)

# Applying the function to the 'content_clean' DataFrame
# Here, the function is used to refine the content text by removing all stopwords,
# which helps in text analysis tasks like sentiment analysis or topic modeling,
# where the focus is on the content's substantive terms.
content_clean_no_stopwords = content_clean.apply(remove_stop_word)

# These refined datasets now contain text that is stripped of common, less informative words,
# making them more suitable for deeper analytical tasks that require a higher level of text precision and relevance.


NameError: name 'both_stopwords' is not defined

In [None]:
# Visualizing the most frequent words in various text data sets after stopwords have been removed
# These visualizations help to understand which words are most prevalent in each section of the dataset
# after filtering out common stopwords that typically don't contribute meaningful information.

# Plotting the most frequent words in the cleaned forum text
# The 'text_clean_no_stopwords' dataset contains the combined titles and content of the forum with stopwords removed.
# The histogram will display the top 30 most frequent words in this dataset, helping to highlight key themes and topics.
plot_histogram_word(text_clean_no_stopwords, 30, 'forum')

# Plotting the most frequent words in the cleaned titles
# The 'titles_clean_no_stopwords' dataset contains just the titles with stopwords removed.
# This histogram focuses on the most frequent terms used in titles, which can provide insights into the primary subjects and attract readers' attention.
plot_histogram_word(titles_clean_no_stopwords, 30, 'titles')

# Plotting the most frequent words in the cleaned content of the messages
# The 'content_clean_no_stopwords' dataset comprises the body text of messages with stopwords removed.
# Visualizing the top 30 most frequent words in this content helps to understand the main points of discussion or interest within the messages.
plot_histogram_word(content_clean_no_stopwords, 30, 'messages')


Some interesting things are showing up, but maybe it would have been more useful to have all text in lowercase before we remove stopwords (given the presence of such obvious stop words like 'A' and 'O').

In [None]:
# Converting text to lowercase and removing stopwords
# Lowercasing all text data is a common preprocessing step in text analysis to standardize the text and ensure consistency,
# as it eliminates variations caused by capitalization. After converting to lowercase, the text has stopwords removed,
# which are common words that typically do not contribute meaningful information to the analysis.

# Applying lowercase conversion and stopword removal to the combined text dataset
text_clean_no_stopwords_lower = text_clean.str.lower().apply(remove_stop_word)

# Applying lowercase conversion and stopword removal to the titles dataset
titles_clean_no_stopwords_lower = titles_clean.str.lower().apply(remove_stop_word)

# Applying lowercase conversion and stopword removal to the content dataset
content_clean_no_stopwords_lower = content_clean.str.lower().apply(remove_stop_word)

# Visualizing the most frequent words after preprocessing
# These histograms display the top 30 most frequent words from each dataset.
# These visualizations are useful for identifying the key terms that dominate in each text segment,
# providing insights into the themes and topics that are most prevalent.

# Plotting the histogram for the cleaned and processed combined text of the forum
plot_histogram_word(text_clean_no_stopwords_lower, 30, 'forum')

# Plotting the histogram for the cleaned and processed titles
plot_histogram_word(titles_clean_no_stopwords_lower, 30, 'titles')

# Plotting the histogram for the cleaned and processed content of the messages
plot_histogram_word(content_clean_no_stopwords_lower, 30, 'messages')


That's better, but now we can see we're getting some extra information that might not be all that useful for now. Is the difference between 'disciplina' and 'disciplinas' particularly important for what we want right now? What about 'interesse' or 'interessado'?

It seems in this case lemmatizing might be a good idea. But let's save our current datasets first.

In [None]:
# Saving the processed text data to CSV files
# After preprocessing the text by converting it to lowercase and removing stopwords, the cleaned data is saved to CSV files.
# This allows the data to be easily accessed for future use, whether for further analysis, reporting, or machine learning applications.

# Saving the cleaned and processed combined text of the forum to a CSV file
# The DataFrame 'text_clean_no_stopwords_lower' contains the entire text data with stopwords removed and converted to lowercase.
# The data is saved to a file named 'text_clean_no_stopwords_lower.csv', preserving the text structure for future use.
text_clean_no_stopwords_lower.to_csv('text_clean_no_stopwords_lower.csv')

# Saving the cleaned and processed titles to a CSV file
# The DataFrame 'titles_clean_no_stopwords_lower' includes all the titles from the forum, cleaned and processed.
# It is saved to 'titles_clean_no_stopwords_lower.csv', facilitating easy access and analysis of the title data.
titles_clean_no_stopwords_lower.to_csv('titles_clean_no_stopwords_lower.csv')

# Saving the cleaned and processed content of the messages to a CSV file
# The DataFrame 'content_clean_no_stopwords_lower' consists of the message content data, cleaned and preprocessed.
# This data is saved to 'content_clean_no_stopwords_lower.csv', ensuring that the content is available for deeper textual analysis or other processing.
content_clean_no_stopwords_lower.to_csv('content_clean_no_stopwords_lower.csv')


We can now do some lemmatizing:

In [None]:
def spacy_lemmatizer(text):
    """
    Applies lemmatization to a given text using the spaCy library, focusing on Portuguese language processing.
    
    Lemmatization is the process of reducing words to their base or root form. For example, the words "running", "runs", and "ran"
    are all forms of the word "run", which is the lemma of all these words. This normalization process is beneficial for natural
    language processing applications where the exact form of a word is less important than its concept and meaning.

    Parameters:
    - text (str): The text to be lemmatized.

    Returns:
    - str: A string where each word has been replaced by its lemma, and only lemmas with more than two characters are included.

    Steps:
    1. The text is processed using the spaCy NLP object 'nlp' to create a document object.
    2. Each token (word) in the document is then converted to its lemma form using list comprehension.
    3. The list comprehension also filters out lemmas that are shorter than three characters to focus on more significant words.
    4. The resulting list of lemmas is joined back into a single string, separated by spaces, to form the lemmatized text.
    """
    # Convert the text to a spaCy document, enabling NLP features like lemmatization
    doc = nlp(text)

    # Extract the lemma for each token in the document and filter out short words
    txt = [token.lemma_ for token in doc]
    txt = [word for word in txt if len(word) > 2]

    # Join the lemmas into a single string and return it
    return ' '.join(txt)


(This might take some minutes)

In [None]:
# Applying lemmatization to the cleaned text data
# After converting the text to lowercase and removing stopwords, this step further processes the text by applying lemmatization.
# Lemmatization converts words to their base or dictionary form, which is useful for standardizing different forms of the same word.

# Using the 'spacy_lemmatizer' function, we apply lemmatization to the 'text_clean_no_stopwords_lower' DataFrame.
# This DataFrame contains the text that has already been converted to lowercase and had stopwords removed.
# Lemmatization is applied to each entry in the DataFrame, which helps reduce each word to its root form,
# ensuring that variations of a word are treated as the same term (e.g., "running", "runs" → "run").

# The result is stored in a new DataFrame 'text_clean_lemmatized', which contains the lemmatized text data.
# This lemmatized text is more uniform and is especially useful for tasks that benefit from a normalized vocabulary,
# such as feature extraction for machine learning models or detailed text analytics.

text_clean_lemmatized = text_clean_no_stopwords_lower.apply(spacy_lemmatizer)


In [None]:
# Applying lemmatization to the cleaned titles
# After converting the titles to lowercase and removing stopwords, we proceed with lemmatization to further refine the text.
# Lemmatization is the process of reducing words to their base or dictionary form, helping to consolidate different forms of a word into a single, common root.

# Using the 'spacy_lemmatizer' function, we apply lemmatization to the 'titles_clean_no_stopwords_lower' DataFrame.
# This DataFrame contains titles that have already been converted to lowercase and had stopwords removed.
# The lemmatization process will ensure that variations of a word are treated as the same term,
# which is particularly beneficial for tasks that rely on text uniformity and reduced vocabulary complexity.

# The result is stored in a new DataFrame 'titles_clean_lemmatized', which contains the lemmatized text of the titles.
# Lemmatized titles can offer a more standardized and concise representation of content, which is useful for indexing,
# search engine optimization, and improving the usability and accessibility of categorization schemes in data systems.

titles_clean_lemmatized = titles_clean_no_stopwords_lower.apply(spacy_lemmatizer)


In [None]:
# Applying lemmatization to the cleaned content
# After removing stopwords and converting the content text to lowercase, the next step involves lemmatization.
# Lemmatization is essential for normalizing the various forms of words to their lemma (base form),
# which aids in reducing the complexity of the text data and enhancing the effectiveness of subsequent analyses.

# Using the 'spacy_lemmatizer' function, we apply lemmatization to the 'content_clean_no_stopwords_lower' DataFrame.
# This DataFrame contains the content of messages that have been preprocessed to remove stopwords and convert to lowercase.
# Lemmatization here will help in consolidating different forms of a word into a single term, ensuring uniformity across the textual data.

# The result is stored in 'content_clean_lemmatized', a new DataFrame that includes the lemmatized content.
# This DataFrame is particularly useful for natural language processing tasks where consistent and simplified text is beneficial,
# such as sentiment analysis, topic modeling, or building language models that require a normalized form of input data.

content_clean_lemmatized = content_clean_no_stopwords_lower.apply(spacy_lemmatizer)


With our lemmatizing done, let's check our histograms again

In [None]:
# Visualizing the most frequent words in lemmatized text datasets
# These visualizations will display the top 30 most frequent words in each segment of the dataset after lemmatization,
# providing insights into the core vocabulary and prominent themes within each type of text.

# Plotting the most frequent words in the lemmatized forum text
# The 'text_clean_lemmatized' contains the combined titles and content of the forum, processed to remove stopwords, 
# converted to lowercase, and lemmatized to their base forms. This histogram will help identify the primary concepts
# and terms that are prevalent across the entire forum.
plot_histogram_word(text_clean_lemmatized, 30, 'forum')

# Plotting the most frequent words in the lemmatized titles
# The 'titles_clean_lemmatized' contains the titles of posts in the forum, which have been cleaned and lemmatized.
# Visualizing the most frequent words in titles can reveal the focal points or topics that frequently appear at the forefront
# of discussions or articles, aiding in understanding how content is framed or highlighted.
plot_histogram_word(titles_clean_lemmatized, 30, 'titles')

# Plotting the most frequent words in the lemmatized content of messages
# The 'content_clean_lemmatized' contains the main body of messages in the forum, thoroughly processed to emphasize
# meaningful vocabulary through lemmatization. This histogram provides a deeper look into the substance of discussions,
# showing the main subjects and terms used within the messages themselves.
plot_histogram_word(content_clean_lemmatized, 30, 'messages')


Let's also check the bigrams and see if we get anything of interest.

In [None]:
# Importing necessary libraries
import nltk
# Downloading the 'punkt' tokenizer models for text tokenization
nltk.download('punkt')
from collections import Counter
from nltk.util import ngrams
from nltk import word_tokenize
import plotly.graph_objs as go

# Defining a function to generate n-grams from text
def generate_ngrams(text, n, lowercase=False):
    """
    Generates n-grams from a given text.
    
    Parameters:
    - text (str): The text from which to generate n-grams.
    - n (int): The number of elements in each n-gram.
    - lowercase (bool): Whether to convert the text to lowercase before generating n-grams.
    
    Returns:
    - list: A list of n-grams represented as strings.
    
    This function tokenizes the input text using NLTK's word_tokenize method, then forms n-grams of the specified size.
    If 'lowercase' is True, it converts the text to lowercase to ensure case uniformity across all tokens.
    """
    if lowercase:
        text = text.lower()

    # Tokenize the text and generate n-grams
    n_grams = ngrams(word_tokenize(text), n)
    # Join the tokens in each n-gram and return the list of n-grams
    return [' '.join(grams) for grams in n_grams]

# Defining a function to plot the frequency of n-grams
def plot_n_grams(dataset, n_most_common=30, n=2, title='text'):
    """
    Plots a histogram of the most common n-grams in a dataset.
    
    Parameters:
    - dataset (Series): A pandas Series where each entry is a text document.
    - n_most_common (int): The number of most common n-grams to display.
    - n (int): The n-gram size.
    - title (str): The title for the plot, which describes the dataset.

    This function processes each text entry in the dataset to extract n-grams and count their occurrences.
    It then uses Plotly to create a bar chart visualizing the frequencies of the most common n-grams.
    """
    n_grams_counter = Counter()

    # Process each text entry to count n-grams
    for text in dataset.values:
        # Update the counter with n-grams from this text
        n_grams_counter.update(generate_ngrams(text, n, lowercase=True))
    
    # Select the top 'n_most_common' frequent n-grams
    n_grams_counter = dict(n_grams_counter.most_common(n_most_common))

    # Create a bar chart using Plotly
    fig = go.Figure([go.Bar(x=list(n_grams_counter.keys()), y=list(n_grams_counter.values()))])
    fig.update_layout(title_text=f'Top {n_most_common} most frequent {n}-grams in the {title}')
    fig.show()

# Example: This function can be used to analyze how often certain phrases or sets of words appear together in a dataset,
# which is particularly useful for understanding language patterns or identifying common phrases within a body of text.


[nltk_data] Downloading package punkt to
[nltk_data]     /home/happyholand/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Using the plot_n_grams function to visualize the most common n-grams in different segments of the dataset
# These visualizations will help to understand the linguistic structures and common phrase patterns within each segment.

# Plotting n-grams for the lemmatized forum text
# Here, we are interested in the top 30 most common bigrams (n=2) in the lemmatized forum text.
# The title 'forum' specifies that this visualization pertains to the combined textual content of the forum.
plot_n_grams(text_clean_lemmatized, n_most_common=30, n=2, title='forum')

# Plotting n-grams for the lemmatized titles
# Similarly, this call focuses on the top 30 most common bigrams in the lemmatized titles.
# The title 'titles' helps differentiate this plot from others, focusing on the structural and thematic patterns in the titles.
plot_n_grams(titles_clean_lemmatized, n_most_common=30, n=2, title='titles')

# Plotting n-grams for the lemmatized content of messages
# This visualization targets the top 30 most common bigrams found in the content of the messages.
# The title 'messages' indicates that the analysis is specific to the body of the messages, offering insights into regular conversational or informational structures.
plot_n_grams(content_clean_lemmatized, n_most_common=30, n=2, title='messages')


It appears that while lemmatizing does get rid of some unecessary data, it also interferes with some relevant information (such as transforming 'dado', which means data, into 'dar', which means to give). 

Let's filter out words with less than 3 characters of our non-lemmatized datasets and compare the results

In [None]:
def remove_less_than_three(text):
    """
    Removes words from the provided text that are less than three characters long.

    This function is designed to refine the text data by excluding short words, which are often less meaningful in text analysis contexts.
    Removing these words can help focus on more substantial and relevant terms when performing text mining or natural language processing tasks.

    Parameters:
    - text (str): The text from which short words will be removed.

    Returns:
    - str: A modified version of the text with all words less than three characters removed.

    Processing Steps:
    1. The text is split into individual words based on spaces.
    2. A list comprehension is used to filter out any words that are less than three characters in length.
    3. The remaining words are then joined back together into a single string with spaces separating each word.

    Example:
    >>> remove_less_than_three("An apple a day keeps the doctor away")
    'apple day keeps the doctor away'
    """
    # Splitting the text into words
    tokens = text.split()

    # Filtering out words that are less than three characters long
    tokens = [token for token in tokens if len(token) > 2]

    # Joining the filtered words back into a single string
    return ' '.join(tokens)


In [None]:
# Applying the remove_less_than_three function to further refine text datasets
# This step involves filtering out words that are less than three characters from the cleaned text data.
# Removing short words can help enhance the clarity and relevance of the textual data by focusing on more substantive words,
# which are often more important for text analysis and natural language processing tasks.

# Applying the function to the cleaned and lowercased forum text
# 'text_clean_no_stopwords_lower' contains the forum text that has been converted to lowercase and had stopwords removed.
# By applying 'remove_less_than_three', we ensure that the text is stripped of any remaining short words that may not contribute significantly to meaning.
text_clean_reduce = text_clean_no_stopwords_lower.apply(remove_less_than_three)

# Applying the function to the cleaned and lowercased titles
# Similarly, for 'titles_clean_no_stopwords_lower', this processing step removes any short words from the titles,
# potentially improving the impact and searchability of the titles within the dataset.
titles_clean_reduce = titles_clean_no_stopwords_lower.apply(remove_less_than_three)

# Applying the function to the cleaned and lowercased content of messages
# For 'content_clean_no_stopwords_lower', removing short words from message content focuses the textual analysis
# on more meaningful and impactful words, which can be crucial for understanding the main themes and sentiments expressed in the messages.
content_clean_reduce = content_clean_no_stopwords_lower.apply(remove_less_than_three)

# These refined datasets now contain text where only words with three or more characters are retained, 
# ideal for subsequent analytical tasks that benefit from a more focused and substantial textual dataset.


In [None]:
# Visualizing the most frequent words in reduced text datasets
# These visualizations will display the top 30 most frequent words in each segment of the dataset after applying the 'remove_less_than_three' function.
# This helps to identify which words are most prevalent in the datasets that have been cleaned to exclude words with less than three characters.

# Plotting the histogram for the reduced forum text
# The 'text_clean_reduce' contains the forum text that has been processed to exclude shorter words.
# This histogram will highlight the top 30 most frequent words, giving insights into the primary themes and topics
# that are discussed across the forum after the additional cleaning step.
plot_histogram_word(text_clean_reduce, 30, 'forum')

# Plotting the histogram for the reduced titles
# The 'titles_clean_reduce' contains the titles that have also been processed to remove shorter words.
# Visualizing the most frequent words in titles can reveal how titles are typically composed in terms of keyword density and focus,
# which is beneficial for understanding how information is framed at a glance.
plot_histogram_word(titles_clean_reduce, 30, 'titles')

# Plotting the histogram for the reduced content of messages
# The 'content_clean_reduce' includes content from messages where shorter words have been removed.
# This histogram provides a deeper look into the substance of discussions, showing the main terms used within the messages themselves,
# which can assist in identifying the focus of communications or potential areas of interest or concern among the users.
plot_histogram_word(content_clean_reduce, 30, 'messages')


In [None]:
# Visualizing the most frequent n-grams in the reduced text datasets
# These visualizations aim to display the top 30 most frequent bigrams (n-grams of size 2) for each specified segment of the dataset,
# focusing on the text that has been filtered to remove shorter words. This helps to identify prevalent phrases and language patterns
# that emerge from more substantive and impactful words within each type of text.

# Plotting the histogram for the most frequent bigrams in the reduced forum text
# 'text_clean_reduce' contains the forum text processed to exclude words with fewer than three characters.
# The histogram will showcase the top 30 most common bigrams, providing insight into recurring phrase patterns across the forum,
# which is useful for understanding the structure of discussions and the interrelation of concepts.
plot_n_grams(text_clean_reduce, n_most_common=30, n=2, title='forum')

# Plotting the histogram for the most frequent bigrams in the reduced titles
# 'titles_clean_reduce' includes titles that have been refined by the same text cleaning process.
# This visualization highlights the top 30 most common bigrams within the titles, revealing common thematic elements
# and possibly predominant topics that catch the audience's attention.
plot_n_grams(titles_clean_reduce, n_most_common=30, n=2, title='titles')

# Plotting the histogram for the most frequent bigrams in the reduced content of messages
# 'content_clean_reduce' consists of message contents where shorter words have been filtered out.
# The histogram for this segment displays the top 30 most common bigrams, illustrating the key phrases within the message bodies,
# which can be crucial for identifying the focal points of discussions or prevalent concerns among the participants.
plot_n_grams(content_clean_reduce, n_most_common=30, n=2, title='messages')


Well, that's certainly better. It seems a certain subject might be frequently mentioned in our forum. let's check our trigrams to confirm it:

In [None]:
# Visualizing the most frequent trigrams in the reduced text datasets
# These visualizations aim to display the top 30 most frequent trigrams (n-grams of size 3) for each specified segment of the dataset,
# after the text has been processed to exclude words shorter than three characters. Analyzing trigrams helps uncover more specific and
# nuanced language patterns and phrase structures used in different contexts within the dataset.

# Plotting the histogram for the most frequent trigrams in the reduced forum text
# 'text_clean_reduce' contains the forum text that has been processed to remove shorter words. Visualizing trigrams in this dataset
# can provide deeper insights into complex thematic expressions and recurring phrase patterns that characterize the forum's discussions.
plot_n_grams(text_clean_reduce, n_most_common=30, n=3, title='forum')

# Plotting the histogram for the most frequent trigrams in the reduced titles
# 'titles_clean_reduce' includes titles that have undergone the same filtering process. This visualization
# focuses on the top 30 most common trigrams within the titles, which might reveal more about how information is
# compactly and effectively communicated through titles, possibly impacting how topics are perceived and engaged with by readers.
plot_n_grams(titles_clean_reduce, n_most_common=30, n=3, title='titles')

# Plotting the histogram for the most frequent trigrams in the reduced content of messages
# 'content_clean_reduce' comprises the message content after filtering out shorter words. The histogram will
# illustrate the top 30 most common trigrams, highlighting complex patterns in how information is detailed and
# discussed within messages, offering insights into the substantive communication habits of the participants.
plot_n_grams(content_clean_reduce, n_most_common=30, n=3, title='messages')


Yes, it is indeed cited ver frequently (even if most topics don't seen to be titled after it). Another thing of note in this dataset is the name 'helena velcic maziviero' which is showing up as a frequent trigram. Sure enough, this person seems to have had at some point an habit of signing of their messages with their name, and they have posted a lot on this forum.

All in all, the forum seems to have mostly been used as expected. Discussions of subjects, hours, the course itself, and internship and work opportunities seem to be the most frequent topics in it. In special, the most frequent subject is "Fundamentos Matemáticos da Computação".