## **UFABC** 

### Mestrado em Engenharia da Informação

### Modelagem de Tópicos para Análise de Indícios de Depressão em Postagens do Reddit na Língua Portuguesa

* Aluno: Márcio Valverde
* Orientador: Prof. Andre Takahata
* Participação especial: Miro Neuro-Psicólogo (Neuropax)


### Uma Abordagem com  **Fatoração de Matrizes Não-Negativas (Non-Negative Matrix Factorization - NMF)** 

ufabc-topic-modeling-1-NMF-Reddit Posts vr.2.0-Sklearn.ipynb
    
*(sklearn)* https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

<img src="../image/Topics.png" alt="image" width="600"/>

#### **Pipeline**
````
   1. Reading Data
    2. Pre-Processing
       2.1 Lematization
       2.2 Stematization
       2.3 Stop words
       2.4 Tokenization
       2.5 Data Clean
       2.6 Remoção da existência de Linhas vazias ou duplicadas
       2.7 Extrai as etiquetas (ou anotações) de cada token no documento analisado
       2.8 Processamento de Bigramas e Trigramas
    3. EDA - Exploratory Data Analysis (Text Data)
       3.1 Details of the resulting dataframe
       3.2 Frequency of words
       3.3 Top words com maior frequência
       3.4 Word Cloud
       3.5 Unique words
       3.6 Descriptive statistics of the words
       3.7 Histogram
       3.8 Boxplot
       3.9 Outliers
    4. Processing
       4.1 Automatically select the Best Number of Topics
       4.2 Compute Coherence Score
       4.3 TF-IDF
       4.4 Hyper-parameters for NMF Modelling
       4.5 NMF Model Training
    5. Results
       5.1 Words with high values of component for each topic
       5.2 List of Topics
       5.3 Manual Topic Labeling
    6. Visualization of Results
       6.1 List of documents
       6.2 Find the most representative documents for each topic
       6.3 Plotting the graphs for each Topic
       6.4 Word cloud
    7. Analysis of the results
       7.1 Residue analysis
       7.2 t-Distributed Stochastic Neighbor Embedding - Visualization
       7.3 Residue analysis
    8. Final Result list
    9. Predict topic for new posts
    10. Conclusion
````


### **1. Setup**

In [None]:
from time import time
current_time = time()
print("Starting time: %0.3fs." % (time() - current_time))

import pandas as pd
print('pandas: {}'.format(pd.__version__))

import numpy as np
print('numpy: {}'.format(np.__version__))

import matplotlib.pyplot as plt
print('matplotlib: {}'.format(plt.get_backend()))

import seaborn as sns
print('seaborn: {}'.format(sns.__version__))
sns.set_style('darkgrid')

# modeling
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction import text
from sklearn.decomposition import NMF

print('sklearn: {}'.format(sklearn.__version__))


import gensim
print('gensim: {}'.format(gensim.__version__))
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.models.nmf import Nmf

from collections import Counter
from operator import itemgetter

# text processing
from wordcloud import WordCloud
import nltk
print('nltk: {}'.format(nltk.__version__))
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import TweetTokenizer, RegexpTokenizer, word_tokenize
from nltk import pos_tag
from nltk import bigrams, trigrams
nltk.download('stopwords')
nltk.download('punkt')

 # text cleaning 
import re
import string
from bs4 import BeautifulSoup 
import unidecode

# spacy - lematização
import spacy 
print('spacy: {}'.format(spacy.__version__))
nlp = spacy.load("pt_core_news_lg")

import joblib

'''  
%pip install spacy
pip install -U spacy
python -m spacy download pt_core_news_lg
'''
#%pip install wordcloud

# My libs
import my_data_clean
import my_save_file
import my_date_time
import my_nltk_commons

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

#### 1.1. Loading dataset

We will use data obtained from the extraction of posts on the social media platform Reddit. These posts were selected based on specific subreddits and keywords that may indicate the presence of symptoms related to depression.

We will start with a brief exploratory data analysis to familiarize ourselves with the dataset in question.

For this study, we have a dataset composed of 'n' posts.
 

In [None]:
path = '../data/corpus/'
input_file = 'pre-corpus.csv'
df = pd.read_csv(path+input_file)

df.head(2)

In [None]:
df.shape

The dataset has the following structure:

- post_id: A unique identifier for each post on a platform such as a forum or social network.
- author: The name or identification of the author who created the post.
- subreddit: The subreddit or category to which the post belongs on a platform like Reddit.
- created_utc: The date and time of post creation in UTC (Coordinated Universal Time) format.
- url: The URL associated with the post, if applicable.
- self_text: The content of the post, which can be text written by the author.
- title: The post's title.
- texto: Another text field associated with the post, which may contain additional information.
- keywords: Relevant keywords or terms found in the post.
- link_flair_text: Text describing the post's category or classification.
- score: The post's score or rating, which can be determined by user votes, for example.
- num_comments: The number of comments associated with the post.
- upvote_ratio: The ratio of upvotes to the total number of votes.
- timestamp: A more human-readable representation of date and time.

In [None]:
df.columns.tolist()

In [None]:
# Column containing the text to be CLEANED AND PROCESSED
coluna = 'texto'

# Column containing the final corpus (texts processed)
coluna_corpus = 'corpus_text'