# Exploração de Dados em Texto

Objetivo: a partir de técnicas de contagem, visualizar e interpretar os dados.
Parte do material baseado em https://github.com/JasonKessler/scattertext#emoji-analysis

"A tool for finding distinguishing terms in corpora, and presenting them in an interactive, HTML scatter plot. Points corresponding to terms are selectively labeled so that they don't overlap with other labels or points."

In [1]:
%matplotlib inline

import os, pkgutil, json, urllib
import scattertext as st
import re, io
import pandas               as pd
import numpy                as np
from   scipy.stats          import rankdata, hmean, norm
from   pprint               import pprint

import spacy

from   urllib.request       import urlopen
from   IPython.display      import IFrame
from   IPython.core.display import display, HTML

display(HTML("<style>.container { width:98% !important; }</style>"))

In [2]:
nlp = spacy.load('en')# ou ('en_core_web_sm')

### Conjunto de Dados: Republicanos e Democratas

Discursos de democratas e republicanos em 2012 (eleição). 

In [3]:
convention_df = st.SampleCorpora.ConventionData2012.get_data()

In [4]:
convention_df.head()

Unnamed: 0,party,text,speaker
0,democrat,Thank you. Thank you. Thank you. Thank you so ...,BARACK OBAMA
1,democrat,"Thank you so much. Tonight, I am so thrilled a...",MICHELLE OBAMA
2,democrat,Thank you. It is a singular honor to be here t...,RICHARD DURBIN
3,democrat,"Hey, Delaware. \nAnd my favorite Democrat, Jil...",JOSEPH BIDEN
4,democrat,"Hello. \nThank you, Angie. I'm so proud of how...",JILL BIDEN


#### Algumas estatísticas

In [5]:
print("\nContando documentos\n")
print(convention_df.groupby('party')['text'].count())

print("\nContando palavras")
convention_df.groupby('party').apply(lambda x: x.text.apply(lambda x: len(x.split())).sum())


Contando documentos

party
democrat      123
republican     66
Name: text, dtype: int64

Contando palavras


party
democrat      76843
republican    58144
dtype: int64

#### Aplicando NLP com Spacy

In [6]:
convention_df['parsed'] = convention_df.text.apply(nlp)

In [7]:
convention_df.head()

Unnamed: 0,party,text,speaker,parsed
0,democrat,Thank you. Thank you. Thank you. Thank you so ...,BARACK OBAMA,"(Thank, you, ., Thank, you, ., Thank, you, ., ..."
1,democrat,"Thank you so much. Tonight, I am so thrilled a...",MICHELLE OBAMA,"(Thank, you, so, much, ., Tonight, ,, I, am, s..."
2,democrat,Thank you. It is a singular honor to be here t...,RICHARD DURBIN,"(Thank, you, ., It, is, a, singular, honor, to..."
3,democrat,"Hey, Delaware. \nAnd my favorite Democrat, Jil...",JOSEPH BIDEN,"(Hey, ,, Delaware, ., \n, And, my, favorite, D..."
4,democrat,"Hello. \nThank you, Angie. I'm so proud of how...",JILL BIDEN,"(Hello, ., \n, Thank, you, ,, Angie, ., I, 'm,..."


In [8]:
convention_df['parsed']                    #[0][0].pos_

0      (Thank, you, ., Thank, you, ., Thank, you, ., ...
1      (Thank, you, so, much, ., Tonight, ,, I, am, s...
2      (Thank, you, ., It, is, a, singular, honor, to...
3      (Hey, ,, Delaware, ., \n, And, my, favorite, D...
4      (Hello, ., \n, Thank, you, ,, Angie, ., I, 'm,...
                             ...                        
184    (As, the, elected, leader, of, 250,000, Colleg...
185    (Good, afternoon, ., I, 'm, Pete, Sessions, ,,...
186    (To, Chairman, Priebus, and, to, my, fellow, A...
187    (\n, Absolutely, ., Thank, you, ,, Mr, ., Chai...
188    (I, am, thrilled, to, add, Utah, 's, voice, in...
Name: parsed, Length: 189, dtype: object

### Transformando em um Corpus

In [9]:

corpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()

### Visualizando termos por uma métrica de frequência

#####  Termos (até duplas de termos) mais usados por candidatos republicanos e democratas, qual a frequência cada palavra é usada com base em 25k palavras e um score de relevância associado. 

Existe uma variante para frases neste mesmo pacote em: https://github.com/JasonKessler/scattertext#using-phrasemachine-to-find-phrases que são as phrase machines.

In [10]:
html = st.produce_scattertext_explorer(corpus,
          category          = 'democrat',
          category_name     = 'Democratic',
          not_category_name = 'Republican',
          width_in_pixels   = 800,
          height_in_pixels  = 500,
          metadata=convention_df['speaker'])

file_name = "Convention-Visualization.html"
open("Convention-Visualization.html", 'wb').write(html.encode('utf-8'))

IFrame(src=file_name, width = 700, height=1000)

#### Características do Corpus 

Often the terms of most interest are ones that are characteristic to the corpus as a whole. These are terms which occur frequently in all sets of documents being studied, but relatively infrequent compared to general term frequencies.

We can produce a plot with a characteristic score on the x-axis and class-association scores on the y-axis using the function produce_characteristic_explorer.

Corpus characteristicness is the difference in dense term ranks between the words in all of the documents in the study and a general English-language frequency list. See this Talk on Term-Class Association Scores for a more thorough explanation.

In [11]:
import scattertext as st

corpus = (st.CorpusFromPandas(st.SampleCorpora.ConventionData2012.get_data(),
                              category_col='party',
                              text_col='text',
                              nlp=st.whitespace_nlp_with_sentences)
                              .build()
                              .get_unigram_corpus()
                              .compact(st.ClassPercentageCompactor(term_count=2,
                                       term_ranker=st.OncePerDocFrequencyRanker)))

html = st.produce_characteristic_explorer(
                                corpus,
                                category          = 'democrat',
                                category_name     = 'Democratic',
                                not_category_name = 'Republican',
                                width_in_pixels   = 800,
                                height_in_pixels  = 500,
                                metadata=corpus.get_df()['speaker']
)

file_name = 'demo_characteristic_chart.html'
open(file_name, 'wb').write(html.encode('utf-8'))

IFrame(src=file_name, width = 700, height=1000)

#### Visualizando termos relacionados à empatia

In order to visualize Empath (Fast et al., 2016) topics and categories instead of terms, we'll need to create a Corpus of extracted topics and categories rather than unigrams and bigrams. To do so, use the FeatsOnlyFromEmpath feature extractor. See the source code for examples of how to make your own.

When creating the visualization, pass the use_non_text_features=True argument into produce_scattertext_explorer. This will instruct it to use the labeled Empath topics and categories instead of looking for terms. Since the documents returned when a topic or category label is clicked will be in order of the document-level category-association strength, setting use_full_doc=True makes sense, unless you have enormous documents. Otherwise, the first 300 characters will be shown.

In [12]:
feat_builder  = st.FeatsFromOnlyEmpath()
empath_corpus = st.CorpusFromParsedDocuments(convention_df,
                                              category_col='party',
                                              feats_from_spacy_doc=feat_builder,
                                              parsed_col='text').build()

html = st.produce_scattertext_explorer(empath_corpus,
                                        category='democrat',
                                        category_name='Democratic',
                                        not_category_name='Republican',
                                        width_in_pixels=800,
                                        height_in_pixels=500,
                                        metadata=convention_df['speaker'],
                                        use_non_text_features=True,
                                        use_full_doc=True,
                                        topic_model_term_lists=feat_builder.get_top_model_term_lists())

file_name = "Convention-Visualization-Empath.html"
open(file_name, 'wb').write(html.encode('utf-8'))

IFrame(src=file_name, width = 700, height=1000)

### Visualizando termos com relação à moralidade

The [Moral Foundations Theory] proposes six psychological constructs as building blocks of moral thinking, as described in Graham et al. (2013). These foundations are, as described on [moralfoundations.org]: care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, sanctity/degradation, and liberty/oppression. Please see the site for a more in-depth discussion of these foundations.

Frimer et al. (2019) created the Moral Foundations Dictionary 2.0, or a lexicon of terms which invoke a moral foundation as a virtue (favorable toward the foundation) or a vice (in opposition to the foundation).

This dictionary can be used in the same way as the General Inquirer. In this example, we can plot the Cohen's d scores of foundation-word counts relative to the frequencies words involving those foundations were invoked.

We can first load the the corpus as normal, and use st.FeatsFromMoralFoundationsDictionary() to extract features.

In [13]:
convention_df           = st.SampleCorpora.ConventionData2012.get_data()
moral_foundations_feats = st.FeatsFromMoralFoundationsDictionary()
corpus                  = st.CorpusFromPandas(convention_df,
                             category_col='party',
                             text_col='text',
                             nlp=st.whitespace_nlp_with_sentences,
                             feats_from_spacy_doc=moral_foundations_feats).build()


In [14]:
html = st.produce_frequency_explorer(
    corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    metadata=convention_df['speaker'],
    use_non_text_features=True,
    use_full_doc=True,
    term_scorer=st.CohensD(corpus).use_metadata(),
    grey_threshold=0,
    width_in_pixels=800,
    height_in_pixels=500,
    topic_model_term_lists=moral_foundations_feats.get_top_model_term_lists(),                
    metadata_descriptions=moral_foundations_feats.get_definitions()
)

In [15]:
file_name = "Convention-Visualization-Morality.html"
open(file_name, 'wb').write(html.encode('utf-8'))

IFrame(src=file_name, width = 1000, height=500)

### Visualizando termos no espaço

In [16]:
convention_df = st.SampleCorpora.ConventionData2012.get_data()
convention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)

corpus = (st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parse')
          .build().get_stoplisted_unigram_corpus())

html = st.produce_projection_explorer(corpus, category='democrat', category_name='Democratic',
  not_category_name='Republican', metadata=convention_df.speaker)



In [17]:
file_name = "Convention-Visualization-general.html"
open(file_name, 'wb').write(html.encode('utf-8'))

IFrame(src=file_name, width = 1000, height=500)

## Avançado

Necessita de treino de modelo e entendimento de embeddings.

#### Como democratas e republicanos se referem à palavra Job ?

In [18]:
from gensim.models import word2vec
from scattertext import SampleCorpora, word_similarity_explorer_gensim, Word2VecFromParsedCorpus
from scattertext.CorpusFromParsedDocuments import CorpusFromParsedDocuments

convention_df           = SampleCorpora.ConventionData2012.get_data()
convention_df['parsed'] = convention_df.text.apply(nlp)
corpus                  = CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()

# transforma palavras em vetores
model = word2vec.Word2Vec(size=300,
                          alpha=0.025,
                          window=5,
                          min_count=5,
                          max_vocab_size=None,
                          sample=0,
                          seed=1,
                          workers=1,
                          min_alpha=0.0001,
                          sg=1,
                          hs=1,
                          negative=0,
                          cbow_mean=0,
                          iter=1,
                          null_word=0,
                          trim_rule=None,
                          sorted_vocab=1)

html = word_similarity_explorer_gensim(corpus,
                                       category='democrat',
                                       category_name='Democratic',
                                       not_category_name='Republican',
                                       target_term='jobs',
                                       minimum_term_frequency=5,
                                       pmi_threshold_coefficient=4,
                                       width_in_pixels=1000,
                                       metadata=convention_df['speaker'],
                                       word2vec=Word2VecFromParsedCorpus(corpus, model).train(),
                                       max_p_val=0.05,
                                       save_svg_button=True)

In [19]:
file_name = "Convention-Visualization-JOB.html"
open(file_name, 'wb').write(html.encode('utf-8'))

IFrame(src=file_name, width = 1000, height=500)