# Visualización

Cuando trabajamos con imágenes, datos numéricos, variables categóricas, series temporales... es sencillo imaginar posibles visualizaciones para representar las distribuciones de los datos, algunas estadísticas, etc.

Cuando trabajamos con textos quizá no es tan intuitivo, ¿o si?

A continuación se muestran algunos ejemplos de visualización de datos cuando trabajamos en NLP.

# Frecuencia de palabras

In [None]:
#!pip install -r requirements.txt
!pip install unzip

In [None]:
!unzip inaugural.zip

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

In [None]:
# Prefacio del NLTK book
text = 'This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing — or NLP for short — in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. At the other extreme, NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them. Technologies based on NLP are becoming increasingly widespread. For example, phones and handheld computers support predictive text and handwriting recognition; web search engines give access to information locked up in unstructured text; machine translation allows us to retrieve texts written in Chinese and read them in Spanish; text analysis enables us to detect sentiment in tweets and blogs. By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multilingual information society. This book provides a highly accessible introduction to the field of NLP. It can be used for individual study or as the textbook for a course on natural language processing or computational linguistics, or as a supplement to courses in artificial intelligence, text mining, or corpus linguistics. The book is intensely practical, containing hundreds of fully-worked examples and graded exercises. The book is based on the Python programming language together with an open source library called the Natural Language Toolkit (NLTK). NLTK includes extensive software, data, and documentation, all freely downloadable from http://nltk.org/. Distributions are provided for Windows, Macintosh and Unix platforms. We strongly encourage you to download Python and NLTK, and try out the examples and exercises along the way.'

In [None]:
words_nltk = text.lower().split()

In [None]:
words_nltk[:10]

In [None]:
wf = Counter(words_nltk)

In [None]:
wf_most_common = wf.most_common(10)

In [None]:
wf_most_common

In [None]:
words = [w[0] for w in wf_most_common]
freqs = [w[1] for w in wf_most_common]

In [None]:
freqs, words = zip(*sorted(zip(freqs, words)))

In [None]:
plt.barh(words, freqs)
plt.show()

# ... o de n-grams

In [None]:
!pip install nltk

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer
from nltk import ngrams
from nltk.probability import FreqDist

In [None]:
bigrams_ = list(ngrams(words_nltk, 2))
trigrams_ = list(ngrams(words_nltk, 3))

In [None]:
bigrams_[:10]

In [None]:
trigrams_[:10]

In [None]:
bg_freq = FreqDist(bigrams_)
tg_freq = FreqDist(trigrams_)

In [None]:
bg_freq.most_common(10)

In [None]:
tg_freq.most_common(10)

In [None]:
bg_freq_most_common = bg_freq.most_common(10)
bgs_ = [str(bg[0]) for bg in bg_freq_most_common]
bgs_f_ = [bg[1] for bg in bg_freq_most_common]

tg_freq_most_common = tg_freq.most_common(10)
tgs_ = [str(tg[0]) for tg in tg_freq_most_common]
tgs_f_ = [tg[1] for tg in tg_freq_most_common]

In [None]:
bgs_f_, bgs_ = zip(*sorted(zip(bgs_f_, bgs_)))
tgs_f_, tgs_ = zip(*sorted(zip(tgs_f_, tgs_)))

In [None]:
plt.barh(bgs_, bgs_f_)
plt.title('Bigram frequencies')
plt.show()

In [None]:
plt.barh(tgs_, tgs_f_)
plt.title('Trigram frequencies')
plt.show()

In [None]:
!pip install stop-words


from stop_words import get_stop_words

sw = get_stop_words (language = 'en')

new_text = [word for word in text.lower().split() if word in sw ]
print(new_text)

In [None]:
new_bigrams = list (ngrams(new_text, 2))
new_bg_freq = FreqDist(new_bigrams)
new_bg_freq.most_common(10)

# Word cloud

# Nueva sección

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [None]:
print(len(text))

In [None]:
print(text[:500])

In [None]:
def plot_word_cloud(text):
    wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(' '.join(text))
    plt.figure(figsize=(12,6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

In [None]:
words_list = text.lower().strip().split()
plot_word_cloud(words_list)

# Dispersión léxica



In [None]:
import os
import glob
import matplotlib.pyplot as plt
from nltk.draw.dispersion import dispersion_plot

In [None]:
inaugural_folder = './inaugural'
inaugural_paths = [os.path.join(inaugural_folder, file) for file in sorted(os.listdir(inaugural_folder)) if '.txt' in file]

In [None]:
inaugural_paths

In [None]:
texts = list()
for file_path in inaugural_paths:
    with open(file_path, mode='r', encoding='latin-1') as f:
        texts.append(f.read())

In [None]:
print(texts[0][:1000])

In [None]:
words = [word.lower() for text in texts for word in text.split()]

In [None]:
target_words = [
    'democracy',
    'citizens',
    'freedom',
    'duties',
    'america'
]

In [None]:
plt.figure(figsize=(12, 9))
plt.style.use('default')
dispersion_plot(words, target_words, ignore_case=True)
plt.show()

## Ejemplo: Google Ngram Viewer

Buscador online que permite representar en un gráfico la frecuencia anual de distintos ngrams detectados en los corpus que tiene Google disponibles para multitud de idiomas.

<img src=https://images2.minutemediacdn.com/image/upload/c_fit,f_auto,fl_lossy,q_auto,w_728/v1555921104/shape/mentalfloss/screen_shot_2014-11-12_at_1.43.16_pm.png>

Link: https://books.google.com/ngrams#

Alguna curiosidad:
- Artículo: _Experiments in Ngram Art_, [link](https://www.mentalfloss.com/article/60033/experiments-ngram-art)
- TED Talk: _What we learned from 5 million books_, [link](https://www.ted.com/talks/jean_baptiste_michel_erez_lieberman_aiden_what_we_learned_from_5_million_books/up-next?language=en)

# Ley de Zipf

Formulada en la década de 1940 por el lingüista George Kingsley Zipf, establece que, dada una lengua, la frecuencia de aparición de las distintas palabras de su vocabulario sigue una distribución que puede aproximarse por:

<img src=https://wikimedia.org/api/rest_v1/media/math/render/svg/9fa76f350fe93da686890acfb9b8e3b1151b85bc>

Gráfico log-log con el ranking y la frecuencia de las 10 millones de palabras más frecuentes (medido con artículos de Wikipedia) para distintos idiomas:
<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/d/da/Zipf_30wiki_es_labels.png/1200px-Zipf_30wiki_es_labels.png>

In [None]:
import matplotlib.pyplot as plt
from nltk.probability import FreqDist

In [None]:
fd = FreqDist(words)

In [None]:
fd

In [None]:
fd = {k: v for k, v in sorted(fd.items(), key=lambda item: item[1], reverse=True)}

In [None]:
ranks = list()
freqs = list()

for rank, word in enumerate(fd):
    ranks.append(rank+1)
    freqs.append(fd[word])

In [None]:
plt.loglog(ranks, freqs)
plt.xlabel('Rank')
plt.ylabel('Freq')
plt.title('Log-Log rank-freq chart')
plt.show()