# Brazilian Newspaper analysis

In this project, we'll use a dataset from a Brazilian Newspaper called "Folha de São Paulo".

We're going to use word embeddings, tensorboard and rnn's and search for political opinions and positions.

You can find the dataset at [kaggle](https://www.kaggle.com/marlesson/news-of-the-site-folhauol).

I want to find in this study case:

+ Political opinions
+ Check if this newspaper is impartial or biased

## Skip-gram model

Let's use a word embedding model to find the relationship between words in the articles. Our model will learn how one word is related to another word and we'll see this relationship in tensorboard and a T-SNE chart (to project our model in a 2D chart).

We have two options to use: CBOW (Continuous Bag-Of-Words) and Skip-gram.

In our case we'll use Skip-gram because it performs better than CBOW.

The models works like this:

![](assets/word2vec_architectures.png)

In CBOW we get some words around another word and try to predict the "middle" word.

In Skip-gram we do the opposite, we get one word and try to predict the words around it.

## Loading the data

After downloading the dataset, put it on a directory `data/` and let's load it using pandas.

**Using python 3.6 and tensorflow 1.3**

In [1]:
# Import dependencies

import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib
import os
import pickle
import random
import time
from collections import Counter

In [2]:
dataset = pd.read_csv('data/articles.csv')

dataset.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Lula diz que está 'lascado', mas que ainda tem...",Com a possibilidade de uma condenação impedir ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
1,"'Decidi ser escrava das mulheres que sofrem', ...","Para Oumou Sangaré, cantora e ativista malines...",2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
2,Três reportagens da Folha ganham Prêmio Petrob...,Três reportagens da Folha foram vencedoras do ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
3,Filme 'Star Wars: Os Últimos Jedi' ganha trail...,A Disney divulgou na noite desta segunda-feira...,2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
4,CBSS inicia acordos com fintechs e quer 30% do...,"O CBSS, banco da holding Elopar dos sócios Bra...",2017-09-10,mercado,,http://www1.folha.uol.com.br/mercado/2017/10/1...


## Preprocessing the data

### Removing unnecessary articles

We are trying to find political opinions. So, let's take only the articles in category 'poder' (power).

In [4]:
political_dataset = dataset.loc[dataset.category == 'poder']

political_dataset.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Lula diz que está 'lascado', mas que ainda tem...",Com a possibilidade de uma condenação impedir ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
2,Três reportagens da Folha ganham Prêmio Petrob...,Três reportagens da Folha foram vencedoras do ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
6,"Posso sair do Brasil quando e como quiser, diz...",O italiano Cesare Battisti disse nesta segunda...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
8,Supremo nega pedido para Senado analisar impea...,O STF (Supremo Tribunal Federal) negou na quin...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
10,"Dodge defende manter Joesley e Saud, da JBS, p...","A procuradora-geral da República, Raquel Dodge...",2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...


### Merging title and text
To maintain article titles and text related, let's merge then together and use this merged text as our inputs

In [5]:
# Merges the title and text with a separator (---)
merged_text = [str(title) + ' ---- ' + str(text) for title, text in zip(political_dataset.title, political_dataset.text)]

print(merged_text[0])

Lula diz que está 'lascado', mas que ainda tem força como cabo eleitoral ---- Com a possibilidade de uma condenação impedir sua candidatura em 2018, o ex-presidente Luiz Inácio Lula da Silva fez, nesta segunda (9), um discurso inflamado contra a Lava Jato, no qual disse saber que está "lascado", exigiu um pedido de desculpas do juiz Sergio Moro e afirmou que, mesmo fora da disputa pelo Planalto, será um cabo eleitoral expressivo para a sucessão de Michel Temer.  Segundo o petista, réu em sete ações penais, o objetivo de Moro é impedir sua candidatura no ano que vem, desidratando-o, inclusive, no apoio a um nome alternativo, como o do ex-prefeito de São Paulo Fernando Haddad (PT), caso ele não possa concorrer à Presidência.  "Eu sei que tô lascado, todo dia tem um processo. Eu não quero nem que Moro me absolva, eu só quero que ele peça desculpas", disse Lula durante um seminário sobre educação em Brasília. "Eles [investigadores] chegam a dizer: 'Ah, se o Lula não for candidato, ele não 

### Tokenizing punctuation
We need to tokenize all text punctuation, otherwise the network will see punctuated words differently (eg: hello != hello!)

In [6]:
def token_lookup():
    tokens = {
        '.'  : 'period',
        ','  : 'comma',
        '"'  : 'quote',
        '\'' : 'single-quote',
        ';'  : 'semicolon',
        ':'  : 'colon',
        '!'  : 'exclamation-mark',
        '?'  : 'question-mark',
        '('  : 'parentheses-left',
        ')'  : 'parentheses-right',
        '['  : 'brackets-left',
        ']'  : 'brackets-right',
        '{'  : 'braces-left',
        '}'  : 'braces-right',
        '_'  : 'underscore',
        '--' : 'dash',
        '\n' : 'return'
    }
    
    return {token: '||{0}||'.format(value) for token, value in tokens.items()}

token_dict = token_lookup()

tokenized_text = []

for text in merged_text:
    for key, token in token_dict.items():
        text = text.replace(key, ' {} '.format(token))
    
    tokenized_text.append(text)

print(tokenized_text[0])

Lula diz que está  ||single-quote|| lascado ||single-quote||  ||comma||  mas que ainda tem força como cabo eleitoral  ||dash||  ||dash||  Com a possibilidade de uma condenação impedir sua candidatura em 2018 ||comma||  o ex-presidente Luiz Inácio Lula da Silva fez ||comma||  nesta segunda  ||parentheses-left|| 9 ||parentheses-right||  ||comma||  um discurso inflamado contra a Lava Jato ||comma||  no qual disse saber que está  ||quote|| lascado ||quote||  ||comma||  exigiu um pedido de desculpas do juiz Sergio Moro e afirmou que ||comma||  mesmo fora da disputa pelo Planalto ||comma||  será um cabo eleitoral expressivo para a sucessão de Michel Temer ||period||   Segundo o petista ||comma||  réu em sete ações penais ||comma||  o objetivo de Moro é impedir sua candidatura no ano que vem ||comma||  desidratando-o ||comma||  inclusive ||comma||  no apoio a um nome alternativo ||comma||  como o do ex-prefeito de São Paulo Fernando Haddad  ||parentheses-left|| PT ||parentheses-right||  ||com

### Lookup tables

We need to create two dicts: `word_to_int` and `int_to_word`.

In [7]:
def lookup_tables(tokenized_text):
    vocab = set()
    
    for text in tokenized_text:
        text = text.lower()
        vocab = vocab.union(set(text.split()))
    
    vocab_to_int = {word: ii for ii, word in enumerate(vocab)}
    int_to_vocab = {ii: word for ii, word in enumerate(vocab)}
    
    return vocab, vocab_to_int, int_to_vocab

vocab, vocab_to_int, int_to_vocab = lookup_tables(tokenized_text)

print('First ten vocab words: ')
print(list(vocab_to_int.items())[0:10])
print('\nVocab length:')
print(len(vocab_to_int))

pickle.dump((tokenized_text, vocab, vocab_to_int, int_to_vocab, token_dict), open('preprocess/preprocess.p', 'wb'))

First ten vocab words: 
[('proibidos', 0), ('estadão', 1), ('planejou', 2), ('what', 3), ('rodoviarismo', 4), ('desenvolvimentista', 5), ('amaldiçoou', 6), ('cooperaria', 7), ('divisas', 8), ('reestruturará', 9)]

Vocab length:
97648


### Convert all text to integers

Let's convert all articles to integer using the `vocab_to_int` variable.

In [8]:
tokenized_text, vocab, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess/preprocess.p',  mode='rb'))

In [9]:
def text_to_int(text):
    int_text = []
    for word in text.split():
        if word in vocab_to_int.keys():
            int_text.append(vocab_to_int[word])
    return np.asarray(int_text, dtype=np.int32)

In [10]:
def convert_articles_to_int(tokenized_text):
    all_int_text = []
    for text in tokenized_text:
        all_int_text.append(text_to_int(text))
    return np.asarray(all_int_text)

In [11]:
converted_text = convert_articles_to_int(tokenized_text)

pickle.dump((converted_text, vocab, vocab_to_int, int_to_vocab, token_dict), open('preprocess/preprocess2.p', 'wb'))

In [12]:
converted_text, vocab, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess/preprocess2.p',  mode='rb'))

In [13]:
converted_text[3]

array([50077,  9094, 11409, 90605, 16348,  8671, 31348, 31348, 86486,
       26397, 44546, 10094, 56182, 86486, 14696, 26397, 63668, 47423,
       11409, 90605, 63668,  9094,  8671, 16348, 13598, 44811, 64040,
       18477, 36023, 27684, 43965, 51058, 21878, 39830, 41809, 63668,
       42619,  8671, 29939,  4457, 35366, 33825, 78945, 63668, 72491,
        8671, 61128, 66833,   974, 83461, 13584, 69656, 83461,  4457,
       86486, 26397,  4457, 35366,  4457, 21878, 63413,  8671, 69112,
        4457, 47406, 63668,  9094,  8671, 16348, 13598, 18477, 34934,
       29939,  4457, 41740, 11216, 30989, 11682,  8559, 44811,  9094,
        8671, 16348, 39939, 26397, 41809, 41740, 53587, 53902, 52062,
       11409, 89820,  4457, 94188, 35366, 37683, 53902,  3483, 85836,
       83461, 18833, 26397, 41809, 28051, 60088, 76728, 86486, 41809,
       33179, 26397,  8671, 76098, 44811, 29166, 37768, 30989, 18477,
       22302,  4457, 44811, 29166, 48487, 61262, 53689, 53672, 92690,
       18477, 64004,

### Subsampling text

We need to subsample our text and remove the words that not provides meaningful information, like: 'the', 'of', 'for'.

Let's use Mikolov's subsampling formula, that's give us the probability of a word to be discarted:

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

Where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

In [14]:
# Converts all articles to one big text

all_converted_text = np.concatenate(converted_text)

In [16]:
def subsampling(int_words, threshold=1e-5):
    word_counts = Counter(int_words)
    total_count = len(int_words)
    freqs = {word: count/total_count for word, count in word_counts.items()}
    p_drop = {word: 1 - np.sqrt(threshold/freqs[word]) for word in word_counts}
    train_words = [word for word in int_words if random.random() < (1 - p_drop[word])]
    
    return np.asarray(train_words)

subsampled_text = subsampling(all_converted_text)

print('Lenght before sumsampling: {0}'.format(len(all_converted_text)))
print('Lenght after sumsampling: {0}'.format(len(subsampled_text)))

Lenght before sumsampling: 10536901
Lenght after sumsampling: 2133443


In [17]:
pickle.dump((subsampled_text, vocab, vocab_to_int, int_to_vocab, token_dict), open('preprocess/preprocess3.p', 'wb'))

In [18]:
subsampled_text, vocab, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess/preprocess3.p',  mode='rb'))

### Save vocab to csv

Let's save our vocab to csv file, so that way we can use it as an embedding on tensorboard.

In [19]:
subsampled_ints = set(subsampled_text)

subsampled_vocab = []

for word in subsampled_ints:
    subsampled_vocab.append(int_to_vocab[word])

In [20]:
vocab_df = pd.DataFrame({'words': subsampled_vocab})

vocab_df.head()

Unnamed: 0,words
0,proibidos
1,planejou
2,what
3,rodoviarismo
4,desenvolvimentista


In [21]:
vocab_df.to_csv('preprocess/vocab.tsv', header=False, index=False)

### Generate batches

Now, we need to convert all text to numbers with lookup tables and create a batch generator.

In [22]:
def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    words = words.flat
    words = list(words)
    
    R = np.random.randint(1, window_size+1)
    start = idx - R if (idx - R) > 0 else 0
    stop = idx + R
    target_words = set(words[start:idx] + words[idx+1:stop+1])
    
    return list(target_words)

In [23]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y

## Building the Embedding Graph


In [24]:
def get_embed_placeholders(graph, reuse=False):
    with graph.as_default():
        with tf.variable_scope('placeholder', reuse=reuse):
            inputs = tf.placeholder(tf.int32, [None], name='inputs')
            labels = tf.placeholder(tf.int32, [None, None], name='labels')
            learning_rate = tf.placeholder(tf.float32, [None], name='learning_rate')
            
            return inputs, labels, learning_rate

In [25]:
def get_embed_embeddings(graph, n_vocab, n_embedding, inputs, reuse=False):
    with graph.as_default():
        with tf.variable_scope('embedding', reuse=reuse):
            embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
            embed = tf.nn.embedding_lookup(embedding, inputs)
            
            return embed

In [26]:
def get_embed_negative_samples(graph, n_vocab, n_embedding, reuse=False):
    with graph.as_default():
        with tf.variable_scope('neg_sample', reuse=reuse):
            softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1))
            softmax_b = tf.Variable(tf.zeros(n_vocab))
            
            # Historigram for tensorboard
            tf.summary.histogram('softmax_w', softmax_w)
            tf.summary.histogram('softmax_b', softmax_b)

            return softmax_w, softmax_b

In [42]:
def get_embed_cost(graph, n_sampled, softmax_w, softmax_b, labels, embed, n_vocab, reuse=False):
    with graph.as_default():
        with tf.variable_scope('loss_cost', reuse=reuse):
            loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, 
                                              labels, embed,
                                              n_sampled, n_vocab)
    
            cost = tf.reduce_mean(loss)
            
            # Scalar for tensorboard
            tf.summary.histogram('loss', loss)
            tf.summary.scalar('cost', cost)
            
            return loss, cost

In [28]:
def get_embed_opt(graph, learning_rate, cost, reuse=False):
    with graph.as_default():
        with tf.variable_scope('optmizer'):
            optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
            
            return optimizer

In [67]:
def train_embed(graph,
                batch_size,
                learning_rate,
                epochs,
                window_size,
                train_words,
                n_sampled,
                n_embedding,
                vocab_to_int,
                int_to_vocab):
    
    with tf.Session(graph=graph) as sess:

        inputs, labels, lr = get_embed_placeholders(graph)

        embed = get_embed_embeddings(graph, len(int_to_vocab), n_embedding, inputs)

        softmax_w, softmax_b = get_embed_negative_samples(graph, len(int_to_vocab), n_embedding, reuse=True)

        loss, cost = get_embed_cost(graph, n_sampled, softmax_w, softmax_b, labels, embed, len(int_to_vocab), reuse=True)

        optimizer = get_embed_opt(graph, learning_rate, cost, reuse=True)
        
        merged_summary = tf.summary.merge_all()
        
        train_writer = tf.summary.FileWriter('checkpoints/train')

        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()

        avg_loss = 0
        iteration = 1

        for e in range(1, epochs + 1):
            batches = get_batches(train_words, batch_size, window_size)

            start = time.time()

            for x, y in batches:
                feed = {
                    inputs: x,
                    labels: np.array(y)[:, None]
                }

                summary, train_loss, _ = sess.run([merged_summary, cost, optimizer], feed_dict=feed)

                avg_loss += train_loss
                
                train_writer.add_summary(summary, epochs + 1)

                if iteration % 100 == 0: 
                    end = time.time()
                    print("Epoch {}/{}".format(e, epochs),
                          "Iteration: {}".format(iteration),
                          "Avg. Training loss: {:.4f}".format(avg_loss/100),
                          "{:.4f} sec/batch".format((end-start)/100))
                    avg_loss = 0
                    start = time.time()
                #break
                iteration += 1
                
        save_path = saver.save(sess, "checkpoints/embed1.ckpt")

In [68]:
epochs = 1
learning_rate = 0.01
window_size = 10
batch_size = 1024
n_sampled = 100
n_embedding = 200

tf.reset_default_graph()

embed_train_graph = tf.Graph()

train_embed(embed_train_graph,
            batch_size,
            learning_rate,
            epochs,
            window_size,
            subsampled_text,
            n_sampled,
            n_embedding,
            vocab_to_int,
            int_to_vocab
            )

Epoch 1/1 Iteration: 100 Avg. Training loss: 3.4441 0.6300 sec/batch
Epoch 1/1 Iteration: 200 Avg. Training loss: 2.7294 0.6393 sec/batch
Epoch 1/1 Iteration: 300 Avg. Training loss: 2.4242 0.6400 sec/batch
Epoch 1/1 Iteration: 400 Avg. Training loss: 2.7199 0.6433 sec/batch
Epoch 1/1 Iteration: 500 Avg. Training loss: 2.9011 0.6441 sec/batch
Epoch 1/1 Iteration: 600 Avg. Training loss: 2.6993 0.6427 sec/batch
Epoch 1/1 Iteration: 700 Avg. Training loss: 2.8817 0.6430 sec/batch
Epoch 1/1 Iteration: 800 Avg. Training loss: 2.9772 0.6406 sec/batch
Epoch 1/1 Iteration: 900 Avg. Training loss: 3.0089 0.6425 sec/batch
Epoch 1/1 Iteration: 1000 Avg. Training loss: 3.1110 0.6394 sec/batch
Epoch 1/1 Iteration: 1100 Avg. Training loss: 3.1541 0.6386 sec/batch
Epoch 1/1 Iteration: 1200 Avg. Training loss: 3.0476 0.6381 sec/batch
Epoch 1/1 Iteration: 1300 Avg. Training loss: 3.1351 0.6427 sec/batch
Epoch 1/1 Iteration: 1400 Avg. Training loss: 3.0813 0.6407 sec/batch
Epoch 1/1 Iteration: 1500 Avg