# Brazilian Newspaper analysis

In this project, we'll use a dataset from a Brazilian Newspaper called "Folha de São Paulo".

We're going to use word embeddings, tensorboard and rnn's and search for political opinions and positions.

You can find the dataset at [kaggle](https://www.kaggle.com/marlesson/news-of-the-site-folhauol).

I want to find in this study case:

+ Political opinions
+ Check if this newspaper is impartial or biased

## Skip-gram model

Let's use a word embedding model to find the relationship between words in the articles. Our model will learn how one word is related to another word and we'll see this relationship in tensorboard and a T-SNE chart (to project our model in a 2D chart).

We have two options to use: CBOW (Continuous Bag-Of-Words) and Skip-gram.

In our case we'll use Skip-gram because it performs better than CBOW.

The models works like this:

![](assets/word2vec_architectures.png)

In CBOW we get some words around another word and try to predict the "middle" word.

In Skip-gram we do the opposite, we get one word and try to predict the words around it.

## Loading the data

After downloading the dataset, put it on a directory `data/` and let's load it using pandas.

**Using python 3.6 and tensorflow 1.3**

In [42]:
# Import dependencies

import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib
import os
import pickle
import random
import time
import math
from collections import Counter

In [3]:
dataset = pd.read_csv('data/articles.csv')

dataset.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Lula diz que está 'lascado', mas que ainda tem...",Com a possibilidade de uma condenação impedir ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
1,"'Decidi ser escrava das mulheres que sofrem', ...","Para Oumou Sangaré, cantora e ativista malines...",2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
2,Três reportagens da Folha ganham Prêmio Petrob...,Três reportagens da Folha foram vencedoras do ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
3,Filme 'Star Wars: Os Últimos Jedi' ganha trail...,A Disney divulgou na noite desta segunda-feira...,2017-09-10,ilustrada,,http://www1.folha.uol.com.br/ilustrada/2017/10...
4,CBSS inicia acordos com fintechs e quer 30% do...,"O CBSS, banco da holding Elopar dos sócios Bra...",2017-09-10,mercado,,http://www1.folha.uol.com.br/mercado/2017/10/1...


## Preprocessing the data

### Removing unnecessary articles

We are trying to find political opinions. So, let's take only the articles in category 'poder' (power).

In [4]:
political_dataset = dataset.loc[dataset.category == 'poder']

political_dataset.head()

Unnamed: 0,title,text,date,category,subcategory,link
0,"Lula diz que está 'lascado', mas que ainda tem...",Com a possibilidade de uma condenação impedir ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
2,Três reportagens da Folha ganham Prêmio Petrob...,Três reportagens da Folha foram vencedoras do ...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
6,"Posso sair do Brasil quando e como quiser, diz...",O italiano Cesare Battisti disse nesta segunda...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
8,Supremo nega pedido para Senado analisar impea...,O STF (Supremo Tribunal Federal) negou na quin...,2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...
10,"Dodge defende manter Joesley e Saud, da JBS, p...","A procuradora-geral da República, Raquel Dodge...",2017-09-10,poder,,http://www1.folha.uol.com.br/poder/2017/10/192...


### Merging title and text
To maintain article titles and text related, let's merge then together and use this merged text as our inputs

In [5]:
# Merges the title and text with a separator (---)
merged_text = [str(title) + ' ---- ' + str(text) for title, text in zip(political_dataset.title, political_dataset.text)]

print(merged_text[0])

Lula diz que está 'lascado', mas que ainda tem força como cabo eleitoral ---- Com a possibilidade de uma condenação impedir sua candidatura em 2018, o ex-presidente Luiz Inácio Lula da Silva fez, nesta segunda (9), um discurso inflamado contra a Lava Jato, no qual disse saber que está "lascado", exigiu um pedido de desculpas do juiz Sergio Moro e afirmou que, mesmo fora da disputa pelo Planalto, será um cabo eleitoral expressivo para a sucessão de Michel Temer.  Segundo o petista, réu em sete ações penais, o objetivo de Moro é impedir sua candidatura no ano que vem, desidratando-o, inclusive, no apoio a um nome alternativo, como o do ex-prefeito de São Paulo Fernando Haddad (PT), caso ele não possa concorrer à Presidência.  "Eu sei que tô lascado, todo dia tem um processo. Eu não quero nem que Moro me absolva, eu só quero que ele peça desculpas", disse Lula durante um seminário sobre educação em Brasília. "Eles [investigadores] chegam a dizer: 'Ah, se o Lula não for candidato, ele não 

### Tokenizing punctuation
We need to tokenize all text punctuation, otherwise the network will see punctuated words differently (eg: hello != hello!)

In [6]:
def token_lookup():
    tokens = {
        '.'  : 'period',
        ','  : 'comma',
        '"'  : 'quote',
        '\'' : 'single-quote',
        ';'  : 'semicolon',
        ':'  : 'colon',
        '!'  : 'exclamation-mark',
        '?'  : 'question-mark',
        '('  : 'parentheses-left',
        ')'  : 'parentheses-right',
        '['  : 'brackets-left',
        ']'  : 'brackets-right',
        '{'  : 'braces-left',
        '}'  : 'braces-right',
        '_'  : 'underscore',
        '--' : 'dash',
        '\n' : 'return'
    }
    
    return {token: '||{0}||'.format(value) for token, value in tokens.items()}

token_dict = token_lookup()

tokenized_text = []

for text in merged_text:
    for key, token in token_dict.items():
        text = text.replace(key, ' {} '.format(token))
    
    tokenized_text.append(text)

print(tokenized_text[0])

Lula diz que está  ||single-quote|| lascado ||single-quote||  ||comma||  mas que ainda tem força como cabo eleitoral  ||dash||  ||dash||  Com a possibilidade de uma condenação impedir sua candidatura em 2018 ||comma||  o ex-presidente Luiz Inácio Lula da Silva fez ||comma||  nesta segunda  ||parentheses-left|| 9 ||parentheses-right||  ||comma||  um discurso inflamado contra a Lava Jato ||comma||  no qual disse saber que está  ||quote|| lascado ||quote||  ||comma||  exigiu um pedido de desculpas do juiz Sergio Moro e afirmou que ||comma||  mesmo fora da disputa pelo Planalto ||comma||  será um cabo eleitoral expressivo para a sucessão de Michel Temer ||period||   Segundo o petista ||comma||  réu em sete ações penais ||comma||  o objetivo de Moro é impedir sua candidatura no ano que vem ||comma||  desidratando-o ||comma||  inclusive ||comma||  no apoio a um nome alternativo ||comma||  como o do ex-prefeito de São Paulo Fernando Haddad  ||parentheses-left|| PT ||parentheses-right||  ||com

### Lookup tables

We need to create two dicts: `word_to_int` and `int_to_word`.

In [7]:
def lookup_tables(tokenized_text):
    vocab = set()
    
    for text in tokenized_text:
        text = text.lower()
        vocab = vocab.union(set(text.split()))
    
    vocab_to_int = {word: ii for ii, word in enumerate(vocab)}
    int_to_vocab = {ii: word for ii, word in enumerate(vocab)}
    
    return vocab, vocab_to_int, int_to_vocab

vocab, vocab_to_int, int_to_vocab = lookup_tables(tokenized_text)

print('First ten vocab words: ')
print(list(vocab_to_int.items())[0:10])
print('\nVocab length:')
print(len(vocab_to_int))

pickle.dump((tokenized_text, vocab, vocab_to_int, int_to_vocab, token_dict), open('preprocess/preprocess.p', 'wb'))

First ten vocab words: 
[('a-', 0), ('retrocederem', 1), ('airton', 2), ('mab', 3), ('lovanni', 4), ('retratem', 5), ('filippeli', 6), ('roousseff', 7), ('monarquismo', 8), ('pintaram', 9)]

Vocab length:
97648


### Convert all text to integers

Let's convert all articles to integer using the `vocab_to_int` variable.

In [8]:
tokenized_text, vocab, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess/preprocess.p',  mode='rb'))

In [9]:
def text_to_int(text):
    int_text = []
    for word in text.split():
        if word in vocab_to_int.keys():
            int_text.append(vocab_to_int[word])
    return np.asarray(int_text, dtype=np.int32)

In [10]:
def convert_articles_to_int(tokenized_text):
    all_int_text = []
    for text in tokenized_text:
        all_int_text.append(text_to_int(text))
    return np.asarray(all_int_text)

In [11]:
converted_text = convert_articles_to_int(tokenized_text)

pickle.dump((converted_text, vocab, vocab_to_int, int_to_vocab, token_dict), open('preprocess/preprocess2.p', 'wb'))

In [12]:
converted_text, vocab, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess/preprocess2.p',  mode='rb'))

In [13]:
converted_text[3]

array([63492, 73324, 76200, 71471, 96956, 84961, 65583, 65583, 87770,
       49229, 88177, 82219, 11830, 87770, 39452, 49229, 46497, 65533,
       76200, 71471, 46497, 73324, 84961, 96956, 43111, 53034, 59515,
       56189, 68360, 91183, 66314, 49102, 27637, 82941, 49723, 46497,
       82870, 84961, 90879, 21712, 78432, 24666, 69666, 46497, 52651,
       84961, 93551, 12244, 57464,  2798, 22159,  9224,  2798, 21712,
       87770, 49229, 21712, 78432, 21712, 27637, 37398, 84961, 15766,
       21712, 82472, 46497, 73324, 84961, 96956, 43111, 56189, 84612,
       90879, 21712,  4699, 55237, 69905,  2277, 66443, 53034, 73324,
       84961, 96956, 11359, 49229, 49723,  4699, 38418, 30357, 56281,
       76200, 66672, 21712, 28503, 78432, 95264, 30357, 73982,   378,
        2798, 13306, 49229, 49723, 20010, 92464, 12641, 87770, 49723,
       80507, 49229, 84961, 45411, 53034, 88132, 68680, 69905, 56189,
       17248, 21712, 53034, 88132, 93894,  9661, 95022, 90655, 31291,
       56189, 90189,

### Subsampling text

We need to subsample our text and remove the words that not provides meaningful information, like: 'the', 'of', 'for'.

Let's use Mikolov's subsampling formula, that's give us the probability of a word to be discarted:

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

Where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

In [14]:
# Converts all articles to one big text

all_converted_text = np.concatenate(converted_text)

In [15]:
def subsampling(int_words, threshold=1e-5):
    word_counts = Counter(int_words)
    total_count = len(int_words)
    freqs = {word: count/total_count for word, count in word_counts.items()}
    p_drop = {word: 1 - np.sqrt(threshold/freqs[word]) for word in word_counts}
    train_words = [word for word in int_words if random.random() < (1 - p_drop[word])]
    
    return np.asarray(train_words)

subsampled_text = subsampling(all_converted_text)

print('Lenght before sumsampling: {0}'.format(len(all_converted_text)))
print('Lenght after sumsampling: {0}'.format(len(subsampled_text)))

Lenght before sumsampling: 10536901
Lenght after sumsampling: 2134935


In [18]:
pickle.dump((subsampled_text, vocab, vocab_to_int, int_to_vocab, token_dict), open('preprocess/preprocess3.p', 'wb'))

In [2]:
subsampled_text, vocab, vocab_to_int, int_to_vocab, token_dict = pickle.load(open('preprocess/preprocess3.p',  mode='rb'))

### Save vocab to csv

Let's save our vocab to csv file, so that way we can use it as an embedding on tensorboard.

In [3]:
subsampled_ints = set(subsampled_text)

subsampled_vocab = []

for word in subsampled_ints:
    subsampled_vocab.append(int_to_vocab[word])

In [4]:
vocab_df = pd.DataFrame.from_dict(int_to_vocab, orient='index')

vocab_df.head()

Unnamed: 0,0
0,a-
1,retrocederem
2,airton
3,mab
4,lovanni


In [6]:
vocab_df.to_csv('preprocess/vocab.tsv', header=False, index=False)

### Generate batches

Now, we need to convert all text to numbers with lookup tables and create a batch generator.

In [7]:
def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    words = words.flat
    words = list(words)
    
    R = np.random.randint(1, window_size+1)
    start = idx - R if (idx - R) > 0 else 0
    stop = idx + R
    target_words = set(words[start:idx] + words[idx+1:stop+1])
    
    return list(target_words)

In [8]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y

## Building the Embedding Graph


In [73]:
def get_embed_placeholders(graph, reuse=False):
    with graph.as_default():
        with tf.variable_scope('placeholder', reuse=reuse):
            inputs = tf.placeholder(tf.int32, [None], name='inputs')
            labels = tf.placeholder(tf.int32, [None, None], name='labels')
            learning_rate = tf.placeholder(tf.float32, [None], name='learning_rate')
            
            return inputs, labels, learning_rate

In [74]:
def get_embed_embeddings(graph, vocab_size, embedding_size, inputs, reuse=False):
    with graph.as_default():
        with tf.variable_scope('embedding', reuse=reuse):
            embedding = tf.Variable(tf.random_uniform((vocab_size, embedding_size),
                                                      -0.5 / embedding_size,
                                                      0.5 / embedding_size))
            embed = tf.nn.embedding_lookup(embedding, inputs)
            
            return embed

In [75]:
def get_nce_weights_biases(graph, vocab_size, embedding_size, reuse=False):
    with graph.as_default():
        with tf.variable_scope('nce', reuse=reuse):
            nce_weights = tf.Variable(tf.truncated_normal((vocab_size, embedding_size),
                                                           stddev=1.0/math.sqrt(embedding_size)))
            nce_biases = tf.Variable(tf.zeros(vocab_size))
            
            # Historigram for tensorboard
            tf.summary.histogram('weights', nce_weights)
            tf.summary.histogram('biases', nce_biases)

            return nce_weights, nce_biases

In [76]:
def get_embed_loss(graph, num_sampled, nce_weights, nce_biases, labels, embed, vocab_size, reuse=False):
    with graph.as_default():
        with tf.variable_scope('nce', reuse=reuse):
            loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights=nce_weights,
                                                             biases=nce_biases,
                                                             labels=labels,
                                                             inputs=embed,
                                                             num_sampled=num_sampled,
                                                             num_classes=vocab_size))
            
            # Scalar for tensorboard
            tf.summary.scalar('loss', loss)
            
            return loss

In [80]:
def get_embed_opt(graph, learning_rate, loss, reuse=False):
    with graph.as_default():
        with tf.variable_scope('optmizer'):
            optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
            
            return optimizer

In [81]:
def train_embed(graph,
                batch_size,
                learning_rate,
                epochs,
                window_size,
                train_words,
                num_sampled,
                embedding_size,
                vocab_size,
                save_dir,
                print_every):
    
    with tf.Session(graph=graph) as sess:

        inputs, labels, lr = get_embed_placeholders(graph)

        embed = get_embed_embeddings(graph, vocab_size, embedding_size, inputs)

        nce_weights, nce_biases = get_nce_weights_biases(graph, vocab_size, embedding_size, reuse=True)

        loss = get_embed_loss(graph, num_sampled, nce_weights, nce_biases, labels, embed, vocab_size, reuse=True)

        optimizer = get_embed_opt(graph, learning_rate, loss, reuse=True)
        
        merged_summary = tf.summary.merge_all()
        
        train_writer = tf.summary.FileWriter(save_dir)

        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()

        avg_loss = 0
        iteration = 1

        for e in range(1, epochs + 1):
            batches = get_batches(train_words, batch_size, window_size)

            start = time.time()

            for x, y in batches:
                feed = {
                    inputs: x,
                    labels: np.array(y)[:, None]
                }

                summary, _, train_loss  = sess.run([merged_summary, optimizer, loss], feed_dict=feed)

                avg_loss += train_loss
                
                train_writer.add_summary(summary, epochs + 1)

                if iteration % print_every == 0: 
                    end = time.time()
                    print("Epoch {}/{}".format(e, epochs),
                          "Batch: {}".format(iteration),
                          "Training loss: {:.4f}".format(avg_loss/print_every),
                          "Speed: {:.4f} sec/batch".format((end-start)/print_every))
                    avg_loss = 0
                    start = time.time()
                #break
                iteration += 1
                
        save_path = saver.save(sess, save_dir + '/embed.ckpt')

In [82]:
epochs = 10
learning_rate = 0.01
window_size = 10
batch_size = 1024
num_sampled = 100
embedding_size = 200
vocab_size = len(vocab_to_int)
save_dir = 'checkpoints/embed/train'
print_every = 1000

tf.reset_default_graph()

embed_train_graph = tf.Graph()

train_embed(embed_train_graph,
            batch_size,
            learning_rate,
            epochs,
            window_size,
            subsampled_text,
            num_sampled,
            embedding_size,
            vocab_size,
            save_dir,
            print_every
            )

Epoch 1/10 Batch: 1000 Training loss: 3.2113 Speed: 0.6224 sec/batch
Epoch 1/10 Batch: 2000 Training loss: 4.0933 Speed: 0.6024 sec/batch
Epoch 2/10 Batch: 3000 Training loss: 4.3585 Speed: 0.5493 sec/batch
Epoch 2/10 Batch: 4000 Training loss: 4.4642 Speed: 0.6041 sec/batch
Epoch 3/10 Batch: 5000 Training loss: 4.4193 Speed: 0.5077 sec/batch
Epoch 3/10 Batch: 6000 Training loss: 4.2437 Speed: 0.6098 sec/batch
Epoch 4/10 Batch: 7000 Training loss: 4.4818 Speed: 0.4570 sec/batch
Epoch 4/10 Batch: 8000 Training loss: 4.2949 Speed: 0.6150 sec/batch
Epoch 5/10 Batch: 9000 Training loss: 4.7017 Speed: 0.4088 sec/batch
Epoch 5/10 Batch: 10000 Training loss: 4.5748 Speed: 0.6180 sec/batch
Epoch 6/10 Batch: 11000 Training loss: 4.8532 Speed: 0.3605 sec/batch
Epoch 6/10 Batch: 12000 Training loss: 5.0591 Speed: 0.6264 sec/batch
Epoch 7/10 Batch: 13000 Training loss: 4.8968 Speed: 0.3107 sec/batch
Epoch 7/10 Batch: 14000 Training loss: 5.1024 Speed: 0.6286 sec/batch
Epoch 8/10 Batch: 15000 Train