<h1>Análise exploratória de notícias: Como identificar fake news</h1><p style="float:left">Autor: Guilherme Milan Santos. Número USP: 9012966</p>

<p>Dataset retirado do trabalho "Fake News Detection on Social Media: A Data Mining Perspective"</p>
Autores: Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan<br/>
Publicação: ACM SIGKDD Explorations Newsletter, Volume 19, Número 1, Páginas 22-36<br/>
</p>

<h1>Importação de bibliotecas e configuração inicial</h1>

In [1]:
#Manipulacao de dados
import pandas as pd
import json
from pandas.io.json import json_normalize
import igraph as ig
import numpy as np
from datetime import datetime
import time
import gc
import operator
import collections

#mlext
from mlxtend.frequent_patterns import apriori

#Sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import confusion_matrix
from sklearn.semi_supervised import LabelPropagation
from sklearn import manifold
from sklearn.ensemble import RandomForestClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

#Bibliotecas de NLP
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import StanfordNERTagger
from nltk.parse import CoreNLPParser
from stanfordcorenlp import StanfordCoreNLP

#Bibliotecas graficas
import matplotlib.pyplot as plt
import networkx as nx
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

#Widgets do jupyter
from IPython.display import display, HTML
from preproc import *
from IPython import embed
import ipywidgets as widgets



In [2]:
init_notebook_mode(connected=True)

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 1000)

file_data = load_dataset()
raw_data = file_data[0]
labels = file_data[1]
raw_df = raw_data_to_df(raw_data, labels)

<h1>Informações gerais do conjunto de dados</h1>

In [3]:
display(HTML('Numero total de notícias: {}'.format(str(raw_df.shape[0]))))
display(HTML('Número de colunas: {}'.format(str(raw_df.shape[1]))))
real_num = raw_df.loc[raw_df['label']==1].shape[0]
fake_num = raw_df.loc[raw_df['label']==0].shape[0]
display(HTML('Numero de notícias verdadeiras: {}'.format(real_num)))
display(HTML('Numero de notícias falsas: {}'.format(fake_num)))
datas = pd.to_datetime(raw_df['publish_date.$date'], unit='ms')
display(HTML('Periodo de publicacao das notícias: de {} até {}'.format(min(datas),max(datas))))

<h1>Colunas</h1>

In [4]:
for column in raw_df.columns:
    print(column)


authors
canonical_link
images
keywords
label
meta_data.DC.date.issued
meta_data.Date
meta_data.HandheldFriendly
meta_data.Last-Modified
meta_data.MobileOptimized
meta_data.al.android.app_name
meta_data.al.android.package
meta_data.al.android.url
meta_data.al.ipad.app_name
meta_data.al.ipad.app_store_id
meta_data.al.ipad.url
meta_data.al.iphone.app_name
meta_data.al.iphone.app_store_id
meta_data.al.iphone.url
meta_data.al.web.url
meta_data.apple-itunes-app
meta_data.apple-mobile-web-app-capable
meta_data.apple-touch-fullscreen
meta_data.application-name
meta_data.article.author
meta_data.article.modified_time
meta_data.article.published_time
meta_data.article.publisher
meta_data.article.section
meta_data.article.tag
meta_data.author
meta_data.build
meta_data.ca_image
meta_data.ca_title
meta_data.copyright
meta_data.date
meta_data.description
meta_data.fb.admins
meta_data.fb.app_id
meta_data.fb.op-recirculation-ads
meta_data.fb.pages
meta_data.fb_title
meta_data.format-detection
meta_dat

Observa-se um número enorme de colunas com pouco valor informativo. Para as análises que se seguem, foram selecionados os campos title (título da notícia), text (corpo da notícia), publish_date$date (data de publicação, meta_data.og.site_name (fonte), além do rótulo (verdadeiro e falso).

<h1>Exemplo de registro</h1>

In [5]:
display(HTML(raw_df[['title','text','publish_date.$date','meta_data.og.site_name','label']].iloc[[0]].to_html()))

Unnamed: 0,title,text,publish_date.$date,meta_data.og.site_name,label
0,Proof The Mainstream Media Is Manipulating The Election By Taking Bill Clinton Out Of Context,"I woke up this morning to find a variation of this headline splashed all over my news feed:\n\nBill Clinton: ‘Natural’ For Foundation Donors to Seek Favors\n\nHere’s Google:\n\nNaturally, my reaction was “oh, s**t, what did Bill Clinton do to damage his wife’s campaign now?”\n\nOf course, the headline sounds really, really awful. It plays right into the idea that the Clinton Foundation is all about pay to play, just like Donald Trump has been saying all along. Unfortunately, it takes reading beyond the headlines, which is something most people don’t do, to find out the real story – and the real story is that there is no pay to play.\n\n“It was natural for people who’ve been our political allies and personal friends to call and ask for things. And I trusted the State Department wouldn’t do anything they shouldn’t do,” Clinton told NPR in an interview that aired Monday morning. Source: CNN\n\nIn other words, people can ask for favors, but that certainly doesn’t mean they’ll get them. Leaked emails have shown that some Clinton Foundation donors have gotten meetings with Clinton and that others were turned down. There is zero evidence of pay to play. In other words, people might have asked for favors, but there’s no evidence they got them.\n\nNow, let’s talk about the foundation the media doesn’t like to mention, the Trump Foundation. Trump hasn’t given to his own foundation since 2008. He does collect money from others, though, and gives it in his name. He also takes from the charity and allegedly buys things like oil paintings and football helmets, all for himself, but out of charity money.\n\nNew York Attorney General Eric Schneiderman said in a September 13 CNN interview that his office is investigating Trump’s charitable foundation over concerns that it “engaged in some impropriety” as related to New York charity laws. The investigation launched amid reports from The Washington Post that Trump spent money from his charity on items meant to benefit himself, such as a $20,000 oil painting of himself and a $12,000 autographed football helmet, and also recycled others’ contributions “to make them appear to have come from him” although he “hasn’t given to the foundation since 2008.” Source: Media Matters\n\nMedia Matters goes on to talk about the double standard and about how clearly the mainstream media is trying to promote Trump at the cost of Clinton’s candidacy:\n\nJournalists have been criticized for the “double standard” in the ways they cover Trump and Democratic presidential nominee Hillary Clinton. Earlier this month, cable news programs devoted 13 times more coverage to Clinton’s pneumonia diagnosis as The Washington Post’s reporting about the Trump Foundation. This week, both the Trump Foundation and Trump Organization stories were given short shrift by the broadcast news programs in favor of coverage of Donald Trump’s Dr. Oz stunt.\n\nAll of this biased coverage is hurting Clinton and helping Trump. Trump has seen major gains over the last few weeks, largely because the media covers every minor Clinton “scandal” (if you call getting sick a “scandal”) while ignoring every scandal in Trump’s closet, and trust me, there are a lot.\n\nFeatured image via Ethan Miller/Getty Images.",1474243000000.0,AddictingInfo,0


<h1>Veracidade de notícias por fonte</h1>

In [6]:
df = raw_df[['title','text','meta_data.og.site_name','label','publish_date.$date']]
raw_df = None
gc.collect()

real_news_per_source = df.loc[df['label']==1]['meta_data.og.site_name'].value_counts()
fake_news_per_source = df.loc[df['label']==0]['meta_data.og.site_name'].value_counts()
total_news_per_source = df['meta_data.og.site_name'].value_counts()

real_news_pctgs = np.multiply(real_news_per_source.divide(total_news_per_source, fill_value=0), 100)
aligned_real_news_pctgs = real_news_pctgs.align(other=total_news_per_source,join='right')[0]

fake_news_pctgs = np.multiply(fake_news_per_source.divide(total_news_per_source, fill_value=0), 100)
aligned_fake_news_pctgs = fake_news_pctgs.align(other=total_news_per_source,join='right')[0]

x_labels = aligned_real_news_pctgs.index.tolist()
real_y_values = aligned_real_news_pctgs.values
fake_y_values = aligned_fake_news_pctgs.values

displayed_num = 20
real_trace = go.Bar(
    x=x_labels[:displayed_num],
    y=real_y_values[:displayed_num],
    text=total_news_per_source[:displayed_num],
    name='Notícias verdadeiras'
)

fake_trace = go.Bar(
    x=x_labels[:displayed_num],
    y=fake_y_values[:displayed_num],
    text=total_news_per_source[:displayed_num],
    name='Notícias falsas',
    marker=dict(
        color='#cc0000'
    )
)

data = [real_trace, fake_trace]

layout=go.Layout(
    barmode='stack',
    title='Porcentagem de notícias verdadeiras e falsas por fonte'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

fake_dataset = df.loc[df['label']==0]
real_dataset = df.loc[df['label']==1]

Pelo gráfico, ve-se que veículos da grande imprensa, como CNN e Politico, apresentam alto grau de confiabilidade. Publicações menores, contudo, apresentam maiores índices de propagação de notícias falsas.

<h1>Evolução temporal: Frequência de publicação de notícias verdadeiras e falsas</h1>

In [7]:
date_df = df.copy()
date_df.rename(columns={'publish_date.$date':'date'},inplace=True)
date_df['date'] = pd.to_datetime(date_df['date'], unit='ms')

max_date = max(date_df['date'])
min_date = min(date_df['date'])

d_range = pd.date_range(start=min_date, end=max_date, periods=50)

true_date_df = date_df.loc[date_df['label'] == 1]
fake_date_df = date_df.loc[date_df['label'] == 0]

true_binned_dates = pd.cut(x=true_date_df['date'], bins=d_range)
fake_binned_dates = pd.cut(x=fake_date_df['date'], bins=d_range)

true_binned_dates = true_binned_dates.apply(lambda x: x.left).value_counts().sort_index()
fake_binned_dates = fake_binned_dates.apply(lambda x: x.left).value_counts().sort_index()

true_trace = go.Scatter(x=true_binned_dates.index.tolist(), y=true_binned_dates.values, name="Notícias verdadeiras")
fake_trace = go.Scatter(x=fake_binned_dates.index.tolist(), y=fake_binned_dates.values, line=dict(color='#cc0000'), name="Notícias falsas")

data = [true_trace, fake_trace]
layout = go.Layout(title='Número de notícias verdadeiras e falsas publicadas em função do tempo')
figure = go.Figure(data=data, layout=layout)
iplot(figure)

<h1>Evolução temporal: Compartilhamento de notícias verdadeiras e falsas em redes sociais em função do tempo</h1>

In [8]:
def load_sm_dataset():    
    count_dict = dict()

    with open ('./FakeNewsNet/Data/BuzzFeed/BuzzFeedNewsUser.txt','r') as f:
        data = f.read()
        rows = data.replace('\t',' ').split('\n')
        #print(rows)

        for row in rows:

            row = row.split()
            #print('row', row)
            if(len(row) != 0):
                #print('colocando {} na count'.format(int(row[2])))
                index = int(row[0])
                if(index not in count_dict.keys()):
                    count_dict[index] = dict()
                    count_dict[index]['count'] = int(row[2])
                    count_dict[index]['date'] = date_df.iloc[index-1]['date']
                    count_dict[index]['label'] = date_df.iloc[index-1]['label']
                    #print('index {} data {}'.format(index, count_dict[index]['date']))
                else:
                    count_dict[index]['count'] += int(row[2])
                    
    with open ('./FakeNewsNet/Data/PolitiFact/PolitiFactNewsUser.txt','r') as f:
        data = f.read()
        rows = data.replace('\t', ' ').split('\n')
        
        for row in rows:
            row = row.split()
            #print(row)
            if(len(row) != 0):
                index = int(row[0])+91
                if(index not in count_dict.keys()):
                    count_dict[index] = dict()
                    count_dict[index]['count'] = int(row[2])
                    count_dict[index]['date'] = date_df.iloc[index]['date']
                    count_dict[index]['label'] = date_df.iloc[index]['label']
                else:
                    count_dict[index]['count'] += int(row[2])
            
    return count_dict


In [9]:
def count_real_fake_shares():
    count_dict = load_sm_dataset()

    count_df = pd.DataFrame.from_dict(count_dict, orient='index')

    real_count_df = count_df[count_df['label'] == 1]
    fake_count_df = count_df[count_df['label'] == 0]

    real_counts = []
    fake_counts = []

    for i in range(0, len(d_range)-1): 
        real_inds = (real_count_df['date'] >= d_range[i]) & (real_count_df['date'] <= d_range[i+1])
        real_sum = np.sum(real_count_df[real_inds]['count'])
        real_counts.append(real_sum)

        fake_inds = (fake_count_df['date'] >= d_range[i]) & (fake_count_df['date'] <= d_range[i+1])
        fake_sum = np.sum(fake_count_df[fake_inds]['count'])
        fake_counts.append(fake_sum)
        
    return real_counts, fake_counts

real_counts, fake_counts = count_real_fake_shares()
true_count_trace = go.Scatter(x=d_range, y=real_counts, name="Notícias verdadeiras")
fake_count_trace = go.Scatter(x=d_range, y=fake_counts, line=dict(color='#cc0000'), name="Notícias falsas")

data = [true_count_trace, fake_count_trace]
layout = go.Layout(title='Número de compartilhamentos de notícias verdadeiras e falsas em função do tempo')
figure = go.Figure(data=data, layout=layout)
iplot(figure)

Em ambas as séries temporais, nota-se um salto na quantidade de notícias publicadas e compartilhadas à época das eleições norte-americanas (16 de novembro de 2016). Além disso, observa-se outro súbito aumento no compartilhamento de notícias falsas em seguida à posse de Donald Trump como presidente.

<h1>Assuntos mais frequentes em notícias verdadeiras</h1>

In [10]:
nlp = StanfordCoreNLP('http://localhost', port=9000)#executa o servidor do stanford core NLP

def count_named_entities(text):
    nlpout = nlp.ner(text)
    entities = {}
    entity_type_exclusions = ['O','DATE','PERCENT','NUMBER', 'DURATION']
    entity_exclusions = ['of, Of', 'To']
    for token in nlpout:
        if(token[1] not in entity_type_exclusions and token[0] not in entity_exclusions):
            if(token[0] in entities.keys()):
                entities[token[0]] += 1
            else:
                entities[token[0]] = 1
    return entities

def count_named_entities_df(df, column):
    total_entities = {}
    for text in df[column]:
        entities = count_named_entities(text)
        for entity in entities.keys():
            if(entity in total_entities.keys()):
                total_entities[entity] += entities[entity]
            else:
                total_entities[entity] = entities[entity]
            
    sorted_tuples = sorted(total_entities.items(), key=lambda x: x[1], reverse=True)
    return sorted_tuples

def word_frequency_tuples_to_plottable_data(tuples):
    x_labels = []
    y_counts = []
    for i in range(0,15):
        x_labels.append(tuples[i][0])
        y_counts.append(tuples[i][1])
    return x_labels, y_counts

name_frequency_tuples = count_named_entities_df(real_dataset, 'title')
x_labels, y_counts = word_frequency_tuples_to_plottable_data(name_frequency_tuples)
data = [go.Bar(x=x_labels, y=y_counts)]
layout = go.Layout(title="Assuntos mais frequentes em notícias verdadeiras")
figure = go.Figure(data=data, layout=layout)
iplot(figure, filename='jupyter-basic_bar')

<h1>Assuntos mais frequentes em notícias falsas</h1>

In [11]:
name_frequency_tuples = count_named_entities_df(fake_dataset,'title')

x_labels, y_counts = word_frequency_tuples_to_plottable_data(name_frequency_tuples)
data = [go.Bar(x=x_labels, y=y_counts, marker=dict(color='#cc0000'))]
layout = go.Layout(title="Assuntos mais frequentes em notícias falsas")
figure = go.Figure(data=data, layout=layout)
iplot(figure, filename='jupyter-basic_bar')

As palavras encontradas foram resultado da aplicação de um modelo de extração de entidades nomeadas sobre os títulos das notícias. Os resultados parecem semelhantes, o que parece suspeito, requerindo uma análise mais detalhada, feita a seguir.

<h1>Detecção de associação de palavras com apriori em notícias verdadeiras</h1>

<p>Feito com suporte de 5%, utilizando o título</p>

In [12]:
def df_to_bag_of_words(df, column):
    vectorizer = CountVectorizer(stop_words='english')
    bag_of_words = vectorizer.fit_transform(df[column])
    vocab = vectorizer.vocabulary_
    sorted_vocab = sorted(vocab.items(), key=operator.itemgetter(1))
    return bag_of_words, sorted_vocab

bag_of_words, vocab = df_to_bag_of_words(real_dataset[::20], 'title')
dense_bag = bag_of_words.todense()

dense_bool_bag = dense_bag == 1

dense_bool_bag = pd.DataFrame(dense_bool_bag)
dense_bool_bag.columns = [x[0] for x in vocab]

rules = apriori(dense_bool_bag, min_support=0.005, use_colnames=True)
rules['length'] = rules['itemsets'].apply(lambda x: len(x))
filtered = rules[(rules['length'] >= 4) & (rules['support'] >= 0.05 )]
print(filtered['itemsets'].iloc[0::10])

886     (ahead, 50, eagle, aren)                                            
896     (hillary, donald, clinton, according)                               
906     (hillary, clinton, according, trump)                                
916     (clinton, according, trump, poll)                                   
926     (points, donald, according, national)                               
936     (hillary, according, national, points)                              
946     (points, according, trump, leads)                                   
956     (alleged, federal, investigation, anthony)                          
966     (alleged, minor, federal, investigation)                            
976     (minor, federal, investigation, anthony)                            
986     (points, rising, eagle, aren)                                       
996     (correct, attack, rising, terrorist)                                
1006    (attack, politically, rising, terrorist)                            

<h1>Detecção de associações de palavras com apriori em notícias falsas</h1>

In [13]:
bag_of_words, vocab = df_to_bag_of_words(fake_dataset.iloc[0::40], 'title')

dense_bag = bag_of_words.todense()

dense_bool_bag = dense_bag == 1

dense_bool_bag = pd.DataFrame(dense_bool_bag)
dense_bool_bag.columns = [x[0] for x in vocab]

rules = apriori(dense_bool_bag, min_support=0.005, use_colnames=True)
rules['length'] = rules['itemsets'].apply(lambda x: len(x))
filtered = rules[(rules['length'] >= 4) & (rules['support'] >= 0.09 )]
print(filtered['itemsets'].iloc[0::10])

496     (charlotte, means, guess, 70)                                                             
506     (rioters, means, guess, 70)                                                               
516     (daily, clinton, dollars, billions)                                                       
526     (clinton, freedom, dollars, billions)                                                     
536     (earthquake, clinton, steal, billions)                                                    
546     (clinton, sick, billions, haiti)                                                          
556     (daily, sick, dollars, billions)                                                          
566     (daily, exploited, billions, haiti)                                                       
576     (daily, steal, billions, haiti)                                                           
586     (freedom, dollars, exploited, billions)                                                   
596     (d

O apriori foi mais eficaz em revelar temas diferentes em cada tipo de notícia. Nota-se, por exemplo, que nas notícias falsas são feitas repetidas referências ao rumor de que a Clinton Foundation desviou dinheiro destinado ao auxílio de vítimas a um terremoto no Haiti. Nas verdadeiras, observa-se com certa frequência notícias analisando um comentário do vice-presidente Joe Biden sobre uma crise no setor imobiliário.

<h1>Ocorrência de classes gramaticais</h1>

In [14]:
#conta substantivos, verbos e adjetivos
def count_nva(df, column):
    nouns = 0
    verbs = 0
    adjs = 0
    advs = 0
    for i in range(0, df.shape[0]):
        expr = df[column].iloc[i]
        tags = nlp.pos_tag(expr)
        for word, tag in tags:
            if(tag[0] == 'N'):
                #print('noun', word)
                nouns += 1
            elif(tag[0] == 'V'):
                #print('verb', word)
                verbs += 1
            elif(tag == 'JJ' or tag == 'JJR' or tag == 'JJS'):
                #print('adjective', word)
                adjs += 1
            elif[tag == 'RB' or tag == 'RBR' or tag == 'RBS']:
                #print('adverb', word)
                advs += 1
    return [nouns, verbs, adjs]

def count_words(df, column):
    word_count = 0
    for i in range(0, df.shape[0]):
        word_count += len(word_tokenize(df[column].iloc[i]))
    return word_count


real_gram_count = count_nva(real_dataset,'title')
fake_gram_count = count_nva(fake_dataset,'title')
real_word_count = count_words(real_dataset, 'title')
fake_word_count = count_words(fake_dataset, 'title')

real_pctgs = np.multiply(np.divide(real_gram_count, real_word_count), 100)
fake_pctgs = np.multiply(np.divide(fake_gram_count, fake_word_count), 100)
labels = ['% de substantivos', '% de verbos', '% de adjetivos']
real_trace = go.Bar(
    x=labels,
    y=real_pctgs,
    name='Notícias verdadeiras'
)
fake_trace = go.Bar(
    x=labels,
    y=fake_pctgs,
    name='Notícias falsas',
    marker=dict(
        color='#cc0000'
    )
)
data = [real_trace, fake_trace]

layout = go.Layout(
    barmode='group',
    title='Análise de classes gramaticais: Notícias verdadeiras e falsas'
)
figure = go.Figure(data=data, layout=layout)
iplot(figure)

A distribuição de classes gramaticais por tipo de notícia mostrou-se bastante semelhante. A análise dos advérbios foi omitida, em função dos maus resultados de marcação de advérbios pelo modelo de processamento de linguagem natural utilizado.

<h1>Estudo do classificador</h1>

<p>Código para montagem da rede</p>

In [15]:
def df_to_tfidf(dataframe, column):
	vectorizer = TfidfVectorizer(stop_words='english')
	bag_of_words_matrix = vectorizer.fit_transform(dataframe[column])
	return bag_of_words_matrix    

def get_real_fake_indexes(df):    
    index = pd.Index(df['label'])
    print(index)
    fake_news_indexes = np.where(index.get_loc(key=0))
    real_news_indexes = np.where(index.get_loc(key=1))
    print('fake news indexes', fake_news_indexes)
    print('real news indexes', real_news_indexes)
    return real_news_indexes, fake_news_indexes
    
def get_3d_graph_coordinates(network):
    graph = ig.Graph.Adjacency(network.tolist())
    
    vertex_num = len(graph.vs)
    
    layt = graph.layout('fr3d', dim=3)
    layt.scale(15)
    #layt.fit_into(bbox=(30, 30, 30))
    
    x_nodes = [layt[k][0] for k in range(vertex_num)]
    y_nodes = [layt[k][1] for k in range(vertex_num)]
    z_nodes = [layt[k][2] for k in range(vertex_num)]

    x_edges = []
    y_edges = []
    z_edges = []

    edges = graph.get_edgelist()
    print('numero de arestas: ', len(edges))
    for a, b in edges:
        #if(a != b):
        #    print('colocando aresta {}->{}'.format(a,b))
        #    print('layout[a]',str(layt[a]),  'layout[b]', str(layt[b]))
        x_edges += [layt[a][0],layt[b][0],None]
        y_edges += [layt[a][1],layt[b][1],None]
        z_edges += [layt[a][2],layt[b][2],None]    
    
    return (x_nodes, y_nodes, z_nodes), (x_edges,y_edges,z_edges)

tfidf_matrix = df_to_tfidf(df, 'text')
network = cosine_similarity(tfidf_matrix)
network[network < 0.15] = 0
network[network > 0] = 1

nodes, edges = get_3d_graph_coordinates(network)
colors = df['label'].copy()
colors[colors == 0] = '#cc0000'
colors[colors == 1] = '#1F77B4'
colors = colors.tolist()


numero de arestas:  2988


<h1>Rede de similaridade</h1>

<p>Código para geração do gráfico</p>

In [16]:
def get_3d_traces_layout(nodes, edges, colors):
    edge_trace = go.Scatter3d(
        visible=True,
        x=edges[0],
        y=edges[1],
        z=edges[2],
        mode='lines',
        hoverinfo='none',
        line=dict(
            color='#000000',
            width=5,
        ),
    )

    node_trace = go.Scatter3d(
        x=nodes[0],
        y=nodes[1],
        z=nodes[2],
        mode='markers',
        showlegend=False,
        hoverinfo='none',    
        marker=dict(
            symbol='circle',
            size=6,
            color=colors,
            colorscale='Viridis',
            line=dict(
                color='#FFFFFF',
                width=0.5
            )
        )
    )

    axis=dict(
        showbackground=True,
        showline=True,
        zeroline=True,
        showgrid=True,
        showticklabels=True,
        title='Rede de similaridade'
    )

    plot_layout = go.Layout(
        title='Rede de similaridade',
        width=1000,
        height=1000,
        scene=dict(
            xaxis=dict(axis),
            yaxis=dict(axis),
            zaxis=dict(axis),
        ),
        margin=dict(
            t=100
        ),
        showlegend=False,
        hovermode='closest',
        annotations=[
            dict(
                showarrow=False,
                xref='paper',
                yref='paper',
                x=0,
                y=0.1,
                xanchor='left',
                yanchor='bottom',
                font=dict(
                    size=14
                )
            )]
    )
    return node_trace, edge_trace, plot_layout

node_trace, edge_trace, plot_layout = get_3d_traces_layout(nodes, edges, colors)

data=[node_trace, edge_trace]
fig=go.Figure(data=data, layout=plot_layout)
iplot(fig)

<h1>Implementação do classificador e cross-validation, treinando com 1 fold e testando com 9</h1>

<p>Este método de validação foi selecionado por um conjunto de razões. Primeiramente, 
no artigo pesquisado, é citado o custo e dificuldade na construção de datasets atualizados
como um dos principais obstáculos para o avanço da pesquisa sobre a análise de fake news.
Portanto, um bom desempenho com pequeno conjuntos de dados classificados é uma característica
altamente desejável para o classificador.<br/>
Além disso, foram lecionados métodos de aprendizado semi-supervisionado, especificamente voltados 
a reduzir a dependência de um classificador humano. Este estudo visava,
portanto, testar esta característica deste tipo de modelo.
</p>

Código da validação do label propagation

In [17]:
def get_average_cm(cms):
	average_cm = np.zeros((2,2))
	total_tp = 0
	total_fp = 0
	total_fn = 0
	total_tn = 0
    #average_cm[0][0] = 0
    #average_cm[0][1] = 0
    #average_cm[1][0] = 0
    #average_cm[1][1] = 0
	i = 0
	for cm in cms:
		total_tp += cm[0][0]
		total_fp += cm[0][1]
		total_fn += cm[1][0]
		total_tn += cm[1][1]
		i += 1
	average_cm[0][0] = total_tp/len(cms)
	average_cm[0][1] = total_fp/len(cms)
	average_cm[1][0] = total_fn/len(cms)
	average_cm[1][1] = total_tn/len(cms)

	return average_cm

def get_percentile_cm(cm):
	cm_sum = sum(sum(cm))
	cm = np.multiply(np.divide(cm, cm_sum),100)
	return cm

def count_dif(arr1, arr2):
    count = 0
    for i in range(0, len(arr1)):
        if(arr1[i] != arr2[i]):
            count += 1
    return count

def lb_prop_classify(network, labels, reverse):
    kf = StratifiedKFold(n_splits=10)
    scores = []
    cms = []

    for train_index, test_index in kf.split(network, labels):
        if(reverse):
            aux = train_index
            train_index = test_index
            test_index = aux
            
        orig_labels = np.array(labels)
        train_labels = np.array(labels)
        #print('test index {} ate {} train index {} ate {}'.format(min(test_index),max(test_index), min(train_index), max(train_index)))
        
        #print('tipo do train labels', type(train_labels))
        train_labels[test_index] = -1
        train_labels[train_index] = orig_labels[train_index]
        #print('len do train labels', len(train_labels))

        #print('train labels depois de copiar os valores')
        #print(train_labels)
    
        label_spreading_model = LabelPropagation()
        label_spreading_model.fit(network, train_labels)
        
        prediction = label_spreading_model.transduction_
        pred_sample = prediction[test_index]
        #print('pred sample', pred_sample)
        test_sample = orig_labels[test_index]
        #print('test sample', test_sample)
        
        diffs = count_dif(pred_sample, test_sample)
        #print('diff', diff)
        diff_rate = diffs/len(test_sample)
        scores.append(1-diff_rate)
        #print('pontuacao ', 1-diff_rate)
        #print('{}%'.format(1-diff_rate))
        cms.append(confusion_matrix(test_sample, pred_sample, label_spreading_model.classes_))
    
    avg_score = np.average(scores)    
    #print('pontuacao media do label propagation: ', avg_score)
    percentile_cm = get_percentile_cm(get_average_cm(cms))
    return avg_score, percentile_cm




In [18]:
#print(df.columns)
text_df = df[['text','title']].copy()
text_df['combined'] = text_df[['text','title']].apply(lambda x: ' '.join(x),axis=1)
#print(text_df.iloc[1])

tfidf_matrix = df_to_tfidf(text_df, 'text')
network = cosine_similarity(tfidf_matrix)
score_lp_text, cm_text = lb_prop_classify(network, df['label'].tolist(), True)

tfidf_matrix = df_to_tfidf(text_df, 'title')
network = cosine_similarity(tfidf_matrix)
score_lp_title, cm_title = lb_prop_classify(network, df['label'].tolist(), True)

tfidf_matrix = df_to_tfidf(text_df, 'combined')
network = cosine_similarity(tfidf_matrix)
score_lp_comb, cm_comb = lb_prop_classify(network, df['label'].tolist(), True)

print('Pontuacao do label propagation utilizando:  texto {} titulo {} texto+titulo {}'.format(score_lp_text, score_lp_title, score_lp_comb))


Pontuacao do label propagation utilizando:  texto 0.6158660540239488 titulo 0.5508214981899192 texto+titulo 0.6156042884990253


Treinamento e validação de modelos convencionais de aprendizado, baseados em matriz atributo-valor, para fins de comparação com o Label Propagation. Foram utilizados KNN, regressão logística e Random Forest.

In [19]:
def attr_val_classify(data, labels, reverse_folds):
    kf = StratifiedKFold(n_splits=10)
    knn_scores = []
    rf_scores = []
    lr_scores = []
    
    lr_cms = []
    
    rf = RandomForestClassifier(criterion='entropy')
    knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
    lr = LogisticRegression()
    orig_labels = np.array(labels)
    #print('data original', data)
    data = data.todense()
    #print('tipo data densa', type(data), 'dimensoes', data.shape)
#    print('data densa', data[0])
    
    for train_index, test_index in kf.split(network, labels):
        if(reverse_folds):
            aux = test_index
            test_index = train_index
            train_index = aux
        #print('indices de treino', train_index)
        #print('indices de teste', test_index)
        
        train_labels = orig_labels[train_index]
        #print('train labels', train_labels)
        train_data = data[train_index]
        #print('train data', train_data.shape)
        #print('train labels', train_labels.shape)
        
        test_labels = orig_labels[test_index]
        #print('test labels', test_labels)
        test_data = data[test_index]
        #print('test data', test_data.shape)
        #print('test labels', test_labels.shape)
        
        knn.fit(train_data, train_labels)
        knn_score = knn.score(test_data, test_labels)
        #print('knn score(validacao)', knn_score)
        knn_scores.append(knn_score)
        

        
        rf.fit(train_data, train_labels)
        rf_score = rf.score(test_data, test_labels)
        rf_scores.append(rf_score)
        
        lr.fit(train_data, train_labels)
        lr_score = lr.score(test_data, test_labels)
        lr_scores.append(lr_score)
        
        lr_results = lr.predict(test_data)
        lr_cms.append(confusion_matrix(test_labels, lr_results, lr.classes_))
    
    print('Media knn', np.average(knn_scores))
    print('Media random forest', np.average(rf_scores))
    print('Media regressao logistica', np.average(lr_scores))
    
    lr_cm = get_percentile_cm(get_average_cm(lr_cms))
    return np.average(knn_scores), np.average(rf_scores), np.average(lr_scores), lr_cm

tfidf_matrix = df_to_tfidf(text_df, 'combined')
knn_score, rf_score, lr_score, lr_cm = attr_val_classify(tfidf_matrix, df['label'].tolist(), True)

scores = [score_lp_text, score_lp_title, score_lp_comb, knn_score, rf_score, lr_score]
score_labels = ['Label Prop.(Texto)', 'Label Prop.(Título)', 'Label Prop.(Texto+Título)', 'KNN(Texto+Título)', 'Random Forest(Texto+Título)', 'Regressao Logistica(Texto+Título)']

Media knn 0.5982024505708716
Media random forest 0.5652937900306322
Media regressao logistica 0.6398064605959343


In [20]:
colors = ['#f4f142', '#f4f142', '#f4f142', '#1F77B4', '#30c95e', '#d6411b']

score_trace = go.Bar(
    x=score_labels,
    y=scores,
    marker=dict(color=colors)
)

data = [score_trace]

layout=go.Layout(
    title='Acurácia dos classificadores'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Observa-se, curiosamente, que a regressão logística apresentou o maior desempenho, seguida do label propagation. Por se tratar de um modelo semi-supervisionado, era esperado que o Label Propagation apresentasse maior desempenho nas condições do teste (treinamento com 1 partição, teste com 9). Contudo, o desempenho do modelo baseado em redes ainda se mostrou satisfatório. 

In [21]:
print(cm_comb)
cells = [['A notícia é falsa', 'A notícia é verdadeira'],[cm_comb[0][0],cm_comb[1][0]],[cm_comb[0][1],cm_comb[1][1]]]
columns = ['Previu que era falso','Previu que era verdadeiro']

layout = go.Layout(
    font=dict(family='Courier New, monospace', size=15, color='#7f7f7f')
)
trace = go.Table(
    columnwidth = [40],
    header=dict(values=['','Previsão: Falso', 'Previsão: Verdadeiro']),
    cells=dict(values=cells, height=40)
)
display(HTML('<h1>Matriz de confusão: Label Propagation (Texto+Título)</h1>'))
data = [trace]
figure = go.Figure(layout=layout, data=data)
iplot(figure)

[[30.88467615 19.11532385]
 [19.32596103 30.67403897]]


In [22]:
cm_comb = lr_cm
cells = [['A notícia é falsa', 'A notícia é verdadeira'],[cm_comb[0][0],cm_comb[1][0]],[cm_comb[0][1],cm_comb[1][1]]]
columns = ['Previu que era falso','Previu que era verdadeiro']

layout = go.Layout(
    font=dict(family='Courier New, monospace', size=15, color='#7f7f7f')
)
trace = go.Table(
    columnwidth = [40],
    header=dict(values=['','Previsão: Falso', 'Previsão: Verdadeiro']),
    cells=dict(values=cells, height=40)
)
display(HTML('<h1>Matriz de confusão: Regressão Logística (Texto+Título)</h1>'))
data = [trace]
figure = go.Figure(layout=layout, data=data)
iplot(figure)

Quando o classificador realiza uma predição errada, seria preferível reduzir o número de previsões que acusem uma notícia de falsa quando ela é, na realidade, verdadeira. Nota-se pela matriz de confusão que a regressão logística apresentou resultado mais desejável neste quesito.