# *Yellow is the new black*
------

## Abstract
The creation and propagation of false information has existed since the dawn of time.
Behind these misleading elements are often hidden political or financial intentions, in order to gain credit or make competitors lose it.
With the advent of the Internet and the ever faster and more direct flow of information, it is becoming easier every day to deceive your fellow citizens and to be fooled.
The term *fake news* took on a new dimension during the 2016 American presidential election, when Donald Trump used it extensively to describe the media coverage about himself. In this instantaneous era, it becomes crucial to be able to be critical of the information received. With this work, we want to highlight the risks related to the propagation of false information by using the fakes news themselves, from the Liar database. The power that these fake new vehicles hold is mostly in the use and resonance we make of them. Our credulity becomes credibility, it's up to us to turn the equation the other way around!

------

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re

------

## Data Loading and Cleaning

In [7]:
datapath = 'data/liar_dataset/'

The files are simple *tsv* which is similar to *csv* with tabs instead of comma. All the columns names are known from the `README` files that accompagne the data

In [8]:
columns = ['ID', 
           'Label', 
           'Statement', 
           'Subject', 
           'Speaker', 
           'Job title', 
           'Home State', 
           'Party Affiliations', 
           'Barely True Counts', 
           'False Counts', 
           'Half True Counts', 
           'Mostly True Counts', 
           'Pants on Fire Counts', 
           'Context']
liar_df = pd.read_csv(datapath + 'train.tsv',delimiter='\t',encoding='utf-8', names = columns)

* Column 1: the ID of the statement ([ID].json).
* Column 2: the label.
* Column 3: the statement.
* Column 4: the subject(s).
* Column 5: the speaker.
* Column 6: the speaker's job title.
* Column 7: the state info.
* Column 8: the party affiliation.
* Column 9-13: the total credit history count, including the current statement.
    * 9: barely true counts.
    * 10: false counts.
    * 11: half true counts.
    * 12: mostly true counts.
    * 13: pants on fire counts.
* Column 14: the context (venue / location of the speech or statement).

In [9]:
liar_df.head()

Unnamed: 0,ID,Label,Statement,Subject,Speaker,Job title,Home State,Party Affiliations,Barely True Counts,False Counts,Half True Counts,Mostly True Counts,Pants on Fire Counts,Context
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


------

All of the result and interpretation we can have from this dataset are conditionned on it. More precisely, the dataset has been constructed from differents sources of media. This selection can be biased. To lower the impact of this bias, the authors have make sure to balance the numbers of article extrated between the two american political parties. But this balance can or cannot represent the reality. It can be interesting to check it the sampling can be considered as representative or not.

## What are the most prominent professions among the liars?

### What are the most prominent professions in the dataset ?

In [10]:
jobs = liar_df.groupby("Job title").count()
jobs = jobs.sort_values(by=['ID'], ascending = False)
jobs = jobs[["ID"]]
jobs.head()

Unnamed: 0_level_0,ID
Job title,Unnamed: 1_level_1
President,492
U.S. Senator,479
Governor,391
President-Elect,273
U.S. senator,263


### Among the liars ??

In [32]:
lie = liar_df[liar_df['Label'] == 'false']

For each Statement, we can access the job title of the speaker. That way, we can access the most frequent jobs in the liar groups for this specific dataset.

In [33]:
jobs_lie = lie.groupby("Job title").count()
jobs_lie = jobs_lie[["ID"]]
jobs_lie.shape

(403, 1)

In [34]:
jobs_lie = jobs_lie.sort_values(by=['ID'], ascending = False)
jobs_lie.head()

Unnamed: 0_level_0,ID
Job title,Unnamed: 1_level_1
President-Elect,101
Governor,75
President,68
U.S. Senator,66
U.S. senator,51


Because all the jobs don't have the same media coverage, it is more interesting to have the percentage of lie of the job than the raw numbers themselves

In [35]:
jobs_lie = jobs_lie.join(jobs, lsuffix = '_lie', rsuffix = '_total')

In [36]:
jobs_lie["ratio (%)"] =jobs_lie["ID_lie"]/jobs_lie["ID_total"]*100

In [38]:
jobs_lie.sort_values(by=['ratio (%)'], ascending = False)

Unnamed: 0_level_0,ID_lie,ID_total,ratio (%)
Job title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"vice president, Hilex Poly Co.",1,1,100.000000
Florida House of Representatives,1,1,100.000000
Former NYPD detective,1,1,100.000000
Fox and Friends co-host,2,2,100.000000
Fox News legal analyst,1,1,100.000000
presidential candidate,1,1,100.000000
president of the Virginia Citizens Defense League,1,1,100.000000
Fulton County Commissioner,1,1,100.000000
Georgia Emergency Management Agency director,1,1,100.000000
president of the Federation for American Immigration Reform,1,1,100.000000


### Lie in politics

The people that have a political message generaly have a party affiliation.

In [26]:
lie_party = lie[lie['Party Affiliations'] != 'none']
lie_party = lie_no_party.groupby("Job title").count()
lie_party = lie_no_party[["ID"]].sort_values(by=['ID'], ascending = False)
lie_party.head()

Unnamed: 0_level_0,ID
Job title,Unnamed: 1_level_1
President-Elect,101
Governor,75
President,68
U.S. Senator,66
U.S. senator,51


In [41]:
politics_regex = re.compile(r'\b(mayor|president|council|house|candidate|political|assembly|republican|governor|senator)[s]*\b' ,re.I)
politics_liar_sets = jobs_lie[jobs_lie.index.str.contains(politics_regex, regex=True) == True]
politics_liar_sets

  from ipykernel import kernelapp as app


Unnamed: 0_level_0,ID_lie,ID_total,ratio (%)
Job title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
President-Elect,101,273,36.996337
Governor,75,391,19.181586
President,68,492,13.821138
U.S. Senator,66,479,13.778706
U.S. senator,51,263,19.391635
Senator,33,147,22.448980
Presidential candidate,29,254,11.417323
Former governor,28,176,15.909091
U.S. House of Representatives,27,102,26.470588
State Senator,22,108,20.370370


Is the President really the most proheminent liar. Maybe not, 