### Dataset LIAR: https://www.kaggle.com/datasets/csmalarkodi/liar-fake-news-dataset/data

- Zawierał 3 osobne pliki (train, test, valid), wszystkie z labelami, wszystkie połączyłem
- Labele nie były binarne - wiersze 'half-true' usunąłem, pozostałe zmapowałem na 'fake' 0 lub 1
- Wyeksportowany plik ma tylko kolumnę 'text' oraz 'fake'
- UWAGA - tutaj teksty z założenia są KRÓTKIE



In [1]:
import pandas as pd
import re
import nltk
import warnings
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline    
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, classification_report, accuracy_score, f1_score

In [5]:
df_train = pd.read_csv('LIAR dataset/train.tsv', sep='\t', header=None)

df_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


In [6]:
df_test = pd.read_csv('LIAR dataset/test.tsv', sep='\t', header=None)

df_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,11972.json,true,Building a wall on the U.S.-Mexico border will...,immigration,rick-perry,Governor,Texas,republican,30,30,42,23,18,Radio interview
1,11685.json,false,Wisconsin is on pace to double the number of l...,jobs,katrina-shankland,State representative,Wisconsin,democrat,2,1,0,0,0,a news conference
2,11096.json,false,Says John McCain has done nothing to help the ...,"military,veterans,voting-record",donald-trump,President-Elect,New York,republican,63,114,51,37,61,comments on ABC's This Week.
3,5209.json,half-true,Suzanne Bonamici supports a plan that will cut...,"medicare,message-machine-2012,campaign-adverti...",rob-cornilles,consultant,Oregon,republican,1,1,3,1,1,a radio show
4,9524.json,pants-fire,When asked by a reporter whether hes at the ce...,"campaign-finance,legal-issues,campaign-adverti...",state-democratic-party-wisconsin,,Wisconsin,democrat,5,7,2,2,7,a web video


In [7]:
df_valid = pd.read_csv('LIAR dataset/valid.tsv', sep='\t', header=None)

df_valid.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,12134.json,barely-true,We have less Americans working now than in the...,"economy,jobs",vicky-hartzler,U.S. Representative,Missouri,republican,1,0,1,0,0,an interview with ABC17 News
1,238.json,pants-fire,"When Obama was sworn into office, he DID NOT u...","obama-birth-certificate,religion",chain-email,,,none,11,43,8,5,105,
2,7891.json,false,Says Having organizations parading as being so...,"campaign-finance,congress,taxes",earl-blumenauer,U.S. representative,Oregon,democrat,0,1,1,1,0,a U.S. Ways and Means hearing
3,8169.json,half-true,Says nearly half of Oregons children are poor.,poverty,jim-francesconi,Member of the State Board of Higher Education,Oregon,none,0,1,1,1,0,an opinion article
4,929.json,half-true,On attacks by Republicans that various program...,"economy,stimulus",barack-obama,President,Illinois,democrat,70,71,160,163,9,interview with CBS News


In [16]:
df = pd.concat([df_train, df_test, df_valid], ignore_index=True)


df.columns = [
    "id", "label", "text", "subject", "speaker", "speaker_job", 
    "state", "party", "barely_used1", "barely_used2", "barely_used3", "barely_used4", "barely_used5", "source"
]


df.head()

Unnamed: 0,id,label,text,subject,speaker,speaker_job,state,party,barely_used1,barely_used2,barely_used3,barely_used4,barely_used5,source
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


In [12]:
print(df_train.size)
print(df_test.size)
print(df_valid.size)

143360
17738
17976


In [10]:
df.size

179074

In [17]:
df_no_half_truths = df[df['label'] != 'half-true']
df_no_half_truths.size

142296

In [21]:
df_no_half_truths = df_no_half_truths.reset_index(drop=True)

In [29]:
label_map = {
    "pants-fire": 1,
    "false": 1,
    "barely-true": 1,
    "mostly-true": 0,
    "true": 0
}

df_no_half_truths["fake"] = df_no_half_truths["label"].map(label_map)

In [30]:
df_no_half_truths.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10164 entries, 0 to 10163
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            10164 non-null  object 
 1   label         10164 non-null  object 
 2   text          10164 non-null  object 
 3   subject       10162 non-null  object 
 4   speaker       10162 non-null  object 
 5   speaker_job   7310 non-null   object 
 6   state         7949 non-null   object 
 7   party         10162 non-null  object 
 8   barely_used1  10162 non-null  float64
 9   barely_used2  10162 non-null  float64
 10  barely_used3  10162 non-null  float64
 11  barely_used4  10162 non-null  float64
 12  barely_used5  10162 non-null  float64
 13  source        10056 non-null  object 
 14  fake          10164 non-null  int64  
dtypes: float64(5), int64(1), object(9)
memory usage: 1.2+ MB


In [31]:
df_for_export = df_no_half_truths[['text','fake']]

In [32]:
df_for_export

Unnamed: 0,text,fake
0,Says the Annies List political group supports ...,1
1,"Hillary Clinton agrees with John McCain ""by vo...",0
2,Health care reform legislation is likely to ma...,1
3,The Chicago Bears have had more starting quart...,0
4,Jim Dunnam has not lived in the district he re...,1
...,...,...
10159,"In the past two years, Democrats have spent mo...",1
10160,Says Donald Trump has bankrupted his companies...,0
10161,"John McCain and George Bush have ""absolutely n...",0
10162,A new poll shows 62 percent support the presid...,1


In [28]:
df_for_export['fake'].value_counts()

fake
fake    5657
true    4507
Name: count, dtype: int64

In [34]:
df_for_export.to_csv('LIAR_for_modeling.csv', index=False)