# Reproducible Analysis of Politician Claims: A Python Demo for R2CASS@ICWSM

This Jupyter Notebook demonstrates a reproducible analysis of a subset of the LIAR dataset, focusing on politician statements and their fact-checking truth ratings. This analysis is designed to be run seamlessly via Binder, ensuring reproducibility across different environments.

**Goal:** To load a dataset of politician claims, analyze the distribution of their truth ratings, and visualize the results.

## 1. Setup and Data Loading

First, we'll import the necessary libraries (`pandas` for data manipulation, `matplotlib.pyplot` and `seaborn` for plotting) and then load our dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Define column names as per LIAR dataset README, since our TSV has no header
column_names = [
    'ID', 'label', 'statement', 'subjects', 'speaker', 'speaker_job_title',
    'state_info', 'party_affiliation', 'barely_true_counts', 'false_counts',
    'half_true_counts', 'mostly_true_counts', 'pants_on_fire_counts', 'context'
]

# Load the dataset.tsv file
# The 'sep=\t' specifies tab-separated values.
# 'header=None' indicates no header row.
# 'names=column_names' assigns our custom column names.
df = pd.read_csv('../data/dataset.tsv', sep='\t', header=None, names=column_names)

print("Dataset loaded successfully!")
print(f"Number of claims: {len(df)}")
print("\nFirst 5 rows of the dataset:")
print(df.head())

## 2. Analyze Distribution of Truth Ratings

Let's examine how many claims fall into each truth rating category (`label` column).

In [None]:
truth_rating_counts = df['label'].value_counts().sort_index()

print("\nDistribution of Truth Ratings:")
print(truth_rating_counts)

## 3. Visualize Truth Rating Distribution

A bar chart provides a clear visual summary of the truth rating distribution. We'll order the labels for better readability.

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=truth_rating_counts.index, y=truth_rating_counts.values, palette='viridis')
plt.title('Distribution of Politician Claim Truth Ratings')
plt.xlabel('Truth Rating')
plt.ylabel('Number of Claims')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Save the plot to a file that can be viewed in the Binder environment
plt.savefig('truth_ratings_distribution.png')
print("Plot saved as 'truth_ratings_distribution.png'")

## 4. Analyze Claims by Party Affiliation

Let's also see the distribution of claims by the speaker's political party.

In [None]:
# Clean party_affiliation data: replace empty strings/NaN with 'Unknown'
df['party_affiliation'] = df['party_affiliation'].replace('', 'unknown').fillna('unknown')
party_counts = df['party_affiliation'].value_counts().sort_index()

print("\nDistribution of Claims by Party Affiliation:")
print(party_counts)

## 5. Visualize Claims by Party Affiliation

Finally, a bar chart for party affiliation to complete our brief exploration.

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=party_counts.index, y=party_counts.values, palette='plasma')
plt.title('Distribution of Claims by Political Party Affiliation')
plt.xlabel('Party Affiliation')
plt.ylabel('Number of Claims')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Save the plot
plt.savefig('party_affiliation_distribution.png')
print("Plot saved as 'party_affiliation_distribution.png'")

## Conclusion

This notebook demonstrates how to load, analyze, and visualize data related to politician claims and their truth ratings in a reproducible Python environment. By packaging this with Binder, we ensure anyone can rerun this analysis with ease.

Feel free to modify the code, add new analyses, or explore other aspects of the dataset!