---
title: "Data Cleaning"
format:
    html: 
        code-fold: false
---

We will go through each of our datasources cleaning and combining datasets where necessary. It includes the following sections:

1. **Mediacloud API Data**:
   - Loads data from CSV files.
   - Drops duplicate columns (`media_url`).
   - Saves the cleaned data back to CSV files.

2. **Liar Dataset**:
   - Loads data from TSV files.
   - Defines column names.
   - Saves the data in CSV format.

3. **NewsAPI Dataset**:
   - Loads data from a JSON file.
   - Extracts and processes various fields (source, author, title, description, URL, content, and publication date).
   - Saves the processed data to a CSV file.

4. **ChatGPT Integration**:
   - Attempts to use OpenAI's GPT model to rate the truthfulness of news articles.
   - Defines a function to interact with the GPT model.
   - Loads the cleaned NewsAPI data and sends prompts to the GPT model for rating.

5. **Conclusion**:
   - Notes that OpenAI's models do not have access to real-time data for claim validation.

# Code 

#### Mediacloud API Data

Given the media_url column and media_name column are exactly the same we don't need both

In [None]:
import pandas as pd
import numpy as np


# Load Data
df = pd.read_csv("../../data/Raw_Data/mediacloud_api_data/covid-story-list.csv")
df2 = pd.read_csv("../../data/Raw_Data/mediacloud_api_data/politics-storylist-20231129_20231130.csv")

# Drop Duplicate Column
df = df.drop(columns=['media_url'])
df2 = df2.drop(columns=['media_url'])

# Save
df.to_csv(("../../data/Clean_Data/mediacloud_api_data/covid-story-list.csv"), index=None)
df2.to_csv(("../../data/Clean_Data/mediacloud_api_data/politics-storylist-20231129_20231130.csv"), index=None)

#### Liar Dataset

This data is stored in a TSV file. Lets extract it and store it in an easier to use format.

In [None]:
# Define column names
columns = ["ID", "Label", "Statement", "Subjects", "Speaker", "Job_Title", "State_Info", "Party", "Barely_True_Count", "False_Count", "Half_True_Count", "Mostly_True_Count", "Pants_On_Fire_Count", "Context"]

# Load the data and add the column names
test_df = pd.read_csv("../../data/Raw_Data/liar_dataset/test.tsv", sep='\t', names=columns)
train_df = pd.read_csv("../../data/Raw_Data/liar_dataset/train.tsv", sep='\t', names=columns)
valid_df = pd.read_csv("../../data/Raw_Data/liar_dataset/valid.tsv", sep='\t', names=columns)

# Save the data in csv format
test_df.to_csv(("../../data/Clean_Data/liar_dataset/test.csv"), index=None)
train_df.to_csv(("../../data/Clean_Data/liar_dataset/train.csv"), index=None)
valid_df.to_csv(("../../data/Clean_Data/liar_dataset/valid.csv"), index=None)

#### NewsAPI Dataset

This data is stored in a json file so we have to extract it and convert it to a usable format

In [None]:
import json

# Import data
with open("../../data/Raw_Data/news_api_data/politics_20241124_214935.json", "r") as f:
    data = json.load(f)

In [None]:
from datetime import datetime

# Split up data
sources = []
authors = []
titles = []
descriptions = []
urls = []
year = []
month = []
day = []
time = []
content = []

for article in data["articles"]:
    sources.append(article["source"]["name"])
    authors.append(article["author"])
    titles.append(article["title"])
    descriptions.append(article["description"])
    urls.append(article["url"])
    content.append(article["content"])
    x = datetime.fromisoformat(article["publishedAt"].replace("Z", "+00:00"))
    year.append(x.year)
    month.append(x.month)
    day.append(x.day)
    time.append("AM" if x.hour < 12 else "PM")


# Create dataframe for data and save data
df = pd.DataFrame({
    "Source": sources,
    "Author": authors,
    "Title": titles,
    "Description": descriptions,
    "URL": urls,
    "Post_Year": year,
    "Post_Month": month,
    "Post_Day": day,
    "Post_Time": time,
    "Post": content
})

# Convert to csv
df.to_csv("../../data/Clean_Data/news_api_data/news_api_data.csv", index=None)

In order to use this dataset, I am going to have chatgpt give me a rating on whether is beleive the information in each article so we can test what our model thinks vs chatgpt

In [3]:
# Import API key from secure file

import json
with open('../../config.json') as f:
    keys = json.load(f)
API_KEY = keys['api_key']

In [13]:
import openai

model="gpt-4o"
temperature=0.0
max_tokens= 50
openai.api_key = API_KEY

def chat_with_gpt(prompt, model=model, temperature=temperature, max_tokens=max_tokens):
    """
    Sends a prompt to the OpenAI ChatGPT model and returns the response.

    :param prompt: The input text prompt to send to the model.
    :param model: The model to use (e.g., "gpt-3.5-turbo").
    :param temperature: Sampling temperature (controls creativity).
    :param max_tokens: Maximum number of tokens in the response.
    :return: The model's response as a string.
    """
    try:
        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ],
            temperature=temperature,
            max_tokens=max_tokens,
        )

        # Extract the assistant's reply
        reply = response['choices'][0]['message']['content'].strip()
        return reply

    except openai.error.OpenAIError as e:
        # Handle errors (e.g., network issues, API errors)
        print(f"An error occurred: {e}")
        return None

In [5]:
import pandas as pd
df = pd.read_csv("../../data/Clean_Data/news_api_data/news_api_data.csv")

In [14]:
i = 1
#x = chat_with_gpt(f"I need you to rate an article as one of the following categories: Pants-on-Fire, False, Mostly False, Half True, Mostly True, True. Please ONLY respond with one of these categories. Here is the article's source: {df['Source'][i]}, Here is the article's title: {df['Title'][i]}, Here is the article's description: {df['Description'][i]}, and here is part of the post: {df['Post'][i]}")
#x = chat_with_gpt(f"Rate the truthfulness of the article using these categories: Pants-on-Fire, False, Mostly False, Half True, Mostly True, True. ONLY respond with one. Evaluate the claim from - Source: {df['Source'][i]}, Title: {df['Title'][i]}, Description: {df['Description'][i]}, Post: {df['Post'][i]} - based on facts and evidence.")
x = chat_with_gpt(f"Based on the following details, fact-check and rate the claim as Pants-on-Fire, False, Mostly False, Half True, Mostly True, or True: Source: {df['Source'][i]}, Title: {df['Title'][i]}, Description: {df['Description'][i]}, Post: {df['Post'][i]}. ONLY respond with one of these categories and ensure the claim aligns with verified factual information.")



'False'

Unfortunetly what I found is that none of the OpenAI API models have access to real-time data in order to validate the claims

## Sources

[https://stackoverflow.com/questions/127803/how-do-i-parse-an-iso-8601-formatted-date-and-time](https://stackoverflow.com/questions/127803/how-do-i-parse-an-iso-8601-formatted-date-and-time)