# Creating Few-Shot and Zero-Shot Training Datasets

This notebook contains the code for concatenating and shuffling (using a random seed) four out of five training sets five times for the five-shot and few times for the zero-shot scenario. For the five-shot scenario, five random samples from the "target" (test) dataset will be included in the concatenated training data, also using a random state for reproducibility. In the zero-shot scenario, this step will be omitted as the aim is to test the classifier trained on no samples from the test set.

For the five random samples for the few-shot mixed training dataset, different random seeds will be tried to ensure there is at least one instance of every label (real/fake news) for the target samples.

The "training" datasets concatenated here are the train splits that do *not* contain a validation split, as k-fold cross-validation will be used on the total training set to find the optimal models.

## Environment Setup

In [9]:
# Imports required libraries
import os
import pandas as pd
import numpy as np
from sklearn.utils import shuffle # For randomly shuffling the merged datasets

In [10]:
def createMixedDatasetForFewShotScenario(list_of_source_train_dfs, target_df_train, number_samples=5, random_seed=5):
    """
    Creates a mixed dataset for use in the few-shot evaluation scenario.
        
        Input Parameters:
            list_of_source_train_dfs (list of pd.DataFrame): list of source training datasets (3 out of 4) to shuffle and concatenate
            target_df_train (pd.DataFrame): the target dataset (that the model will be evaluted on) 
            number_samples (int): the number of samples to include from the target dataset
            random_seed (int): the random seed for reproducibility of selecting samples from the training dataset
        Output:
            concat_df_shuffled (pd.DataFrame): the shuffled and merged source DataFrame to train the pipelines on in the few-shot setting
    """
    
    # Concatenates the rows of the three unimodal source datasets
    concatenated_source_df = pd.concat(list_of_source_train_dfs, axis=0, ignore_index=True)
    
    # Verifies that enough samples from each label/target class are being sampled (so, a 2-3 proportion of real-fake or fake-real labels)
    target_sample = target_df_train.sample(n=number_samples, random_state=random_seed)
    print("Target sample:\n", target_sample["label"])

    # Adds the selected samples from the target dataset to the rest of the data samples as rows (axis=0) at the end of the DataFrame
    concatenated_target_df = pd.concat([concatenated_source_df, target_sample], axis=0, ignore_index=True)
    
    # Applies scikit-learn's in-built shuffle function to remove ordernig from the dataset, adding a random seed for reproducibility
    concat_df_shuffled = shuffle(concatenated_target_df, random_state=random_seed)

    # Returns the merged dataset
    return concat_df_shuffled

In [11]:
def createMixedDatasetForZeroShotScenario(list_of_source_train_dfs, random_seed=5):
    """
    Creates a mixed dataset for use in the zero-shot evaluation scenario, as well as for creating
    a dataset combining all the four text-only datasets to mitigate domain shift issue.

    Input Parameters:
            list_of_source_train_dfs (list of pd.DataFrame): list of source training datasets (3 out of 4) to shuffle and concatenate
            target_df_train (pd.DataFrame): the target dataset (that the model will be evaluted on) 
            random_seed (int): the random seed for reproducibility of selecting samples from the training dataset


    Output:
        concat_df_shuffled (pd.DataFrame): the shuffled and merged DataFrame for training in zero-shot and all-combined evaluation scenarios
        
    """
    # Stacks the datasets vertically using axis=0
    concatenated_source_df = pd.concat(list_of_source_train_dfs, axis=0, ignore_index=True)
    
    # Shuffles the combined DataFrame using sklearn's shuffle and random seed for reproducibility
    concat_df_shuffled = shuffle(concatenated_source_df, random_state=random_seed)
    
    return concat_df_shuffled

In [12]:
# Sets up all the paths to access the different training datasets


# WELFake
wf_path = "../../FPData/WELFake/"
train_path_wf = os.path.join(wf_path, "clean_train_wf.csv") # "clean" means dataset has simply been cleaned of NaNs and duplicates
# Fakeddit
fe_path = "../../FPData/Fakeddit/"
train_path_fe = os.path.join(fe_path, "clean_train.csv")
# Constraint
ct_path = "../../FPData/Constraint/"
train_path_ct = os.path.join(ct_path, "clean_train.csv")
# PolitiFact
pf_path = "../../FPData/PolitiFact/"
train_path_pf = os.path.join(pf_path, "clean_train_pf.csv")
# GossipCop
gc_path = "../../FPData/GossipCop/"
train_path_gc = os.path.join(gc_path, "clean_train_gc.csv")

# Prints the column names for all datasets
wf_df = pd.read_csv(train_path_wf)
print(f"WELFake columns: {wf_df.columns}")
fe_df = pd.read_csv(train_path_fe)
print(f"Fakeddit columns: {fe_df.columns}")
ct_df = pd.read_csv(train_path_ct)
print(f"Constraint columns: {ct_df.columns}")
pf_df = pd.read_csv(train_path_pf)
print(f"PolitiFact columns: {pf_df.columns}")
gc_df = pd.read_csv(train_path_gc)
print(f"GossipCop columns: {gc_df.columns}")

# Renames the id columns to "original_id" to keep track of the original "source" dataset
wf_df = wf_df.rename(columns={"id": "original_id"})
fe_df = fe_df.rename(columns={"id": "original_id"})
ct_df = ct_df.rename(columns={"id": "original_id"})
pf_df = pf_df.rename(columns={"id": "original_id"})
gc_df = gc_df.rename(columns={"id": "original_id"})

# Checks it has worked
print("\n\nWELFake:", wf_df.head(), "\n\n\n")
print("Fakeddit:", fe_df.head(), "\n\n\n")
print("Constraint:", ct_df.head(), "\n\n\n")
print("PolitiFact:", pf_df.head(), "\n\n\n")
print("GossipCop:", gc_df.head(), "\n\n\n")

WELFake columns: Index(['id', 'title', 'text', 'label'], dtype='object')
Fakeddit columns: Index(['author', 'clean_title', 'created_utc', 'domain', 'hasImage', 'id',
       'image_url', 'linked_submission_id', 'num_comments', 'score',
       'subreddit', 'text', 'upvote_ratio', 'label', '3_way_label',
       '6_way_label'],
      dtype='object')
Constraint columns: Index(['id', 'text', 'label'], dtype='object')
PolitiFact columns: Index(['id', 'text', 'label'], dtype='object')
GossipCop columns: Index(['id', 'text', 'label'], dtype='object')


WELFake:    original_id                                              title  \
0        56051  The Politics of Death: Cancer and Politics, a ...   
1        30084  Governor-Elect Of Kentucky Tells The EPA To Go...   
2        40781  ARE YOU READY FOR JOE? 91% Of Obama-Biden Bund...   
3        64772  Trump win, Democratic setbacks cloud Pelosi's ...   
4        67872  Investigators ask White House for details on F...   

                          

## Preprocessing Datasets to Get the Same Columns and Datatypes

In [13]:
# Adds the hasImage column to all datasets except Fakeddit, which already has one, and set values to False for all other dataframes
wf_df["hasImage"] = False
ct_df["hasImage"] = False
pf_df["hasImage"] = False
gc_df["hasImage"] = False

# Checks it has worked
print(gc_df.head())

# Adds the image_url column to all non-image non-multimodal datasets (all except Fakeddit which has image URLs) and set values to np.NaN
wf_df["image_url"] = np.NaN
ct_df["image_url"] = np.NaN
pf_df["image_url"] = np.NaN
gc_df["image_url"] = np.NaN
# Check it has worked
print(gc_df.head())

# Adds a column in each dataset to mark the original dataset the news sample came from
wf_df["original_dataset"] = "WELFake"
fe_df["original_dataset"] = "Fakeddit"
ct_df["original_dataset"] = "Constraint"
pf_df["original_dataset"] = "PolitiFact"
gc_df["original_dataset"] = "GossipCop"
# Check it has worked
print(gc_df.head())

# Makes sure columns are the same in each DataFrame
wf_df_cols = wf_df[["original_id", "text", "label", "hasImage", "image_url", "original_dataset"]]
fe_df_cols = fe_df[["original_id", "text", "label", "hasImage", "image_url", "original_dataset"]]
ct_df_cols = ct_df[["original_id", "text", "label", "hasImage", "image_url", "original_dataset"]]
pf_df_cols = pf_df[["original_id", "text", "label", "hasImage", "image_url", "original_dataset"]]
gc_df_cols = gc_df[["original_id", "text", "label", "hasImage", "image_url", "original_dataset"]]

# Check it has worked
print("\n\n", wf_df_cols.columns, "\n", fe_df_cols.columns, "\n", ct_df_cols.columns, "\n", pf_df_cols.columns, "\n", 
     gc_df_cols.columns, "\n")

            original_id                                               text  \
0      gossipcop-871425  American news and talk television show\n\nToda...   
1  gossipcop-9574358029  Angelina Jolie attends First They Killed My Fa...   
2  gossipcop-1587930811  Now that Katie Holmes & Jamie Foxx’s relations...   
3      gossipcop-890268  THE race is on to become the next Christmas Nu...   
4      gossipcop-924667  As Meghan Markle prepares to officially enter ...   

   label  hasImage  
0      0     False  
1      1     False  
2      1     False  
3      0     False  
4      0     False  
            original_id                                               text  \
0      gossipcop-871425  American news and talk television show\n\nToda...   
1  gossipcop-9574358029  Angelina Jolie attends First They Killed My Fa...   
2  gossipcop-1587930811  Now that Katie Holmes & Jamie Foxx’s relations...   
3      gossipcop-890268  THE race is on to become the next Christmas Nu...   
4      gossipco

## Concatenating the Five-Shot DataFrames

### Five-Shot Datasets (including Fakeddit multimodal dataset)

In [14]:
## Fakeddit has been excluded following the Data Analysis

# five_shot_all_except_welfake = createMixedDatasetForFewShotScenario(
#     [fe_df_cols, ct_df_cols, pf_df_cols, gc_df_cols], 
#     # changed the random seed from default 5 to 10 due to some classes being missing from the 5 random rows, need a 2-3 balance
#     wf_df_cols, number_samples=5, random_seed=10 
# )

# five_shot_all_except_fakeddit = createMixedDatasetForFewShotScenario(
#     [wf_df_cols, ct_df_cols, pf_df_cols, gc_df_cols], 
#     fe_df_cols, number_samples=5, random_seed=5
# )

# five_shot_all_except_constraint = createMixedDatasetForFewShotScenario(
#     [wf_df_cols, fe_df_cols, pf_df_cols, gc_df_cols], 
#     ct_df_cols, number_samples=5, random_seed=5
# )

# five_shot_all_except_politifact = createMixedDatasetForFewShotScenario(
#     [wf_df_cols, fe_df_cols, ct_df_cols, gc_df_cols], 
#     pf_df_cols, number_samples=5, random_seed=5
# )

# five_shot_all_except_gossipcop = createMixedDatasetForFewShotScenario(
#     [wf_df_cols, fe_df_cols, ct_df_cols, pf_df_cols], 
#     gc_df_cols, number_samples=5, random_seed=7 # changed random seed from 5 to 7 due to some labels being missing, need a 2-3 balance
# )


### Five-Shot Datasets (excluding Fakeddit multimodal dataset)

In [15]:
no_fakeddit_five_shot_target_welfake = createMixedDatasetForFewShotScenario(
    [ct_df_cols, pf_df_cols, gc_df_cols], 
    wf_df_cols, number_samples=5, random_seed=10 # Experiment with different random seeds until get labels from both classes
)

no_fakeddit_five_shot_target_constraint = createMixedDatasetForFewShotScenario(
    [wf_df_cols, pf_df_cols, gc_df_cols], 
    ct_df_cols, number_samples=5, random_seed=9
)

no_fakeddit_five_shot_target_politifact = createMixedDatasetForFewShotScenario(
    [wf_df_cols, ct_df_cols, gc_df_cols], 
    pf_df_cols, number_samples=5, random_seed=5
)

no_fakeddit_five_shot_target_gossipcop = createMixedDatasetForFewShotScenario(
    [wf_df_cols, ct_df_cols, pf_df_cols], 
    gc_df_cols, number_samples=5, random_seed=20 # changed random seed from 5 to 20 due to some labels being missing, need a 2-3 balance
)


In [23]:
def dropDuplicates(df, dataset_name):
    """
        A function which removes duplicate rows based on duplication in the "text" field
        for a news dataset and prints the number of duplicates that were removed.

        Input Parameters:
            df (pd.DataFrame): the news DataFrame (with a "text" field) to clean duplicates from
            name (str): name of the dataset for logging results of duplicate removal

        Output:
            df_no_duplicates (pd.DataFrame): the news DataFrame with duplicate "text" entries removed
    """
    # Prints the original number of rows in the DataFrame
    original_len = len(df)
    
    # Drops the duplicates based on the "text" column, keeping the first occurrence only
    df_no_duplicates = df.drop_duplicates(subset=["text"], keep="first")
    
    # Return the modified DataFrame with duplicates removed
    return df_no_duplicates

# Apply the drop duplicates function to each of the datasets
five_shot_all_except_welfake_no_duplicates = dropDuplicates(five_shot_all_except_welfake, "five_shot_all_except_welfake")
five_shot_all_except_fakeddit_no_duplicates = dropDuplicates(five_shot_all_except_fakeddit, "five_shot_all_except_fakeddit")
five_shot_all_except_constraint_no_duplicates = dropDuplicates(five_shot_all_except_constraint, "five_shot_all_except_constraint")
five_shot_all_except_politifact_no_duplicates = dropDuplicates(five_shot_all_except_politifact, "five_shot_all_except_politifact")
five_shot_all_except_gossipcop_no_duplicates = dropDuplicates(five_shot_all_except_gossipcop, "five_shot_all_except_gossipcop")


# Applies the function to the five-shot datasets excluding Fakeddit
no_fakeddit_five_shot_target_welfake_no_duplicates = dropDuplicates(no_fakeddit_five_shot_target_welfake,
                                                                    "no_fakeddit_five_shot_all_except_welfake")
no_fakeddit_five_shot_target_constraint_no_duplicates = dropDuplicates(no_fakeddit_five_shot_target_constraint,
                                                                    "no_fakeddit_five_shot_all_except_constraint")
no_fakeddit_five_shot_target_politifact_no_duplicates = dropDuplicates(no_fakeddit_five_shot_target_politifact,
                                                                    "no_fakeddit_five_shot_all_except_politifact")
no_fakeddit_five_shot_target_gossipcop_no_duplicates = dropDuplicates(no_fakeddit_five_shot_target_gossipcop,
                                                                    "no_fakeddit_five_shot_all_except_gossipcop")

In [132]:
# Check if datasets are shuffled properly by taking a slice and step through the data
print(five_shot_all_except_fakeddit_no_duplicates["original_dataset"].iloc[1000:10000:300], "\n")

48124    PolitiFact
54051     GossipCop
4697        WELFake
2620        WELFake
25101       WELFake
3952        WELFake
26494       WELFake
43136    Constraint
37609       WELFake
10491       WELFake
49106     GossipCop
17301       WELFake
23098       WELFake
32259       WELFake
20676       WELFake
6597        WELFake
34360       WELFake
24056       WELFake
40604       WELFake
31978       WELFake
33555       WELFake
31599       WELFake
36044       WELFake
3564        WELFake
37756       WELFake
57491     GossipCop
51303     GossipCop
97          WELFake
9438        WELFake
51368     GossipCop
Name: original_dataset, dtype: object 



In [87]:
# # Saves the new training datasets
def saveFewShotDataset(few_shot_df, target_dataset_name):
    """
    Saves the mixed few/five-shot dataset with Fakeddit.
        
        Input Parameters:
            few_shot_df (pd.DataFrame): the combined training dataset to save
            target_dataset_name(str): the name of the target dataset
    """
    save_path =  f"../FPData/CleanFewShotDatasets_withoutValSets/five_shot_train_data_except_{target_dataset_name}.csv"
    few_shot_df.to_csv(save_path, index=False)

saveFewShotDataset(five_shot_all_except_welfake_no_duplicates, "welfake")
saveFewShotDataset(five_shot_all_except_fakeddit_no_duplicates , "fakeddit")
saveFewShotDataset(five_shot_all_except_constraint_no_duplicates , "constraint")
saveFewShotDataset(five_shot_all_except_politifact_no_duplicates , "politifact")
saveFewShotDataset(five_shot_all_except_constraint_no_duplicates , "gossipcop")

In [137]:
def saveFewShotDatasetWithoutFakeddit(few_shot_df, target_dataset_name,
                                      root_path="../FPData/CleanFewShotDatasets_withoutValSets_withoutFakeddit"):
    """
    Saves the mixed few/five-shot dataset without Fakeddit.
        
        Input Parameters:
            few_shot_df (pd.DataFrame): the combined training dataset to save
            target_dataset_name(str): the name of the target dataset
            root_path (str): the name of the directory to save the .csv file to
    """
    save_path =  f"{os.path.join(root_path, target_dataset_name)}.csv"
    few_shot_df.to_csv(save_path, index=False)
    
saveFewShotDatasetWithoutFakeddit(no_fakeddit_five_shot_target_welfake_no_duplicates, "five_shot_train_data_except_welfake")
saveFewShotDatasetWithoutFakeddit(no_fakeddit_five_shot_target_constraint_no_duplicates, "five_shot_train_data_except_constraint")
saveFewShotDatasetWithoutFakeddit(no_fakeddit_five_shot_target_politifact_no_duplicates, "five_shot_train_data_except_politifact")
saveFewShotDatasetWithoutFakeddit(no_fakeddit_five_shot_target_gossipcop_no_duplicates, "five_shot_train_data_except_gossipcop")

## Concatenating the Zero-Shot DataFrames

### Zero-Shot Datasets: With Fakeddit

In [94]:
## Fakeddit has been excluded now following data analysis

# zero_shot_all_except_welfake = createMixedDatasetForZeroShotScenario(
#     [fe_df_cols, ct_df_cols, pf_df_cols, gc_df_cols]
# )

# zero_shot_all_except_fakeddit = createMixedDatasetForZeroShotScenario(
#     [wf_df_cols, ct_df_cols, pf_df_cols, gc_df_cols]
# )

# zero_shot_all_except_constraint = createMixedDatasetForZeroShotScenario(
#     [wf_df_cols, fe_df_cols, pf_df_cols, gc_df_cols]
# )

# zero_shot_all_except_politifact = createMixedDatasetForZeroShotScenario(
#     [wf_df_cols, fe_df_cols, ct_df_cols, gc_df_cols]
# )

# zero_shot_all_except_gossipcop = createMixedDatasetForZeroShotScenario(
#     [wf_df_cols, fe_df_cols, ct_df_cols, pf_df_cols]
# )


### Zero-Shot Datasets: Without Fakeddit

In [114]:
## Creates the mixed datasets without Fakeddit

# no_fakeddit_zero_shot_all_except_welfake = createMixedDatasetForZeroShotScenario(
#     [ct_df_cols, pf_df_cols, gc_df_cols]
# )

# no_fakeddit_zero_shot_all_except_constraint = createMixedDatasetForZeroShotScenario(
#     [wf_df_cols, pf_df_cols, gc_df_cols]
# )

# no_fakeddit_zero_shot_all_except_politifact = createMixedDatasetForZeroShotScenario(
#     [wf_df_cols, ct_df_cols, gc_df_cols]
# )

# no_fakeddit_zero_shot_all_except_gossipcop = createMixedDatasetForZeroShotScenario(
#     [wf_df_cols, ct_df_cols, pf_df_cols]
# )

In [96]:
# Applies the drop duplicates function to each of the Fakeddit-including zero-shot datasets
zero_shot_all_except_welfake_no_duplicates = dropDuplicates(zero_shot_all_except_welfake, "zero_shot_all_except_welfake")
zero_shot_all_except_fakeddit_no_duplicates = dropDuplicates(zero_shot_all_except_fakeddit, "zero_shot_all_except_fakeddit")
zero_shot_all_except_constraint_no_duplicates = dropDuplicates(zero_shot_all_except_constraint, "zero_shot_all_except_constraint")
zero_shot_all_except_politifact_no_duplicates = dropDuplicates(zero_shot_all_except_politifact, "zero_shot_all_except_politifact")
zero_shot_all_except_gossipcop_no_duplicates = dropDuplicates(zero_shot_all_except_gossipcop, "zero_shot_all_except_gossipcop")

For zero_shot_all_except_welfake, 2 duplicate rows were removed.
For zero_shot_all_except_fakeddit, 2 duplicate rows were removed.
For zero_shot_all_except_constraint, 2 duplicate rows were removed.
For zero_shot_all_except_politifact, 0 duplicate rows were removed.
For zero_shot_all_except_gossipcop, 0 duplicate rows were removed.


In [113]:
# Applies the drop duplicates function to each of the zero-shot no-Fakeddit datasets
no_fakeddit_zero_shot_all_except_welfake_no_duplicates = dropDuplicates(
    no_fakeddit_zero_shot_all_except_welfake,
    "no_fakeddit_zero_shot_all_except_welfake"
)
no_fakeddit_zero_shot_all_except_constraint_no_duplicates = dropDuplicates(
    no_fakeddit_zero_shot_all_except_constraint,
    "no_fakeddit_zero_shot_all_except_constraint"
)
no_fakeddit_zero_shot_all_except_politifact_no_duplicates = dropDuplicates(
    no_fakeddit_zero_shot_all_except_politifact,
    "no_fakeddit_zero_shot_all_except_politifact"
)
no_fakeddit_zero_shot_all_except_gossipcop_no_duplicates = dropDuplicates(
    no_fakeddit_zero_shot_all_except_gossipcop,
    "no_fakeddit_zero_shot_all_except_gossipcop"
)

For no_fakeddit_zero_shot_all_except_welfake, 2 duplicate rows were removed.
For no_fakeddit_zero_shot_all_except_constraint, 2 duplicate rows were removed.
For no_fakeddit_zero_shot_all_except_politifact, 0 duplicate rows were removed.
For no_fakeddit_zero_shot_all_except_gossipcop, 0 duplicate rows were removed.


In [98]:
def saveZeroShotDataset(zero_shot_df, target_dataset_name):
    """
    Saves the mixed zero-shot dataset with Fakeddit.
        
        Input Parameters:
            zero_shot_df (pd.DataFrame): the combined training dataset to save
            target_dataset_name(str): the name of the target dataset
    """
    save_path =  f"../FPData/CleanZeroShotDatasets_withoutValSets/zero_shot_train_data_except_{target_dataset_name}.csv"
    zero_shot_df.to_csv(save_path, index=False)

saveZeroShotDataset(zero_shot_all_except_welfake_no_duplicates, "welfake")
saveZeroShotDataset(zero_shot_all_except_fakeddit_no_duplicates, "fakeddit")
saveZeroShotDataset(zero_shot_all_except_constraint_no_duplicates, "constraint")
saveZeroShotDataset(zero_shot_all_except_politifact_no_duplicates, "politifact")
saveZeroShotDataset(zero_shot_all_except_gossipcop_no_duplicates, "gossipcop")

In [145]:
# Saves the new combined training datasets without Fakeddit
def saveZeroShotDatasetWithoutFakeddit(zero_shot_df, target_dataset_name,
                                      root_path="../FPData/CleanZeroShotDatasets_withoutValSets_withoutFakeddit"):
    """
    Saves the mixed zero-shot dataset without Fakeddit.
        
        Input Parameters:
            few_shot_df (pd.DataFrame): the combined training dataset to save
            target_dataset_name(str): the name of the target dataset
            root_path (str): the name of the directory to save the .csv file to
    """
    save_path =  f"{os.path.join(root_path, target_dataset_name)}.csv"
    zero_shot_df.to_csv(save_path, index=False)
    
saveZeroShotDatasetWithoutFakeddit(no_fakeddit_zero_shot_all_except_welfake_no_duplicates, "zero_shot_train_data_except_welfake")
saveZeroShotDatasetWithoutFakeddit(no_fakeddit_zero_shot_all_except_constraint_no_duplicates, "zero_shot_train_data_except_constraint")
saveZeroShotDatasetWithoutFakeddit(no_fakeddit_zero_shot_all_except_politifact_no_duplicates, "zero_shot_train_data_except_politifact")
saveZeroShotDatasetWithoutFakeddit(no_fakeddit_zero_shot_all_except_gossipcop_no_duplicates, "zero_shot_train_data_except_gossipcop")

## Creating the All-Four combined Dataset

In [170]:
# Creates a dataset of all source (text-only, excluding Fakeddit) datasets for training the model for user testing on ALL data
all_train_data = createMixedDatasetForZeroShotScenario(
    [wf_df_cols, ct_df_cols, pf_df_cols, gc_df_cols]
)

In [159]:
# Checks the 4 datasets are in there, no Fakeddit
all_train_data["original_dataset"].unique() 

array(['GossipCop', 'WELFake', 'PolitiFact', 'Constraint'], dtype=object)

In [153]:
# Saves the all_train dataset
all_train_data.to_csv("../FPData/four_training_sets_combined.csv", index=False)

In [18]:
# Loads in and combines the validation sets and test sets for evaluating on all 4 text-based datasets combined

# WELFake
val_path_wf = os.path.join(wf_path, "clean_val_wf.csv")
wf_val_df = pd.read_csv(val_path_wf)
test_path_wf = os.path.join(wf_path, "clean_test_wf.csv")
wf_test_df = pd.read_csv(test_path_wf)

# Constraint
val_path_ct = os.path.join(ct_path, "clean_val.csv")
ct_val_df = pd.read_csv(val_path_ct)
test_path_ct = os.path.join(ct_path, "clean_test.csv")
ct_test_df = pd.read_csv(test_path_ct)

# PolitiFact
val_path_pf = os.path.join(pf_path, "clean_val_pf.csv")
pf_val_df = pd.read_csv(val_path_pf)
test_path_pf = os.path.join(pf_path, "clean_test_pf.csv")
pf_test_df = pd.read_csv(test_path_pf)

# GossipCop
val_path_gc = os.path.join(gc_path, "clean_val_gc.csv")
gc_val_df = pd.read_csv(val_path_gc)
test_path_gc = os.path.join(gc_path, "clean_test_gc.csv")
gc_test_df = pd.read_csv(test_path_gc)

In [162]:
# Drops the unwanted "Unnamed" column from the WELFake validation set
wf_val_df = wf_val_df.drop("Unnamed: 0", axis=1) 
wf_val_df.head()

Unnamed: 0,id,title,text,label
0,64425,Chilean economic officials resign in blow to c...,SANTIAGO (Reuters) - Chilean President Michell...,0
1,46016,Nintendo Switch Won’t Support Video Streaming ...,A few more details have dribbled out about the...,0
2,1496,Obama hits the trail for Hillary Clinton: Will...,President Obama campaigns Tuesday with Mrs. Cl...,0
3,27368,President Elect Trump – A New Era of Unpredict...,Waking Times \nPresident Elect Trump is new ti...,1
4,66323,Turkish academic on lengthy hunger strike appe...,ANKARA (Reuters) - A sacked Turkish professor ...,0


In [19]:
# Drops the unwanted "Unnamed" column from the WELFake test set
wf_test_df = wf_test_df.drop("Unnamed: 0", axis=1)
wf_test_df.head()

Unnamed: 0,id,title,text,label
0,12113,WTF: Top Trump Advisor Tells Trump To Kill Li...,As if Donald Trump isn t paranoid and delusion...,1
1,65434,EU leaders to give mandate for next phase of B...,BRUSSELS (Reuters) - European Union leaders wi...,0
2,42045,"If this is what a “Rubio surge” looks like, Re...","On one hand, it is yet another example of how ...",0
3,50226,Morneau Shepell​ sees no impact from pension l...,OTTAWA (Reuters) - Morneau Shepell Inc (MSI.TO...,0
4,8925,"Trump touts support for NATO, but expansion la...",WASHINGTON(Reuters) - In his first major speec...,0


In [160]:
# Inspects the Constraint validation set
ct_val_df.head()

Unnamed: 0,id,text,label
0,1,Chinese converting to Islam after realising th...,1
1,2,11 out of 13 people (from the Diamond Princess...,1
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",1
3,4,Mike Pence in RNC speech praises Donald Trump’...,1
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,0


In [20]:
# Inspects the Constraint test set
ct_test_df.head()

Unnamed: 0,id,text,label
0,1,Our daily update is published. States reported...,0
1,2,Alfalfa is the only cure for COVID-19.,1
2,3,President Trump Asked What He Would Do If He W...,1
3,4,States reported 630 deaths. We are still seein...,0
4,5,This is the sixth time a global health emergen...,0


In [167]:
# Drops the unwanted "Unnamed" column from the PolitiFact validation set
pf_val_df = pf_val_df.drop("Unnamed: 0", axis=1)
pf_val_df.head()

Unnamed: 0,id,text,label
0,politifact15164,Vice President Biden on Monday raised the poss...,1
1,politifact14394,Hillary Clinton had a third and most-likely fa...,1
2,politifact3693,"WASHINGTON, May 1, 2011 — -- AMANPOUR (voice-o...",0
3,politifact409,Nightly on MTV\n\nHost Rob Dyrdek is joined by...,0
4,politifact4181,OMB HOME •\n\nHistorical Tables\n\nHistorical ...,0


In [21]:
# Drops the unwanted "Unnamed" column from the PolitiFact test set
pf_test_df = pf_test_df.drop("Unnamed: 0", axis=1)
pf_test_df.head()

Unnamed: 0,id,text,label
0,politifact15241,"(Natural News) Late last month, it was reporte...",1
1,politifact14794,ARE YOU READY? GET IT NOW!\n\nIncrease more th...,1
2,politifact201,"""Some of the hardest-working and most producti...",0
3,politifact15207,Twitter Is Doing the MOST with These Cardi B G...,1
4,politifact14727,"New York City Woman Loses Her Temper, Causes B...",1


In [169]:
# Drops the unwanted "Unnamed" column from the GossipCop validation set
gc_val_df = gc_val_df.drop("Unnamed: 0", axis=1)
gc_val_df.head()

Unnamed: 0,id,text,label
0,gossipcop-867511,Andy Cohen Knows Exactly How to Push Vicki Gun...,0
1,gossipcop-7946844362,Things are getting serious between Brad Pitt a...,1
2,gossipcop-947390,Tuesday night might have made late-night histo...,0
3,gossipcop-1500928748,Blake Shelton recently brought Gwen Stefani to...,1
4,gossipcop-839877,Music festival season is right around the corn...,0


In [22]:
# Drops the unwanted "Unnamed" column from the GossipCop test set
gc_test_df = gc_test_df.drop("Unnamed: 0", axis=1)
gc_test_df.head()

Unnamed: 0,id,text,label
0,gossipcop-852394,"— -- Last month, Elle King mysteriously posted...",0
1,gossipcop-892129,Roselyn Sanchez and Eric Winter have a new bun...,0
2,gossipcop-938801,Will Smith is taking on the naysayers when it ...,0
3,gossipcop-941383,Don't let the name of's HBO hit fool you. Whil...,0
4,gossipcop-6823819000,"PALM BEACH, Fla. (AP) — President Donald Trump...",1


In [24]:
# Creates an all-four combined validation set
all_val_data = createMixedDatasetForZeroShotScenario(
    [wf_val_df, ct_val_df, pf_val_df, gc_val_df]
)

In [27]:
# Drops duplicates from the all-four combined validation set
all_val_data_no_duplicates = dropDuplicates(all_val_data, "'All four validation sets combined'")

For 'All four validation sets combined', 0 duplicate rows were removed.


In [176]:
# Saves the all-four combined validation set as .csv file
all_val_data.to_csv("../FPData/four_val_sets_combined.csv", index=False)

In [30]:
# Creates an all-four combined validation set
all_test_data = createMixedDatasetForZeroShotScenario(
    [wf_test_df, ct_test_df, pf_test_df, gc_test_df]
)

In [39]:
# Drops duplicates from the all-four combined test set
all_test_data_no_duplicates = dropDuplicates(all_test_data, "'All four test sets combined'")

For 'All four test sets combined', 1 duplicate rows were removed.


In [41]:
# Saves the all-four combined test set as .csv file
all_test_data_no_duplicates.to_csv("../../FPData/four_test_sets_combined.csv", index=False)