**Classifier Polrization**

**Final project**

**Diego Mazorra**

**Psych 757**

This notebook aims to build the polarization classifier. For this task, we are using data downloaded through the software Synthesio, authorized by the University of Wisconsin-Madison and the research group on Social Media and Democracy from the School of Journalism and Mass Communication of the same university.
The data was compiled by selecting all the interactions in the social network Twitter (X) on the day of the political presidential debates for the elections in the USA in the year 2024. The first debate was between the candidates Joe Biden and Donald J. Trump. The second debate was between the candidates Kamala Harris and Donald J. Trump.
The key terms of the search were: a) interactions (mentions and retweets); b) boolean marker Biden AND Trump AND Kamala AND Trump; c) search specific for the dates of the political debates, one for the first debate and the other for the second debate.
The data cap limit for collection was 50,000 tweets retrieved. The list of tweets was then downloaded by selecting chunks of time where the total interactions were less than 50,000 and downloading in a .csv file the total interactions publicly available.
The objective of this notebook is to produce a single dataset to work on this raw data and further anonymize the publicly available information.

**PART 1. - Create and clean the datasets**


The objective of these first part of the code is to consolidate the raw data, clean it and produce an single dataset that is ready to our posterior analysys. Our raw data come for a different source in Synthesio, and clean it to process it will require several steps.

**Structure:**

Library: Imports libraries.

Merge function: A function is defined to read all .csv files from a specified folder, handle potential errors common in Synthesio exports, and concatenate them into a single pandas DataFrame (merge_debate_csvs is the name of the function).

Data: The merge_debate_csvs function is called for two separate folders—one for the Biden-Trump debate and one for the Harris-Trump debate. As a result we have 2 df, one ofr each debate.

Debate identifier: A new column, Debate, is added to each dataset (with values 1 and 2) to distinguish between the two events before they are merged.

Column selection and merge: A subset of relevant columns is selected from the raw data, and the two debate datasets are concatenated into one master DataFrame.

Data cleaning with duplicate removal: Duplicate tweets are identified and removed based on their unique Id.

Data cleaning with content removal: Tweets marked as "Deleted or protected mention" are filtered out as they contain no analyzable text.

Text Sanitization: A sanitize_text function is defined using regular expressions (re) to clean the tweet text. This function converts text to lowercase, removes URLs and special characters, and standardizes whitespace. The output is stored in a new santext column.

Sample: A random sample of 1000 tweets is drawn from the cleaned dataset to be used for annotation in the next phase of the project.

In [8]:
# Libraries required
import os
import pandas as pd
import glob
import re

In [9]:
#We need to create a function to create the datasets, 
# because there are too many and they are in the same folder as CSVs
#, merge will be the best, and merge all the folders, not by name, otherwise it will be demanding to classify every CSV file

def merge_debate_csvs(folder_path, output_filename):
    csv_files = glob.glob(os.path.join(folder_path, '*.csv'))
    dfs = []

    for file in csv_files:
        try:
            df = pd.read_csv(
                file,
                encoding='utf-8',
                sep=',',
                quotechar='"', #two separators "," and "" because synthesiso uses both in the raw files
                on_bad_lines='skip',  # skip rows that break parsing (this was one of the problems in the first versions of the code)
                engine='python'       
            )
            dfs.append(df)
        except Exception as e:
            print(f"Error reading {file}: {e}") #in case we have pro0blmes with a file, we can identify the source

    if dfs:
        df_combined = pd.concat(dfs, ignore_index=True)
        os.makedirs('data/raw', exist_ok=True)
        df_combined.to_csv(f'data/raw/{output_filename}', index=False)
        print(f"Saved merged file: data/raw/{output_filename} — shape: {df_combined.shape}")
        return df_combined
    else:
        print(" No valid CSV files found.")
        return pd.DataFrame()

In [10]:
# Here we specify the location of the folders to run the previous function. 
#Also, we separate the folders to create 2 different files, one per debate.
folder_biden_trump = '/Users/diego/Downloads/USA/DEBATE BIDEN-TRUMP'
folder_kamala_trump = '/Users/diego/Downloads/USA/DEBATE KAMALA-TRUMP'

df_biden = merge_debate_csvs(folder_biden_trump, 'biden_trump_raw.csv')
df_kamala = merge_debate_csvs(folder_kamala_trump, 'kamala_trump_raw.csv')

Saved merged file: data/raw/biden_trump_raw.csv — shape: (1098310, 57)
Saved merged file: data/raw/kamala_trump_raw.csv — shape: (2747862, 57)


In [11]:
#now, we have a data file, before advance, we don´t want surprises with the name of the columns, 
#so, before cleaning, we want to revise the strucutre and shape of the file and made a dictionary of variables

# Load the merged file (for Biden-Trump debate)
df_biden = pd.read_csv('data/raw/biden_trump_raw.csv', encoding='utf-8', engine='python')
print(df_biden.shape)
df_biden.head(3)  # Look at first 3 rows to visually inspect structure

(1098310, 57)


Unnamed: 0,Id,Date,Time,Media Type,Site Name,Site Domain,Mention URL,Publisher Name,Publisher Username,title,...,Youtube Comments,Youtube Favorites,User Age,User Gender,User Family Status,User Marital Status,User Affinities,User Jobs,User biography tags,Media URL
0,281380390449,2024-06-27,22:30:00 -0500 CDT,Twitter,Twitter,http://www.twitter.com,http://twitter.com/LouisianaCRs/status/1806530...,Louisiana College Republicans,LouisianaCRs,RT @cr_national: President Trump is the clear ...,...,,,,,,,,,,
1,281380353382,0000-12-31,18:09:24 -0550 LMT,Twitter,Twitter,http://www.twitter.com,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,...,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention
2,281380386657,2024-06-27,22:30:00 -0500 CDT,Twitter,Twitter,http://www.twitter.com,http://twitter.com/barneystoneage/status/18065...,“Deplorable” Barney “the Listless Vessel”,barneystoneage,"RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...",...,,,,,,,,,,


In [12]:
# Load the merged file (for Kamala-Trump debate)
df_kamala = pd.read_csv('data/raw/kamala_trump_raw.csv', encoding='utf-8', engine='python')
print(df_kamala.shape)
df_kamala.head(3)  # Look at first 3 rows to visually inspect structure

(2747862, 57)


Unnamed: 0,Id,Date,Time,Media Type,Site Name,Site Domain,Mention URL,Publisher Name,Publisher Username,title,...,Youtube Comments,Youtube Favorites,User Age,User Gender,User Family Status,User Marital Status,User Affinities,User Jobs,User biography tags,Media URL
0,287695105673,2024-09-10,22:50:00 -0500 CDT,Twitter,Twitter,http://www.twitter.com,http://twitter.com/ced1951/status/183371463200...,Charles Davis,ced1951,RT @hodgetwins: Trump is right Virginias gover...,...,,,,,,,,,,https://pbs.twimg.com/ext_tw_video_thumb/10906...
1,287695103227,2024-09-10,22:50:00 -0500 CDT,Twitter,Twitter,http://www.twitter.com,http://twitter.com/MomInHerPJs/status/18337146...,PamajaPants ?,MomInHerPJs,RT @MrJonCryer: Hi Trump supporters I know rig...,...,,,,,,,,,,
2,287695144422,2024-09-10,22:50:00 -0500 CDT,Twitter,Twitter,http://www.twitter.com,http://twitter.com/carolynbunker9/status/18337...,Old Lefty?️‍???????woke,carolynbunker9,RT @MayoIsSpicyy: I want to see Kamala Harris ...,...,,,,,,,,,,


In [13]:
# List the columns in each df
print("📌 Columns in Biden-Trump debate dataset:")
print(df_biden.columns.tolist())

print("\n📌 Columns in Kamala-Trump debate dataset:")
print(df_kamala.columns.tolist())

📌 Columns in Biden-Trump debate dataset:
['Id', 'Date', 'Time', 'Media Type', 'Site Name', 'Site Domain', 'Mention URL', 'Publisher Name', 'Publisher Username', 'title', 'Mention Content', 'Topics', 'Subtopics', 'Classifiers', 'Classifiers tags', 'Review Products', 'Sentiment', 'Star Rating', 'Country', 'State', 'City', 'Language', 'Potential Reach', 'Engagement Rate', 'Interactions Total', 'Earned Media Value', 'Facebook Views', 'Facebook Likes', 'Facebook Comments', 'Facebook Shares', 'Facebook love reactions', 'Facebook haha reactions', 'Facebook wow reactions', 'Facebook sad reactions', 'Facebook angry reactions', 'Tiktok Likes', 'Tiktok Views', 'Tiktok Shares', 'Tiktok Comments', 'Twitter Retweets', 'Twitter Favorites', 'Twitter replies', 'Instagram Likes', 'Instagram Comments', 'Youtube Views', 'Youtube Likes', 'Youtube Dislikes', 'Youtube Comments', 'Youtube Favorites', 'User Age', 'User Gender', 'User Family Status', 'User Marital Status', 'User Affinities', 'User Jobs', 'User 

In [14]:
#Now, to avoid misinterpretation of each debate, we add a column to identify them
# Add Debate column: 1 for Biden-Trump, 2 for Kamala-Trump
df_biden['Debate'] = 1
df_kamala['Debate'] = 2

# Preview the result to verify that the columns were created with the nominal variables for each debate.
print(df_biden[['Id', 'Debate']].head(3))
print(df_kamala[['Id', 'Debate']].head(3))

             Id  Debate
0  281380390449       1
1  281380353382       1
2  281380386657       1
             Id  Debate
0  287695105673       2
1  287695103227       2
2  287695144422       2


In [15]:
# Now we create a unique database with both debates
# Merge both df
df_all = pd.concat([df_biden, df_kamala], ignore_index=True)

# Save final combined file
df_all.to_csv('data/raw/debate_tweets_all_labeled.csv', index=False, encoding='utf-8')
print(f"Combined dataset saved. Shape: {df_all.shape}")

Combined dataset saved. Shape: (3846172, 58)


In [16]:
#now, the next step is to verify the shape and form of the file 
# Load the merged file
df_all = pd.read_csv('data/raw/debate_tweets_all_labeled.csv', encoding='utf-8', engine='python')
print(df_all.shape)
df_all.head(3)  # Look at first 3 rows to visually inspect structure

(3846172, 58)


Unnamed: 0,Id,Date,Time,Media Type,Site Name,Site Domain,Mention URL,Publisher Name,Publisher Username,title,...,Youtube Favorites,User Age,User Gender,User Family Status,User Marital Status,User Affinities,User Jobs,User biography tags,Media URL,Debate
0,281380390449,2024-06-27,22:30:00 -0500 CDT,Twitter,Twitter,http://www.twitter.com,http://twitter.com/LouisianaCRs/status/1806530...,Louisiana College Republicans,LouisianaCRs,RT @cr_national: President Trump is the clear ...,...,,,,,,,,,,1
1,281380353382,0000-12-31,18:09:24 -0550 LMT,Twitter,Twitter,http://www.twitter.com,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,...,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,1
2,281380386657,2024-06-27,22:30:00 -0500 CDT,Twitter,Twitter,http://www.twitter.com,http://twitter.com/barneystoneage/status/18065...,“Deplorable” Barney “the Listless Vessel”,barneystoneage,"RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...",...,,,,,,,,,,1


In [17]:
# now, we clean the data to anonymize and select the variables we are only going to require.
# Define the list of selected columns
selected_columns = [
    'Id', 'Date', 'Time', 'Media Type', 'Site Domain', 'Mention URL',
    'Publisher Name', 'Publisher Username', 'title', 'Mention Content',
    'Topics', 'Subtopics', 'Classifiers', 'Classifiers tags',
    'Sentiment', 'Potential Reach', 'Engagement Rate', 'Interactions Total',
    'Earned Media Value', 'Twitter Retweets', 'Twitter Favorites',
    'Twitter replies', 'Media URL', 'Debate'  
]

# Create the new filtered DataFrame
df_selected = df_all[selected_columns]

# Preview
print(df_selected.head())

# Save (the encoding file is needed because synthesio usually mess around with the encoding and programs like spss are sensible to that
df_selected.to_csv('data/raw/debate_tweets_selected.csv', index=False, encoding='utf-8')
print(f"Filtered dataset saved. Shape: {df_selected.shape}")

             Id        Date                Time Media Type  \
0  281380390449  2024-06-27  22:30:00 -0500 CDT    Twitter   
1  281380353382  0000-12-31  18:09:24 -0550 LMT    Twitter   
2  281380386657  2024-06-27  22:30:00 -0500 CDT    Twitter   
3  281380349903  2024-06-27  22:30:00 -0500 CDT    Twitter   
4  281380357944  2024-06-27  22:30:00 -0500 CDT    Twitter   

              Site Domain                                        Mention URL  \
0  http://www.twitter.com  http://twitter.com/LouisianaCRs/status/1806530...   
1  http://www.twitter.com                       Deleted or protected mention   
2  http://www.twitter.com  http://twitter.com/barneystoneage/status/18065...   
3  http://www.twitter.com  http://twitter.com/unusual_doge/status/1806530...   
4  http://www.twitter.com  http://twitter.com/KeepingUPosted/status/18065...   

                              Publisher Name            Publisher Username  \
0              Louisiana College Republicans                  Louisi

In [18]:
#Finally, we can clean the dataset. First, we need to find duplicate tweets, by found their ID Marker. 
#The way we collected the data can lead to have this problem fo duplicates
# Find duplicated Ids
duplicated_ids = df_selected[df_selected.duplicated(subset='Id', keep=False)]

# Count how many duplicates exist
num_duplicates = duplicated_ids['Id'].nunique()
print(f"Number of duplicated Ids: {num_duplicates}")
print(f"Total duplicated rows: {duplicated_ids.shape[0]}")

# View examples of duplicated
if num_duplicates > 0:
    # Show the Ids that are duplicated
    duplicate_id_list = duplicated_ids['Id'].unique()
    print(f"Sample duplicated Ids: {duplicate_id_list[:3]}")

    # For up to 3 duplicated Ids, show all associated rows
    for id_val in duplicate_id_list[:3]:
        print(f"\n Rows for duplicated Id: {id_val}")
        display(df_selected[df_selected['Id'] == id_val])

Number of duplicated Ids: 1118550
Total duplicated rows: 2581706
Sample duplicated Ids: [281380390449 281380353382 281380386657]

 Rows for duplicated Id: 281380390449


Unnamed: 0,Id,Date,Time,Media Type,Site Domain,Mention URL,Publisher Name,Publisher Username,title,Mention Content,...,Sentiment,Potential Reach,Engagement Rate,Interactions Total,Earned Media Value,Twitter Retweets,Twitter Favorites,Twitter replies,Media URL,Debate
0,281380390449,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/LouisianaCRs/status/1806530...,Louisiana College Republicans,LouisianaCRs,RT @cr_national: President Trump is the clear ...,RT @cr_national: President Trump is the clear ...,...,positive,999,,,,,,,,1
961523,281380390449,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/LouisianaCRs/status/1806530...,Louisiana College Republicans,LouisianaCRs,RT @cr_national: President Trump is the clear ...,RT @cr_national: President Trump is the clear ...,...,positive,999,,,,,,,,1



 Rows for duplicated Id: 281380353382


Unnamed: 0,Id,Date,Time,Media Type,Site Domain,Mention URL,Publisher Name,Publisher Username,title,Mention Content,...,Sentiment,Potential Reach,Engagement Rate,Interactions Total,Earned Media Value,Twitter Retweets,Twitter Favorites,Twitter replies,Media URL,Debate
1,281380353382,0000-12-31,18:09:24 -0550 LMT,Twitter,http://www.twitter.com,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,...,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,1
961526,281380353382,0000-12-31,18:09:24 -0550 LMT,Twitter,http://www.twitter.com,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,...,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,Deleted or protected mention,1



 Rows for duplicated Id: 281380386657


Unnamed: 0,Id,Date,Time,Media Type,Site Domain,Mention URL,Publisher Name,Publisher Username,title,Mention Content,...,Sentiment,Potential Reach,Engagement Rate,Interactions Total,Earned Media Value,Twitter Retweets,Twitter Favorites,Twitter replies,Media URL,Debate
2,281380386657,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/barneystoneage/status/18065...,“Deplorable” Barney “the Listless Vessel”,barneystoneage,"RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...","RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...",...,neutral,1592,,,,,,,,1
961527,281380386657,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/barneystoneage/status/18065...,“Deplorable” Barney “the Listless Vessel”,barneystoneage,"RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...","RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...",...,neutral,1592,,,,,,,,1


In [19]:
#There are too many duplicates, and we need to create a clean version
# The new data version transforms the selected dataset without duplicated Ids
df_unique = df_selected.drop_duplicates(subset='Id', keep='first').reset_index(drop=True)

# Save df
df_unique.to_csv('data/raw/debate_tweets_selected_unique.csv', index=False)

# Confirm result and print the summary to report later this in methods
print(f"Original rows: {df_selected.shape[0]}")
print(f"Rows after removing duplicates: {df_unique.shape[0]}")

Original rows: 3846172
Rows after removing duplicates: 2383016


In [20]:
# Now, we have a lot of Deleted or protected mention form the source of the twitter itself, 
# Since it is impossible to analyze them, I will remove them from the db
# Count total tweets before filtering
total_before = df_unique.shape[0]

# Filter out rows with "Deleted or protected mention" in Mention Content
df_cleaned = df_unique[df_unique['Mention Content'] != "Deleted or protected mention"].copy()

# Count rows removed and rows remaining
total_removed = total_before - df_cleaned.shape[0]
total_after = df_cleaned.shape[0]

# Print summary to report these after in the methods section
print(f"Total tweets before cleaning: {total_before}")
print(f"Removed tweets with deleted/protected content: {total_removed}")
print(f"Final tweets after cleaning: {total_after}")

# check how many were removed
removed_count = df_unique.shape[0] - df_cleaned.shape[0]
print(f"Removed {removed_count} rows with deleted/protected content.")

# Save DF
df_cleaned.to_csv('data/raw/debate_tweets_selected_cleaned.csv', index=False)
print("Cleaned dataset saved as 'data/raw/debate_tweets_selected_cleaned.csv'")

Total tweets before cleaning: 2383016
Removed tweets with deleted/protected content: 144304
Final tweets after cleaning: 2238712
Removed 144304 rows with deleted/protected content.
Cleaned dataset saved as 'data/raw/debate_tweets_selected_cleaned.csv'


In [21]:
df_cleaned.head(50)

Unnamed: 0,Id,Date,Time,Media Type,Site Domain,Mention URL,Publisher Name,Publisher Username,title,Mention Content,...,Sentiment,Potential Reach,Engagement Rate,Interactions Total,Earned Media Value,Twitter Retweets,Twitter Favorites,Twitter replies,Media URL,Debate
0,281380390449,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/LouisianaCRs/status/1806530...,Louisiana College Republicans,LouisianaCRs,RT @cr_national: President Trump is the clear ...,RT @cr_national: President Trump is the clear ...,...,positive,999,,,,,,,,1
2,281380386657,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/barneystoneage/status/18065...,“Deplorable” Barney “the Listless Vessel”,barneystoneage,"RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...","RT @BoLoudon: 🚨@MATTGAETZ, @RONNYJACKSONTX, & ...",...,neutral,1592,,,,,,,,1
3,281380349903,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/unusual_doge/status/1806530...,(Golden) ᴅᴏɢᴇ,unusual_doge,RT @RpsAgainstTrump: Trump: “I didn't have sex...,RT @RpsAgainstTrump: Trump: “I didn't have sex...,...,neutral,2321,,,,,,,https://pbs.twimg.com/amplify_video_thumb/1806...,1
4,281380357944,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/KeepingUPosted/status/18065...,Keeping You Posted ?,KeepingUPosted,RT @NickKnudsenUS: Whoa! Wonder why the North ...,RT @NickKnudsenUS: Whoa! Wonder why the North ...,...,unassigned,11650,,,,,,,https://pbs.twimg.com/media/GL81861XAAAzCkc.jpg,1
6,281380399438,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/SteveKaye/status/1806530646...,Steve Kaye,SteveKaye,Share this. And then share this. And then Vote...,Share this. And then share this. And then Vote...,...,unassigned,610,,,,,,,https://pbs.twimg.com/media/GEDaMOYWQAAbPxj.jpg,1
7,281380375907,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/OldToad99/status/1806530607...,Old Toad (Parody),OldToad99,RT @MarinaMedvin: They didn’t just actively co...,RT @MarinaMedvin: They didn’t just actively co...,...,negative,438,,,,,,,,1
8,281380422012,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/gonzales_sid/status/1806530...,Sid Gonzales,gonzales_sid,RT @SpencerAlthouse: Biden just called out Tru...,RT @SpencerAlthouse: Biden just called out Tru...,...,negative,1831,,,,,,,https://pbs.twimg.com/ext_tw_video_thumb/18065...,1
9,281380409207,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/wuldntulik2kn/status/180653...,Wuldntulik2kn,wuldntulik2kn,RT @RpsAgainstTrump: Trump: “I didn't have sex...,RT @RpsAgainstTrump: Trump: “I didn't have sex...,...,neutral,682,,,,,,,https://pbs.twimg.com/amplify_video_thumb/1806...,1
10,281380361576,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/Tigerkenshin808/status/1806...,"""The Way is in training.""~Miyamoto Musashi, Rob M",Tigerkenshin808,RT @MarinaMedvin: They didn’t just actively co...,RT @MarinaMedvin: They didn’t just actively co...,...,negative,1873,,,,,,,,1
11,281380369003,2024-06-27,22:30:00 -0500 CDT,Twitter,http://www.twitter.com,http://twitter.com/Mike19771/status/1806530575...,Mike Rindos,Mike19771,RT @RpsAgainstTrump: Trump: “I didn't have sex...,RT @RpsAgainstTrump: Trump: “I didn't have sex...,...,neutral,1328,,,,,,,https://pbs.twimg.com/amplify_video_thumb/1806...,1


In [22]:
#now create a sanitized text column (the reason we charge the library re)
# for info of the library is: (and the first edition cover examples of use)
#Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly Media, 2009.

# Function to sanitize text: lowercasing, remove URLs, special characters, and extra whitespace
def sanitize_text(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r"http\S+|www.\S+", "", text)           # remove URLs
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)            # remove special characters
    text = re.sub(r"\s+", " ", text).strip()              # remove extra whitespace
    return text

# Apply to 'title' and store in new column 'santext'
df_cleaned['santext'] = df_cleaned['title'].apply(sanitize_text)

# preview 
print(df_cleaned[['title', 'santext']].sample(5))

# Save with sanitized text
df_cleaned.to_csv('data/raw/debate_tweets_selected_cleaned_with_santext.csv', index=False)
print("Cleaned dataset with 'santext' saved to 'data/raw/debate_tweets_selected_cleaned_with_santext.csv'")

                                                     title  \
949317   RT @RobertKennedyJr: VP Harris sounds like she...   
7547     RT @BPUnion: To be clear, we never have and ne...   
730646   This was reminiscent of when that lady called ...   
1076511  The swifities?? Yeah Trump is finished RT @Pop...   
147600   RT @reproforall: Donald Trump fact check: The ...   

                                                   santext  
949317   rt robertkennedyjr vp harris sounds like she j...  
7547     rt bpunion to be clear we never have and never...  
730646   this was reminiscent of when that lady called ...  
1076511  the swifities yeah trump is finished rt popbas...  
147600   rt reproforall donald trump fact check the vas...  
Cleaned dataset with 'santext' saved to 'data/raw/debate_tweets_selected_cleaned_with_santext.csv'


In [24]:
# Now create the sample code to compare. We can use hand coders, 
#but since this is a project of only one person, we are trying to do a different apporach
# Let's try to use ChatGPT as a coder for the first sample.
# Set seed for reproducibility
# Load cleaned and sanitized data
df_sanitized = pd.read_csv('data/raw/debate_tweets_selected_cleaned_with_santext.csv')

# Create sample from it
sample_df = df_sanitized.sample(n=1000, random_state=42).copy()

# Reset index for clarity
sample_df.reset_index(drop=True, inplace=True)

# Save the sample
os.makedirs('data/sample', exist_ok=True)
sample_df.to_csv('data/sample/debate_tweets_sample_1000.csv', index=False)
print(" Sample saved: data/sample/debate_tweets_sample_1000.csv")
# We have a warning due to the mixed type of data in the columns.

  df_sanitized = pd.read_csv('data/raw/debate_tweets_selected_cleaned_with_santext.csv')


 Sample saved: data/sample/debate_tweets_sample_1000.csv


In [1]:
#now, this step if we choose doing a hand code process...let's prepare the file and add some columns for the polarization measures we want to try
# Load the sample dataset
# df_sample2 = pd.read_csv('data/sample/debate_tweets_sample_1000.csv')

# Add columns for manual polarization coding
# df_sample2['polar_score'] = ""       # To be filled with -7 to 7 or 'INV'
# df_sample2['pole_label'] = ""        # 'left', 'right', 'neutral', or 'inv'
# df_sample2['valid'] = ""             # 1 for valid, 0 for invalid
# df_sample2['comments'] = ""          # Optional comments field

# Save updated file
# df_sample2.to_csv('data/sample/debate_tweets_sample_1000_for_annotation.csv', index=False)
# print("File saved with annotation columns.")