# Building the dataset for the 2024 mexican election

The purpose of this file is to translate the raw data, as included in the New_DB file, into a goup of datasets that resemble the sample photo included in the project. This sample is originated by the collection of the number of posts, and reactions to each one as for different social media profiles for each candidate, Xóchitl Gálvez, Claudia Sheinbaum, and Álvarez Mainez in this upcoming 2024 election.

The New_DB file is updated on a daily basis to collect the presence of the candidates in different social media platforms, and the reponse from their audience to it.

In [18]:
# imports
import os
import pandas as pd
import openpyxl

In [2]:
# We will treat Claudia as candidate 1, Galvez as 2, and Mainez as 3

In [19]:
# The days prior to a date for which we´ll sum the number of reactions
days = [1,2,3,4,5,6,7,14,21,28]

# Dates of published polls to read (as reported by Oraculus.mx)
date_strings = [
    "2023-09-15", "2023-09-16", "2023-10-17", "2023-10-15", "2023-10-16", "2023-10-17", "2023-10-18", "2023-11-15", "2023-11-16", "2023-11-17", "2023-11-18", "2023-12-15", "2023-12-16", "2023-12-17", "2023-12-18"
]

# Target variables for each polling release (reported share of preference)
targets_claudia = [66, 60, 68, 55, 62, 65, 62, 54, 59, 63, 54, 59, 63, 66, 58]

targets_xochitl = [26, 32, 26, 35, 31, 31, 34, 35, 28, 32, 39, 35, 32, 27, 36]

# Polls correspoding to a date
polls = ["Enkoll", "GEA-ISA", "Covarrubias", "El FInanciero", "Enkoll", "Mendoza Blanco", "Simo", "El Financiero", "Enkoll", "Mendoza Blanco", "Reforma", "GEA-ISA", "Mendoza Blanco", "Simo", "El Financiero"]

#Setting columns to use (see New_DB)
columns = ['XPosts', 'Xcomments', 'XRts', 'Xlikes', 'XCommsPPost', 'XRTsPPost', 'XlikesPPost', 'FBPosts', 'FBReactions', 'FBComments', 'FBShares', 'FBReactsPPost', 'FBCommsPPost', 'FBSharesPPost', 'IGPosts', 'IGLikes', 'IGLikesPPost', 'YTPosts', 'YTViews', 'YTViewsPPost']

In [30]:
# Create a set of tuples associating dates with corresponding polls
date_poll_set = set(zip(date_strings, polls, targets_xochitl))
for i, n, j in date_poll_set:
  print(f"Date: {i}, Poll: {n}, Result: {j}")

Date: 2023-11-15, Poll: El Financiero, Result: 35
Date: 2023-11-18, Poll: Reforma, Result: 39
Date: 2023-10-18, Poll: Simo, Result: 34
Date: 2023-11-17, Poll: Mendoza Blanco, Result: 32
Date: 2023-11-16, Poll: Enkoll, Result: 28
Date: 2023-10-16, Poll: Enkoll, Result: 31
Date: 2023-12-15, Poll: GEA-ISA, Result: 35
Date: 2023-10-17, Poll: Mendoza Blanco, Result: 31
Date: 2023-12-17, Poll: Simo, Result: 27
Date: 2023-10-15, Poll: El FInanciero, Result: 35
Date: 2023-10-17, Poll: Covarrubias, Result: 26
Date: 2023-12-16, Poll: Mendoza Blanco, Result: 32
Date: 2023-12-18, Poll: El Financiero, Result: 36
Date: 2023-09-15, Poll: Enkoll, Result: 26
Date: 2023-09-16, Poll: GEA-ISA, Result: 32


In [31]:
# File Path
file_path = "New_DB.xlsx"
# Read all sheets into a dictionary of DataFrames
all_sheets = pd.read_excel(file_path, skiprows=1, sheet_name=None)

# Access each DataFrame by sheet name
galvez_df = all_sheets["Galvez"]
claudia_df = all_sheets["Claudia"]

# Convert 'Date' column to datetime if it's not already
galvez_df['Date'] = pd.to_datetime(galvez_df['Date'])
claudia_df['Date'] = pd.to_datetime(claudia_df['Date'])

In [32]:
# How mauch data should we have? 10 dates * n number of poll releases
count = 0
for i in days:
  for x, y, z in date_poll_set:
    print(i, x, y, z)
    count += 1
count

1 2023-11-15 El Financiero 35
1 2023-11-18 Reforma 39
1 2023-10-18 Simo 34
1 2023-11-17 Mendoza Blanco 32
1 2023-11-16 Enkoll 28
1 2023-10-16 Enkoll 31
1 2023-12-15 GEA-ISA 35
1 2023-10-17 Mendoza Blanco 31
1 2023-12-17 Simo 27
1 2023-10-15 El FInanciero 35
1 2023-10-17 Covarrubias 26
1 2023-12-16 Mendoza Blanco 32
1 2023-12-18 El Financiero 36
1 2023-09-15 Enkoll 26
1 2023-09-16 GEA-ISA 32
2 2023-11-15 El Financiero 35
2 2023-11-18 Reforma 39
2 2023-10-18 Simo 34
2 2023-11-17 Mendoza Blanco 32
2 2023-11-16 Enkoll 28
2 2023-10-16 Enkoll 31
2 2023-12-15 GEA-ISA 35
2 2023-10-17 Mendoza Blanco 31
2 2023-12-17 Simo 27
2 2023-10-15 El FInanciero 35
2 2023-10-17 Covarrubias 26
2 2023-12-16 Mendoza Blanco 32
2 2023-12-18 El Financiero 36
2 2023-09-15 Enkoll 26
2 2023-09-16 GEA-ISA 32
3 2023-11-15 El Financiero 35
3 2023-11-18 Reforma 39
3 2023-10-18 Simo 34
3 2023-11-17 Mendoza Blanco 32
3 2023-11-16 Enkoll 28
3 2023-10-16 Enkoll 31
3 2023-12-15 GEA-ISA 35
3 2023-10-17 Mendoza Blanco 31
3 202

150

In [33]:
# Specify the directory where you want to save the CSV files
output_directory = './galvez/'
result_dataframes = {}
for i in days:
  for x, y, z in date_poll_set:
    # Filter rows based on the established date and count the specified number of rows before that
    filtered_df = galvez_df.loc[galvez_df['Date'] < x].iloc[-i:]
    # Sum the data in the selected rows
    sum_result = filtered_df[columns].sum()
    # Turn result into a DF and Transpose
    pd_result = sum_result.to_frame().T
    #Adding metadata columns to the new sum dataframe
    pd_result['Candidate'] = 'Galvez'
    pd_result['Window'] = i
    pd_result['Ref. Date'] = x
    pd_result['Institute'] = y
    pd_result['Target'] = z
    # round the result and output
    pd_result = pd_result.round()

    # Store the dataframe in the dictionary with a meaningful key
    key = f"2_{i}_{x}_{y}"
    result_dataframes[key] = pd_result

    # Save the dataframe to a CSV file in the specified directory
    csv_filepath = os.path.join(output_directory, f"{key}.csv")
    pd_result.to_csv(csv_filepath, index=False)


We have now created single column files that correspond to each poll by a date, by candidate. We now need to join corresponding poll results categorized by window time.

In [35]:
# Set the directory where your CSV files are located
directory = './galvez/'
for i in days:

    # Specify the pattern for file names to include
    file_name_pattern = f'2_{i}_'

    # Get a list of CSV files in the directory that match the pattern
    csv_files = [file for file in os.listdir(directory) if file.endswith('.csv') and file.startswith(file_name_pattern)]

    # Ensure there are matching CSV files in the directory
    if not csv_files:
        print(f"No CSV files matching the pattern {file_name_pattern} found in the specified directory.")
    else:
        # Read the first CSV file to get the header
        first_file_path = os.path.join(directory, csv_files[0])
        df_combined = pd.read_csv(first_file_path)

        # Loop through the remaining CSV files and concatenate them
        for csv_file in csv_files[1:]:
            file_path = os.path.join(directory, csv_file)
            df = pd.read_csv(file_path)
            df_combined = pd.concat([df_combined, df], ignore_index=True)

        # Write the combined DataFrame to a new CSV file
        combined_output_path = f'./galvez/2_{i}.csv'
        df_combined.to_csv(combined_output_path, index=False)

        print(f"CSV files matching the pattern {file_name_pattern} successfully combined. Output saved to: {combined_output_path}")


CSV files matching the pattern 2_1_ successfully combined. Output saved to: ./galvez/2_1.csv
CSV files matching the pattern 2_2_ successfully combined. Output saved to: ./galvez/2_2.csv
CSV files matching the pattern 2_3_ successfully combined. Output saved to: ./galvez/2_3.csv
CSV files matching the pattern 2_4_ successfully combined. Output saved to: ./galvez/2_4.csv
CSV files matching the pattern 2_5_ successfully combined. Output saved to: ./galvez/2_5.csv
CSV files matching the pattern 2_6_ successfully combined. Output saved to: ./galvez/2_6.csv
CSV files matching the pattern 2_7_ successfully combined. Output saved to: ./galvez/2_7.csv
CSV files matching the pattern 2_14_ successfully combined. Output saved to: ./galvez/2_14.csv
CSV files matching the pattern 2_21_ successfully combined. Output saved to: ./galvez/2_21.csv
CSV files matching the pattern 2_28_ successfully combined. Output saved to: ./galvez/2_28.csv


This process will be repeated to feed the regression models on multiple occasions (as new polls get released and as new posts get interactions on a daily basis). 

The release of this dataset belongs to the research effort performed by the author of this repository, which hopes to upbring new research based on it since similar databases have not been found.