Analyse the data in the raw dataset answering the following questions (note, you are free to choose the suitable methods of the analysis yourself, based on your knowledge of e.g. applied statistics or other courses of your study):

(A) Is there a significant difference between the political preferences as expressed in the survey and the election results for both electronic and polling station votes?

(B) Is there a significant difference between political preferences of the voters depending on their demographic attributes recorded in the survey (that is, age, gender, education level…)?

(C) Is there a significant difference between voter’s choice of the voting channel (that is, if they decide to vote either online or in person) depending on their demographic attributes recorded in the survey?

In [8]:
import pandas as pd
from scipy.stats import chi2_contingency
import statsmodels.api as sm

In [9]:
#load the excel data
survey_data = pd.read_excel('data/private_dataE.xlsx')
pub_data_results = pd.read_excel('data/public_data_resultsE.xlsx')
pub_data_register = pd.read_excel('data/public_data_registerE.xlsx')

In [12]:
### B)

contingency_table_gender = pd.crosstab(survey_data['sex'], survey_data['party'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table_gender)
print("Chi-square Test Results for Gender vs Political Preference")
print("Chi-square statistic:", chi2)
print("p-value:", p_value)

contingency_table_education = pd.crosstab(survey_data['education'], survey_data['party'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table_education)
print("\nChi-square Test Results for Education vs Political Preference")
print("Chi-square statistic:", chi2)
print("p-value:", p_value)

#transform date of birth to age
survey_data['age'] = 2024 - survey_data['dob'].dt.year
contingency_table_age = pd.crosstab(survey_data['age'], survey_data['party'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table_age)
print("\nChi-square Test Results for Age vs Political Preference")
print("Chi-square statistic:", chi2)
print("p-value:", p_value)


Chi-square Test Results for Gender vs Political Preference
Chi-square statistic: 1.0948595723439902
p-value: 0.578434602059867

Chi-square Test Results for Education vs Political Preference
Chi-square statistic: 34.70051433812302
p-value: 0.0043666219362705665

Chi-square Test Results for Age vs Political Preference
Chi-square statistic: 130.03690118193933
p-value: 0.5806961048843412


In [11]:
# Prepare the data
survey_data['Political_Preference_Binary'] = (survey_data['party'] == 'Green').astype(int)

# Define independent variables (e.g., age, gender, education)
X = survey_data[['age', 'sex', 'education']]
X = pd.get_dummies(X, drop_first=True)  # Convert categorical vars to binary (dummy variables)
X = sm.add_constant(X)  # Add constant for the intercept

# Define dependent variable
y = survey_data['Political_Preference_Binary']

# Fit logistic regression
model = sm.Logit(y, X).fit()
print(model.summary())

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

In [None]:
### C)
# Example: Chi-square test for voting channel by gender
contingency_table_channel_gender = pd.crosstab(survey_data['sex'], survey_data['evote'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table_channel_gender)
print("Chi-square Test Results for Gender vs Voting Channel")
print("Chi-square statistic:", chi2)
print("p-value:", p_value)

contingency_table_channel_education = pd.crosstab(survey_data['education'], survey_data['evote'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table_channel_education)
print("\nChi-square Test Results for Education vs Voting Channel")
print("Chi-square statistic:", chi2)
print("p-value:", p_value)

contingency_table_channel_age = pd.crosstab(survey_data['age'], survey_data['evote'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table_channel_age)
print("\nChi-square Test Results for Age vs Voting Channel")
print("Chi-square statistic:", chi2)
print("p-value:", p_value)

Chi-square Test Results for Gender vs Voting Channel
Chi-square statistic: 2.6379263431645796
p-value: 0.10433965288985117

Chi-square Test Results for Education vs Voting Channel
Chi-square statistic: 8.784052225187445
p-value: 0.36083942749998765

Chi-square Test Results for Age vs Voting Channel
Chi-square statistic: 63.97738150893831
p-value: 0.5821109942082197


In [None]:
# Define independent variables (demographics)
X = survey_data[['age', 'sex', 'education']]
X = pd.get_dummies(X, drop_first=True)  # Create dummy variables for categorical data
X = sm.add_constant(X)

# Define the dependent variable
y = survey_data['evote']

# Fit the logistic regression model
model = sm.Logit(y, X).fit()
print(model.summary())

         Current function value: 0.584308
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:                  evote   No. Observations:                  200
Model:                          Logit   Df Residuals:                      189
Method:                           MLE   Df Model:                           10
Date:                Mon, 04 Nov 2024   Pseudo R-squ.:                 0.05620
Time:                        18:43:13   Log-Likelihood:                -116.86
converged:                      False   LL-Null:                       -123.82
Covariance Type:            nonrobust   LLR p-value:                    0.1768
                                                        coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------------------------
const                                                 0.2960      

