1. Separate the Data 


Survey Data: This is your baseline. It represents what voters said their preferences were.
Election Results: Separate the electronic votes from the polling station votes to analyze each type of voting independently.

Let's load the data.

In [56]:
import pandas as pd

private_data = pd.read_excel('private_dataD.xlsx')
public_data_register = pd.read_excel('public_data_registerD.xlsx')
public_data_results = pd.read_excel('public_data_resultsD.xlsx')


2. Organize Data for Comparison

Create tables showing the percentage or count of votes for each party in:

The survey data

The election results for electronic votes

The election results for polling station votes



In [96]:
private_totals = private_data['party'].value_counts()

# Get the total amount of votes
public_data_totals = public_data_results.tail(2)

# Rename the index to match the private data
public_data_totals.set_index('Unnamed: 0', inplace=True)






# Calculate the amount of paper votes
public_data_totals.loc['Paper'] = public_data_totals.loc['Total_row'].apply(int) - public_data_totals.loc['E-votes'].apply(int)

# Ensure private_totals is a DataFrame and transpose it to match the public data
private_totals = private_totals.to_frame().transpose()

print(private_totals)
print('\n')




public_data_totals.rename(columns={

    'Invalid ballots': 'Invalid vote'
}, inplace=True)

public_data_totals.rename(index={'E-votes':'electronic','Total_row':'count', 'Paper':'paper'}, inplace=True)

# Drop the 'Unnamed: 0' row when reading the data
#public_data_totals.drop(index='Unnamed: 0')

# Rename the index to match the private data
#public_data_totals.index = ['electronic','count', 'paper']

print(public_data_totals)
print('\n')
print(public_data_totals.index)

party  Green  Red  Invalid vote
count    121   75             4


            Red  Green  Invalid vote  Total
Unnamed: 0                                 
electronic  107    199             7    313
count       386    630            20   1036
paper       279    431            13    723


Index(['electronic', 'count', 'paper'], dtype='object', name='Unnamed: 0')


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  public_data_totals.loc['Paper'] = public_data_totals.loc['Total_row'].apply(int) - public_data_totals.loc['E-votes'].apply(int)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  public_data_totals.rename(columns={
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  public_data_totals.rename(index={'E-votes':'electronic','Total_row':'count', 'Paper':'paper'}, inplace=True)


In [97]:
# Initialize the private_totals DataFrame with party names
private_totals = pd.DataFrame(index=['paper', 'electronic'], columns=['Green', 'Red', 'Invalid vote'])

# Calculate counts for electronic votes (evote == 1)
for party in ['Green', 'Red', 'Invalid vote']:
    private_totals.loc['electronic', party] = private_data.loc[(private_data['evote'] == 1) & (private_data['party'] == party), 'party'].count()

# Calculate counts for paper votes (evote == 0)
for party in ['Green', 'Red', 'Invalid vote']:
    private_totals.loc['paper', party] = private_data.loc[(private_data['evote'] == 0) & (private_data['party'] == party), 'party'].count()

# Add a count row for totals
private_totals.loc['count'] = private_totals.sum()

# Add a total row for each party
private_totals['Total'] = private_totals.sum(axis=1)

# Display the resulting DataFrame
print(private_totals)

           Green Red Invalid vote Total
paper         78  54            1   133
electronic    43  21            3    67
count        121  75            4   200



3. Choose a Statistical Test

Since you’re comparing categorical data (party preference distributions), a Chi-Square test of independence is a good option.
Chi-Square Test: This test will help you see if there are statistically significant differences in political preferences between:
Survey vs. Electronic Votes
Survey vs. Polling Station Votes
Each of these comparisons will show if the preferences in the survey align with the actual votes or if there are notable differences.


In [101]:
from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind

# Create a contingency table for the Chi-Square test
contingency_table = pd.concat([private_totals.loc[['paper', 'electronic'], ['Green', 'Red', 'Invalid vote']],
                               public_data_totals.loc[['paper', 'electronic'], ['Green', 'Red', 'Invalid vote']]])

# Perform the Chi-Square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Print the results
print(f"Chi-Square Statistic: {chi2}")
print(f"P-Value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)
print('\n')

# Perform t-test for each party
for party in ['Green', 'Red', 'Invalid vote']:
    t_stat, p_val = ttest_ind(private_totals.loc[['paper', 'electronic'], party].astype(int),
                              public_data_totals.loc[['paper', 'electronic'], party].astype(int))
    print(f"T-Test for {party}:")
    print(f"T-Statistic: {t_stat}")
    print(f"P-Value: {p_val}\n")

Chi-Square Statistic: 6.360299944406317
P-Value: 0.3840618947439205
Degrees of Freedom: 6
Expected Frequencies:
[[ 80.81148867  49.60598706   2.58252427]
 [ 40.70954693  24.9894822    1.30097087]
 [439.29854369 269.66262136  14.03883495]
 [190.18042071 116.74190939   6.0776699 ]]


T-Test for Green:
T-Statistic: -2.1694171309207846
P-Value: 0.16227922122515373

T-Test for Red:
T-Statistic: -1.7757517924317536
P-Value: 0.2177604255759461

T-Test for Invalid vote:
T-Statistic: -2.5298221281347035
P-Value: 0.12712843905603044





4. Interpret the Results

If the p-value from your Chi-Square test is less than 0.05, it means there’s a statistically significant difference in preferences between the groups.
If there’s a significant difference, it could indicate a possible manipulation or other factors influencing the results.

### Interpretation of Results

#### Chi-Square Test
The Chi-Square test was performed to determine if there are statistically significant differences in political preferences between the survey data and the election results.

- **Chi-Square Statistic**: 6.3603
- **Degrees of Freedom (dof)**: 6
- **P-Value**: 0.3841

Since the p-value (0.3841) is greater than the significance level of 0.05, we fail to reject the null hypothesis. This indicates that there is no statistically significant difference in political preferences between the survey data and the election results.

#### T-Test
T-tests were conducted for each party to compare the means of the survey data and the election results.

- **Green Party**: 
    - T-Statistic: -2.5298
    - P-Value: 0.1271
- **Red Party**: 
    - T-Statistic: -1.7757517924317536
    - P-Value: 0.2177604255759461
- **Invalid Vote**: 
    - T-Statistic: -2.5298221281347035
    - P-Value: 0.12712843905603044

For the Green Party, the p-value (0.1271) is greater than the significance level of 0.05, indicating no statistically significant difference between the survey data and the election results for this party. The results for the Red Party and Invalid Vote were not provided, but similar interpretations would apply based on their respective p-values.

#### Conclusion
Overall, the statistical tests suggest that there are no significant differences in political preferences between the survey data and the election results. This implies that the survey data aligns well with the actual votes, and there are no notable discrepancies that would indicate manipulation or other influencing factors.
```