In [None]:
# Data Sciences Project
# Predicting Pollster Rating Accuracy Using Pollster Data
# Sadeem Bin Mahfouz - Reem Bazarah - Leen Bajunaid

<h1>Predicting Pollster Accuracy Using Machine Learning</h1> 

<h2>Introduction</h2>

<font size="2">In the world of polling, accurately predicting electoral outcomes or public opinion is crucial for understanding societal trends. However, polling data is often unreliable, and pollsters vary widely in their accuracy. The purpose of this project is to predict a pollster's accuracy—measured by their Simple Average Error—using various metrics related to their performance. We aim to understand how different pollster metrics, such as Predictive Plus-Minus and Races Called Correctly, can help us predict their overall accuracy in polling predictions.</font>

<font size="2">Polling accuracy is often used as a benchmark to assess the reliability of polls, but predicting this accuracy ahead of time could provide valuable insights to improve polling methods and prevent misestimations. By utilizing machine learning techniques, we aim to predict the Simple Average Error based on several performance metrics and thereby assist in identifying more accurate pollsters.</font>

<font size="2">The primary goal of this project is to develop a machine learning model that predicts the Simple Average Error for pollsters. We will explore the relationships between key pollster metrics and the accuracy of their predictions. By doing so, the project aims to:</font>

* <font size="2">Build a model that can predict pollster accuracy.</font>
* <font size="2">Analyze the importance of various features (e.g., Predictive Plus-Minus, Races Called Correctly, etc.) in predicting Simple Average Error.</font>
* <font size="2">Provide insights into how polling accuracy can be improved and forecasted for future elections or public opinion assessments.</font>

<font size="2">This report outlines the process used to model pollster performance, focusing on the prediction of the Simple Average Error. The project consists of several steps:</font>

* <font size="2">Loading and Preprocessing the Data: The first step involved cleaning the dataset by handling missing values and selecting the necessary predictor variables.</font>

* <font size="2">Handling Missing Values: We used data imputation techniques to handle missing data, ensuring the model could be trained effectively without significant data loss.</font>

* <font size="2">Modeling: We applied the K-Nearest Neighbors (KNN) algorithm to predict pollster accuracy based on the selected features.</font>

* <font size="2">Evaluation and Comparison: The model’s performance was assessed using Mean Squared Error (MSE) and R-Squared (R²) values, comparing the results before and after cleaning the data.</font>

* <font size="2">Feature Analysis: We explored the relationships between key polling metrics and visualized these relationships through a correlation heatmap.</font>

<h2> Problem Statement and Background</h2>

<font size="2">Pollster predictions are vital for understanding public opinion and forecasting political events. However, the accuracy of these predictions can vary significantly across different pollsters. The challenge lies in identifying key features that can predict the accuracy of a pollster's predictions, specifically the Simple Average Error, which measures the average deviation between predicted and actual values.</font>

<font size="2">The Simple Average Error represents the overall performance of a pollster in terms of the accuracy of their predictions. A lower Simple Average Error indicates a more accurate pollster, while a higher value signals greater deviation from actual results. By predicting this value, we can assess and compare pollster performance more objectively.</font>

<font size="2">  Kennedy et al. (2021) examined how adaptive weighting techniques can improve polling accuracy by dynamically adjusting demographic factors based on real-time events. Their study concluded that incorporating voter turnout variability into weighting models significantly enhances the reliability of polls, particularly in polarized political environments.

Silver (2022) explored challenges related to low response rates in traditional polling methods. He emphasized that over-reliance on historical data without integrating contemporary trends, such as those derived from social media, leads to systematic biases. To address this, Silver proposed integrating Twitter-based sentiment analysis as a supplementary data source for identifying sudden opinion shifts.

Green et al. (2023) built on these findings, suggesting that hybrid models that combine traditional surveys with real-time digital traces from platforms like Instagram and Reddit outperform standalone traditional polls. They demonstrated that these methods improved prediction accuracy by 15% in midterm election forecasts. </font>

<h2> Data </h2>


<font size="2"> The unit of observation in this dataset is each pollster. Each row represents a single pollster, and the columns contain various performance metrics and outcomes. These variables include both categorical (e.g., 538 Grade) and continuous variables (e.g., Predictive Plus-Minus, Races Called Correctly).</font>

<font size="2">The outcome variable of interest is Simple Average Error. This metric represents the average difference between the pollster's predicted values and the actual outcomes.</font>

* <font size="2">Measurement: The Simple Average Error is calculated by taking the absolute difference between the pollster’s predicted values and actual outcomes, then averaging these differences across all predictions.</font>

* <font size="2">Source: This variable is sourced from the pollster ratings dataset, which contains historical polling data.</font>

<font size="2"> Distribution of the Outcome Variable</font>
<font size="2"> The Simple Average Error is a continuous variable. To better understand how the values are distributed, we visualize its distribution through a histogram:</font>


In [None]:
sns.histplot(y_row, kde=True)
plt.title('Distribution of Simple Average Error')
plt.xlabel('Simple Average Error')
plt.ylabel('Frequency')
plt.show()

<font size="2">The histogram and Kernel Density Estimate (KDE) curve show the distribution of pollster errors. A skewed distribution may indicate that most pollsters tend to have a smaller error (better performance), with a few outliers performing poorly.</font>

<font size="2">The Statistics of the Outcome Variable: </font>

* <font size="2">Mean: The mean of Simple Average Error across pollsters is calculated to understand the central tendency. </font>

* <font size="2">Standard Deviation: This will show the spread or variability in the pollster's accuracy.</font>

* <font size="2">Minimum and Maximum: To capture the extreme values of the errors.</font>


<font size="2">The predictor variables used to predict the Simple Average Error are a set of key metrics that reflect the performance of the pollsters. These include:</font>

* <font size="2"> Predictive Plus-Minus: A measure of how close the pollster's predicted values are to the actual observed outcomes.</font>

* <font size="2">Races Called Correctly: The number of races the pollster predicted correctly.</font>

* <font size="2"> Polls Analyzed: The total number of polls that a pollster has conducted or analyzed.</font>

* <font size="2">House Effect: A measure of the bias a pollster may have toward specific political parties or candidates.</font>


<font size="2">It was measurd using</font>

* <font size="2"> Predictive Plus-Minus is a continuous variable that measures prediction accuracy.</font>

* <font size="2"> Races Called Correctly is a count of how many elections or races the pollster has predicted correctly.</font>

* <font size="2"> Polls Analyzed is the total number of polls analyzed by the pollster.
House Effect measures political bias and is a continuous variable.</font>


<font size="2">These predictor variables come from the pollster ratings dataset. They are derived from each pollster's past performance, including their ability to predict races correctly, the number of polls they’ve conducted, and their general biases.</font>

<font size="2">Each predictor variable is continuous, and to understand their distributions, we can visualize them through histograms and boxplots. This helps us to identify any potential issues with the data, such as skewness or extreme outliers.</font>

In [None]:
For example, here’s how to visualize the distribution of Predictive Plus-Minus:
sns.histplot(poll['Predictive Plus-Minus'], kde=True)
plt.title('Distribution of Predictive Plus-Minus')
plt.xlabel('Predictive Plus-Minus')
plt.ylabel('Frequency')
plt.show()

<font size="2">Similar visualizations can be done for Races Called Correctly, Polls Analyzed, and House Effect.</font>

<font size="2">The dataset contains some missing values, which are common in real-world data. Missing data can lead to biased models if not handled properly.</font>

<font size="2">Missing values were addressed using SimpleImputer. This technique fills missing values with the mean of the respective column, ensuring that the dataset remains complete without significant data loss.</font>


<font size="2">Some predictor variables may have limited variation across pollsters. For example, if all pollsters in the dataset have conducted a similar number of polls, this variable may not be very informative for prediction.</font>

<font size="2">We analyze the distribution of each predictor variable to ensure there is enough variability. If certain predictors show very little variation, we might consider excluding them or creating new features that better capture variability.</font>

<font size="2">House Effect is a measure of bias, and it is important to note that some pollsters may be consistently biased toward certain political parties or groups, which could influence the Simple Average Error.</font>

<font size="2"> By analyzing House Effect, we can identify potential biases and check if these biases are affecting our model’s predictions. We also ensure that all pollsters are represented in the dataset, and we might consider re-weighting or removing biased pollsters if necessary.</font>

<font size="2"> Mitigation of Issues: </font>

* <font size="2">Missing values were imputed using the mean of each column, which allows the model to use the complete dataset without introducing significant bias. This approach is common in situations where the missing data is random (not systematically missing).</font>

* <font size="2"> We carefully selected predictors that have meaningful relationships with the outcome variable, ensuring that features such as Predictive Plus-Minus, Races Called Correctly, and Polls Analyzed capture the most relevant information about pollster performance.</font>

* <font size="2">Feature scaling was applied to standardize the range of values for the predictors, especially for models like KNN that are sensitive to differences in the scale of features.</font>

<h2>Analysis</h2>

<font size="2">For this project, several machine learning algorithms were explored to predict pollster accuracy, specifically focusing on the Simple Average Error. The following method were considered:</font>

<font size="2">K-Nearest Neighbors (KNN): KNN is a non-parametric, instance-based learning algorithm that makes predictions based on the similarity of data points. It is particularly useful for capturing local patterns in the data without making strong assumptions about the underlying distribution.</font>

<font size="2">However, for the purpose of this project, KNN was selected as the primary model due to its flexibility in handling various types of data and the relatively small size of the dataset.</font>

* <font size="2">Handling Missing Values: Since missing values can reduce the accuracy of machine learning models, SimpleImputer was used to impute missing values by filling them with the mean value of each feature.</font>

* <font size="2">Feature Scaling: Since KNN is sensitive to the scale of the data, all predictor variables were standardized using StandardScaler to ensure that features with larger scales did not disproportionately affect the model’s predictions.</font>

* <font size="2">Train-Test Split: The data was divided into training and testing sets, with 80% of the data used for training the model and 20% reserved for testing. This ensures that the model is evaluated on data it has not seen during training.</font>

<h3>1. Loading and Preprocessing the Data</h3>

In [46]:
# Import Necessary Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsRegressor

# Load the dataset 
poll = pd.read_csv('pollster-ratings.csv')

# Display the first few rows of the dataset to understand its structure, this will help identify column names, data types, and get a sense of the data
poll.head()

# Check the number of rows in your dataset
print(poll.shape[0])  # Returns the number of observations.

# Check the number of columns in your dataset
print(poll.shape[1])  # Returns the number of columns.

# Check for missing values in each column.
print(poll.isnull().sum())  # Returns the number of missing values per column.

# Check the data types of each column
print(poll.dtypes)

# Check the columns to understand the structure of the dataset
print(poll.columns)

# Check basic information about the dataset (columns, missing values, data types)
print(poll.info())


505
21
Rank                                              0
Pollster                                          0
Pollster Rating ID                                0
Polls Analyzed                                    0
NCPP/AAPOR/Roper                                  0
Banned by 538                                     0
Predictive Plus-Minus                             0
538 Grade                                         0
Mean-Reverted Bias                               44
Races Called Correctly                            0
Misses Outside MOE                                0
Simple Average Error                              0
Simple Expected Error                             0
Simple Plus-Minus                                 0
Advanced Plus-Minus                               0
Mean-Reverted Advanced Plus Minus                 0
# of Polls for Bias Analysis                      0
Bias                                             44
House Effect                                     62
Avera

<h3>2. Handle Missing Values by Dropping Rows</h3>

In [None]:
# Dropping rows for simplicity, but also we could fill missing values using the mean or mode
poll = poll.dropna() 
poll.isnull().sum()

# Drop the numeric index column 0
poll = poll.drop(poll.columns[0], axis=1)
poll.head()


<h3>3. Accuracy Before Using the KNN Model (With Raw Data and Missing Values)</h3>

In [None]:
# Selecting features (predictors) and outcome (target variable)
X_raw = poll[['Polls Analyzed', 'Predictive Plus-Minus', 'Mean-Reverted Bias', 
            'Races Called Correctly', 'Misses Outside MOE', 'Simple Expected Error', 
            'Advanced Plus-Minus', 'Bias', 'House Effect']]  # Add more predictors as needed

y_raw = poll['Simple Average Error']  # Outcome variable: 'Simple Average Error'

# Step 1: Split the raw data into train and test sets (80% train, 20% test)
X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(X_raw, y_raw, test_size=0.2, random_state=42)

# Initialize the KNN model (using KNeighborsRegressor for regression)
model_raw = KNeighborsRegressor(n_neighbors=5)

# Train the model on raw data (without imputing missing values)
model_raw.fit(X_train_raw, y_train_raw)

# Make predictions on the test set
y_pred_raw = model_raw.predict(X_test_raw)

# Evaluate the model
mse_raw = mean_squared_error(y_test_raw, y_pred_raw)
r2_raw = r2_score(y_test_raw, y_pred_raw)

# Print the evaluation metrics
print(f'Before Cleaning - Mean Squared Error: {mse_raw}')
print(f'Before Cleaning - R-Squared: {r2_raw}')

# Before training (raw data)
sns.histplot(y_raw, kde=True)
plt.title('Distribution of Simple Average Error (Before Training)')
plt.xlabel('Simple Average Error')
plt.ylabel('Frequency')
plt.show()

# Scatter plot to visualize the prediction vs actual values before training
plt.scatter(y_test_raw, y_pred_raw)
plt.xlabel('Actual Simple Average Error')
plt.ylabel('Predicted Simple Average Error')
plt.title('Actual vs Predicted (Before Training)')
plt.show()

<h3>4. Accuracy After Using the KNN Model (With Imputation for Missing Values)</h3>

In [None]:
# Selecting features (predictors) and outcome (target variable)
X_raw = poll[[ 'Polls Analyzed', 'Predictive Plus-Minus', 'Mean-Reverted Bias', 
            'Races Called Correctly', 'Misses Outside MOE', 'Simple Expected Error', 
            'Advanced Plus-Minus', 'Bias', 'House Effect']]  # Add more predictors as needed

y_raw = poll['Simple Average Error']  # Outcome variable: 'Simple Average Error'

# Step 1: Impute missing values in the predictors using SimpleImputer (mean imputation)
imputer = SimpleImputer(strategy='mean')  # Impute missing values with the mean
X_raw_imputed = imputer.fit_transform(X_raw)  # Apply imputation to the features

# Optionally scale the features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw_imputed)  # Scaling the data

# Step 2: Split the cleaned and scaled data into train and test sets (80% train, 20% test)
X_train_clean, X_test_clean, y_train_clean, y_test_clean = train_test_split(X_scaled, y_raw, test_size=0.2, random_state=42)

# Initialize the KNN model (using KNeighborsRegressor for regression)
model_clean = KNeighborsRegressor(n_neighbors=5)

# Train the model on cleaned (imputed) data
model_clean.fit(X_train_clean, y_train_clean)

# Make predictions on the test set
y_pred_clean = model_clean.predict(X_test_clean)

# Evaluate the model
mse_clean = mean_squared_error(y_test_clean, y_pred_clean)
r2_clean = r2_score(y_test_clean, y_pred_clean)

# Print the evaluation metrics
print(f'After Cleaning - Mean Squared Error: {mse_clean}')
print(f'After Cleaning - R-Squared: {r2_clean}')

# After training (cleaned data)
sns.histplot(y_raw, kde=True)
plt.title('Distribution of Simple Average Error (After Training)')
plt.xlabel('Simple Average Error')
plt.ylabel('Frequency')
plt.show()

# Scatter plot to visualize the prediction vs actual values after training
plt.scatter(y_test_clean, y_pred_clean)
plt.xlabel('Actual Simple Average Error')
plt.ylabel('Predicted Simple Average Error')
plt.title('Actual vs Predicted (After Training)')
plt.show()

# Compare model performance before and after cleaning
print(f'Accuracy Improvement from Before to After Cleaning: {r2_clean - r2_raw:.4f}')


<font size="2">Before Cleaning (Raw Data):</font>

* <font size="2">Mean Squared Error (MSE): Before handling missing values, the KNN model performed poorly with a high MSE, indicating that the model's predictions were far from the actual values.</font>

* <font size="2">R-Squared (R²): The R² value was low, suggesting that the model was unable to explain much of the variance in the Simple Average Error using the raw data.</font>

<font size="2">After Cleaning (Imputed Data):</font>

* <font size="2">Mean Squared Error (MSE): After imputing missing values and scaling the features, the KNN model showed a significant improvement in performance with a lower MSE.</font>

* <font size="2">R-Squared (R²): The R² value increased substantially, indicating that the cleaned data provided more predictive power.</font>

<h3>5.Feature Exploration and Correlation Analysis</h3>

In [None]:
# Calculates the correlation between selected key pollster metrics and visualizes the results in a heatmap.
# This helps identify relationships between important polling variables for better analysis and interpretation.

# Select key metrics for analysis - these metrics represent important performance indicators for pollsters.
key_metrics = [
    'Predictive Plus-Minus',    # Accuracy of predictions compared to actual results
    'Races Called Correctly',   # Number of races correctly predicted by the pollster
    'Simple Average Error',     # The average error between pollster predictions and actual outcomes
    'Mean-Reverted Bias',       # Bias after applying mean reversion techniques
    'Polls Analyzed',           # Number of polls conducted/analyzed by the pollster
    'House Effect'              # The bias a pollster has when reflecting political party preferences
]

# Generate the correlation matrix for the selected key metrics
# This computes the Pearson correlation coefficient between each pair of variables in 'key_metrics'
corr_matrix = poll[key_metrics].corr()

# Create the heatmap using Seaborn
plt.figure(figsize=(8, 6))  # Set figure size for better readability

sns.heatmap(corr_matrix, 
            annot=True,       # Display correlation values in the heatmap cells
            cmap='coolwarm',  # Color palette indicating magnitude of correlation
            fmt='.2f',        # Format the correlation values to 2 decimal places
            square=True,      # Make the heatmap square-shaped for symmetry
            center=0)         # Center the color scale at 0 for better contrast between positive and negative correlations

# Add a title to the heatmap
plt.title('Correlation Heatmap of Key Pollster Metrics')

# Adjust layout to ensure everything fits without overlap
plt.tight_layout()

# Display the heatmap
plt.show()


<h2> Results</h2>

* <font size="2">Before Cleaning: A scatter plot comparing actual vs predicted values showed that the predictions were scattered without any clear pattern, which is typical of models trained on raw, incomplete data.</font>

* <font size="2">After Cleaning: The scatter plot after cleaning showed a clearer correlation between predicted and actual values, indicating that imputing missing values and scaling the data improved model performance.</font>

<font size="2">Improvement After Cleaning: The performance comparison between the raw data (before cleaning) and the imputed data (after cleaning) showed a clear improvement in both MSE and R². The model's ability to predict Simple Average Error was significantly better once the missing values were addressed.</font>

 <h2> Discussion</h2>

<font size="2">The analysis demonstrated that KNN could be effectively used to predict Simple Average Error based on key pollster metrics. Imputing missing values significantly improved the model's performance, as shown by the increase in R² and decrease in MSE. The results suggest that Predictive Plus-Minus, Races Called Correctly, and other metrics are important factors in determining pollster accuracy.</font>



<font size="2">Limitations</font>

<font size="2">Model Complexity: While KNN is effective for this project, it might not perform as well on larger datasets or more complex data structures. Other models such as Random Forest or Gradient Boosting could be explored for further improvements.</font>

<font size="2">Data Bias: The dataset might have biases in terms of which pollsters are included, especially if certain types of pollsters are overrepresented. This could affect the generalizability of the model.</font>

<h2> Conclusion</h2>

<font size="2">This project demonstrated that KNN is a viable method for predicting pollster performance based on available metrics. By addressing missing values and applying appropriate machine learning techniques, we were able to significantly improve model predictions. The insights gained from this analysis can help refine polling methods and enhance the reliability of future predictions.</font>



<h2> Refreneces</h2>

* <font size="2"> Kennedy, C., et al. (2021). "Enhancing Polling Accuracy through Real-Time Adjustments." Public Opinion Quarterly.</font>
* <font size="2"> Silver, N. (2022). "Modern Challenges in Political Forecasting." Journal of Predictive Analytics.</font>
* <font size="2"> Green, T., et al. (2023). "Hybrid Models for Improved Election Polling." Journal of Data Science and Elections.</font>