# Analysis of pollsters rating by Five-Thirty-Eight
## Dr. Tirthajyoti Sarkar, Fremont, CA, June 2020

Five-Thirty-Eight especially prides itself on their unique strength of assigning a rating to every pollster, whose polling data they assimilate in their predictive models. They proclaim to do this based on the historical accuracy and methodology of each firm’s polls.

They also publish the curated dataset on this ranking here: https://github.com/fivethirtyeight/data/tree/master/pollster-ratings

Details on this dataset can be found here: https://projects.fivethirtyeight.com/pollster-ratings/

##### Import libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.optimize import curve_fit

# Read in the dataset directly from the URL

In [3]:
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/pollster-ratings/pollster-ratings.csv"

In [4]:
try:
    df = pd.read_csv(url)
except:
    print("Could not retrieve the data!")

##### Show column names

In [5]:
for c in df.columns:
    print(c,end=', ')

Rank, Pollster, Pollster Rating ID, Polls Analyzed, AAPOR/Roper, Banned by 538, Predictive Plus-Minus, 538 Grade, Mean-Reverted Bias, Races Called Correctly, Misses Outside MOE, Simple Average Error, Simple Expected Error, Simple Plus-Minus, Advanced Plus-Minus, Mean-Reverted Advanced Plus-Minus, # of Polls for Bias Analysis, Bias, House Effect, Average Distance from Polling Average (ADPA), Herding Penalty, 

##### Dataset info

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 21 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Rank                                          517 non-null    int64  
 1   Pollster                                      517 non-null    object 
 2   Pollster Rating ID                            517 non-null    int64  
 3   Polls Analyzed                                517 non-null    int64  
 4   AAPOR/Roper                                   517 non-null    object 
 5   Banned by 538                                 517 non-null    object 
 6   Predictive Plus-Minus                         517 non-null    float64
 7   538 Grade                                     517 non-null    object 
 8   Mean-Reverted Bias                            473 non-null    float64
 9   Races Called Correctly                        517 non-null    flo

##### Rename a column to remove extra spaces

In [7]:
df.rename(columns={'Predictive    Plus-Minus':'Predictive Plus-Minus'},inplace=True)

##### Convert Races Called Correctly to float from string


In [8]:
df['Races Called Correctly'][:3]

0    0.747368
1    0.811321
2    0.886364
Name: Races Called Correctly, dtype: float64

In [11]:
def percent_to_float(x):
    """
    Converts percentage to float
    """
    return float(x[:-1])/100

In [10]:
df['Races Called Correctly']=df['Races Called Correctly'].apply(percent_to_float)

TypeError: 'float' object is not subscriptable

In [None]:
df['Races Called Correctly'][:3]

##### Extract partisan bias from the Bias column

In [None]:
def bias_party_id(x):
    """
    Returns a string indicating partisan bias
    """
    if x is np.nan: return "No data"
    x = str(x)
    if x[0]=='D': return "Democratic"
    else: return 'Republican'

In [None]:
def bias_party_degree(x):
    """
    Returns a string indicating partisan bias
    """
    if x is np.nan: return np.nan
    x = str(x)
    return float(x[3:])

In [None]:
df['Partisan Bias']=df['Bias'].apply(bias_party_id)

In [None]:
df['Partisan Bias Degree']=df['Bias'].apply(bias_party_degree)

In [None]:
df[['Pollster','Bias','Partisan Bias','Partisan Bias Degree']].sample(5)

##### Examine and quantize the 538 Grade column

In [None]:
df['538 Grade'].unique()

In [None]:
plt.figure(figsize=(12,4))
plt.title("Pollster grade counts",fontsize=18)
plt.bar(x=df['538 Grade'].unique(),
        height=df['538 Grade'].value_counts(),
       color='red',alpha=0.6,edgecolor='k',linewidth=2.5)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.grid(True)
plt.show()

In [None]:
def grade_numeric(x):
    """
    Quantizes the letter grades
    """
    if x[0]=='A': return 4
    if x[0]=='B': return 3
    if x[0]=='C': return 2
    if x[0]=='D': return 1
    else: return 0

In [None]:
df['Numeric grade']=df['538 Grade'].apply(grade_numeric)

In [None]:
df['Numeric grade'].value_counts()

##### Boxplots

In [None]:
def custom_boxplot(x,y,rot=90):
    plt.figure(figsize=(12,4))
    plt.title("Boxplot of \"{}\" by \"{}\"".format(y,x),fontsize=17)
    sns.boxplot(x=x,y=y,data=df)
    plt.xticks(rotation=rot,fontsize=12)
    plt.yticks(fontsize=13)
    plt.xlabel(x,fontsize=15)
    plt.ylabel(y+'\n',fontsize=15)
    plt.show()

In [None]:
custom_boxplot(x='Methodology',y='Simple Average Error')

In [None]:
custom_boxplot(x='Methodology',y='Predictive Plus-Minus')

In [None]:
custom_boxplot(x='Partisan Bias',y='Races Called Correctly',rot=0)

In [None]:
custom_boxplot(x='Partisan Bias',y='Advanced Plus-Minus')

In [None]:
custom_boxplot(x='NCPP / AAPOR / Roper',y='Races Called Correctly',rot=0)

In [None]:
custom_boxplot(x='NCPP / AAPOR / Roper',y='Advanced Plus-Minus',rot=0)

##### Scatter and regression plots

In [None]:
def custom_scatter(x,y,data=df,pos=(0,0),regeqn=True):
    """
    Plots customized scatter plots with regression fit using Seaborn
    """    
    sns.lmplot(x=x,y=y,data=data,height=4,aspect=1.5,
       scatter_kws={'color':'yellow','edgecolor':'k','s':100},
              line_kws={'linewidth':3,'color':'red','linestyle':'--'})
    
    plt.xticks(fontsize=15)
    plt.yticks(fontsize=15)
    plt.xlabel(x,fontsize=15)
    plt.ylabel(y+'\n',fontsize=15)
    ax = plt.gca()
    ax.set_title("Regression fit of \"{}\" vs. \"{}\"".format(x,y),fontsize=15)
    
    if (regeqn):
        slope, intercept, r_value, p_value, std_err = stats.linregress(df[x],df[y])
        r_squared = r_value**2
        eqn= "$y$={0:.3f}$x$+{1:.3f},\n$R^2$:{2:.3f}".format(slope,intercept,r_squared)
        plt.annotate(s=eqn,xy=pos,fontsize=13)

In [None]:
custom_scatter(x='Races Called Correctly',
               y='Predictive Plus-Minus',
              pos=(0.05,-1.5))

In [None]:
custom_scatter(x='Numeric grade',
               y='Simple Average Error',
              pos=(0,20))

In [1]:
df.columns

NameError: name 'df' is not defined

In [None]:
df_2 = df.dropna()
filtered = df_2[df_2['# of Polls']>100]
custom_scatter(x='# of Polls for Bias Analysis',
              y='Partisan Bias Degree',
              data=filtered,regeqn=False)

In [None]:
x = df_2['# of Polls for Bias Analysis']
y = df_2['Partisan Bias Degree']
            
plt.scatter(x,y,color='yellow',edgecolors='k',s=100)

def func(x, a, b, c):
    return a * np.exp(-b *0.1*x) + c
popt, pcov = curve_fit(func, x, y)
y_fit = func(x,popt[0],popt[1],popt[2])
plt.scatter(x,y_fit,color='red',alpha=0.5)
plt.show()

In [None]:
plt.scatter(np.log10(np.abs(x)),np.log10(np.abs(y)),color='yellow',edgecolors='k',s=100)

In [None]:
popt

In [None]:
filtered = df[df['# of Polls']>20]
plt.title("Histogram of the \'# of Polls\'",fontsize=16)
plt.hist(filtered['# of Polls'],color='orange',edgecolor='k')
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

filtered =filtered[filtered['# of Polls']<400]
custom_scatter(x='# of Polls',
               y='Predictive Plus-Minus',
               pos = (200,-1),
               data=filtered)

In [None]:
df_scores = df[['Predictive Plus-Minus','Races Called Correctly',
                'Simple Average Error','Advanced Plus-Minus',
               'Numeric grade']]

In [None]:
sns.pairplot(data=df_scores,
             plot_kws={'color':'red','edgecolor':'k'},
             diag_kws={'color':'blue','edgecolor':'k'})

##### Filtering And Sorting

In [None]:
df_sorted = df[df['# of Polls']>50].sort_values(by=['Advanced Plus-Minus'])[:10]
df_sorted[['Pollster','# of Polls','Advanced Plus-Minus','Partisan Bias','538 Grade']]

In [12]:
df['House Effect']

0       0.700506
1      -0.343495
2       0.632896
3      -0.526230
4      -0.199840
         ...    
512    11.670131
513     8.833333
514     4.727778
515    10.899998
516   -27.666667
Name: House Effect, Length: 517, dtype: float64