We will be working on predicting the win_probability of a congressional candidate depending upon the forecast data. We will consider all the features that we might think be influential for the target value.  

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("data/house_district_forecast.csv")

# Get to know the data

We are going to see how our data looks

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

lets take a look at the "party" column distribution

In [None]:
df['party'].value_counts()

# Get to know your features

We are going to look at how the columns are skewed using histogram. After looking at the individual column data we might want to do feature scaling or feature aggregation.

In [None]:
import matplotlib.pyplot as plt

In [None]:
dfplot = df.drop(columns=['special','incumbent'])

In [None]:
dfplot.info()

In [None]:
dfplot.hist(bins=50,figsize=(20,15))

All we got to know is few of them are skewed to left or have outliers. Since the district feature doesnt make sense on its own, we might want to aggregate it with state.

In [None]:
df["state_district"] = df["state"].map(str)+"_"+df["district"].map(str)

In [None]:
df["voteshare"].value_counts()/len(df)

To categorize the data for voteshare we create a new column called votershare. In order to keep the no of categories to low we divide the voteshare by 10. we mainly calculate this to get stratified sampling everytime

In [None]:
import numpy as np
df["voteshare_cat"] = np.ceil(df["voteshare"]/10)
df["voteshare_cat"].value_counts()/len(df)

# spliting the data into tarinaing and test sets

We generally split it with 20 to 80 ratio for test and trainig sets. To keep it startified we use the help of sikit learn.

In [None]:
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)
for train_index, test_index in split.split(df,df["voteshare_cat"]):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

In [None]:
for set in (strat_train_set,strat_test_set):
    set.drop(["voteshare_cat"], axis=1, inplace=True)
    

In [None]:
data_train = strat_train_set.copy()

In [None]:
correlation_matrix = data_train.corr()
data_train.info()

# Correlation between features

We use correlation matrix to keep or eliminate feature that the data shows might influence our target value.

In [None]:
correlation_matrix["win_probability"].sort_values(ascending=False)

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["win_probability","state_district","party","voteshare","p10_voteshare","p90_voteshare"]
scatter_matrix(data_train[attributes],figsize=(12,12))

we can see a strong correlation between winning probability and voteshare a candidate has in their respective districts

In [None]:
df.plot(kind="scatter", x="voteshare", y="win_probability",alpha=0.1)
plt.show()

Since the data has no null values except for "special" column, we are dropping the column. As it is has no correlation with win_probability too.

In [None]:
data_train.drop("special",axis=1)

We will create two dataframes out of the df which has predictors and labels. As we dont want to apply transformation for target values. And we are dropping all the columns that might be irrelevant as per the correlation matrix too.

# Predictors and labels

In [None]:
df_predictors = strat_train_set.drop(["win_probability","forecastdate","special","candidate","incumbent","model","p10_voteshare","p90_voteshare"],axis=1)
df_predictors_labels = strat_train_set["win_probability"].copy()

to make use of text data we need to convert it into numerical form. Here we will convert state, party into numerical form.

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder_party = LabelEncoder()
party_cat = df_predictors["party"]
party_cat_encoded = encoder_party.fit_transform(party_cat.astype(str))
party_cat_encoded

In [None]:
print(encoder_party.classes_)

In [None]:
encoder = LabelEncoder()
state_cat = df_predictors["state"]
state_cat_encoded = encoder.fit_transform(state_cat.astype(str))
state_cat_encoded

In [None]:
print(encoder.classes_)

We are creating two categories of columns in-order to transform them in data pipeline. We are removing all the non-numerical columns and also the target value column "win_probability". We have only categorial columns that are strings "party" and "state_district"

In [None]:
cat_attributes = ["party","state_district","state"]

# Data pipeline

Generally in the pipeline we an use different transformations that ssklearn provides us with but since data seems tobe pretty consistent for this dataset. So we wont be using imputer or standard scaler. But we will be using LabelEncoder for string categorial data. FaetureUnion from scikit learn makes it simple to run multiple tranformation on data parallely.  

In [None]:
from sklearn.pipeline import FeatureUnion,make_pipeline

Since we have multiple label columns, we might need a multiple column label encoder. sklearn supports multilabelencoder but not multilabelencoder. The solution is influenced by this answer in stackoverflow (https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn). Check "data.py" for implementation.

In [None]:
from data import MultiColumnLabelEncoder
df_predictors_prepared = MultiColumnLabelEncoder(cat_attributes).fit_transform(df_predictors)

# Lets try out some models

Since we are predicting a target numeric value this falls under regression task. We can use Linear regression, multivariate regression, Decision Trees and Random Forests. Let's try out some.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linReg = LinearRegression()
linReg.fit(df_predictors_prepared,df_predictors_labels)

In [None]:
ex_data = df_predictors.iloc[:8]
ex_label = df_predictors_labels.iloc[:8]
ex_prepared = MultiColumnLabelEncoder(cat_attributes).fit_transform(ex_data)

In [None]:
print(linReg.predict(ex_prepared))

In [None]:
print(list(ex_label))

In [None]:
from sklearn.metrics import mean_squared_error
df_predictions = linReg.predict(df_predictors_prepared)
linMse = mean_squared_error(df_predictors_labels,df_predictions)
linRmse = np.sqrt(linMse)
linRmse

We experimented with Linear regression here. We took a sample data of 8 samples and predicted their target values. And compared them to the actual values. We can see there is lot of descrepency. So lets try Decision trees.

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtreeReg = DecisionTreeRegressor()
dtreeReg.fit(df_predictors_prepared,df_predictors_labels)

In [None]:
df_predictions = dtreeReg.predict(df_predictors_prepared)
dtreeMse = mean_squared_error(df_predictors_labels,df_predictions)
dtreeRmse = np.sqrt(dtreeMse)
dtreeRmse

That is surprisingly low. But we may have overfit the model. Lets try it with Random forests.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfReg = RandomForestRegressor()
rfReg.fit(df_predictors_prepared,df_predictors_labels)
df_predictions = rfReg.predict(df_predictors_prepared)
rfRegMse = mean_squared_error(df_predictors_labels,df_predictions)
rfRegRmse = np.sqrt(rfRegMse)
rfRegRmse

# Final test
Now is the time to test it with test data. Lets finalize the model to be Random Forests. Keep in mind this might not be a real life scenario as you may want to tweak a bit till you find the best model.

In [None]:
X_test = strat_test_set.drop(["win_probability","forecastdate","special","candidate","incumbent","model"],axis=1)
y_test = strat_test_set["win_probability"].copy()
X_test_prepared = MultiColumnLabelEncoder(cat_attributes).fit_transform(X_test)
final_predictions = rfReg.predict(X_test_prepared)
final_mse = mean_squared_error(y_test,final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse