# Spark analysis with MLLIB and linear regression

In this notebook I have used the MLLIB package by Spark to perform linear regression on the parliamentary data I have collected. To perform the analysis I will look at two datasets; the first displaying the MPs loyalty scores and the second displaying the three major parties total loyalty score per vote.


I will use linear regression on the above datasets to test a set of three hypothesis:

1. A relationship exists between the voting loyalty of the Labour party and the Conservative party
2. A relationship exists between the voting loyalty of the Labour party and the Liberal Democrat party
3. A relationship exists between the loyalty score of an MP and their number of years in service


In [10]:
# Importing libraries.
from pyspark  import SparkContext, SparkConf
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.mllib.evaluation import RegressionMetrics
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
import pandas as pd
import matplotlib.pyplot as plt

In [11]:
# Now create the Spark SQL Context, which will let us use RDDs
conf = SparkConf().setAppName('Big_Data')
sc = SparkContext(conf=conf)
sq = SQLContext(sc)

In [12]:
# Reading in csv data for analyses.
party_loyalty = pd.read_csv("DATA/VOTES-PARTY-LOYALTY.csv", sep=',', index_col=0, error_bad_lines=False)
mp_scores = pd.read_csv("DATA/MP_ID_SCORES.csv", sep=',', index_col=0, error_bad_lines=False)

# Checking data from mp_scores
mp_scores.head(1)

Unnamed: 0,member_id,score,constituency,date_of_birth,days_service,first_start_date,gender,list_name,party
0,337,0.967858,Harborough,1952-10-26,4693.0,1992-04-09,M,"Garnier, Sir Edward",Conservative


In [13]:
# Checking data from party_loyalty
party_loyalty.head(1)

Unnamed: 0,uin,lab_score,con_score,ld_score,total,date,title
0,CD:2001-07-04:10,1.0,1.0,1.0,1.0,2001-07-04,European Communities (Amendment) Bill (Programme)


In [15]:
# Creating RDDs from dataframes.
spark_mps_scores = sq.createDataFrame(mp_scores)
spark_party_scores = sq.createDataFrame(party_loyalty)

In [16]:
# Split data into training and test datasets.
party_test,party_train = spark_party_scores.randomSplit([0.3,0.7], seed=4)
mp_test,mp_train = spark_mps_scores.randomSplit([0.3,0.7], seed=4)

In [17]:
# Creating Spark vectors for the paramters. 

# Using conservative and and liberal democrat scores as seperate features.
VectorCon = VectorAssembler(inputCols = ["con_score"], outputCol = "features")
VectorLD = VectorAssembler(inputCols = ["ld_score"], outputCol = "features")

# Using days service as a feature from the mp dataset.
VectorDS = VectorAssembler(inputCols = ["days_service"], outputCol = "features")

In [18]:
# Creating an object to define the linear model.

# Object defined with labour score as the prediction.
lr = LinearRegression(predictionCol="predicted_lab_score", labelCol="lab_score", featuresCol="features",regParam=0.1)
# Object defined with loyalty score as the prediction.
lr2 = LinearRegression(predictionCol="predicted_score", labelCol="score", featuresCol="features",regParam=0.1)

In [20]:
# Combine feature and linear model object to pipeline.
lrPipeCon = Pipeline(stages=[VectorCon,lr])
lrPipeLD = Pipeline(stages=[VectorLD,lr])
lrPipeDS = Pipeline(stages=[VectorDS,lr2])

In [21]:
# Using the fit function to execute pipeline on training data.
lrModelCon = lrPipeCon.fit(party_train)
lrModelLD = lrPipeLD.fit(party_train)
lrModelDS = lrPipeDS.fit(mp_train)

In [22]:
# Baseline effect for each model
interCon = lrModelCon.stages[1].intercept 
interLD = lrModelLD.stages[1].intercept 
interDS = lrModelDS.stages[1].intercept 

In [23]:
# With a fit model, we are able to make some predictions using our held-out test data:
predCon = lrModelCon.transform(party_test)
predLD = lrModelLD.transform(party_test)
predDS = lrModelDS.transform(mp_test)

In [26]:
# Function to get RMSE and R^2.
def get_RMSE(predict_col, label_col, predict, metric):
    regEval = RegressionEvaluator(predictionCol=predict_col,labelCol=label_col,metricName=metric)
    rmse = regEval.evaluate(predict)
    return rmse

# Calling above function to retreive results
con_rmse = get_RMSE("predicted_lab_score", "lab_score", predCon, "rmse")
ld_rmse = get_RMSE("predicted_lab_score", "lab_score", predLD, "rmse")
ds_rmse = get_RMSE("predicted_score", "score", predDS, "rmse")
con_r2 = get_RMSE("predicted_lab_score", "lab_score", predCon, "r2")
ld_r2 = get_RMSE("predicted_lab_score", "lab_score", predLD, "r2")
ds_r2 = get_RMSE("predicted_score", "score", predDS, "r2")

In [30]:
def print_results(model, baseline, rmse, r2):
    print("RESULTS FOR THE MODEL PREDICTING "+model+" >>")
    print("The baseline effect is: "+str(baseline))
    print("The models RMSE is: "+str(rmse))
    print("The models r2 is: "+str(r2))    

## Metrics and findings

Looking at the below metrics I can infer the following:

1. None of the three models have a high enough r2 to reject the null hypotheses.
2. There is a slight increase in the r2 value of the first to the second model. The could suggest there is a stronger chance of a relationship between the labour loyalty and lib dem loyalty, rather than the labour loyalty and conservative loyalty.

In [31]:
print_results("LABOUR LOYALTY USING CONSERVATIVE LOYALTY", interCon, con_rmse, con_r2)

RESULTS FOR THE MODEL PREDICTING LABOUR LOYALTY USING CONSERVATIVE LOYALTY >>
The baseline effect is: 0.6552764482422851
The models RMSE is: 0.11918314383425618
The models r2 is: 0.10356915158700442


In [32]:
print_results("LABOUR LOYALTY USING LIB DEM LOYALTY", interLD, ld_rmse, ld_r2)

RESULTS FOR THE MODEL PREDICTING LABOUR LOYALTY USING LIB DEM LOYALTY >>
The baseline effect is: 0.7825971209409588
The models RMSE is: 0.11639597042427337
The models r2 is: 0.14500611230803995


In [33]:
print_results("MP LOYALTY USING DAYS SERVICE", interDS, ds_rmse, ds_r2)

RESULTS FOR THE MODEL PREDICTING MP LOYALTY USING DAYS SERVICE >>
The baseline effect is: 0.9323264195682996
The models RMSE is: 0.05512485525385629
The models r2 is: -0.0033908613087543227
