# Disagreement in Crowdsourced Policymaking

## Factors that influence idea generation

This notebook contains the regression analysis conducted for the article *The Value of Disagreement in Crowdsourced Policymaking:Idea Generation Through Elaborated Perspectives*. Multivariate regression analysis is applied to explore what factors influence idea generation and to which extend.

## Content

- [Load libraries](#0.-Load-libraries)
- [Load data](#1.-Load-data)
- [Preprocess data](#2.-Preprocess-data)
- [Regression analysis](#3.-Regression-analysis)

## 0. Load libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

from collections import defaultdict
from statsmodels.formula.api import glm, ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

## 1. Load data

Data is distributed in three data files, each representing a discussion topic.

### Topic: Member decision making and right

In [2]:
member_df = pd.read_csv('data/member_decision_making_and_right.csv')
print(f'The dataset has {member_df.shape[0]} rows and {member_df.shape[1]} columns')

The dataset has 213 rows and 49 columns


### Topic: Administration of an association

In [3]:
admin_df = pd.read_csv('data/administration_of_association.csv')
print(f'The dataset has {admin_df.shape[0]} rows and {admin_df.shape[1]} columns')

The dataset has 119 rows and 49 columns


### Topic: Informally organized groups of an association

In [4]:
informal_df = pd.read_csv('data/informally_organized_group_of_a.csv')
print(f'The dataset has {informal_df.shape[0]} rows and {informal_df.shape[1]} columns')

The dataset has 159 rows and 51 columns


## 2. Preprocess data

### Normalize column names

Put column names to lower case, replace spaces with underscore, delete leading and trailing spaces, and remove non alpha-numerical characters

In [5]:
def normalize_column_names(names):
    names = names.str.lower()
    names = names.str.replace(' ','_')
    names = names.str.replace('.','_')
    names = names.str.replace('(','_')
    names = names.str.replace(')','')
    names = names.str.replace('/','_')
    names = names.str.replace('___','_')
    names = names.str.strip()
    return names

In [6]:
member_df.columns = normalize_column_names(member_df.columns)
admin_df.columns = normalize_column_names(admin_df.columns)
informal_df.columns = normalize_column_names(informal_df.columns)

### Standarize column names

Columns that contained the same information but have slightly different names are renamed for compatibility purposes

In [7]:
member_df = member_df.rename(columns={
    'annotations_for_disagreement_new_idea_cascade_s': 'annotations_for_disagreement_new_idea_cascades'
})

In [8]:
admin_df = admin_df.rename(columns={
    'amount_of_likes': 'number_of_likes',
    'comment\'s_id': 'comment_id',
    'proposal': 'proposals',    
})

In [9]:
informal_df = informal_df.rename(columns={
    'annotations_for_disagreement_new_idea_cascade_s': 'annotations_for_disagreement_new_idea_cascades',
    'proposal': 'proposals'
})

### Merge datasets in one dataframe

Add an extra column to indicate the discussion topic

In [10]:
member_df['topic'] = 'member'
admin_df['topic'] = 'admin'
informal_df['topic'] = 'informal'

#### Merge datasets

In [11]:
all_df = pd.concat([member_df, admin_df, informal_df], axis=0, ignore_index=True)
print(f'The merged dataset has {all_df.shape[0]} rows and {all_df.shape[1]} columns')

The merged dataset has 491 rows and 53 columns


### Remove columns containing text in Finnish

Text of comments and responses are both in Finnish and English. Columns `comment` and `response` that contain text in Finnish are removed because they will not be considered in this analysis.

In [12]:
all_df = all_df.drop(['comment', 'response'], axis=1)

Columns `comment_1` and `response_1` are renamed removing `_1` from their names

In [13]:
all_df = all_df.rename(columns={'comment_1': 'comment', 'response_1': 'response'})

### Check null values

Check the number of null values by columns

In [14]:
all_df.isnull().sum()

background                                          3
proposals                                           3
time                                                3
user_id                                             3
comment_id                                        346
response_id                                       148
number_of_likes                                     0
users_who_liked                                   263
attachments                                       491
comment                                           348
response                                          146
topic_1                                             4
topic_2                                           147
topic_3                                           358
disagreement                                        0
agreement                                           0
simple_disagreement                                 0
elaborated_disagreement                             0
simple_agreement            

### Remove summary rows

Remove rows that contain summaries. They are identified by having a null value in the column background.

In [15]:
idxs_to_remove = all_df[all_df.background.isnull()].index.values
all_df = all_df.drop(index=idxs_to_remove)
print(f'Data set size after removing summary rows. Rows: {all_df.shape[0]}, Columns: {all_df.shape[1]}')

Data set size after removing summary rows. Rows: 488, Columns: 51


### Fix errors in variables

There were labeling errors in the rows `336` and `310`; they are fixed below

In [16]:
all_df.loc[336, 'simple_agreement'] = 0
all_df.loc[310, 'elaborated_agreement'] = 1

### Fix values in the column `number_of_ideas`

In [17]:
all_df.loc[all_df['number_of_ideas']=='unclear', 'number_of_ideas'] = 0
all_df.loc[all_df['number_of_ideas'].isna(), 'number_of_ideas'] = 0
all_df['number_of_ideas'] = pd.to_numeric(all_df['number_of_ideas'], downcast='unsigned')

### Set value `unclear` of column `gives_reason_s` to `0`

In [18]:
all_df.loc[all_df['gives_reason_s']=='unclear', 'gives_reason_s'] = '0'

### Show final columns

In [19]:
all_df.columns

Index(['background', 'proposals', 'time', 'user_id', 'comment_id',
       'response_id', 'number_of_likes', 'users_who_liked', 'attachments',
       'comment', 'response', 'topic_1', 'topic_2', 'topic_3', 'disagreement',
       'agreement', 'simple_disagreement', 'elaborated_disagreement',
       'simple_agreement', 'elaborated_agreement', 'idea_s', 'number_of_ideas',
       'new_idea', 'sourcing', 'value_s', 'topic_shift', 'brainstorming',
       'blending', 'building', 'broadening', 'fact', 'value', 'policy',
       'interpretation', 'target_of_disagreement', 'target_of_agreement',
       'gives_reason_s', 'presents_evidence', 'asks_question_s',
       'provides_information', 'clarifies_position_stance',
       'responds_to_previous_comment', 'constructive_tone', 'moderator_post',
       'acknowledges_problem', 'notes',
       'annotations_for_disagreement_new_idea_cascades', 'topic', 'irrpolicy',
       'irrinterpretation', 'irrconstructive_tone'],
      dtype='object')

### Show a small sample

In [20]:
all_df.head()

Unnamed: 0,background,proposals,time,user_id,comment_id,response_id,number_of_likes,users_who_liked,attachments,comment,...,responds_to_previous_comment,constructive_tone,moderator_post,acknowledges_problem,notes,annotations_for_disagreement_new_idea_cascades,topic,irrpolicy,irrinterpretation,irrconstructive_tone
0,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-05-29T06:46:24+00:00,5cee25de2878cf678e79d737,5cee2ac02878cf4c8d260521,,1,nina-laakso@luukku.com,,"Association members’ decision-making, access t...",...,0,1,1,0,,,member,,,
1,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T07:08:18+00:00,5cf60f17d8f1250a070160ee,,5cf618e22878cf073b0eafca,1,karin.rinne@netti.fi,,,...,1,1,0,0,,,member,,,
2,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T10:39:32+00:00,5cf6379534204f3a8d121027,,5cf64a64d8f1253e5a242b5b,0,,,,...,1,1,0,0,,,member,,,
3,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T11:34:33+00:00,5cf63071d8f12537d5632b3e,,5cf657492878cf36b01b767b,0,,,,...,1,1,0,1,,,member,,,
4,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T12:44:16+00:00,5cee2ce1d8f125593074aeeb,,5cf667a034204f5af1758bd9,2,"valtteri.tervala@vanhempainliitto.fi,tanja.sal...",,,...,1,1,1,0,,,member,,,


### Show final dataset dimesion

In [21]:
print(f'The final dataset has a dimension of {all_df.shape[0]} rows and {all_df.shape[1]} columns')

The final dataset has a dimension of 488 rows and 51 columns


## 3. Regression analysis

### Select variables

Variable selection is based on the literature and goals of the paper.

In [22]:
independent_vars = ['simple_agreement', 'elaborated_agreement', 'simple_disagreement', 'elaborated_disagreement',
                    'gives_reason_s', 'presents_evidence']
print(f"In total {len(independent_vars)} independent variables will be considered in the analysis")

In total 6 independent variables will be considered in the analysis


Select columns that include independent and dependent variables

In [23]:
analysis_df = all_df.loc[:,independent_vars + ['number_of_ideas']]
print(f"The analysis is conducted with a dataset composed of " \
      f"{analysis_df.shape[0]} rows and {analysis_df.shape[1]} columns")

The analysis is conducted with a dataset composed of 488 rows and 7 columns


### Check data consistency

There should not be comments that have at the same simple agreement and elaborated agreement

In [24]:
analysis_df[(analysis_df['simple_agreement']==1)&(analysis_df['elaborated_agreement']==1)].shape[0]

0

There should not be comments that have at the same simple disagreement and elaborated disagreement

In [25]:
analysis_df[(analysis_df['simple_disagreement']==1)&(analysis_df['elaborated_disagreement']==1)].shape[0]

0

### Cast variables to numeric

Before modeling data, it is required to ensure that independent variables are not correlated. Varince Inflation Factor (VIF) is the most common method used to detect multicollinearity (or variables independence). The python implementation of VIF requires the variables to be numeric. Here, the independent variables are casted to numeric.

In [26]:
analysis_df.columns

Index(['simple_agreement', 'elaborated_agreement', 'simple_disagreement',
       'elaborated_disagreement', 'gives_reason_s', 'presents_evidence',
       'number_of_ideas'],
      dtype='object')

In [27]:
analysis_df[analysis_df['gives_reason_s']=='unclear']

Unnamed: 0,simple_agreement,elaborated_agreement,simple_disagreement,elaborated_disagreement,gives_reason_s,presents_evidence,number_of_ideas


In [28]:
analysis_df = analysis_df.apply(pd.to_numeric)

### Check for multicollinearity

In general, values above 5 indicate high multicollinearity.

In [29]:
vif = pd.DataFrame()
vif["Variable"] = analysis_df[independent_vars].columns
vif["VIF"] = [variance_inflation_factor(analysis_df[independent_vars].values, i) 
              for i in range(analysis_df[independent_vars].shape[1])]
vif

Unnamed: 0,Variable,VIF
0,simple_agreement,1.047498
1,elaborated_agreement,2.123538
2,simple_disagreement,1.018652
3,elaborated_disagreement,2.688071
4,gives_reason_s,4.351844
5,presents_evidence,1.128419


There is not multicollinearity among the independent variables.

### Cast independent variables to category

In [30]:
cast = {}
for independent_var in independent_vars:
    cast[independent_var] = 'category'
analysis_df = analysis_df.astype(cast)

Check variable types

In [31]:
analysis_df.dtypes

simple_agreement           category
elaborated_agreement       category
simple_disagreement        category
elaborated_disagreement    category
gives_reason_s             category
presents_evidence          category
number_of_ideas               uint8
dtype: object

### Fit model

Iterate over independent variables to create formula.

In [32]:
formula = f'number_of_ideas ~ '
for idx, independent_var in enumerate(independent_vars):
    formula += f' C({independent_var})'
    if idx < (len(independent_vars)-1):
        formula += ' + '
print(f'Model formula:\n{formula}')

Model formula:
number_of_ideas ~  C(simple_agreement) +  C(elaborated_agreement) +  C(simple_disagreement) +  C(elaborated_disagreement) +  C(gives_reason_s) +  C(presents_evidence)


Build model with created formula

In [33]:
model = ols(formula, data = analysis_df).fit()

In [34]:
model.summary()

0,1,2,3
Dep. Variable:,number_of_ideas,R-squared:,0.227
Model:,OLS,Adj. R-squared:,0.218
Method:,Least Squares,F-statistic:,23.58
Date:,"Wed, 31 Mar 2021",Prob (F-statistic):,1.83e-24
Time:,12:34:25,Log-Likelihood:,-636.33
No. Observations:,488,AIC:,1287.0
Df Residuals:,481,BIC:,1316.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.8582,0.098,8.800,0.000,0.667,1.050
C(simple_agreement)[T.1],-0.3725,0.141,-2.649,0.008,-0.649,-0.096
C(elaborated_agreement)[T.1],0.3053,0.097,3.143,0.002,0.114,0.496
C(simple_disagreement)[T.1],0.0596,0.197,0.303,0.762,-0.328,0.447
C(elaborated_disagreement)[T.1],0.2832,0.096,2.952,0.003,0.095,0.472
C(gives_reason_s)[T.1],0.6153,0.120,5.114,0.000,0.379,0.852
C(presents_evidence)[T.1],-0.0360,0.155,-0.232,0.816,-0.340,0.268

0,1,2,3
Omnibus:,179.523,Durbin-Watson:,1.889
Prob(Omnibus):,0.0,Jarque-Bera (JB):,765.929
Skew:,1.606,Prob(JB):,4.79e-167
Kurtosis:,8.229,Cond. No.,7.54


### Interpret significant coefficients

Out of the six independent variables, four of them influence the dependent variable statistically significant at alpha level `0.05`. These variables are: `simple_agreement`, `elaborated_agreement`, `elaborated_disagreement`, and `give_reasons`. Next, the coefficient of these variables are interpreted.

In [35]:
significant_coefficients = {}
model_variables = model.pvalues.index
alpha_level = 0.05
for idx, p_value in enumerate(model.pvalues):
    if model_variables[idx] == 'Intercept':
        continue
    if p_value < alpha_level:
        if 'C(' in model_variables[idx]:
            variable_name = model_variables[idx].split('[T')[0].replace('C(','').replace(')','')
        else:
            variable_name = model_variables[idx]
        significant_coefficients[variable_name] = model.params[model_variables[idx]]

#### Simple agreement

In [36]:
print(f"Discussions with simple agreement decrease the estimate value of new ideas by "\
      f"{round(significant_coefficients['simple_agreement'],3)}")

Discussions with simple agreement decrease the estimate value of new ideas by -0.373


#### Elaborated agreement

In [37]:
print(f"Discussions with elaborated agreement increase the estimate value of new ideas by "\
      f"{round(significant_coefficients['elaborated_agreement'],3)}")

Discussions with elaborated agreement increase the estimate value of new ideas by 0.305


#### Elaborated disagreement

In [38]:
print(f"Discussions with elaborated disagreement increase the estimate value of new ideas by "\
      f"{round(significant_coefficients['elaborated_disagreement'],3)}")

Discussions with elaborated disagreement increase the estimate value of new ideas by 0.283


#### Giver reasons

In [39]:
print(f"Discussions where positions are justified increase the estimate value of new ideas by "\
      f"{round(significant_coefficients['gives_reason_s'],3)}")

Discussions where positions are justified increase the estimate value of new ideas by 0.615
