# Disagreement in Crowdsourced Policymaking

## Factors that influence idea generation (w/o moderation)

This notebook contains the regression analysis conducted for the article *The Value of Disagreement in Crowdsourced Policymaking:Idea Generation Through Elaborated Perspectives*. Multivariate regression analysis is applied to explore what factors influence idea generation and to which extend.

## Content

- [Load libraries](#0.-Load-libraries)
- [Load data](#1.-Load-data)
- [Preprocess data](#2.-Preprocess-data)
- [Regression analysis](#3.-Regression-analysis)

## 0. Load libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

from collections import defaultdict
from statsmodels.formula.api import glm, ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

## 1. Load data

Data is distributed in three data files, each representing a discussion topic.

### Topic: Member decision making and right

In [26]:
member_df = pd.read_csv('data/member_decision_making_and_right.csv')
print(f'The dataset has {member_df.shape[0]} rows and {member_df.shape[1]} columns')

The dataset has 213 rows and 49 columns


### Topic: Administration of an association

In [27]:
admin_df = pd.read_csv('data/administration_of_association.csv')
print(f'The dataset has {admin_df.shape[0]} rows and {admin_df.shape[1]} columns')

The dataset has 119 rows and 49 columns


### Topic: Informally organized groups of an association

In [28]:
informal_df = pd.read_csv('data/informally_organized_group_of_a.csv')
print(f'The dataset has {informal_df.shape[0]} rows and {informal_df.shape[1]} columns')

The dataset has 159 rows and 51 columns


## 2. Preprocess data

### Normalize column names

Put column names to lower case, replace spaces with underscore, delete leading and trailing spaces, and remove non alpha-numerical characters

In [29]:
def normalize_column_names(names):
    names = names.str.lower()
    names = names.str.replace(' ','_')
    names = names.str.replace('.','_')
    names = names.str.replace('(','_')
    names = names.str.replace(')','')
    names = names.str.replace('/','_')
    names = names.str.replace('___','_')
    names = names.str.strip()
    return names

In [30]:
member_df.columns = normalize_column_names(member_df.columns)
admin_df.columns = normalize_column_names(admin_df.columns)
informal_df.columns = normalize_column_names(informal_df.columns)

### Standarize column names

Columns that contained the same information but have slightly different names are renamed for compatibility purposes

In [31]:
member_df = member_df.rename(columns={
    'annotations_for_disagreement_new_idea_cascade_s': 'annotations_for_disagreement_new_idea_cascades'
})

In [32]:
admin_df = admin_df.rename(columns={
    'amount_of_likes': 'number_of_likes',
    'comment\'s_id': 'comment_id',
    'proposal': 'proposals',    
})

In [10]:
informal_df = informal_df.rename(columns={
    'annotations_for_disagreement_new_idea_cascade_s': 'annotations_for_disagreement_new_idea_cascades',
    'proposal': 'proposals'
})

### Merge datasets in one dataframe

Add an extra column to indicate the discussion topic

In [33]:
member_df['topic'] = 'member'
admin_df['topic'] = 'admin'
informal_df['topic'] = 'informal'

#### Merge datasets

In [34]:
all_df = pd.concat([member_df, admin_df, informal_df], axis=0, ignore_index=True)
print(f'The merged dataset has {all_df.shape[0]} rows and {all_df.shape[1]} columns')

The merged dataset has 491 rows and 55 columns


### Remove columns containing text in Finnish

Text of comments and responses are both in Finnish and English. Columns `comment` and `response` that contain text in Finnish are removed because they will not be considered in this analysis.

In [35]:
all_df = all_df.drop(['comment', 'response'], axis=1)

Columns `comment_1` and `response_1` are renamed removing `_1` from their names

In [36]:
all_df = all_df.rename(columns={'comment_1': 'comment', 'response_1': 'response'})

### Check null values

Check the number of null values by columns

In [37]:
all_df.isnull().sum()

background                                           3
proposals                                          161
time                                                 3
user_id                                              3
comment_id                                         346
response_id                                        148
number_of_likes                                      0
users_who_liked                                    263
attachments                                        491
comment                                            348
response                                           146
topic_1                                              4
topic_2                                            147
topic_3                                            358
disagreement                                         0
agreement                                            0
simple_disagreement                                  0
elaborated_disagreement                              0
simple_agr

### Remove summary rows

Remove rows that contain summaries. They are identified by having a null value in the column background.

In [38]:
idxs_to_remove = all_df[all_df.background.isnull()].index.values
all_df = all_df.drop(index=idxs_to_remove)
print(f'Data set size after removing summary rows. Rows: {all_df.shape[0]}, Columns: {all_df.shape[1]}')

Data set size after removing summary rows. Rows: 488, Columns: 53


### Fix errors in variables

There were labeling errors in the rows `336` and `310`; they are fixed below

In [39]:
all_df.loc[336, 'simple_agreement'] = 0
all_df.loc[310, 'elaborated_agreement'] = 1

### Fix values in the column `number_of_ideas`

In [40]:
all_df.loc[all_df['number_of_ideas']=='unclear', 'number_of_ideas'] = 0
all_df.loc[all_df['number_of_ideas'].isna(), 'number_of_ideas'] = 0
all_df['number_of_ideas'] = pd.to_numeric(all_df['number_of_ideas'], downcast='unsigned')

### Set value `unclear` of column `gives_reason_s` to `0`

In [41]:
all_df.loc[all_df['gives_reason_s']=='unclear', 'gives_reason_s'] = '0'

### Remove moderator comments

In [44]:
print(f'There are {len(all_df[all_df["moderator_post"]==1])} comments that will be removed')

There are 120 comments that will be removed


In [45]:
all_df = all_df[all_df['moderator_post']!=1]

### Show final columns

In [46]:
all_df.columns

Index(['background', 'proposals', 'time', 'user_id', 'comment_id',
       'response_id', 'number_of_likes', 'users_who_liked', 'attachments',
       'comment', 'response', 'topic_1', 'topic_2', 'topic_3', 'disagreement',
       'agreement', 'simple_disagreement', 'elaborated_disagreement',
       'simple_agreement', 'elaborated_agreement', 'idea_s', 'number_of_ideas',
       'new_idea', 'sourcing', 'value_s', 'topic_shift', 'brainstorming',
       'blending', 'building', 'broadening', 'fact', 'value', 'policy',
       'interpretation', 'target_of_disagreement', 'target_of_agreement',
       'gives_reason_s', 'presents_evidence', 'asks_question_s',
       'provides_information', 'clarifies_position_stance',
       'responds_to_previous_comment', 'constructive_tone', 'moderator_post',
       'acknowledges_problem', 'notes',
       'annotations_for_disagreement_new_idea_cascades', 'topic', 'proposal',
       'irrpolicy', 'irrinterpretation', 'irrconstructive_tone',
       'annotations_for

### Show a small sample

In [47]:
all_df.head()

Unnamed: 0,background,proposals,time,user_id,comment_id,response_id,number_of_likes,users_who_liked,attachments,comment,...,moderator_post,acknowledges_problem,notes,annotations_for_disagreement_new_idea_cascades,topic,proposal,irrpolicy,irrinterpretation,irrconstructive_tone,annotations_for_disagreement_new_idea_cascade_s
1,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T07:08:18+00:00,5cf60f17d8f1250a070160ee,,5cf618e22878cf073b0eafca,1,karin.rinne@netti.fi,,,...,0,0,,,member,,,,,
2,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T10:39:32+00:00,5cf6379534204f3a8d121027,,5cf64a64d8f1253e5a242b5b,0,,,,...,0,0,,,member,,,,,
3,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T11:34:33+00:00,5cf63071d8f12537d5632b3e,,5cf657492878cf36b01b767b,0,,,,...,0,1,,,member,,,,,
7,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-04T16:31:27+00:00,5cf63071d8f12537d5632b3e,,5cf69cdfd8f12530a35e8e7c,2,"arto.paeivinen@gmail.com,stina.koivisto@eslu.fi",,,...,0,1,,,member,,,,,
9,Members’ decision-making and rights\n\nIn this...,Proposal: Allow association members’ decision-...,2019-06-07T12:01:58+00:00,5cf63071d8f12537d5632b3e,,5cfa52362878cf5c9946913f,0,,,,...,0,0,,,member,,,,,


### Show final dataset dimesion

In [48]:
print(f'The final dataset has a dimension of {all_df.shape[0]} rows and {all_df.shape[1]} columns')

The final dataset has a dimension of 368 rows and 53 columns


## 3. Regression analysis

### Select variables

Variable selection is based on the literature and goals of the paper.

In [49]:
independent_vars = ['simple_agreement', 'elaborated_agreement', 'simple_disagreement', 'elaborated_disagreement',
                    'gives_reason_s', 'presents_evidence']
print(f"In total {len(independent_vars)} independent variables will be considered in the analysis")

In total 6 independent variables will be considered in the analysis


Select columns that include independent and dependent variables

In [50]:
analysis_df = all_df.loc[:,independent_vars + ['number_of_ideas']]
print(f"The analysis is conducted with a dataset composed of " \
      f"{analysis_df.shape[0]} rows and {analysis_df.shape[1]} columns")

The analysis is conducted with a dataset composed of 368 rows and 7 columns


### Check data consistency

There should not be comments that have at the same time `simple agreement` and `elaborated agreement`

In [51]:
analysis_df[(analysis_df['simple_agreement']==1)&(analysis_df['elaborated_agreement']==1)].shape[0]

0

There should not be comments that have at the same time `simple disagreement` and `elaborated disagreement`

In [52]:
analysis_df[(analysis_df['simple_disagreement']==1)&(analysis_df['elaborated_disagreement']==1)].shape[0]

0

### Cast variables to numeric

Before modeling data, it is required to ensure that independent variables are not correlated. Varince Inflation Factor (VIF) is the most common method used to detect multicollinearity (or variables independence). The python implementation of VIF requires the variables to be numeric. Here, the independent variables are casted to numeric.

In [53]:
analysis_df.columns

Index(['simple_agreement', 'elaborated_agreement', 'simple_disagreement',
       'elaborated_disagreement', 'gives_reason_s', 'presents_evidence',
       'number_of_ideas'],
      dtype='object')

In [54]:
analysis_df[analysis_df['gives_reason_s']=='unclear']

Unnamed: 0,simple_agreement,elaborated_agreement,simple_disagreement,elaborated_disagreement,gives_reason_s,presents_evidence,number_of_ideas


In [55]:
analysis_df = analysis_df.apply(pd.to_numeric)

### Check for multicollinearity

In general, values above 5 indicate high multicollinearity.

In [56]:
vif = pd.DataFrame()
vif["Variable"] = analysis_df[independent_vars].columns
vif["VIF"] = [variance_inflation_factor(analysis_df[independent_vars].values, i) 
              for i in range(analysis_df[independent_vars].shape[1])]
vif

Unnamed: 0,Variable,VIF
0,simple_agreement,1.055298
1,elaborated_agreement,2.094704
2,simple_disagreement,1.024406
3,elaborated_disagreement,2.944082
4,gives_reason_s,4.609584
5,presents_evidence,1.114999


There is not multicollinearity among the independent variables.

### Cast independent variables to category

In [57]:
cast = {}
for independent_var in independent_vars:
    cast[independent_var] = 'category'
analysis_df = analysis_df.astype(cast)

Check variable types

In [58]:
analysis_df.dtypes

simple_agreement           category
elaborated_agreement       category
simple_disagreement        category
elaborated_disagreement    category
gives_reason_s             category
presents_evidence          category
number_of_ideas               uint8
dtype: object

### Fit model

Iterate over independent variables to create formula.

In [59]:
formula = f'number_of_ideas ~ '
for idx, independent_var in enumerate(independent_vars):
    formula += f' C({independent_var})'
    if idx < (len(independent_vars)-1):
        formula += ' + '
print(f'Model formula:\n{formula}')

Model formula:
number_of_ideas ~  C(simple_agreement) +  C(elaborated_agreement) +  C(simple_disagreement) +  C(elaborated_disagreement) +  C(gives_reason_s) +  C(presents_evidence)


Build model with created formula

In [60]:
model = ols(formula, data = analysis_df).fit()

In [61]:
model.summary()

0,1,2,3
Dep. Variable:,number_of_ideas,R-squared:,0.253
Model:,OLS,Adj. R-squared:,0.241
Method:,Least Squares,F-statistic:,20.39
Date:,"Tue, 23 Nov 2021",Prob (F-statistic):,1.47e-20
Time:,15:52:01,Log-Likelihood:,-501.98
No. Observations:,368,AIC:,1018.0
Df Residuals:,361,BIC:,1045.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.8331,0.139,6.008,0.000,0.560,1.106
C(simple_agreement)[T.1],-0.3966,0.169,-2.343,0.020,-0.729,-0.064
C(elaborated_agreement)[T.1],0.3167,0.117,2.706,0.007,0.087,0.547
C(simple_disagreement)[T.1],0.0792,0.234,0.338,0.735,-0.381,0.540
C(elaborated_disagreement)[T.1],0.2091,0.119,1.763,0.079,-0.024,0.442
C(gives_reason_s)[T.1],0.7977,0.159,5.019,0.000,0.485,1.110
C(presents_evidence)[T.1],0.1164,0.194,0.599,0.549,-0.266,0.498

0,1,2,3
Omnibus:,119.027,Durbin-Watson:,1.807
Prob(Omnibus):,0.0,Jarque-Bera (JB):,414.888
Skew:,1.423,Prob(JB):,8.09e-91
Kurtosis:,7.354,Cond. No.,8.04


### Interpret significant coefficients

Out of the six independent variables (i.e., `simple_agreement`, `elaborated_agreement`, `simple_disagreement`, `elaborated_disagreement`, `gives_reason_s`, and `presents_evidence`), three of them influence the dependent variable statistically significant at alpha level `0.05`. These variables are: `simple_agreement`, `elaborated_agreement`, and `give_reasons`. Next, the coefficient of these variables are interpreted.

In [62]:
significant_coefficients = {}
model_variables = model.pvalues.index
alpha_level = 0.05
for idx, p_value in enumerate(model.pvalues):
    if model_variables[idx] == 'Intercept':
        continue
    if p_value < alpha_level:
        if 'C(' in model_variables[idx]:
            variable_name = model_variables[idx].split('[T')[0].replace('C(','').replace(')','')
        else:
            variable_name = model_variables[idx]
        significant_coefficients[variable_name] = model.params[model_variables[idx]]

#### Simple agreement

In [63]:
print(f"Discussions with simple agreement decrease the estimate value of new ideas by "\
      f"{round(significant_coefficients['simple_agreement'],3)}")

Discussions with simple agreement decrease the estimate value of new ideas by -0.397


#### Elaborated agreement

In [64]:
print(f"Discussions with elaborated agreement increase the estimate value of new ideas by "\
      f"{round(significant_coefficients['elaborated_agreement'],3)}")

Discussions with elaborated agreement increase the estimate value of new ideas by 0.317


#### Giver reasons

In [67]:
print(f"Discussions where positions are justified increase the estimate value of new ideas by "\
      f"{round(significant_coefficients['gives_reason_s'],3)}")

Discussions where positions are justified increase the estimate value of new ideas by 0.798
