### One-Dimensional Aggregation: Respondent Metadata

NB: The unit of analysis being aggregated here is the individual vote cast by participants.

NB: In the following, only the non-free text variables will be investigated (the rest will be added once we have the final data).

### Variable Reminder:

* v_5: primary working area
* v_6: free text in case of other in v5
* v_11: free text years of primary working area experience
* v_118-v_121: intensity of involvement in RE (v118 most intense)
* v_12: CS degree
* v_14: team size
* v_15: class of systems in project scope
* v_16: free text in case of other in v15
* v_19: free text industry sector
* v_124: country

In [1]:
exportdate = 20180327
projectname = 'repract'

The usual preparations...

In [2]:
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns, numpy as np
import datetime
now = datetime.datetime.now().strftime('%Y%m%d')

In [3]:
%matplotlib notebook

In [4]:
sns.set_style('darkgrid')

In [5]:
ratings = ['Essential', 'Worthwhile', 'Unimportant', 'Unwise']
teamsizes = ['Small (1-4)', 'Medium (5-10)', 'Larger (10-49)', 'Very large (50+)']

In [6]:
df = pd.read_csv(f'../analysis/{exportdate}{projectname}_votelist_with_respondentmeta.csv')
df['Vote'] = pd.Categorical(df['Vote'].values, categories=ratings)
df['v_14'] = pd.Categorical(df['v_14'].values, categories=teamsizes)
df.head(1)

Unnamed: 0,EvID,PaperID,Vote,v_5,v_6,v_11,v_12,v_14,v_15,v_16,v_19,v_124,v_118,v_119,v_120,v_121
0,2,10,Worthwhile,Other (please specify),Product Management Coach,10,No,Medium (5-10),Other (please specify),Customer facing software products,Wide range (from automotive supplier to insura...,Germany,quoted,quoted,quoted,not quoted


### By Country (v_124)

In [7]:
bycountry = df.groupby(['v_124', 'Vote']
            ).count().reset_index().rename({'index':'v_124'}
            ).pivot('v_124', 'Vote', 'EvID').sort_index(ascending=False).fillna(0)#[ratings]
#bycountry['TotalVotes'] = bycountry.sum(axis=1)
bycountry.head(2)

Vote,Essential,Worthwhile,Unimportant,Unwise
v_124,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Uruguay,2.0,5.0,5.0,3.0
United States,49.0,102.0,48.0,1.0


Counts

In [8]:
bycountry.plot.barh(stacked=True, cmap='bwr', figsize=(9,5), alpha=0.75)
plt.ylabel('')
plt.legend(title='')
plt.tight_layout()
plt.savefig(f'../graphics/{now}_Respondents_Country_abs.pdf')

<IPython.core.display.Javascript object>

Percentages

In [9]:
(bycountry.T / bycountry.T.sum()).T

Vote,Essential,Worthwhile,Unimportant,Unwise
v_124,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Uruguay,0.133333,0.333333,0.333333,0.2
United States,0.245,0.51,0.24,0.005
United Kingdom,0.258824,0.517647,0.2,0.023529
Switzerland,0.224806,0.418605,0.325581,0.031008
Sweden,0.150794,0.436508,0.357143,0.055556
Spain,0.238938,0.39823,0.274336,0.088496
South Africa,0.3,0.4,0.233333,0.066667
Romania,0.2,0.266667,0.466667,0.066667
Portugal,0.464286,0.428571,0.107143,0.0
Norway,0.166667,0.566667,0.266667,0.0


In [10]:
(bycountry.T / bycountry.T.sum()).T.plot.barh(stacked=True, cmap='bwr', figsize=(10,6), alpha=0.75)
plt.legend([])
plt.xlim(0,1)
plt.xticks(np.arange(0,1.1,0.1))
plt.tight_layout();
plt.savefig(f'../graphics/{now}_Respondents_Country_rel.pdf')

<IPython.core.display.Javascript object>

### By Role (v_5)

NB: At this point, without coded answers from v_6

In [11]:
byrole = df.groupby(['v_5', 'Vote']
            ).count().reset_index().rename({'index':'v_5'}
                                          ).pivot('v_5', 'Vote', 'EvID'
            ).sort_index(ascending=False).fillna(0)

In [12]:
byrole.plot.barh(
    stacked=True, cmap='bwr', figsize=(12,6), alpha=0.75)
plt.legend(loc='center right');
plt.savefig(f'../graphics/{now}_Respondents_Role_abs.pdf')

<IPython.core.display.Javascript object>

In [13]:
(byrole.T / byrole.T.sum()).T.plot.barh(stacked=True, cmap='bwr', figsize=(12,6), alpha=0.75)
plt.xlim(0,1)
plt.xticks(np.arange(0,1.1,0.1))
plt.legend([]);
plt.savefig(f'../graphics/{now}_Respondents_Role_rel.pdf')

<IPython.core.display.Javascript object>

### By CS Degree (v_12)

In [19]:
csdegree = df.groupby(['v_12', 'Vote']).count()[['EvID']].reset_index(
).pivot('v_12', 'Vote', 'EvID')
csdegree

Vote,Essential,Worthwhile,Unimportant,Unwise
v_12,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,120,242,153,16
Yes,401,761,395,76


In [27]:
csdegree.plot.bar(cmap='bwr', alpha=0.75, rot=0)
plt.xlabel('CS Degree?');
plt.legend(title='')
plt.savefig(f'../graphics/{now}_Respondents_CS_Degree_abs.pdf')

<IPython.core.display.Javascript object>

In [25]:
(csdegree.T / csdegree.T.sum()).T.plot.bar(stacked=True, cmap='bwr', alpha=0.75, rot=0)
plt.ylim(0,1)
plt.xlabel('CS Degree?')
plt.ylabel('Percentage of Votes')
plt.yticks(np.arange(0,1.1,0.1))
plt.legend(loc='center')
plt.savefig(f'../graphics/{now}_Respondents_CS_Degree_rel.pdf')

<IPython.core.display.Javascript object>

### By Team Size (v_14)

In [28]:
teamsize = df.groupby(['v_14', 'Vote']).count()[['EvID']].reset_index(
).pivot('v_14', 'Vote', 'EvID')
teamsize

Vote,Essential,Worthwhile,Unimportant,Unwise
v_14,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Small (1-4),102,167,88,20
Medium (5-10),211,433,261,39
Larger (10-49),153,316,141,22
Very large (50+),55,87,58,11


In [30]:
teamsize.plot.bar(rot=0, cmap='bwr', alpha=0.75)
plt.xlabel('Team Size')
plt.legend(title='')
plt.ylabel('Vote Count');
plt.savefig(f'../graphics/{now}_Respondents_Team_Size_abs.pdf')

<IPython.core.display.Javascript object>

In [31]:
(teamsize.T / teamsize.T.sum()).T

Vote,Essential,Worthwhile,Unimportant,Unwise
v_14,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Small (1-4),0.270557,0.442971,0.233422,0.05305
Medium (5-10),0.223517,0.458686,0.276483,0.041314
Larger (10-49),0.242089,0.5,0.223101,0.03481
Very large (50+),0.260664,0.412322,0.274882,0.052133


In [32]:
(teamsize.T / teamsize.T.sum()).T.plot.barh(stacked=True, cmap='bwr', figsize=(10,6))
plt.legend([])
plt.xlim(0,1)
plt.xticks(np.arange(0,1.1,0.1))
plt.ylabel('Team Size')
plt.tight_layout();
plt.savefig(f'../graphics/{now}_Respondents_Team_Size_rel.pdf')

<IPython.core.display.Javascript object>

### By Class of Systems (v_15)

NB: At this point, without coded answers from v_16

In [33]:
systems = df.groupby(['v_15', 'Vote']).count()[['EvID']].reset_index(
).pivot('v_15', 'Vote', 'EvID')
systems

Vote,Essential,Worthwhile,Unimportant,Unwise
v_15,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
(Business) information systems,251,476,246,49
Hybrid / mix of embedded systems and information systems,112,230,141,24
Other (please specify),33,63,56,7
Software-intensive embedded systems,125,234,105,12


In [36]:
systems.plot.bar(cmap='bwr', rot=0, alpha=0.75)
plt.xlabel('Class of System')
plt.ylabel('Vote Count');
plt.legend(title='')
plt.xticks(np.arange(4), 
           ['Business', 'Hybrid/Mix', 'Other', 'Embedded']);
plt.savefig(f'../graphics/{now}_Respondents_System_Class_abs.pdf')

<IPython.core.display.Javascript object>

In [37]:
(systems.T / systems.T.sum()).T

Vote,Essential,Worthwhile,Unimportant,Unwise
v_15,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
(Business) information systems,0.245597,0.465753,0.240705,0.047945
Hybrid / mix of embedded systems and information systems,0.220907,0.453649,0.278107,0.047337
Other (please specify),0.207547,0.396226,0.352201,0.044025
Software-intensive embedded systems,0.262605,0.491597,0.220588,0.02521


In [38]:
(systems.T / systems.T.sum()).T.plot.barh(stacked=True, cmap='bwr', figsize=(10,6), alpha=0.75)
plt.xlim(0,1)
plt.yticks(np.arange(4), ['Business Information', 'Hybrid', 'Other', 'Embedded', ])
plt.ylabel('Class of System')
plt.legend([])
plt.xticks(np.arange(0,1.1,0.1))
plt.tight_layout();
plt.savefig(f'../graphics/{now}_Respondents_System_Class_rel.pdf')

<IPython.core.display.Javascript object>

### Degree of Involvement in RE

NB: I have no idea why this was not implemented as a single-selection question. 
The way participants answered, it is impossible to cleanly define what they answered. 
Some might have understood the question so as to tick only their 'maximum' involvement, seeing the first three levels in a 'nested/subset' relationship (especially due to the word 'extent' in the question).
Others might have ticked just one, not seeing the subset relationship.
Others might have ticked all degrees of involvement that apply.

In [39]:
reqvars = ['v_'+str(x) for x in range(118,122)]

Let's see whether the answers are at least internally consistent by looking at v_118-v_121 in their bitvector representations.

In [40]:
involvement = df[reqvars].replace(['quoted', 'not quoted'], [1,0]).stack()

In [41]:
involvement.head()

0  v_118    1
   v_119    1
   v_120    1
   v_121    0
1  v_118    1
dtype: int64

In [42]:
answers = [tuple(involvement[x].values) for x in involvement.index.levels[0]]

In [43]:
set(answers)

{(0, 0, 0, 1),
 (0, 0, 1, 0),
 (0, 0, 1, 1),
 (0, 1, 0, 0),
 (0, 1, 1, 0),
 (1, 0, 0, 0),
 (1, 0, 1, 0),
 (1, 1, 0, 0),
 (1, 1, 1, 0)}

Of the different answer vectors, (0,0,1,1), (1,0,1,0), (1,1,0,0) seem problematic, as seem either (1,0,0,0), (0,1,0,0), (0,0,1,0) or (1,1,1,0), (0,1,1,0), (1,1,0,0), depending on the viewpoint. It will be hard to derive anything meaningful from this...

In [44]:
reqvectors = pd.DataFrame(df[['EvID', 'Vote']], copy=True)
reqvectors['InvolvementVector'] = answers
reqvectors.head()

Unnamed: 0,EvID,Vote,InvolvementVector
0,2,Worthwhile,"(1, 1, 1, 0)"
1,2,Unwise,"(1, 1, 1, 0)"
2,2,Essential,"(1, 1, 1, 0)"
3,2,Unimportant,"(1, 1, 1, 0)"
4,2,Unwise,"(1, 1, 1, 0)"


In [45]:
rvdf = reqvectors.groupby(['InvolvementVector', 'Vote']).count().fillna(0).reset_index(
).pivot('InvolvementVector', 'Vote', 'EvID')
rvdf

Vote,Essential,Worthwhile,Unimportant,Unwise
InvolvementVector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(0, 0, 0, 1)",4,13,9,4
"(0, 0, 1, 0)",52,58,20,2
"(0, 0, 1, 1)",1,11,1,2
"(0, 1, 0, 0)",84,152,67,16
"(0, 1, 1, 0)",58,114,50,9
"(1, 0, 0, 0)",160,334,209,30
"(1, 0, 1, 0)",30,40,30,2
"(1, 1, 0, 0)",66,131,63,13
"(1, 1, 1, 0)",66,150,99,14


In [46]:
rvdf.plot.bar(cmap='bwr', rot=30, alpha=0.75)
plt.legend(title='')
plt.ylabel('Vote Count');
plt.savefig(f'../graphics/{now}_Respondents_Involvement_Vector_abs.pdf')

<IPython.core.display.Javascript object>

In [47]:
(rvdf.T / rvdf.T.sum()).T.plot.barh(stacked=True, cmap='bwr', rot=0, alpha=0.75)
plt.xlim(0,1)
plt.xticks(np.arange(0,1.1,0.1))
plt.ylabel('')
plt.legend([]);
plt.savefig(f'../graphics/{now}_Respondents_Involvement_Vector_rel.pdf')

<IPython.core.display.Javascript object>

In my opinion, it would be unwise to look just at the maximum involvement (e.g., pooling (1,0,0,0), (1,0,1,0), (1,0,1,0) and (1,1,1,0), as participants might well have understood the question differently...

The End.