#### Variables coded in this notebook: 18, 1373, 8345 et seq, 8780 et seq

NB: Here, multiple tags should be allowed, so we need long-form DataFrames.

##### Maybe work out satisfiers, dissatisfiers and excitement factors (or what were they called?) as RE research relevance requirements?

In [2]:
import os, pandas as pd, re

In [3]:
exportdate = 20180327
projectname = 'repract'

In [4]:
df = pd.read_csv(f'../../data/{exportdate}{projectname}.csv')
df.head(2)

Unnamed: 0,lfdn,external_lfdn,tester,dispcode,lastpage,quality,duration,v_7039,v_7040,v_7041,...,output_mode,javascript,flash,session_id,language,cleaned,ats,datetime,date_of_last_access,date_of_first_mail
0,106,0,no tester,Completed after break (32),2138658,NotShown,-1,NotShown,NotShown,0,...,HTML,NotShown,NotShown,3bb21c1b318e2f6b87557566bdd6b4d9,English,Not cleaned,1515411510,2018-01-08 11:38:30,2018-01-08 13:07:14,0000-00-00 00:00:00
1,131,0,no tester,Completed (31),2138658,NotShown,3805,NotShown,NotShown,NotShown,...,HTML,NotShown,NotShown,fc38f6556787a459c2cc604abf799448,English,Not cleaned,1515667019,2018-01-11 10:36:59,2018-01-11 11:40:24,0000-00-00 00:00:00


In [5]:
basedir = '../../data/freetext'
freetextfiles = os.listdir(basedir)
dfs = {file[:-4]:pd.read_csv(f'{basedir}/{file}') for file in freetextfiles}

In [6]:
dfs.keys()

dict_keys(['.DS_S', 'v_11', 'v_1373', 'v_16', 'v_18', 'v_19', 'v_6', 'v_8345etseq', 'v_8780etseq'])

In [112]:
codedir = '../../analysis/freetext'
def write_coded(df, varname, prelim=False):
    filepath = f'{codedir}/{varname}_coded{"_prelim" if prelim else ""}.csv'
    df.to_csv(filepath, index=False)
    print(f'File stored at {filepath}.')

General Procedure:
- Refine Regex to the point where further refinement would lead to drastical overfitting
- Assign Reasoning Tags via Automation
- Find Answers with no Tags and Assign Manually
- Verify all tags

In [501]:
def assign_reasoning_tags(df, regexes):
    newdf = pd.DataFrame(columns=list(df.columns.values) + ['Tag'])
    for k,v in regexes.items():
        matches = pd.DataFrame([list(row[1])+[k] for row in df.iterrows() if re.search(v, row[1][-1])], 
                               columns=list(df.columns.values) + ['Tag'])
        newdf = newdf.append(matches)
    newdf = newdf.append(pd.DataFrame([list(row[1])+['NotAnswered'] for row in df.iterrows() if len(row[1][-1]) < 5], 
                               columns=list(df.columns.values) + ['Tag']))
    newdf = newdf.sort_values(['PaperID', 'lfdn'])
    newdf = newdf.append(pd.DataFrame([list(row[1])+[''] for row in df.iterrows() 
                                       if (row[1][0],row[1][1]) not in list(zip(newdf.PaperID.values, newdf.lfdn.values))], 
                               columns=list(df.columns.values) + ['Tag']))
    return newdf

In [537]:
def get_unmatched(df):
    return df[df.Tag == ''][['PaperID', 'lfdn', 'reasoning']]

In [11]:
# helper
def minitest(df, regex):
    x = 0
    for idx, r in enumerate(df):
        if (re.search(
            regex,r)
            ):
            print(idx, r)
            x +=1 
    return x

#### Positive Reasoning

Positive Reasoning Categories:
- Practical Problem (Relevance) / Characteristics of the Problem
- Sensible Solution (Plausibility) / Characteristics of the Solution
- Astute Approach (Originality) / Characteristics of the Solution (more generic than Plausibility)
- NotAnswered (String with no true content)

With Reference to:
- experience (explicit reference to personal experience)
- opinion    (explicit reference to personal judgment)
- perception (implicit - people making statements of fact without reference to their source - euphemism for 'no arguments presented' - not explicitly tagged)


In [87]:
regexes = {
    'reason:relevance': 
        '[Pp]roblem|[Cc]halleng|[Ee]xperienc|[Rr]elevan|[Ee]ssential|[Cc]ritical|[Cc]rucial|[Ii]mporta|[Ii]ssue(?!s)|[Cc]oncern|[Dd]ifficult|[Nn]eed(?!s)|(?:^|\W)[Ww]e\W|[Ff]undamental|[Dd]ilemma',
    'reason:plausibility': 
        '[Cc]ould|[Mm]ight|[Hh]elp(s|ful)?(?!\w)|[Ii]mproves?(?!m)|(?<!\sto\s)better',
    'reason:originality': 
        'literature|gap|interesting',
    'source:experience': 
        '(?:^|\W)my\W.*?(?:experience|work)',
    'source:opinion': 
        '(?:^|\W)my\Wopinion(?!:)|[Bb]eliev|[Tt]hink|[Ff]eel',
}

In [503]:
postags = assign_reasoning_tags(dfs['v_8345etseq'], regexes)
postags['level_1'], postags['level_2'] = list(zip(*[tag.split(':') if len(tag.split(':')) > 1 
                                                 else tag.split(':') + ['']
                                          for tag in postags.Tag 
                                          ]))
postags.head()

Unnamed: 0,PaperID,lfdn,reasoning,Tag,level_1,level_2
0,2,116,Aligning requirements to regulatory standards ...,reason:relevance,reason,relevance
0,5,152,.,NotAnswered,NotAnswered,
0,8,94,In order to gain a better understanding from t...,reason:plausibility,reason,plausibility
1,9,110,Ambiguities are a critical source of issues. A...,reason:relevance,reason,relevance
1,9,110,Ambiguities are a critical source of issues. A...,reason:plausibility,reason,plausibility


In [512]:
write_coded(postags[postags.columns.values[:-2]], 'v_8345etseq', prelim=True)

File stored at ../../analysis/freetext/v_8345etseq_coded_prelim.csv.


In [546]:
postags.groupby(['level_1', 'level_2']).count()[['Tag']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Tag
level_1,level_2,Unnamed: 2_level_1
,,31
NotAnswered,,1
reason,originality,5
reason,plausibility,34
reason,relevance,63
source,experience,6
source,opinion,12


In [544]:
list(get_unmatched(postags).reasoning.values)

['A learning-by-example method is one of the most efficient tools allowing for decreasing the uncertainty in the future.',
 'Benchmarks are the start of everything.',
 'A many more automatical methods shall support and ease the work of RE. Especially for standardised requirements, security, safety, traceability analysis,etc. ',
 'non-functional reqs have the biggest impact on architecture and may not be left aside ',
 'it will give me good reasones for RE that managers will understand',
 "aren't they connected?",
 'Because of the conflict between complete requirements and fast requirements.',
 'A lot of regulations are contradictory, it is good to have an overview for decision making',
 'Green field happens from time to time; incremental improvements are very common. Change requests occur frequently',
 'Non-functional requirements do not always get the attention they deserve.  Correlating understanding of non-functional requirements to project success would be of interest. ',
 'To many

### TODO for positive reasoning
- add/check tags manually in separate file
- do count analysis on final file

### Negative Reasoning

- Lack of Practical Relevance (problem)
- Approach too theoretical (solution)
- Domain too specialized (solution)
-...

Also:
- Rejection of the assumption contained in the question (e.g., 'it's not a lower rating...')

Observation: Practitioners are much more specific when they state what they dislike about research than when they state what they like about research.

Some answers are so specific that they can only be properly understood when taking a look at the rated summary.

#### NB: The regex have gotten a little ugly and need cleanup/refactoring.

At this point, they are likely overinclusive...

In [554]:
# nb some refer to problem some to solution some to question itself
negation = "(?:n'?t|[Nn]ot?)\W"
negexes = {
    # not that 'not important' refers to absolute and relative (=prioritization) importance reasonings
    'reason:notimportant': ('[Nn]ot?\W.{,50}(?:impor|relev|need|necess|frequ)|'
                    + '[Dd]istract|decr.*?relev|[Nn]ot?\W.{,20}(priori|essent)|not.*?big.*?prob'),
    'reason:notefficient': '[Ee]ffort',
    'reason:notinteresting':  '(?:[Nn]ot?\W.{,10}|un)interest',
    # not that 'not convincing' statements are quite diffuse - critique of assumptions, critique of procedures, ...
    'reason:notconvincing': 
                ("(?:n't|[Nn]ot?)\W.{,10}?(?:conv|impr)|[Nn]ot?\W.{,30}?(help|use)|(?:n'?t|[Nn]ot?)\W.{,10}sense|"
                    +"pointless|worse|harm|waste|fail|simplistic|obscure|[Ii]mpractical|[Nn]ot?\W.*?good|(?:(?:[Cc]a|[Dd]o)n'?t|[Nn]ot?)\W.{,10}work|"
                    +"(?:[Nn]ot?\W|[Dd]on'?t|[Cc]an'?t|[Cc]an\s?not).{,30}(?:value|benefit)"),
    'reason:notrealistic': '[Tt]o.{,5}(?:theoretic|acad)|skill|scenario|real-w',
    'reason:toocomplicated': "(?:not\W|n't).*?\Wund|[Tt]o.{,5}technical", # ie not understood
    'reason:toovague': "[Ff]luffy|don't know.*?use|not.*?understood",
    'reason:toosubjective': '[Ss]ubjecti',
    'reason:notoriginal': 'already',
    'reason:toospecialized': '[Tt]o.{,10}specif|particular|special.*?domain|limit|should.*?wider|narrow', 
    'reason:respondentattitude': 'attitude',
    'rejection:ratingnotnegative': 'not?\W.*?lower', 
    'rejection:questionnotunderstood': 'not?\W.*?underst.*?quest|Sorry'
}

In [555]:
negtags = assign_reasoning_tags(dfs['v_8780etseq'], negexes)
negtags['level_1'], negtags['level_2'] = list(zip(*[tag.split(':') if len(tag.split(':')) > 1 
                                                 else tag.split(':') + ['']
                                          for tag in negtags.Tag 
                                          ]))
negtags.head(2)

Unnamed: 0,PaperID,lfdn,reasoning,Tag,level_1,level_2
0,4,30,"no one cares 'how' you developed, just that yo...",reason:notconvincing,reason,notconvincing
0,5,35,Document driven approaches are decreasing in r...,reason:notimportant,reason,notimportant


In [556]:
write_coded(negtags[negtags.columns.values[:-2]], 'v_8780etseq', prelim=True)

File stored at ../../analysis/freetext/v_8780etseq_coded_prelim.csv.


In [557]:
negtags.groupby(['level_1', 'level_2']).count()[['Tag']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Tag
level_1,level_2,Unnamed: 2_level_1
,,36
NotAnswered,,1
reason,notconvincing,26
reason,notefficient,5
reason,notimportant,18
reason,notinteresting,5
reason,notoriginal,3
reason,notrealistic,7
reason,respondentattitude,1
reason,toocomplicated,6


In [543]:
list(get_unmatched(negtags).reasoning.values)

['If we could see how ambiguous documentation can affect a software project, perhaps software industry could be persuaded to do a better requirements documentation.',
 'because it can benefit the reusability of solutions',
 'Quantitative analysis of usability is confusing and misleding',
 'not for all, some is, some not.',
 "I selected the option 'Worthwhile', so I consider it important.",
 'Form specification would solve a lot of problem, however I do not know any industry project which could specifiy the system with formal language. Yes we need a more dynamic way of specifiying our systems',
 'As a practioner, I prefer learning about the answers than learning about the problems we all know we have.',
 'In my experience, most modeling language research never goes beyond the PhD lab. It may be valuable eventually, but this needs application before asserting its utility.',
 'creativity ist nicht der entscheidende Punkt, eher geht es um Kenntnis der Probreme eines Anwendungsbereichs',
 '

In [518]:
minitest(dfs['v_8780etseq']['reasoning'], "not.*?big.*?prob")

115 Understandability of requirements is not the biggest problem in RE.


1

In [144]:
for elem in dfs['v_8780etseq']['reasoning']:
    print(elem)

no one cares 'how' you developed, just that you have your specifications. there are also too many personalities that no one would use it anyway
Document driven approaches are decreasing in relevance. 
If we could see how ambiguous documentation can affect a software project, perhaps software industry could be persuaded to do a better requirements documentation.
I don't know what techniques will be used
because it can benefit the reusability of solutions
If an analyst can communicate well with domain experts and users, if he can analyze what he heard and organize his own thoughts logically, he can create a domain model. I am not sure I trust an analyst who needs a method to explain him how to do it...
it is limited to  a company , if it would be a survey with many companies it would have more validity
No need in my environment.
Assumes that high-level goals and requirements are hierarcical. They are not in practice. They are many-to-many and attempts to make them hierarchical make thing

###  Appendix: Vocab Exploration for Positive and Negative Reasoning

In [23]:
from gensim.parsing.preprocessing import preprocess_documents
from gensim import corpora, models

In [33]:
from collections import Counter

In [25]:
documents = list(dfs['v_8345etseq']['reasoning'])
texts = preprocess_documents(documents)
dictionary = corpora.Dictionary(texts)
print(dictionary)

Dictionary(544 unique tokens: ['align', 'case', 'chang', 'cumbersum', 'especi']...)


In [49]:
# a list of word stems, sorted by frequency in the answers DESC
sorted([(k,v) for k,v in Counter([elem[x] for elem in texts for x in range(len(elem))]).items()], 
       key=lambda x:(x[1],x[0]), reverse=True)

[('requir', 44),
 ('import', 17),
 ('help', 17),
 ('method', 14),
 ('work', 13),
 ('project', 13),
 ('need', 13),
 ('understand', 11),
 ('time', 11),
 ('studi', 9),
 ('improv', 9),
 ('softwar', 8),
 ('process', 8),
 ('practic', 8),
 ('commun', 8),
 ('better', 8),
 ('think', 7),
 ('problem', 7),
 ('perform', 7),
 ('manag', 7),
 ('industri', 7),
 ('identifi', 7),
 ('good', 7),
 ('experi', 7),
 ('differ', 7),
 ('develop', 7),
 ('busi', 7),
 ('user', 6),
 ('traceabl', 6),
 ('system', 6),
 ('stakehold', 6),
 ('specif', 6),
 ('risk', 6),
 ('lot', 6),
 ('essenti', 6),
 ('critic', 6),
 ('wai', 5),
 ('result', 5),
 ('qualiti', 5),
 ('product', 5),
 ('model', 5),
 ('know', 5),
 ('interest', 5),
 ('impact', 5),
 ('goal', 5),
 ('exist', 5),
 ('especi', 5),
 ('environ', 5),
 ('chang', 5),
 ('case', 5),
 ('big', 5),
 ('base', 5),
 ('avoid', 5),
 ('agil', 5),
 ('want', 4),
 ('us', 4),
 ('test', 4),
 ('team', 4),
 ('task', 4),
 ('success', 4),
 ('standard', 4),
 ('spec', 4),
 ('solut', 4),
 ('secur', 

In [55]:
documents2 = list(dfs['v_8780etseq']['reasoning'])
texts2 = preprocess_documents(documents2)
dictionary2 = corpora.Dictionary(texts2)
print(dictionary2)

Dictionary(462 unique tokens: ['care', 'develop', 'person', 'specif', 'us']...)


In [56]:
sorted([(k,v) for k,v in Counter([elem[x] for elem in texts2 for x in range(len(elem))]).items()], 
       key=lambda x:(x[1],x[0]), reverse=True)

[('requir', 18),
 ('work', 12),
 ('need', 11),
 ('understand', 10),
 ('us', 9),
 ('specif', 9),
 ('project', 9),
 ('industri', 9),
 ('research', 8),
 ('problem', 8),
 ('model', 8),
 ('import', 8),
 ('approach', 8),
 ('softwar', 7),
 ('relev', 7),
 ('person', 7),
 ('know', 7),
 ('interest', 7),
 ('domain', 7),
 ('differ', 7),
 ('sure', 6),
 ('process', 6),
 ('practic', 6),
 ('method', 6),
 ('manag', 6),
 ('engin', 6),
 ('appli', 6),
 ('tool', 5),
 ('result', 5),
 ('real', 5),
 ('product', 5),
 ('languag', 5),
 ('high', 5),
 ('good', 5),
 ('experi', 5),
 ('effort', 5),
 ('compani', 5),
 ('better', 5),
 ('applic', 5),
 ('wai', 4),
 ('topic', 4),
 ('think', 4),
 ('set', 4),
 ('notat', 4),
 ('help', 4),
 ('goal', 4),
 ('environ', 4),
 ('document', 4),
 ('benefit', 4),
 ('avail', 4),
 ('analyst', 4),
 ('valu', 3),
 ('usual', 3),
 ('uncertainti', 3),
 ('time', 3),
 ('technolog', 3),
 ('system', 3),
 ('studi', 3),
 ('special', 3),
 ('sound', 3),
 ('solut', 3),
 ('small', 3),
 ('prefer', 3),
 (