#### Variables coded in this notebook: 8345 et seq, 8780 et seq

NB: Here, multiple tags should be allowed, so we need long-form DataFrames.

General Procedure:
1. Design and refine Regex to the point where further refinement would lead to drastical overfitting
2. Assign reasoning tags via automation
3. Let human(s) validate all tags and assign at least one tag to all answers

This notebook is concerned with steps 1 and 2 only.

#### The Regex Configuration is considered fixed as of 2018-04-04.

In [1]:
import os, pandas as pd, re

In [2]:
exportdate = 20180327
projectname = 'repract'

In [3]:
df = pd.read_csv(f'../../data/{exportdate}{projectname}.csv')
df.head(2)

Unnamed: 0,lfdn,external_lfdn,tester,dispcode,lastpage,quality,duration,v_7039,v_7040,v_7041,...,output_mode,javascript,flash,session_id,language,cleaned,ats,datetime,date_of_last_access,date_of_first_mail
0,106,0,no tester,Completed after break (32),2138658,NotShown,-1,NotShown,NotShown,0,...,HTML,NotShown,NotShown,3bb21c1b318e2f6b87557566bdd6b4d9,English,Not cleaned,1515411510,2018-01-08 11:38:30,2018-01-08 13:07:14,0000-00-00 00:00:00
1,131,0,no tester,Completed (31),2138658,NotShown,3805,NotShown,NotShown,NotShown,...,HTML,NotShown,NotShown,fc38f6556787a459c2cc604abf799448,English,Not cleaned,1515667019,2018-01-11 10:36:59,2018-01-11 11:40:24,0000-00-00 00:00:00


In [4]:
basedir = '../../data/freetext'
freetextfiles = os.listdir(basedir)
dfs = {file[:-4]:pd.read_csv(f'{basedir}/{file}') for file in freetextfiles}

In [5]:
dfs.keys()

dict_keys(['.DS_S', 'v_11', 'v_1373', 'v_16', 'v_18', 'v_19', 'v_6', 'v_8345etseq', 'v_8780etseq'])

In [6]:
codedir = '../../analysis/freetext'
def write_coded(df, varname, prelim=False, sep=','):
    filepath = f'{codedir}/{varname}_coded{"_prelim" if prelim else ""}.csv'
    df.to_csv(filepath, index=False, sep=sep)
    print(f'File stored at {filepath}.')

NB: If you want German-style csv files, set `sep` to `;` (default is `,`).

In [7]:
def assign_reasoning_tags(df, regexes):
    newdf = pd.DataFrame(columns=list(df.columns.values) + ['Tag'])
    for k,v in regexes.items():
        matches = pd.DataFrame([list(row[1])+[k] for row in df.iterrows() if re.search(v, row[1][-1].lower())], 
                               columns=list(df.columns.values) + ['Tag'])
        newdf = newdf.append(matches)
    newdf = newdf.append(pd.DataFrame([list(row[1])+['NotAnswered'] for row in df.iterrows() if len(row[1][-1]) < 5], 
                               columns=list(df.columns.values) + ['Tag']))
    newdf = newdf.sort_values(['PaperID', 'lfdn'])
    newdf = newdf.append(pd.DataFrame([list(row[1])+[''] for row in df.iterrows() 
                                       if (row[1][0],row[1][1]) not in 
                                       list(zip(newdf.PaperID.values, newdf.lfdn.values))], 
                               columns=list(df.columns.values) + ['Tag']))
    return newdf

In [8]:
def get_unmatched(df):
    return df[df.Tag == ''][['PaperID', 'lfdn', 'reasoning']]

In [9]:
# helper
def minitest(df, regex):
    x = 0
    for idx, r in enumerate(df):
        rval = r.lower()
        if (re.search(
            regex,rval)
            ):
            print(idx, r)
            x +=1 
    return x

#### Positive Reasoning

Reasons (Content of the Reasoning Provided):
- Relevance (of the Problem), i.e., Practical Problem (seemingly) addressed
- Plausibility (of the Solution), i.e., (seemingly) Sensible Solution
- Originality (of the Approach, i.e., (seemingly) Astute Approach - this being somewhat more generic than Plausibility
- NotAnswered (String with no true content)

Sources (Evidence/Support for the Reasons Provided):
- experience (explicit reference to personal experience)
- opinion    (explicit reference to personal judgment)
- perception (implicit reference to personal world view - people making statements of fact without reference to any specific source - euphemism for 'no arguments presented' - not explicitly tagged)

In [10]:
posexes = {
    'reason:relevance': 
        ('problem|challeng|experienc|issue(?!s)|concern|need(?!s)|dilemma|'+
        'relevan|essential|critical|crucial|importa|difficult|(?:^|\W)we\W|fundamental'),
    'reason:plausibility': 
        'could|might|help(s|ful)?(?!\w)|improves?(?!m)|(?<!\sto\s)better',
    'reason:originality': 
        'literature|gap|interesting',
    'source:experience': 
        '(?:^|\W)my\W.*?(?:experience|work)',
    'source:opinion': 
        '(?:^|\W)my\Wopinion(?!:)|believ|think|feel',
}

In [11]:
postags = assign_reasoning_tags(dfs['v_8345etseq'], posexes)
postags['level_1'], postags['level_2'] = list(zip(*[tag.split(':') if len(tag.split(':')) > 1 
                                                 else tag.split(':') + ['']
                                          for tag in postags.Tag 
                                          ]))
postags.head()

Unnamed: 0,PaperID,lfdn,reasoning,Tag,level_1,level_2
0,2,116,Aligning requirements to regulatory standards ...,reason:relevance,reason,relevance
0,5,152,.,NotAnswered,NotAnswered,
0,8,94,In order to gain a better understanding from t...,reason:plausibility,reason,plausibility
1,9,110,Ambiguities are a critical source of issues. A...,reason:relevance,reason,relevance
1,9,110,Ambiguities are a critical source of issues. A...,reason:plausibility,reason,plausibility


In [12]:
postags.groupby(['level_1', 'level_2']).count()[['Tag']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Tag
level_1,level_2,Unnamed: 2_level_1
,,31
NotAnswered,,1
reason,originality,5
reason,plausibility,34
reason,relevance,63
source,experience,6
source,opinion,12


In [13]:
#write_coded(postags[postags.columns.values[:-2]], 'v_8345etseq', prelim=True, sep=';')

File stored at ../../analysis/freetext/v_8345etseq_coded_prelim.csv.


In [14]:
list(get_unmatched(postags).reasoning.values)

['A learning-by-example method is one of the most efficient tools allowing for decreasing the uncertainty in the future.',
 'Benchmarks are the start of everything.',
 'A many more automatical methods shall support and ease the work of RE. Especially for standardised requirements, security, safety, traceability analysis,etc. ',
 'non-functional reqs have the biggest impact on architecture and may not be left aside ',
 'it will give me good reasones for RE that managers will understand',
 "aren't they connected?",
 'Because of the conflict between complete requirements and fast requirements.',
 'A lot of regulations are contradictory, it is good to have an overview for decision making',
 'Green field happens from time to time; incremental improvements are very common. Change requests occur frequently',
 'Non-functional requirements do not always get the attention they deserve.  Correlating understanding of non-functional requirements to project success would be of interest. ',
 'To many

In [15]:
postags.groupby(['level_1', 'level_2']).count()[['Tag']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Tag
level_1,level_2,Unnamed: 2_level_1
,,31
NotAnswered,,1
reason,originality,5
reason,plausibility,34
reason,relevance,63
source,experience,6
source,opinion,12


#### Negative Reasoning

Reasons:
- Lack of Practical Relevance (notimportant - Critique of the Problem Addressed)
- Solution takes too much Effort (notefficient - Critique of the Solution Presented)
- Research not interesting (notinteresting - General Critique)
- Approach or Solution unconvincing (notconvincing - Critique of the Approach in General or of the Solution Presented)
- Research already exists (notoriginal - Critique of the Approach)
- Setup unrealistic (notrealistic - Critique of the Approach)
- Explicit Reference to Own Attitude as a Reason (respondentattitude - No Real Critique)
- Content too complicated (toocomplicated)
- Content too narrow (toospecialized)
- Description too vague (toovague)
- Too much Opinion (toosubjective)

Rejection:
- Of the Question itself ("Sorry, I didn't understand...")
- Of the Assumption Contained in the Question ("it's not a lower rating")

Observations: 
- Reasons for negative evaluations are much more diverse than reasons for positive evaluations
  - I.e., practitioners are much more specific when they state what they dislike about research than when they state what they like about research
  - I.e., RE Research dissatisfiers are more diverse than RE Research satisfiers/delighters
- Some answers are so specific that they can only be properly understood when taking a look at the rated summary
- Even with the more fine grained coding scheme, there are two categories that are quite heterogeneous:
  - notimportant
  - notconvincing

#### NB: The regex have gotten a little ugly and need cleanup/refactoring.

At this point, they are likely overinclusive...

In [16]:
# nb some refer to problem some to solution some to question itself
negexes = {
    # not that 'not important' refers to absolute and relative (=prioritization) importance reasonings
    'reason:notimportant': ('not?\W.{,50}(?:impor|relev|need|necess|frequ)|'
                    + 'distract|decr.*?relev|not?\W.{,20}(priori|essent)|not.*?big.*?prob'),
    'reason:notefficient': 'effort',
    'reason:notinteresting':  '(?:not?\W.{,10}|un)interest',
    # not that 'not convincing' statements are quite diffuse - critique of assumptions, critique of procedures, ...
    'reason:notconvincing': 
                ("(?:n't|not?)\W.{,10}?(?:conv|impr)|not?\W.{,30}?(help|use)|(?:n'?t|not?)\W.{,10}sense|"
                    +"pointless|worse|harm|waste|fail|simplistic|obscure|impractical|not?\W.*?good|"
                    + "(?:(?:ca|do)n'?t|not?)\W.{,10}work|"
                    +"(?:not?\W|don'?t|can'?t|can\s?not).{,30}(?:value|benefit)"),
    'reason:notoriginal': 'already',
    'reason:notrealistic': 'to.{,5}(?:theoretic|acad)|skill|scenario|real-w',
    'reason:toocomplicated': "(?:not\W|n't).*?\Wund|to.{,5}technical", # ie not understood
    'reason:toospecialized': 'to.{,10}specif|particular|special.*?domain|limit|should.*?wider|narrow', 
    'reason:toovague': "fluffy|don't know.*?use|not.*?understood",
    'reason:toosubjective': 'subjecti',
    'reason:respondentattitude': 'attitude',
    'rejection:ratingnotnegative': 'not?\W.*?lower', 
    'rejection:questionnotunderstood': 'not?\W.*?underst.*?quest|sorry'
}

In [17]:
negtags = assign_reasoning_tags(dfs['v_8780etseq'], negexes)
negtags['level_1'], negtags['level_2'] = list(zip(*[tag.split(':') if len(tag.split(':')) > 1 
                                                 else tag.split(':') + ['']
                                          for tag in negtags.Tag 
                                          ]))
negtags.head(2)

Unnamed: 0,PaperID,lfdn,reasoning,Tag,level_1,level_2
0,4,30,"no one cares 'how' you developed, just that yo...",reason:notconvincing,reason,notconvincing
0,5,35,Document driven approaches are decreasing in r...,reason:notimportant,reason,notimportant


In [18]:
#write_coded(negtags[negtags.columns.values[:-2]], 'v_8780etseq', prelim=True, sep=';')

File stored at ../../analysis/freetext/v_8780etseq_coded_prelim.csv.


In [19]:
list(get_unmatched(negtags).reasoning.values)

['If we could see how ambiguous documentation can affect a software project, perhaps software industry could be persuaded to do a better requirements documentation.',
 'because it can benefit the reusability of solutions',
 'Quantitative analysis of usability is confusing and misleding',
 'not for all, some is, some not.',
 "I selected the option 'Worthwhile', so I consider it important.",
 'Form specification would solve a lot of problem, however I do not know any industry project which could specifiy the system with formal language. Yes we need a more dynamic way of specifiying our systems',
 'As a practioner, I prefer learning about the answers than learning about the problems we all know we have.',
 'In my experience, most modeling language research never goes beyond the PhD lab. It may be valuable eventually, but this needs application before asserting its utility.',
 'creativity ist nicht der entscheidende Punkt, eher geht es um Kenntnis der Probreme eines Anwendungsbereichs',
 '

In [20]:
negtags.groupby(['level_1', 'level_2']).count()[['Tag']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Tag
level_1,level_2,Unnamed: 2_level_1
,,36
NotAnswered,,1
reason,notconvincing,26
reason,notefficient,5
reason,notimportant,18
reason,notinteresting,5
reason,notoriginal,3
reason,notrealistic,7
reason,respondentattitude,1
reason,toocomplicated,6


###  Appendix: Vocab Exploration for Positive and Negative Reasoning

In [23]:
from gensim.parsing.preprocessing import preprocess_documents
from gensim import corpora, models

In [33]:
from collections import Counter

In [25]:
documents = list(dfs['v_8345etseq']['reasoning'])
texts = preprocess_documents(documents)
dictionary = corpora.Dictionary(texts)
print(dictionary)

Dictionary(544 unique tokens: ['align', 'case', 'chang', 'cumbersum', 'especi']...)


In [None]:
# a list of word stems, sorted by frequency in the answers DESC
sorted([(k,v) for k,v in Counter([elem[x] for elem in texts for x in range(len(elem))]).items()], 
       key=lambda x:(x[1],x[0]), reverse=True)

In [55]:
documents2 = list(dfs['v_8780etseq']['reasoning'])
texts2 = preprocess_documents(documents2)
dictionary2 = corpora.Dictionary(texts2)
print(dictionary2)

Dictionary(462 unique tokens: ['care', 'develop', 'person', 'specif', 'us']...)


In [None]:
sorted([(k,v) for k,v in Counter([elem[x] for elem in texts2 for x in range(len(elem))]).items()], 
       key=lambda x:(x[1],x[0]), reverse=True)

The End.