# Script to process the comments, extract sentiment and make it into a decision class/verdict

This script processes only the comments and is based on sentiment analysis of NLP. Basic version, using just the keywords. 

Processing of comments:
* find comments that are positive 
* find comments that are negative
* finding = looking for specific word and phrases

In [1]:
# lexicon for positive sentiments
keywordsComments_positive = ['good', 'idea', 'good idea', 
                             'done', 'beside', 'improved', 
                             'thank' 'yes', 'well', 
                             'nice', 'positive', 'better', 
                             'best', 'super', 'great', 
                             'fantastic']

# lexicon for negative sentiments 
keywordsComments_negative = ['not good', 'not improve', "don't"
                             'should', 'should not', '?', 
                             'aside', 'tend', 'not done', 
                             'bad', 'improve', 'remove', 
                             'add', 'include', 'not include', 
                             'defeat', 'no', 'do not',  
                             'chaotic', 'negative', 'worse', 
                             'worst']



In [2]:
# location of the file with the input
filename = './gerrit_reviews_wireshark.csv'

# location of the file with the output feature vector
saveFilename = './gerrit_review_comments_dictionary_sentiment_wireshark.csv'

In [3]:
# import numpy
# we use it to easily work with arrays
import numpy as np

# we use it for saving CSV file
import pandas as pd


In [4]:

# comment feature vector is analyzed based on the keywords specified as parameters
# the sentiment is based on the percentage of positive - negative keywords found
def comment2sentiment(strComment,keywordsComments_positive, keywordsComments_negative):
    countPositive = 0
    countNegative = 0
    
    totalPositives = len(keywordsComments_positive)
    totalNegatives = len(keywordsComments_negative)
    
    for oneKeyword in keywordsComments_positive:
        countPositive += strComment.lower().count(oneKeyword.lower())
    
    for oneKeyword in keywordsComments_negative:
        countNegative += strComment.lower().count(oneKeyword.lower())
    
    quotinentPositive = countPositive / totalPositives
    quotinentNegative = countNegative / totalNegatives
    
    sentimentQuotinent = quotinentPositive - quotinentNegative
    
    # once we have the quotinent, we change it into verdict
    # anything that is positive becomes 1 and
    # anything that is negative becomes 0
    if sentimentQuotinent > 0:
        return 1
    else:
        return 0

In [9]:
# use this to skip the first line and to print out something once every 1000 lines
iIndex = 0

# initializing a data frame with the result of the sentiment analysis
dfSentimentedLines = pd.DataFrame()

with open(filename, 'r', encoding = 'utf-8') as fInputFile:
    for strInputLine in fInputFile:
        lineElements = strInputLine.split(';')
        iIndex += 1
        
        if not iIndex % 1000:
            print(f'INFO: Processing line {iIndex}') 
        
        if len(lineElements) > 7 and iIndex > 1:
            strLineCode = lineElements[6]
            strLineComment = lineElements[7].replace('\n', '_')
            strReviewFilename = lineElements[2]
            
            # filter if line is about COMMIT_MSG
            # if not, then we calculate the sentiment
            if not 'COMMIT_MSG' in strReviewFilename:                
                sentiment = comment2sentiment(strLineComment, keywordsComments_positive, keywordsComments_negative) 
                oneRow = {'filename': strReviewFilename, 'LOC': strLineCode, 'class_value': sentiment, 'message': strLineComment}
                dfSentimentedLine = pd.DataFrame([oneRow], columns = oneRow.keys())
                dfSentimentedLines = pd.concat([dfSentimentedLines, dfSentimentedLine], axis=0)
            else:
                print(f'INFO: Skipping Commit message in line {strReviewFilename}: {strLineCode}')
                print(f'INFO: Lines processed: {dfSentimentedLines.shape[0]}')

INFO: Skipping Commit message in line /COMMIT_MSG: It is indicated by adding a "[Tree view truncated]" item.
INFO: Lines processed: 11
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 11
INFO: Skipping Commit message in line /COMMIT_MSG: Tested-by: Petri Dish Buildbot
INFO: Lines processed: 11
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Peter Wu <peter@lekensteyn.nl>
INFO: Lines processed: 97
INFO: Skipping Commit message in line /COMMIT_MSG:   value while in fact it is 32-bit wide
INFO: Lines processed: 106
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Tomasz Moń <desowin@gmail.com>
INFO: Lines processed: 115
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Ie8b005ccadda39c653782fc38280ce21cf2ca0a8
INFO: Lines processed: 141
INFO: Skipping Commit message in line /COMMIT_MSG: In this release (IEC 61850-7-2:2003) there is a field name called: Test.
INFO: Lines processed: 158
INFO: Skipping Commit message in lin

In [10]:
dfSentimentedLines

Unnamed: 0,filename,LOC,class_value,message
0,.github/workflows/close_pr.yml,"comment: ""We do not accept PRs. Patche...",0,"Maybe ""We do not accept GitHub PRs."" ?_"
0,epan/dissectors/packet-tls-utils.c,"tvb, offset, next_offset -...",0,since you are not adding any extra information...
0,epan/dissectors/packet-tls-utils.c,"tvb, offset, next_offset -...",1,Done_
0,epan/dissectors/packet-tls-utils.c,"tvb, offset, next_offset -...",0,Would this be worth showing as expert info ins...
0,epan/dissectors/packet-tls-utils.c,"tvb, offset, next_offset -...",0,The number of layers added per DN is not fixed...
...,...,...,...,...
0,packaging/rpm/wireshark.spec.in,* Sun Jan 19 2019 Jiri Novak,0,Should not it be Sun Jan 19 2020 at the previo...
0,ui/io_graph_item.c,#define get_io_graph_item_advanced_FT_xINT_uni...,0,Using macros makes the code more compacet but ...
0,ui/io_graph_item.c,//shift left to take the sign bit ...,0,"Typo in comment: ""durch"" should be ""during""_"
0,ui/io_graph_item.c,//shift left to take the sign bit ...,0,Is this translation needed for unsigned quanti...


In [11]:
# saving the output into a .csv file with $ as separator
pd.DataFrame(dfSentimentedLines).to_csv(saveFilename, 
                                        sep = "$",
                                        index = False)

The results of this script are saved now in a .csv file where each line is tagged with the sentiment-analyzed verdict. It can be used as a dictionary of "verdict" for each line. The file is in a raw format, which means:
* it contains duplicated lines - some lines can duplicated with a different verdict
* it contains mny duplicated lines - many lines are naturally part of many commits and sometimes even in the same commit we could have extracted them twice (sometimes the API provides us with the same data)
* it contains irrelevant lines - some lines can be like "#" or "//" or even "" only, which means that we need to clean up the data set