# Script to process the comments, extract sentiment and make it into a decision class/verdict

This script processes only the comments and is based on sentiment analysis of NLP. Basic version, using just the keywords. 

Processing of comments:
* find comments that are positive 
* find comments that are negative
* finding = looking for specific word and phrases

In [1]:
# lexicon for positive sentiments
keywordsComments_positive = ['good', 'idea', 'good idea', 
                             'done', 'beside', 'improved', 
                             'thank' 'yes', 'well', 
                             'nice', 'positive', 'better', 
                             'best', 'super', 'great', 
                             'fantastic']

# lexicon for negative sentiments 
keywordsComments_negative = ['not good', 'not improve', "don't"
                             'should', 'should not', '?', 
                             'aside', 'tend', 'not done', 
                             'bad', 'improve', 'remove', 
                             'add', 'include', 'not include', 
                             'defeat', 'no', 'do not',  
                             'chaotic', 'negative', 'worse', 
                             'worst']



In [None]:
# location of the file with the input
filename = './gerrit_reviews_wireshark.csv'

# location of the file with the output feature vector
saveFilename = './gerrit_review_comments_dictionary_sentiment_wireshark.csv'

In [None]:
# import numpy
# we use it to easily work with arrays
import numpy as np

# we use it for saving CSV file
import pandas as pd


In [2]:

# comment feature vector is analyzed based on the keywords specified as parameters
# the sentiment is based on the percentage of positive - negative keywords found
def comment2sentiment(strComment,keywordsComments_positive, keywordsComments_negative):
    countPositive = 0
    countNegative = 0
    
    totalPositives = len(keywordsComments_positive)
    totalNegatives = len(keywordsComments_negative)
    
    for oneKeyword in keywordsComments_positive:
        countPositive += strComment.lower().count(oneKeyword.lower())
    
    for oneKeyword in keywordsComments_negative:
        countNegative += strComment.lower().count(oneKeyword.lower())
    
    quotinentPositive = countPositive / totalPositives
    quotinentNegative = countNegative / totalNegatives
    
    sentimentQuotinent = quotinentPositive - quotinentNegative
    
    # once we have the quotinent, we change it into verdict
    # anything that is positive becomes 1 and
    # anything that is negative becomes 0
    if sentimentQuotinent > 0:
        return 1
    else:
        return 0

In [3]:
# use this to skip the first line and to print out something once every 1000 lines
iIndex = 0

# initializing a data frame with the result of the sentiment analysis
dfSentimentedLines = pd.DataFrame()

with open(filename, 'r', encoding = 'utf-8') as fInputFile:
    for strInputLine in fInputFile:
        lineElements = strInputLine.split(';')
        iIndex += 1
        
        if not iIndex % 1000:
            print(f'INFO: Processing line {iIndex}') 
        
        if len(lineElements) > 7 and iIndex > 1:
            strLineCode = lineElements[6]
            strLineComment = lineElements[7]
            strReviewFilename = lineElements[2]
            
            # filter if line is about COMMIT_MSG
            # if not, then we calculate the sentiment
            if not 'COMMIT_MSG' in strReviewFilename:                
                sentiment = comment2sentiment(strLineComment, keywordsComments_positive, keywordsComments_negative) 
                oneRow = {'filename': strReviewFilename, 'LOC': strLineCode, 'class_value': sentiment}
                dfSentimentedLine = pd.DataFrame([oneRow], columns = oneRow.keys())
                dfSentimentedLines = pd.concat([dfSentimentedLines, dfSentimentedLine], axis=0)
            else:
                print(f'INFO: Skipping Commit message in line {strReviewFilename}: {strLineCode}')
                print(f'INFO: Lines processed: {dfSentimentedLines.shape[0]}')

INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 10
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 10
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 10
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 10
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Ivan Quach <ivan.quach@aireon.com>
INFO: Lines processed: 10
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Ivan Quach <ivan.quach@aireon.com>
INFO: Lines processed: 10
INFO: Skipping Commit message in line /COMMIT_MSG: protocol, which should be also addressed.
INFO: Lines processed: 71
INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Michael Mann <mmann78@netscape.net>
INFO: Lines processed: 94
INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Michael Mann <mmann78@netscape.net>
INFO: Lines processed: 94
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: I38c781

INFO: Skipping Commit message in line /COMMIT_MSG: 2. Request/Response tracking with timestamp between request and response in response frame.
INFO: Lines processed: 2527
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 2528
INFO: Skipping Commit message in line /COMMIT_MSG: Removed date and time present_value field dissectors.
INFO: Lines processed: 2570
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Ie13da6bfd7fefefbc5bb5df3461c7fc18261df81
INFO: Lines processed: 2601
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Ie13da6bfd7fefefbc5bb5df3461c7fc18261df81
INFO: Lines processed: 2601
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Anders Broman <a.broman58@gmail.com>
INFO: Lines processed: 2601
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Anders Broman <a.broman58@gmail.com>
INFO: Lines processed: 2601
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Anders Broman <a.broman58@gmai

INFO: Skipping Commit message in line /COMMIT_MSG: correctly updated. Right now wireshark gives a dsn for every TCP frame (even
INFO: Lines processed: 5150
INFO: Skipping Commit message in line /COMMIT_MSG: correctly updated. Right now wireshark gives a dsn for every TCP frame (even
INFO: Lines processed: 5150
INFO: Skipping Commit message in line /COMMIT_MSG: correctly updated. Right now wireshark gives a dsn for every TCP frame (even
INFO: Lines processed: 5150
INFO: Skipping Commit message in line /COMMIT_MSG: - Now displays mappings only for packets with data (seglen > 0).
INFO: Lines processed: 5150
INFO: Skipping Commit message in line /COMMIT_MSG: - Now displays mappings only for packets with data (seglen > 0).
INFO: Lines processed: 5150
INFO: Skipping Commit message in line /COMMIT_MSG: - Now displays mappings only for packets with data (seglen > 0).
INFO: Lines processed: 5150
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Ibd6e5e9144df1feadbabbfe8498d33e4882f9

INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Anders Broman <a.broman58@gmail.com>
INFO: Lines processed: 6247
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 6247
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 6256
INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Pascal Quantin <pascal.quantin@gmail.com>
INFO: Lines processed: 6263
INFO: Skipping Commit message in line /COMMIT_MSG: of spatial streams (bit 14-16)
INFO: Lines processed: 6470
INFO: Skipping Commit message in line /COMMIT_MSG: dot11_ht_vht_flags=0x00000551 <--short preamble encoded to 10th bit of
INFO: Lines processed: 6470
INFO: Skipping Commit message in line /COMMIT_MSG: dot11_ht_vht_flags.
INFO: Lines processed: 6470
INFO: Skipping Commit message in line /COMMIT_MSG: of spatial streams (bit 14-16)
INFO: Lines processed: 6472
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: If7f921f7ede7518ecbb88395d6200f600a47bd85
INFO:

INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Anders Broman <a.broman58@gmail.com>
INFO: Lines processed: 9406
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 9446
INFO: Skipping Commit message in line /COMMIT_MSG: items exported by nProbe.
INFO: Lines processed: 9472
INFO: Skipping Commit message in line /COMMIT_MSG: Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org>
INFO: Lines processed: 9472
INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Peter Wu <peter@lekensteyn.nl>
INFO: Lines processed: 9488
INFO: Processing line 10000
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Michael Mann <mmann78@netscape.net>
INFO: Lines processed: 10110
INFO: Skipping Commit message in line /COMMIT_MSG: Tested-by: Petri Dish Buildbot <buildbot-no-reply@wireshark.org>
INFO: Lines processed: 10131
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 10204
INFO: Skipping Commit message in line 

INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Michael Mann <mmann78@netscape.net>
INFO: Lines processed: 30985
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 30987
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Dario Lombardo <lomato@gmail.com>
INFO: Lines processed: 30987
INFO: Skipping Commit message in line /COMMIT_MSG: https://www.wireshark.org/lists/wireshark-dev/201604/msg00141.html
INFO: Lines processed: 31118
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-on: https://code.wireshark.org/review/18415
INFO: Lines processed: 31135
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-on: https://code.wireshark.org/review/18415
INFO: Lines processed: 31135
INFO: Skipping Commit message in line /COMMIT_MSG: which case it will scan until a non-escaped finishing double quote is
INFO: Lines processed: 31196
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 31285
INFO: Skipping Commi

INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-on: https://code.wireshark.org/review/16569
INFO: Lines processed: 60281
INFO: Processing line 62000
INFO: Processing line 63000
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 62664
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 62664
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Anders Broman <a.broman58@gmail.com>
INFO: Lines processed: 62671
INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Gerald Combs <gerald@wireshark.org>
INFO: Lines processed: 62678
INFO: Skipping Commit message in line /COMMIT_MSG: Petri-Dish: Gerald Combs <gerald@wireshark.org>
INFO: Lines processed: 62678
INFO: Skipping Commit message in line /COMMIT_MSG: Reviewed-by: Anders Broman <a.broman58@gmail.com>
INFO: Lines processed: 62685
INFO: Processing line 64000
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 63381
INFO: Skipping Commit me

In [4]:
# saving the output into a .csv file with $ as separator
pd.DataFrame(dfSentimentedLines).to_csv(saveFilename, 
                                        sep = "$",
                                        index = False)

The results of this script are saved now in a .csv file where each line is tagged with the sentiment-analyzed verdict. It can be used as a dictionary of "verdict" for each line. The file is in a raw format, which means:
* it contains duplicated lines - some lines can duplicated with a different verdict
* it contains mny duplicated lines - many lines are naturally part of many commits and sometimes even in the same commit we could have extracted them twice (sometimes the API provides us with the same data)
* it contains irrelevant lines - some lines can be like "#" or "//" or even "" only, which means that we need to clean up the data set