# Table of Contents

1. [Data](#data)<br>
    1.1. [Data Extraction](#data-extraction)<br>
    1.2. [Manual Feature Selection](#manual-feature-selection)<br>
    1.3. [Data Cleaning](#data-cleaning)<br>
    1.4. [Generating New Features](#generating-new-features)<br>
    &nbsp;&nbsp;&nbsp;1.4.1. [Sentiment Analysis](#sentiment-analysis)<br>
    &nbsp;&nbsp;&nbsp;1.4.2. [Fix Duration](#fix-duration)<br>

2. [Implementation](#implementation)<br>
3. [Evaluation](#evaluation)<br>
4. [Conclusion](#conclusion)<br>

In [173]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import matplotlib.cm as cm
# import tensorflow as tf
# import sklearn
# import subprocess
# import shlex

import csv

# 1. Data <a class="anchor" id="data"></a>

## 1.1. Data Extraction <a class="anchor" id="data-extraction"></a>

In [174]:
path_all_data = "../dataset/data_test.csv" # only using 10 examples

df = pd.read_csv(path_all_data, sep=',')
data = np.array(df)

# TODO: Only use columns where the hand-selected features lie

## 1.2. Manual Feature Selection <a class="anchor" id="manual-feature-selection"></a>

First, the most relevant features are selected manually. The original features included in the dataset were the followig:

<img src="./resources/features_in_dataset.png" alt="Features in Dataset" width="280">

However, after careful consideration, only the following columns will be taken into account for the purposes of this project:

<font color=red>
Need to perform manual feature selection!!!
</font>

## 1.3. Data Cleaning <a class="anchor" id="data-cleaning"></a>

In [175]:
# Remove rows where data is missing
print("> SIZE BEFORE CLEANING: " + str(data.shape))
for i, row in enumerate(data):
    for j, val in enumerate(row):
        if (row[j] == 'null'):
            print("> REMOVING ROW:\n" + str(row))
            data = np.delete(data, i, 0)
            break
            
print("> SIZE AFTER CLEANING: " + str(data.shape))

> SIZE BEFORE CLEANING: (10, 32)
> REMOVING ROW:
[226 'Bug' 'Resolved' 'Fixed' '   ' 'Minor' 'Suresh Srinivas'
 '2012/05/11 18:29:32 +0100' 'null' 'Unassigned'
 '2012/05/15 03:31:24 +0100' 3.376296296 'null'
 'Make the daemon names and other field names consistent   '
 'Following names need to be consistent: Hdfs -&gt; HDFS Mapreduce -&gt; MapReduce Zookeeper -&gt; ZooKeeper HADOOP -&gt; Hadoop   '
 '0.9.0' '0.9.0' 0 1 20 0 1 1 1 0 0 0 0 0 0 0 nan]
> SIZE AFTER CLEANING: (9, 32)


## 1.4. Generating New Features <a class="anchor" id="generating-new-features"></a>

### 1.4.1. Sentiment Analysis <a class="anchor" id="sentiment-analysis"></a>

http://sentistrength.wlv.ac.uk/

SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths:

-1 (not negative) to -5 (extremely negative)

1 (not positive) to 5 (extremely positive)

Why does it use two scores? Because research from psychology has revealed that we process positive and negative sentiment in parallel - hence mixed emotions.

SentiStrength can also report binary (positive/negative), trinary (positive/negative/neutral) and single scale (-4 to +4) results. SentiStrength was originally developed for English and optimised for general short social web texts but can be configured for other languages and contexts by changing its input files - some variants are demonstrated below.

In [4]:
import subprocess
import shlex
import os.path
import sys

In [13]:
#Set the proper paths
#SentiStrength
#the SentiStrength data folder and to make this code work. These are near the top of the code below. The results will be saved to the folder where the YouTube files are kept. Only use forward slashes /.
SentiStrengthLocation = "../sentiment_analysis/SentiStrength/SentiStrength.jar" #The location of SentiStrength on your computer
SentiStrengthLanguageFolder = "../sentiment_analysis/SentiStrength/data/" #The location of the unzipped SentiStrength data files on your computer

In [15]:
#The following code tests that the above three locations are correct. If you don't get an error message then this is fine.
if not os.path.isfile(SentiStrengthLocation):
    print("SentiStrength not found at: ", SentiStrengthLocation)
if not os.path.isdir(SentiStrengthLanguageFolder):
    print("SentiStrength data folder not found at: ", SentiStrengthLanguageFolder)

In [19]:
#The code below allows SentiStrength to be called and run on a single line of text.
def RateSentiment(sentiString):
    #open a subprocess using shlex to get the command line string into the correct args list format
    p = subprocess.Popen(shlex.split("java -jar '" + SentiStrengthLocation + "' stdin sentidata '" + SentiStrengthLanguageFolder + "'"),stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
    #communicate via stdin the string to be rated. Note that all spaces are replaced with +
    b = bytes(sentiString.replace(" ","+"), 'utf-8') #Can't send string in Python 3, must send bytes
    stdout_byte, stderr_text = p.communicate(b)
    stdout_text = stdout_byte.decode("utf-8")  #convert from byte
    # -------- Edit - Nov 9 2017 --------
    stdout_list = stdout_text.split("\t")      #Split by tab: ['2', '-1','\n']
    del stdout_list[-1]                        #Get rid of the last newline element: ['2', '-1']
    results = list(map(int, stdout_list))      #Conver the characters to integers
    results = results[0] + results[1]          #Combine the positive and the negative
    # -------- END: Edit - Nov 9 2017 --------
    #stdout_text = stdout_text.rstrip().replace("\t"," ") #remove the tab spacing between the positive and negative ratings. e.g. 1    -5 -> 1 -5
    #return stdout_text + " " + sentiString
    return results

In [20]:
#Test to ensure that it works
print(RateSentiment("i am happy but not really"))

1


### 1.4.2. Fix Duration <a class="anchor" id="fix-duration"></a>

# 2. Implementation <a class="anchor" id="implementation"></a>

*Note: Automatic Feature Selection will be included here*

# 3. Evaluation <a class="anchor" id="evaluation"></a>

*Note: Include comparisson with other related work*

# 4. Conclusion <a class="anchor" id="conclusion"></a>