## Notebook to perform pre-processing on sentiment analysis data.

To load dataset into a folder perform the following;

mkdir -p datasets/words
wget http://www.cs.jhu.edu/~mdredze/datasets/sentiment/processed_acl.tar.gz -O datasets/words-temp.tar.gz
tar xzfv datasets/words-temp.tar.gz -C datasets/words
rm datasets/words-temp.tar.gz



In [22]:
import (
    "io/ioutil"
    "fmt"
    "strings"
    "strconv"
    "github.com/kniren/gota/dataframe"
)

const debug = false // set to true to run extra stuff

// Create a struct to store the sentiment dataset
type Pair struct {
    Phrase string
    Frequency int
}

In [2]:
// Load the dataset into a bite slice.
const kitchenReviews = "../datasets/ML/sentiment/kitchen"

positives, err := ioutil.ReadFile(kitchenReviews + "/positive.review")
negatives, err2 := ioutil.ReadFile(kitchenReviews + "/negative.review")

if err != nil || err2 != nil {
    fmt.Println("Error(s) in loading datafile:", err, err2)
}

Convert the bite slice into a string and then parse with strings.Fields() to convert into slices of pairs.  This creates slices that consist of a single string of text such as 'i_can:2'

In [3]:
pairsPositive := strings.Fields(string(positives))
pairsNegative := strings.Fields(string(negatives))

Function to take a string and then do the following;

Split the string into 2 elements, seperated by a colon.
Place first element into a new Pair struct as Phrase.
Convert second element to integer with strconv and put into same struct

Return a slice of Pair and a seperate lookup table as a map of phrases and frequency.

In [4]:
func pairsAndFilters(splitPairs []string) ([]Pair, map[string]bool) {
    var (
    pairs []Pair
    m map[string]bool
    )
    
    m = make(map[string]bool)
    for _, pair := range splitPairs {
        p := strings.Split(pair, ":")
        phrase := p[0]
        m[phrase] = true
        if len(p) < 2 {
            continue
        }
        
        freq, err := strconv.Atoi(p[1])
        if err != nil { 
            continue
        }
        
        pairs = append(pairs, Pair{
            Phrase: phrase,
            Frequency: freq,
        })
    }
    
    return pairs, m
    
}


When phrases exist in both positive and negative reviews they are less likely to indicate sentiment, the map is used to remove items from opposite pairs.  i.e. if the Map holds a phrase for positive sentiment then we need to delete those items from the negative Pair objects.  Function takes the sentiment Pair slices and the index from the opposite sentiment and then removes Pair objects that exist in the alternative data set...

In [5]:
func exclude(pairs []Pair, exclusions map[string]bool) []Pair{
    var returnPairs []Pair
    for i, p := range pairs{
        if !exclusions[p.Phrase]{
            returnPairs = append(returnPairs, p)
        }
    }
    
    return returnPairs
}

Run the dataset through the two functions;
* Create the slice of Pairs and the lookup to remove the word pairs from the opposite sentiment
* Parse the slice of Pairs through the opposite lookup model

In [6]:
parsedPositives, posPhrases := pairsAndFilters(pairsPositive)
parsedNegatives, negPhrases := pairsAndFilters(pairsNegative)
parsedPositives = exclude(parsedPositives, negPhrases)
parsedNegatives = exclude(parsedNegatives, posPhrases)

Print out the parsedPositives slice.

In [23]:
// Print out the 10 on the top of the slice
if debug {
    i := 0
    j := 0

    for _, pair := range parsedPositives {
        if pair.Frequency > 2 && i <10 {
             fmt.Printf("Positive Phrase: %s Frequency: %d \n", pair.Phrase, pair.Frequency)
            i += 1
        }
    }

    // Print out the Negatives
    for _, pair := range parsedNegatives {
        if pair.Frequency > 2 && j<10 {
            fmt.Printf("Negative Phrase: %s Frequency: %d \n", pair.Phrase, pair.Frequency)
            j += 1
        }
    }
}


In [10]:
// Load the Struct objects into Dataframes
dfPos := dataframe.LoadStructs(parsedPositives)
dfNeg := dataframe.LoadStructs(parsedNegatives)

// Sort the dataframes
dfPos = dfPos.Arrange(dataframe.RevSort("Frequency"))
dfNeg = dfNeg.Arrange(dataframe.RevSort("Frequency"))

Display top 10 of the dataframes.

In [19]:
fmt.Printf("\n Positives: %s \n\n Negatives: %s", dfPos, dfNeg)


 Positives: [46383x2] DataFrame

    Phrase       Frequency
 0: tic-tac-toe  10       
 1: wusthoff     7        
 2: emperor      7        
 3: shot_glasses 6        
 4: pulp         6        
 5: games        6        
 6: sentry       6        
 7: gravel       6        
 8: the_emperor  5        
 9: aebleskivers 5        
    ...          ...      
    <string>     <int>    
 

 Negatives: [45760x2] DataFrame

    Phrase          Frequency
 0: seeds           9        
 1: perculator      7        
 2: probes          7        
 3: cork            7        
 4: coffee_tank     5        
 5: brookstone      5        
 6: convection_oven 5        
 7: black_goo       5        
 8: waring_pro      5        
 9: packs           5        
    ...             ...      
    <string>        <int>    


811 <nil>