## Upload Review Data using AzureML

Create a batch file and execute:
    
```
cd "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy"
AzCopy /Source:C:\_ilia_share\amazon_prod_reviews_clean\raw /Dest:https://ikcentralusstore.blob.core.windows.net/amazonrev /DestKey:dLR5lH2QN/ejGmyD61nQoh7Cc2DW8jIKhR5n5uvGu8+H3Qem4J0XzWG1/7XtBxmVlWr+y/GNRlwX4Km5YU68sg== /Pattern:"aggressive_dedup.json"
pause
```

## Load Review Data (from Blob)

In [1]:
# Idea courtesy of Thomas D.
import time
STIME = { "start" : time.time() }

def tic():
    STIME["start"] = time.time()

def toc():
    elapsed = time.time() - STIME["start"]
    print("%.2f seconds elasped" % elapsed)

Creating SparkContext as 'sc'


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
21,application_1469092805367_0009,pyspark,idle,Link,Link,✔


Creating HiveContext as 'sqlContext'
SparkContext and HiveContext created. Executing user code ...


In [3]:
# paths
blob = "wasb://amazonrev@ikcentralusstore.blob.core.windows.net"
json_dta = blob + "/aggressive_dedup.json"

In [4]:
# load data
jsonFile = sqlContext.read.json(json_dta)
jsonFile.registerTempTable("reviews")

print(type(jsonFile)) #  <class 'pyspark.sql.dataframe.DataFrame'>
jsonFile.show(5)

# Note: also load the IMDB data at some point
# ...

<class 'pyspark.sql.dataframe.DataFrame'>
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|      asin|helpful|overall|          reviewText| reviewTime|          reviewerID|   reviewerName|             summary|unixReviewTime|
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|B003UYU16G| [0, 0]|    5.0|It is and does ex...|11 21, 2012|A00000262KYZUE4J5...| Steven N Elich|Does what it's su...|    1353456000|
|B005FYPK9C| [0, 0]|    5.0|I was sketchy at ...| 01 8, 2013|A000008615DZQRRI9...|      mj waldon|           great buy|    1357603200|
|B000VEBG9Y| [0, 0]|    3.0|Very mobile produ...|03 24, 2014|A00000922W28P2OCH...|Gabriel Merrill|Great product but...|    1395619200|
|B001EJMS6K| [0, 0]|    4.0|Easy to use a mob...|03 24, 2014|A00000922W28P2OCH...|Gabriel Merrill|Great inexpensive...|    1395619200|
|B003XJCNVO| 

## Examine some of the reviews

In [5]:
%%sql 
SELECT overall, reviewText
FROM reviews
LIMIT 10

In [6]:
%%sql 
SELECT overall, COUNT(overall) as freq
FROM reviews
GROUP BY overall
ORDER by -freq

In [7]:
# Create a dataframe of our reviews
# To analyse class imbalance

reviews =  sqlContext.sql("SELECT " + 
                          "CASE WHEN overall < 3 THEN 'low' " +
                          "WHEN overall > 3 THEN 'high' ELSE 'mid' END as label, " + 
                          "reviewText as sentences " + 
                          "FROM reviews")
# Tally
#tally = reviews.groupBy("label").count()
#tally.show()
"""
mid| 7,039,272
low|10,963,811
high|64,453,794
"""

'\nmid| 7,039,272\nlow|10,963,811\nhigh|64,453,794\n'

In [8]:
# Let's look at some reviews to see how clean they are
# there seems to be lots of html formatting
for c,r in enumerate(reviews.take(10)):
    print("%d. %s" % (c+1,r['sentences']))

1. It is and does exactly what the description said it would be and would do. Couldn't be happier with it.
2. I was sketchy at first about these but once you wear them for a couple hours they break in they fit good on my board an have little wear from skating in them. They are a little heavy but won't get eaten up as bad by your grip tape like poser dc shoes.
3. Very mobile product. Efficient. Easy to use; however product needs a varmint guard. Critters are able to gorge themselves without a guard.
4. Easy to use a mobile. If you're taller than 4ft, be ready to tuck your legs behind you as you hang and pull.
5. Love this feeder. Heavy duty & capacity. Best feature is the large varmint guard. Definitely use a small lock or securing device on the battery housing latch. I gave 4 stars because several bolts were missing. Check contents b4 beginning.
6. Solid, stable mount. Holds iPhone with phone protector well. I have not however used the dash mount part of this product (only windshield).

In [9]:
# Some very basic cleaning
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType, DoubleType 
from bs4 import BeautifulSoup

def cleanerHTML(line):
    # html formatting
    html_clean = BeautifulSoup(line, "lxml").get_text().lower()
    # remove any double spaces, line-breaks, etc.
    return " ".join(html_clean.split())

def labelForResults(s):
    # string label to numeric
    if s == 'low':
        return 0.0
    elif s == 'high':
        return 1.0
    else:
        return -1.0
        
cleaner = UserDefinedFunction(cleanerHTML, StringType())
label = UserDefinedFunction(labelForResults, DoubleType())

cleanedReviews = reviews.select(reviews.label,
                                label(reviews.label).alias('sentiment'), 
                                cleaner(reviews.sentences).alias('sentences'))

In [10]:
# A bit cleaner ...
for c,r in enumerate(cleanedReviews.take(10)):
    print("%d. %s" % (c+1,r['sentences']))

1. it is and does exactly what the description said it would be and would do. couldn't be happier with it.
2. i was sketchy at first about these but once you wear them for a couple hours they break in they fit good on my board an have little wear from skating in them. they are a little heavy but won't get eaten up as bad by your grip tape like poser dc shoes.
3. very mobile product. efficient. easy to use; however product needs a varmint guard. critters are able to gorge themselves without a guard.
4. easy to use a mobile. if you're taller than 4ft, be ready to tuck your legs behind you as you hang and pull.
5. love this feeder. heavy duty & capacity. best feature is the large varmint guard. definitely use a small lock or securing device on the battery housing latch. i gave 4 stars because several bolts were missing. check contents b4 beginning.
6. solid, stable mount. holds iphone with phone protector well. i have not however used the dash mount part of this product (only windshield).

In [11]:
#cleanedReviews.show()

In [12]:
# Equalise classes 
neg_rev = cleanedReviews.filter("sentiment = 0.0")
pos_rev = cleanedReviews.filter("sentiment = 1.0").limit(neg_rev.count())

In [13]:
# Save data
allData = pos_rev.unionAll(neg_rev)
print(allData.count()) # 21,927,622 ( = 10,963,811 * 2)

allDataLoc = blob + "/cleaned_equal_classes.json"
allData.write.json(allDataLoc)

21927622

## Load Clean Data

In [14]:
allDataLoc = blob + "/cleaned_equal_classes.json"
allData = sqlContext.read.json(allDataLoc)

data_count = allData.count()
print(data_count)

21927622

In [32]:
# Take 1 million
sub_sample = 1000000
sub_sample_ratio = float(sub_sample)/float(data_count)

print(sub_sample_ratio)

print(type(allData))
allData.take(5)

0.0456045803781
<class 'pyspark.sql.dataframe.DataFrame'>
[Row(label=u'high', sentences=u"this is one of the best albums i've listened too, incredible. it doesn't get much heavier than sic. eyeless is a classic along with wait and bleed. in wait and bleed cory shows us his singing ability other than his aggressive screaming, which is very impressive. surfacing, incredible also. spit it out, amazing. trust me, this cd is great. the only tracks to be warned about are probably...1...7...10 and the last one. but still the last one and 10 along with 7 are listenable. very good album", sentiment=1.0), Row(label=u'high', sentences=u'wow..really..what can i say? wow. ross robinson has added another to his collection of great cd\'s. this is my favorite machine head...for some good reasons. the singer\'s voice (flynn) has really come far in this album. the metal riffs on the guitar are toooo catchy. :-) some stand out songs are "the blood the sweat the tears" "i defy" "desire to fire" well..they

In [33]:
# sub_sample -> sample(boolean withReplacement, double fraction, long seed)
subData = allData.sample(False, sub_sample_ratio, 12345)

# split intro training and test (50%, 50%)
trainingData, testData = neg_rev.unionAll(pos_rev).randomSplit([0.5, 0.5])

In [34]:
tic()
trainingData.show()
toc()

+-----+---------+--------------------+
|label|sentiment|           sentences|
+-----+---------+--------------------+
|  low|      0.0|!!!!!!!!!!!!!!!!!...|
|  low|      0.0|!is too big for m...|
|  low|      0.0|"... the conserva...|
|  low|      0.0|"aerobics and pus...|
|  low|      0.0|"attention!!the s...|
|  low|      0.0|"backlash" was a ...|
|  low|      0.0|"beautiful ruins"...|
|  low|      0.0|"boxing helena" i...|
|  low|      0.0|"clad" means that...|
|  low|      0.0|"classic pong" is...|
|  low|      0.0|"denialism"??? lo...|
|  low|      0.0|"do you want to b...|
|  low|      0.0|"doesn't work by ...|
|  low|      0.0|"duvet set" does ...|
|  low|      0.0|"fallen masters" ...|
|  low|      0.0|"full clip" is es...|
|  low|      0.0|"gods & generals"...|
|  low|      0.0|"gould and lewont...|
|  low|      0.0|"greenhouse" top ...|
|  low|      0.0|"h-he wh-whispere...|
+-----+---------+--------------------+
only showing top 20 rows

1625.88 seconds elasped

In [None]:
tic()
testData.show()
toc()

+-----+---------+--------------------+
|label|sentiment|           sentences|
+-----+---------+--------------------+
|  low|      0.0|!!!!**attention!!...|
|  low|      0.0|"...under george ...|
|  low|      0.0|"40 more years" i...|
|  low|      0.0|"a hundred years ...|
|  low|      0.0|"ain't it awful" ...|
|  low|      0.0|"all size fits mo...|
|  low|      0.0|"an unseen enemy ...|
|  low|      0.0|"as bradley put i...|
|  low|      0.0|"austenland" appe...|
|  low|      0.0|"big capacity" co...|
|  low|      0.0|"big" amazon brok...|
|  low|      0.0|"conflict of inte...|
|  low|      0.0|"convention girl"...|
|  low|      0.0|"did you take tha...|
|  low|      0.0|"dolgyal" is a in...|
|  low|      0.0|"double exposure"...|
|  low|      0.0|"easy & quick set...|
|  low|      0.0|"energy" was the ...|
|  low|      0.0|"fits most golf c...|
|  low|      0.0|"from the directo...|
+-----+---------+--------------------+
only showing top 20 rows

1624.16 seconds elasped

In [None]:
trainingData.cache()
testData.cache()

print(trainingData.count())
print(testData.count())

## 1. TFIDF

In [None]:
# Some sample data to use
# to save time ...

sampleData = sqlContext.createDataFrame([
  (0.0, "Hi, I heard about Spark"),
  (0.0, "I couldn't wish Java was any different"),
  (1.0, "Logistic regression models are super boring")], 
                        ["sentiment", "sentences"])
sampleData.show()

In [None]:
# Example to get ngrams-range
from itertools import chain
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import col, udf
from pyspark.ml.feature import NGram

class nGramRanger:
    
    """
    
    Wrapper around pyspark.ml.feature.Ngram which calls it in a loop and produces a 
    range of ngrams as features. 
    
    ngramRange = (1,3) creates Ngram(1) and Ngram(2) and Ngram(3) in interim
    and produces ngrams.
    
    """
    
    def __init__(self, ngramRange, inputCol, outputCol = "ngrams"):
        # ngramRange = (1,3)
        self.ngramRange = ngramRange
        self.inputCol = inputCol
        self.outputCol = outputCol

    def transform(self, inDf):
        _orig = inDf
        def concat(type):
            def concat_(*args):
                return list(chain(*args))
            return udf(concat_, ArrayType(type))
        
        # Create columns for n-grams
        for ngr in range(self.ngramRange[0],self.ngramRange[-1]+1):
            ngram = NGram(inputCol=self.inputCol , n=ngr, outputCol="%sgram" % ngr)
            inDf = ngram.transform(inDf)        
        # Combine
        concat_string_arrays = concat(StringType())
        ###!!!!! FIx this at some point as 3 grams manually !!!!!
        outDta = _orig.join(inDf.select(concat_string_arrays(col("1gram"),
                                                             col("2gram"),
                                                             col("3gram")).alias(self.outputCol)))
        # Return dataframe
        return outDta

"""
tokenizer = Tokenizer(inputCol="sentences", outputCol="words")
tokenisedData = tokenizer.transform(sampleData)

stopremover = StopWordsRemover(inputCol="words", outputCol="wordsFiltered")
stoppedData = stopremover.transform(tokenisedData)

ngrammer = nGramRanger(ngramRange = (1,3), inputCol = "wordsFiltered", outputCol = "ngrams")
output_test = ngrammer.transform(stoppedData)

for rw in output_test.take(5):
    print(rw)
"""

In [None]:
# Pipeline for feature selection and classification
# Using https://spark.apache.org/docs/1.5.2/ml-features.html

from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover, NGram
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier


numfeat = 40000
ngram_range = (1,3)

tic()

# 1. Feature-extraction
tokenizer = Tokenizer(inputCol="sentences", outputCol="words")
ngrammer = nGramRanger(ngramRange = ngram_range, inputCol = "words", outputCol = "ngrams")
hashingtf  = HashingTF(inputCol="ngrams", outputCol="rawFeatures", numFeatures=numfeat)
idf = IDF(inputCol="rawFeatures", outputCol="features")

# Train
tokenized_train = tokenizer.transform(trainingData)
ngrammed_train = ngrammer.transform(tokenized_train)
hashed_train = hashingtf.transform(ngrammed_train)
idfModel = idf.fit(hashed_train)
idf_train = idfModel.transform(hashed_train)

# Test
tokenized_test = tokenizer.transform(testData)
ngrammed_test = ngrammer.transform(tokenized_test)
hashed_test = hashingtf.transform(ngrammed_test)
idf_test = idfModel.transform(hashed_test)

toc()

In [None]:
# 2A. Classifier (Logistic Regression)
tic()

classi = LogisticRegression(labelCol="sentiment", featuresCol="features")
tfidfModel = classi.fit(idf_train)
pred = tfidfModel.transform(idf_test)

toc()  

# 3. Examine
numSuccesses = pred.where("""(prediction = sentiment)""").count()
numInspections = numSuccesses + pred.where("""(prediction != sentiment)""").count()
acc = (float(numSuccesses) / float(numInspections)) * 100
print("%.2f\% success rate" % acc) # 49.74 success rate

In [None]:
# 2B. Classifier (RandomForest)
tic()

classi = RandomForestClassifier(labelCol="sentiment", featuresCol="features")
tfidfModel = classi.fit(idf_train)
pred = tfidfModel.transform(idf_test)

toc()  

# 3. Examine
numSuccesses = pred.where("""(prediction = sentiment)""").count()
numInspections = numSuccesses + pred.where("""(prediction != sentiment)""").count()
acc = (float(numSuccesses) / float(numInspections)) * 100
print("%.2f\% success rate" % acc) # 49.74 success rate

In [None]:
# 2C. Classifier (GBTClassifier)
tic()

classi = GBTClassifier(labelCol="sentiment", featuresCol="features")
tfidfModel = classi.fit(idf_train)
pred = tfidfModel.transform(idf_test)

toc()  

# 3. Examine
numSuccesses = pred.where("""(prediction = sentiment)""").count()
numInspections = numSuccesses + pred.where("""(prediction != sentiment)""").count()
acc = (float(numSuccesses) / float(numInspections)) * 100
print("%.2f\% success rate" % acc) # 49.74 success rate

Running this locally got totally different results. 

Briefly:

```
# Cleaning
def clean_review(review):
    temp = BeautifulSoup(review, "lxml").get_text()
    punctuation = """.,?!:;(){}[]"""
    for char in punctuation
        temp = temp.replace(char, ' ' + char + ' ')
    words = " ".join(temp.lower().split()) + "\n"
    return words

# Vectoriser
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1, 3), sublinear_tf = True)

# Classification using LR
classifier_tfidf = LogisticRegression()




# Review
classifier_tfidf.fit(train_data_features, train_labels)
classifier_tfidf.score(test_data_features, test_labels) 
```

* 50k gives 0.91479748910479386
* 500k gives 0.92755933545886227
* 1mill gives 0.93064867578787447

*Cannot get anywhere near that in spark?!*

In [None]:
# 3. Evaluation
pred.select(col('prediction'),col('sentiment')).show()