## Upload Review Data using AzureML

Create a batch file and execute:
    
```
cd "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy"
AzCopy /Source:C:\_ilia_share\amazon_prod_reviews_clean\raw /Dest:https://ikcentralusstore.blob.core.windows.net/amazonrev /DestKey:dLR5lH2QN/ejGmyD61nQoh7Cc2DW8jIKhR5n5uvGu8+H3Qem4J0XzWG1/7XtBxmVlWr+y/GNRlwX4Km5YU68sg== /Pattern:"aggressive_dedup.json"
pause
```

## Load Review Data (from Blob)

In [1]:
# paths
blob = "wasb://amazonrev@ikcentralusstore.blob.core.windows.net"
json_dta = blob + "/aggressive_dedup.json"

Creating SparkContext as 'sc'


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
16,application_1469092805367_0004,pyspark,idle,Link,Link,✔


Creating HiveContext as 'sqlContext'
SparkContext and HiveContext created. Executing user code ...


In [2]:
# load data
jsonFile = sqlContext.read.json(json_dta)
jsonFile.registerTempTable("reviews")

print(type(jsonFile))
jsonFile.show(5)

# Note: also load the IMDB data at some point
# ...

<class 'pyspark.sql.dataframe.DataFrame'>
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|      asin|helpful|overall|          reviewText| reviewTime|          reviewerID|   reviewerName|             summary|unixReviewTime|
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|B003UYU16G| [0, 0]|    5.0|It is and does ex...|11 21, 2012|A00000262KYZUE4J5...| Steven N Elich|Does what it's su...|    1353456000|
|B005FYPK9C| [0, 0]|    5.0|I was sketchy at ...| 01 8, 2013|A000008615DZQRRI9...|      mj waldon|           great buy|    1357603200|
|B000VEBG9Y| [0, 0]|    3.0|Very mobile produ...|03 24, 2014|A00000922W28P2OCH...|Gabriel Merrill|Great product but...|    1395619200|
|B001EJMS6K| [0, 0]|    4.0|Easy to use a mob...|03 24, 2014|A00000922W28P2OCH...|Gabriel Merrill|Great inexpensive...|    1395619200|
|B003XJCNVO| 

## Examine some of the reviews

In [None]:
%%sql -o topn
SELECT overall, reviewText
FROM reviews
LIMIT 10

In [None]:
%%sql -o revs
SELECT overall, COUNT(overall) as freq
FROM reviews
GROUP BY overall
ORDER by -freq

In [None]:
%%sql -o avgrev
SELECT reviewText
FROM reviews
WHERE overall = '3'
LIMIT 10

In [4]:
# Create a dataframe of our reviews
# To analyse class imbalance

reviews =  sqlContext.sql("SELECT " + 
                          "CASE WHEN overall < 3 THEN 'low' " +
                          "WHEN overall > 3 THEN 'high' ELSE 'mid' END as label, " + 
                          "reviewText as sentences " + 
                          "FROM reviews")
# Tally
#tally = reviews.groupBy("label").count()
#tally.show()
"""
mid| 7,039,272
low|10,963,811
high|64,453,794
"""

'\nmid| 7,039,272\nlow|10,963,811\nhigh|64,453,794\n'

In [5]:
# Let's look at some reviews to see how clean they are
# there seems to be lots of html formatting
for c,r in enumerate(reviews.take(10)):
    print("%d. %s" % (c+1,r['sentences']))

1. It is and does exactly what the description said it would be and would do. Couldn't be happier with it.
2. I was sketchy at first about these but once you wear them for a couple hours they break in they fit good on my board an have little wear from skating in them. They are a little heavy but won't get eaten up as bad by your grip tape like poser dc shoes.
3. Very mobile product. Efficient. Easy to use; however product needs a varmint guard. Critters are able to gorge themselves without a guard.
4. Easy to use a mobile. If you're taller than 4ft, be ready to tuck your legs behind you as you hang and pull.
5. Love this feeder. Heavy duty & capacity. Best feature is the large varmint guard. Definitely use a small lock or securing device on the battery housing latch. I gave 4 stars because several bolts were missing. Check contents b4 beginning.
6. Solid, stable mount. Holds iPhone with phone protector well. I have not however used the dash mount part of this product (only windshield).

In [6]:
# Some very basic cleaning
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType, DoubleType 
from bs4 import BeautifulSoup

def cleanerHTML(line):
    # html formatting
    html_clean = BeautifulSoup(line, "lxml").get_text().lower()
    # remove any double spaces, line-breaks, etc.
    return " ".join(html_clean.split())

def labelForResults(s):
    # string label to numeric
    if s == 'low':
        return 0.0
    elif s == 'high':
        return 1.0
    else:
        return -1.0
        
cleaner = UserDefinedFunction(cleanerHTML, StringType())
label = UserDefinedFunction(labelForResults, DoubleType())

cleanedReviews = reviews.select(reviews.label,
                                label(reviews.label).alias('sentiment'), 
                                cleaner(reviews.sentences).alias('sentences'))

In [7]:
# A bit cleaner ...
reviews = cleanedReviews
for c,r in enumerate(reviews.take(10)):
    print("%d. %s" % (c+1,r['sentences']))

1. it is and does exactly what the description said it would be and would do. couldn't be happier with it.
2. i was sketchy at first about these but once you wear them for a couple hours they break in they fit good on my board an have little wear from skating in them. they are a little heavy but won't get eaten up as bad by your grip tape like poser dc shoes.
3. very mobile product. efficient. easy to use; however product needs a varmint guard. critters are able to gorge themselves without a guard.
4. easy to use a mobile. if you're taller than 4ft, be ready to tuck your legs behind you as you hang and pull.
5. love this feeder. heavy duty & capacity. best feature is the large varmint guard. definitely use a small lock or securing device on the battery housing latch. i gave 4 stars because several bolts were missing. check contents b4 beginning.
6. solid, stable mount. holds iphone with phone protector well. i have not however used the dash mount part of this product (only windshield).

In [8]:
reviews.show()

+-----+---------+--------------------+
|label|sentiment|           sentences|
+-----+---------+--------------------+
| high|      1.0|it is and does ex...|
| high|      1.0|i was sketchy at ...|
|  mid|     -1.0|very mobile produ...|
| high|      1.0|easy to use a mob...|
| high|      1.0|love this feeder....|
| high|      1.0|solid, stable mou...|
| high|      1.0|i bought this pep...|
| high|      1.0|beautiful photos/...|
|  low|      0.0|my idea of colora...|
|  low|      0.0|no matter what we...|
|  low|      0.0|i do not suggest ...|
|  low|      0.0|useless - all you...|
| high|      1.0|this book is real...|
| high|      1.0|it is not a stick...|
| high|      1.0|love the size and...|
| high|      1.0|its very colorful...|
| high|      1.0|the condition of ...|
| high|      1.0|only negative. th...|
| high|      1.0|bought this for m...|
| high|      1.0|this book is a gr...|
+-----+---------+--------------------+
only showing top 20 rows

In [9]:
# Equalise classes for this example
# 10 mill pos and 10 mill neg
num_reviews = 1000 # sample
neg_rev = reviews.filter("sentiment = 0.0").limit(num_reviews)
pos_rev = reviews.filter("sentiment = 1.0").limit(num_reviews)

# split intro training and test (50%, 50%)
trainingData, testData = neg_rev.unionAll(pos_rev).randomSplit([0.5, 0.5])

In [10]:
#print(neg_rev.count(), pos_rev.count())

In [11]:
#trainingData.show()

In [12]:
#testData.show()

## TFIDF

In [20]:
# Some sample data to use
# to save time ...

sampleData = sqlContext.createDataFrame([
  (0.0, "Hi, I heard about Spark"),
  (0.0, "I couldn't wish Java was any different"),
  (1.0, "Logistic regression models are super boring")], 
                        ["sentiment", "sentences"])
sampleData.show()

+---------+--------------------+
|sentiment|           sentences|
+---------+--------------------+
|      0.0|Hi, I heard about...|
|      0.0|I couldn't wish J...|
|      1.0|Logistic regressi...|
+---------+--------------------+

In [119]:
# Example to get ngrams-range
from itertools import chain
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import col, udf
from pyspark.ml.feature import NGram

class nGramRanger:
    
    """
    
    Wrapper around pyspark.ml.feature.Ngram to call that in a loop and produce a 
    range of ngrams. 
    
    ngramRange = (1,3) creates Ngram(1) and Ngram(2) and Ngram(3)
    
    """
    
    def __init__(self, ngramRange, inputCol, outputCol = "ngrams"):
        # ngramRange = (1,3)
        self.ngramRange = ngramRange
        self.inputCol = inputCol
        self.outputCol = outputCol

    def transform(self, inDf):
        _orig = inDf
        def concat(type):
            def concat_(*args):
                return list(chain(*args))
            return udf(concat_, ArrayType(type))
        
        # Create columns for n-grams
        for ngr in range(self.ngramRange[0],self.ngramRange[-1]+1):
            ngram = NGram(inputCol=self.inputCol , n=ngr, outputCol="%sgram" % ngr)
            inDf = ngram.transform(inDf)        
        # Combine
        concat_string_arrays = concat(StringType())
        ###!!!!! FIx this at some point as 3 grams manually !!!!!
        outDta = _orig.join(inDf.select(concat_string_arrays(col("1gram"),
                                                            col("2gram"),
                                                            col("3gram")).alias(self.outputCol)))
        # Return dataframe
        return outDta

"""
tokenizer = Tokenizer(inputCol="sentences", outputCol="words")
tokenisedData = tokenizer.transform(sampleData)

stopremover = StopWordsRemover(inputCol="words", outputCol="wordsFiltered")
stoppedData = stopremover.transform(tokenisedData)

ngrammer = nGramRanger(ngramRange = (1,3), inputCol = "wordsFiltered", outputCol = "ngrams")
output_test = ngrammer.transform(stoppedData)

for rw in output_test.take(5):
    print(rw)
"""

Row(sentiment=0.0, sentences=u'Hi, I heard about Spark', words=[u'hi,', u'i', u'heard', u'about', u'spark'], wordsFiltered=[u'hi,', u'heard', u'spark'], ngrams=[u'hi,', u'heard', u'spark', u'hi, heard', u'heard spark', u'hi, heard spark'])
Row(sentiment=0.0, sentences=u'Hi, I heard about Spark', words=[u'hi,', u'i', u'heard', u'about', u'spark'], wordsFiltered=[u'hi,', u'heard', u'spark'], ngrams=[u"couldn't", u'wish', u'java', u'different', u"couldn't wish", u'wish java', u'java different', u"couldn't wish java", u'wish java different'])
Row(sentiment=0.0, sentences=u'Hi, I heard about Spark', words=[u'hi,', u'i', u'heard', u'about', u'spark'], wordsFiltered=[u'hi,', u'heard', u'spark'], ngrams=[u'logistic', u'regression', u'models', u'super', u'boring', u'logistic regression', u'regression models', u'models super', u'super boring', u'logistic regression models', u'regression models super', u'models super boring'])
Row(sentiment=0.0, sentences=u"I couldn't wish Java was any different"

In [None]:
# Pipeline for feature selection and classification
# Using https://spark.apache.org/docs/1.5.2/ml-features.html

from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover, NGram
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

numfeat = 40000
ngram_range = (1,3)

# 1. Feature-extraction
tokenizer = Tokenizer(inputCol="sentences", outputCol="words")
ngrammer = nGramRanger(ngramRange = ngram_range, inputCol = "words", outputCol = "ngrams")
hashingtf  = HashingTF(inputCol="ngrams", outputCol="rawFeatures", numFeatures=numfeat)
idf = IDF(inputCol="rawFeatures", outputCol="features")

# Train
tokenized_train = tokenizer.transform(trainingData)
ngrammed_train = ngrammer.transform(tokenized_train)
hashed_train = hashingtf.transform(ngrammed_train)
idfModel = idf.fit(hashed_train)
idf_train = idfModel.transform(hashed_train)

# Test
tokenized_test = tokenizer.transform(testData)
ngrammed_test = ngrammer.transform(tokenized_test)
hashed_test = hashingtf.transform(ngrammed_test)
idf_test = idfModel.transform(hashed_test)

# 2. Classifier
classi = LogisticRegression(labelCol="sentiment", featuresCol="features")

# Train
tfidfModel = classi.fit(idf_train)

# Predict
pred = tfidfModel.transform(idf_test)

# 3. Examine
numSuccesses = pred.where("""(prediction = 0 AND label = 'low') OR (prediction = 1 AND label = 'high')""").count()
numInspections = pred.count()
acc = (float(numSuccesses) / float(numInspections)) * 100
print("%.2f success rate" % acc) # 51.41 success rate