## Upload Review Data using AzureML

Create a batch file and execute:
    
```
cd "C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy"
AzCopy /Source:C:\_ilia_share\amazon_prod_reviews_clean\raw /Dest:https://ikcentralusstore.blob.core.windows.net/amazonrev /DestKey:dLR5lH2QN/ejGmyD61nQoh7Cc2DW8jIKhR5n5uvGu8+H3Qem4J0XzWG1/7XtBxmVlWr+y/GNRlwX4Km5YU68sg== /Pattern:"aggressive_dedup.json"
pause
```

## Load Review Data (from Blob)

In [1]:
# Idea courtesy of Thomas D.
import time
STIME = { "start" : time.time() }

def tic():
    STIME["start"] = time.time()

def toc():
    elapsed = time.time() - STIME["start"]
    print("%.2f seconds elasped" % elapsed)

Creating SparkContext as 'sc'


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
31,application_1469453428769_0009,pyspark,idle,Link,Link,✔


Creating HiveContext as 'sqlContext'
SparkContext and HiveContext created. Executing user code ...


In [2]:
# paths
blob = "wasb://amazonrev@ikcentralusstore.blob.core.windows.net"
json_dta = blob + "/aggressive_dedup.json"

In [3]:
# load data
jsonFile = sqlContext.read.json(json_dta)
jsonFile.registerTempTable("reviews")

print(type(jsonFile)) #  <class 'pyspark.sql.dataframe.DataFrame'>
jsonFile.show(5)

# Note: also load the IMDB data at some point
# ...

<class 'pyspark.sql.dataframe.DataFrame'>
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|      asin|helpful|overall|          reviewText| reviewTime|          reviewerID|   reviewerName|             summary|unixReviewTime|
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|B003UYU16G| [0, 0]|    5.0|It is and does ex...|11 21, 2012|A00000262KYZUE4J5...| Steven N Elich|Does what it's su...|    1353456000|
|B005FYPK9C| [0, 0]|    5.0|I was sketchy at ...| 01 8, 2013|A000008615DZQRRI9...|      mj waldon|           great buy|    1357603200|
|B000VEBG9Y| [0, 0]|    3.0|Very mobile produ...|03 24, 2014|A00000922W28P2OCH...|Gabriel Merrill|Great product but...|    1395619200|
|B001EJMS6K| [0, 0]|    4.0|Easy to use a mob...|03 24, 2014|A00000922W28P2OCH...|Gabriel Merrill|Great inexpensive...|    1395619200|
|B003XJCNVO| 

## Examine some of the reviews

In [4]:
%%sql 
SELECT overall, reviewText
FROM reviews
LIMIT 10

In [5]:
%%sql 
SELECT overall, COUNT(overall) as freq
FROM reviews
GROUP BY overall
ORDER by -freq

In [6]:
# Create a dataframe of our reviews
# To analyse class imbalance
reviews =  sqlContext.sql("SELECT " + 
                          "CASE WHEN overall < 3 THEN 'low' " +
                          "WHEN overall > 3 THEN 'high' ELSE 'mid' END as label, " + 
                          "reviewText as sentences " + 
                          "FROM reviews")

tally = reviews.groupBy("label").count()
tally.show()

#mid| 7,039,272
#low|10,963,811
#high|64,453,794

+-----+--------+
|label|   count|
+-----+--------+
|  mid| 7039272|
|  low|10963811|
| high|64453794|
+-----+--------+

In [None]:
"""
# Let's look at some reviews to see how clean they are
# there seems to be lots of html formatting
for c,r in enumerate(reviews.take(10)):
    print("%d. %s" % (c+1,r['sentences']))
"""

In [None]:
"""
# Some very basic cleaning
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType, DoubleType 
from bs4 import BeautifulSoup

def cleanerHTML(line):
    # html formatting
    html_clean = BeautifulSoup(line, "lxml").get_text().lower()
    # remove any double spaces, line-breaks, etc.
    return " ".join(html_clean.split())

def labelForResults(s):
    # string label to numeric
    if s == 'low':
        return 0.0
    elif s == 'high':
        return 1.0
    else:
        return -1.0
        
cleaner = UserDefinedFunction(cleanerHTML, StringType())
label = UserDefinedFunction(labelForResults, DoubleType())

cleanedReviews = reviews.select(reviews.label,
                                label(reviews.label).alias('sentiment'), 
                                cleaner(reviews.sentences).alias('sentences'))
"""

In [None]:
"""
# A bit cleaner ...
for c,r in enumerate(cleanedReviews.take(10)):
    print("%d. %s" % (c+1,r['sentences']))
"""

In [None]:
"""
#cleanedReviews.show()
"""

In [None]:
"""
# Equalise classes 
neg_rev = cleanedReviews.filter("sentiment = 0.0")
pos_rev = cleanedReviews.filter("sentiment = 1.0").limit(neg_rev.count())
"""

In [None]:
"""
# Save data
allData = pos_rev.unionAll(neg_rev)
print(allData.count()) # 21,927,622 ( = 10,963,811 * 2)

allDataLoc = blob + "/cleaned_equal_classes.json"
allData.write.json(allDataLoc)
"""

## Load Clean Data

In [3]:
allDataLoc = blob + "/cleaned_equal_classes.json"
allData = sqlContext.read.json(allDataLoc)

data_count = allData.count()
print(data_count)

21927622

In [4]:
# Take 100,000
sub_sample = 1000000
sub_sample_ratio = float(sub_sample)/float(data_count)

print(sub_sample_ratio)
print(type(allData))

0.00456045803781
<class 'pyspark.sql.dataframe.DataFrame'>

In [5]:
# sub_sample -> sample(boolean withReplacement, double fraction, long seed)
allData = allData.sample(False, sub_sample_ratio, 12345)

# split intro training and test (50%, 50%)
trainingData, testData = allData.randomSplit([0.5, 0.5])

In [32]:
trainingDataLoc = blob + "/training_1mill.json"
testDataLoc = blob + "/testing_1mill.json"

In [33]:
# Save
#trainingData.write.mode(SaveMode.Overwrite).json(trainingDataLoc)
#testData.write.mode(SaveMode.Overwrite).json(testDataLoc)

In [34]:
# Load
trainingData = sqlContext.read.json(trainingDataLoc)
testData = sqlContext.read.json(testDataLoc)

In [35]:
trainingData.cache()
testData.cache()

print(trainingData.count())
print(testData.count())

500349
499826

In [36]:
trainingData.show()

+-----+--------------------+---------+
|label|           sentences|sentiment|
+-----+--------------------+---------+
| high|!i recommend this...|      1.0|
| high|" duty, honor, co...|      1.0|
| high|" ok let first st...|      1.0|
| high|"a deadly justice...|      1.0|
| high|"a dirty job" is ...|      1.0|
| high|"a man of god" is...|      1.0|
| high|"a practical book...|      1.0|
| high|"a widow's story"...|      1.0|
| high|"abraham's burden...|      1.0|
| high|"always said if i...|      1.0|
| high|"american fool" b...|      1.0|
| high|"antsy does time"...|      1.0|
| high|"anyone who leads...|      1.0|
| high|"athlete/warrior"...|      1.0|
| high|"better living th...|      1.0|
| high|"bob cornuke writ...|      1.0|
| high|"changing seasons...|      1.0|
| high|"city of angels: ...|      1.0|
| high|"cleopatra" was t...|      1.0|
| high|"courage" by dais...|      1.0|
+-----+--------------------+---------+
only showing top 20 rows

In [37]:
testData.show()

+-----+--------------------+---------+
|label|           sentences|sentiment|
+-----+--------------------+---------+
| high|" to sheldon and ...|      1.0|
| high|"...those men who...|      1.0|
| high|"a prayer for the...|      1.0|
| high|"a seacat's love"...|      1.0|
| high|"amazing product ...|      1.0|
| high|"aves - the age o...|      1.0|
| high|"butera does it a...|      1.0|
| high|"c'era una volta ...|      1.0|
| high|"circle william" ...|      1.0|
| high|"crush proof" is ...|      1.0|
| high|"die unendliche g...|      1.0|
| high|"don't let fear h...|      1.0|
| high|"du hast" means y...|      1.0|
| high|"feedback" is an ...|      1.0|
| high|"fling" is one of...|      1.0|
| high|"forgiving maximo...|      1.0|
| high|"gangster governm...|      1.0|
| high|"gone girl" just ...|      1.0|
| high|"good parents bad...|      1.0|
| high|"great price with...|      1.0|
+-----+--------------------+---------+
only showing top 20 rows