![header](img/header.png)

In [1]:
import pickle
import pandas as pd

<a name="methodology"> </a>
# Methodology

* Identify the business case and criteria for success. 
* Acquire data and determine if fit for purpose.
* Parse, clean and transform data. Enriching with further features. 
* Perform exploratory data analysis - identify trends and outliers.
* Select appropriate modelling techniques and evaluate.
* Present results, challenges, assumptions and next steps. 

# Table of Contents

* [Methodology](#methodology)
* [Executive Summary](#executive)
    * [Context](#context)
    * [Audience](#audience)
    * [Objectives](#objectives)
    * [Success Criteria](#success)
    * [Findings](#findings)
    * [Data Collection & Preprocessing](#datacollection)
        * [Amazon Dataset](#amazon)
        * [Web Scraping](#web)
    * [Data Cleaning](#datacleaning)
    * [Feature Engineering](#feature)
    * [Exploratory Data Analysis (EDA)](#eda)
    * [Model Selection & Evaluation](#modelselection)
* [Results ](#results)
    * [Multi Classification](#multiclass)
        * [Single Feature - Review Body](#reviewbody)
        * [Multiple Features](#multiplefeatures)
    * [Binary Classification](#binary)
        * [Resampling](#resampling)
        * [Misclassified results](#misclassified)
* [Challenges](#challenges)
* [Risks](#risks)
* [Assumptions](#assumptions)
* [Key Takeaways](#takeaways)
* [Next Steps](#nextsteps)

<a name="executive"> </a>
# Executive summary

<img src="img/df_us_reviewcounttime.png" width="500"/>
<a name="context"> </a>

### Context

Amazon is the worlds largest online marketplace, surpassing Walmart in market value and briefly attaining a $1 Trillion valuation in July 2019. The growth of Amazon has mirrored a global trend in the last decade of increasing online purchase activity and spend from consumers worldwide. In this changing landscape, ever more important have become the role of product reviews to inform purchase decisions.

Reviews provide prospective customers with real-world usage experiences and serve to validate marketing claims to judge fit for use. In an ideal world they would provide an unbiased opinion without brand affiliation. Longer term usage reports can also attest to durability while media launch reviews often focus on specifications with short turnarounds.

The importance of review ratings are striking when observing the consumer buying process online. It is common to filter out low rated products. Algorithms also favour highly ranked products for search impressions and in reccomendation engines. This unfortunately raises the incentive to game the system and buy fake reviews. A large enough problem that Amazon actually does allow sellers to buy reviews from trusted reviewers legally using their Vine programme. Providing those first few initial reviews to get a product off the ground. 

This makes understanding reviewer behaviour important for any brand. When aggregating insights over thousands of reviews the seller can gain a macro level understanding of the repsonse to their product. This in turn can inform marketing communications and the product development pipeline. For digital cameras this can be more immediate - there is even an opportunity to resolve issues or add competetive features in product functionality through firmware updates.

Another use case for analysing reviews is to perform competitor research on a competing product or across the whole digital camera market to identify unfulfilled needs from consumers.

<a name="audience"> </a>
### Audience 

Product developers and marketeers seeking understanding of the digital camera market through reviewer behaviour and identifcation of unmet needs. 

<a name="objectives"> </a>
### Objectives 

The aim of this project is to use the Amazon review dataset to predict the star rating on camera reviews. NLP will be performed on the review text before using a machine learning classifier to "learn" the difference between positive and negative reviews. 

This will be done both on a review scale of 1-5 and also a high/low binary classification. 

A secondary objective is to identify camera features that lead to good or bad experiences through popular words mentioned in reviews.

The predictive features used are:
- Helpful ratio (helpful votes / total votes on the review)
- Verified purchase (Y/N)
- Review body (text)
- Review length 
- Camera brand
- VADER sentiment scores (performed on the review body text)

<a name="success"> </a>
### Success Criteria

- Model accuracy above baseline score
- Model comparison across Precision/Recall/F1 scores 
- Confusion Matrix analysis

<a name="findings"> </a>
### Findings

1) Features that gather high reviews - low light, image quality, full frame, auto focus and high iso.

2) Problems to resolve in negative reviews- memory card, customer service, camera body, image quality, high iso.

3) The best performing binary classification model was a Logistic regression algorithm using  Tomeklinks resampling technique. 

|      |  pred Low  |	pred High  | 
|------|------|-----|    
|True Low|	77|   29 |  
|True High|	25|   1081 |  

4)  Misclassified reviews were often due to experiences that are mixed, or that change over time. Sometimes even a camera could be satisfactory but another element of the buying experience was not. Adding additional context specific keywords into our model may improve it. E.g hot pixels is a problem or "noise" in an image. 


<a name="datacollection"> </a>
## Data Collection

<a name="amazon"> </a>
### Amazon dataset
https://registry.opendata.aws/amazon-reviews/ 

The datasets contain the customer review text with accompanying metadata. 

1. A collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale. (130M+ customer reviews). I downloaded a subset of this dataset for the camera category.


2. Originally from a collection of reviews about products in multiple languages from different Amazon marketplaces, intended to facilitate analysis of customers’ perception of the same products and wider consumer preferences across languages and countries. (200K+ customer reviews in 5 countries). I downloaded a UK dataset composed of all product categories.

**The anatomy of a review** 

Other features not seen by the review are taken the product section such as product_id, marketplace, product title and category etc)
<img src="img/reviewanatomy.png" width="500"/>


**DATA DICTIONARY:**

* marketplace       - 2 letter country code of the marketplace where the review was written.
* customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
* review_id         - The unique ID of the review.
* product_id        - The unique Product ID the review pertains to. Also known as the Amazon ASIN product code.
* product_parent    - Random identifier that can be used to aggregate reviews for the same product.
* product_title     - Title of the product.
* product_category  - Broad product category that can be used to group reviews.
* star_rating       - 1-5 star rating of the review.
* helpful_votes     - Number of helpful votes.
* total_votes       - Number of total votes the review received.
* vine              - Review was written as part of the Vine verified reviewer program.
* verified_purchase - The review is on a verified purchase.
* review_headline   - The title of the review.
* review_body       - The review text.
* review_date       - The date the review was written.

**DATA FORMAT**

Tab ('\t') separated text file, without quote or escape characters.

#### Preprocessing and checking fit for use

In [3]:
#Download datasets into local folder
#!curl https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz > amazon_reviews.tsv.gz
#!curl https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Camera_v1_00.tsv.gz > amazon_reviews_us_.tsv.gz

In [1]:
#Amazon US and UK Dataset

#df_uk = pd.read_csv("amazon_review_datasets.tsv/amazon_reviews.tsv", delimiter='\t', error_bad_lines=False)
#df_us = pd.read_csv("amazon_review_datasets/amazon_reviews_us.tsv", delimiter='\t', error_bad_lines=False)

Reviews remaining after filtering on the camera product category in the UK dataset. 

The US dataset was downloaded as camera category only so this was not necessary.

In [16]:
df_uk = df_uk[df_uk['product_category'] == "Camera"]

* UK Dataset:     6427 reviews
* US Dataset:     1800845 reviews

Accessories like lenses, filters, cables, stands etc were also included in the camera category. This was an issue 
as I wanted reviews on cameras only.

I filtered further using regular expressions for product titles that contained "body only".
A common term used to buy a camera without any lenses.


In [None]:
body_only_us = df_us[df_us['product_title'].str.contains('[Bb]ody [Oo]nly', regex=True)]

In [6]:
#Example reviews after filtering on cameras

with open('Pickles/df_us_cameras.pkl', 'rb') as f:
    df_us_cameras = pickle.load(f)
df_us_cameras.head(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
262,US,30739283,R190J2PDOZ5GVK,B00ZDWGFR2,390090468,Sony a7R II Full-Frame Mirrorless Interchangea...,Camera,3,36,51,N,Y,"Great camera, but there are shortcomings","Great camera, but there are shortcomings : -4...",2015-08-31
917,US,15760475,R3SGZ5G1GJAWVU,B00TSR7YPK,515216474,Nikon D750 DSLR Camera (Body Only) + 32GB Extr...,Camera,5,1,1,N,Y,"This camera is amazing, I have a d610 that I r...","This camera is amazing, I have a d610 that I r...",2015-08-31
1282,US,10861723,R3BWM499VCMGS7,B00ZDWGFR2,390090468,Sony a7R II Full-Frame Mirrorless Interchangea...,Camera,5,48,52,N,Y,Absolutly Must have camera,"Im a Canon guy, but as soon as I found out thi...",2015-08-31
3287,US,22343417,R3SMKIWNMR55UB,B00O29LKN6,474362814,Canon EOS 7D Mark II Digital SLR Camera (Body ...,Camera,5,0,0,N,Y,Canon Digital,I read several review's before my purchase. T...,2015-08-29
4855,US,106593,RNBM8M0T11BV0,B009F2OUOQ,968361935,Olympus OM-G OMG OM-20 Manual Focus Film Camer...,Camera,1,3,3,N,Y,This is only for the camera and strap. There ...,This is only for the camera and strap. There i...,2015-08-29


The UK dataset was not viable at this point resulting in too few reviews. 
I continued with the US dataset with **4260** reviews remaining. 

In [7]:
#This method of filtering was successful down to camera's only

with open('Pickles/cameras_list.pkl', 'rb') as f:
    cameras = pickle.load(f)
cameras

array(['Sony a7R II Full-Frame Mirrorless Interchangeable Lens Camera, Body Only (Black) (ILCE7RM2/B)',
       'Nikon D750 DSLR Camera (Body Only) + 32GB Extreme Pro Memory Card + Extra Battery and Charger',
       'Sony a7R II Full-Frame Mirrorless Interchangeable Lens Camera, Body Only (Black) (ILCE7RM2/B)',
       'Canon EOS 7D Mark II Digital SLR Camera (Body Only) with Canon Battery Grip BG-E16 and 64GB Deluxe Accessory Kit',
       'Olympus OM-G OMG OM-20 Manual Focus Film Camera; Body Only',
       'Sony a7R II Full-Frame Mirrorless Interchangeable Lens Camera, Body Only (Black) (ILCE7RM2/B)',
       'Sigma SD14 14MP Digital SLR Camera (Body Only)',
       'Nikon D3200 24.2 MP Digital SLR Camera (Body Only) - International Version (No Warranty) (Black, Open Box)',
       'Nikon D3300 24.2 MP CMOS Digital SLR Body Only (Grey) - International Version (No Warranty)',
       'Minolta SRT-101 35mm SLR film camera body only; lens is not included.'],
      dtype=object)

Class imbalance in the scraped data?

Insert here to show

<a name="web"> </a>
### Web scraping reviews 

I created two custom web crawlers that allowed for the extraction of further products and reviews. 
The initial usage was to add newer products than 2015 to my dataset.

The first setback was Beautiful Soup triggering Amazon's bot detection system. I circumvented this by using the Selenium package which physically emulates a human by opening a browser window.

The second setback was [total_votes] number is no longer available to reviews. Replaced with "X people found this helpful" this meant I was unable to recreate the same features. Though comments are now tracked and amount of follow up comments on a review are available.

The additional features I was able to add were number of comments and product price. Subcategory was also available due to scraping products separately.

Due to the difference in features available I decided to run my models on my initial dataset before considering adding further scraped data.

**Please see full jupyter notebook code for the scraping functions.**

**Product Scraper**

<img src="img/product_scraper.png" width="600"/>

I created a data structure for the Python Web Crawler to capture a range of features from the HTML web page. 

The user enters a base category URL and number of pages to click through which then records for every product on the page:
- asin (product_id)
- url
- product_name
- review_count
- avg_rating
- price

As Amazon now has further sub categories beyond camera I scraped datasets using the base category URL for
- DSLR
- Action Cameras
- Mirrorless
- Point and Shoot

In [8]:
#Example of scraped product data
DSLR = pd.read_csv("Scraped_data/DSLR.csv") 
DSLR.head(10)

Unnamed: 0,ASIN,URL,Product,Reviews,Rating,Price
0,B0060MVJ1Q,https://www.amazon.com/Nikon-D750-FX-format-Di...,Nikon D750 FX-format Digital SLR Camera Body,616,4.5,1496.95
1,B0060MVLXC,https://www.amazon.com/Nikon-FX-format-Digital...,Nikon D750 FX-format Digital SLR Camera w/ 24-...,616,4.5,1996.95
2,B00NEWZDRG,https://www.amazon.com/Canon-Mark-Digital-Came...,Canon EOS 7D Mark II Digital SLR Camera (Body ...,365,4.6,1024.99
3,B00T3ER7QO,https://www.amazon.com/Canon-Rebel-Digital-EF-...,Canon EOS Rebel T6i Digital SLR with EF-S 18-5...,204,4.6,583.99
4,B0101SRIKU,https://www.amazon.com/Canon-Creator-EF-M15-45...,Canon EOS M50 Video Creator Kit with EF-M15-45...,150,4.2,699.0
5,B01A7Q0J3Y,https://www.amazon.com/Nikon-D500-DX-Format-Di...,Nikon D500 DX-Format Digital SLR (Body Only),208,4.7,1496.95
6,B01BUYK04A,https://www.amazon.com/Canon-Digital-Camera-Me...,Canon Digital SLR Camera Body [EOS 80D] with 2...,214,4.4,999.0
7,B01D93Z89W,https://www.amazon.com/Canon-T6-Digital-Teleph...,Canon EOS Rebel T6 Digital SLR Camera with 18-...,1282,4.6,449.95
8,B01KURGS9E,https://www.amazon.com/Canon-Mark-Frame-Digita...,Canon EOS 5D Mark IV Full Frame Digital SLR Ca...,187,4.3,2799.0
9,B01M586Y9R,https://www.amazon.com/Sony-Alpha-Mirrorless-D...,Sony Alpha a6500 Mirrorless Digital Camera w/ ...,158,4.5,1198.0


**Review Scraper**

<img src="img/review_scraper.png" width="600"/>

The review scraper that I created requires input variables
- asin (product_id)
- review_pages (number of pages of reviews to iterate through)

Therefore I could iterate through the product_id's in the scraped products above to capture:

    - user profile_name 
    - user_url_id 
    - review_date 
    - star_rating 
    - review_title 
    - helpful_votes 
    - total_comments 
    - verified_purchase 
    - review_body 
    - review_id 

In [9]:
scraped_reviews = pd.read_csv("Scraped_data/reviews30.csv") 
scraped_reviews.head(5)

Unnamed: 0,ASIN,review_id,profile_name,user_url_id,review_date,star_rating,review_title,helpful_votes,total_comments,verified_purchase,review_body
0,B0060MVJ1Q,R10H6Z68UKZ4WI,Amazon Customer,/gp/profile/amzn1.account.AGCIJLD2WJDGYHAHQ7M4...,"January 15, 2016",5,Professional camera features at a semi-pro cost,38,0,Verified Purchase,"Oh my gosh, this camera. I'm coming from using..."
1,B0060MVJ1Q,R113WZQUPLYWT,Saileswar Mohanty,/gp/profile/amzn1.account.AHP2WQU4JYD7VMHRNXNJ...,"October 8, 2018",1,The D750 sent by Amazon was defective,13,0,Verified Purchase,The camera which i received was having major F...
2,B0060MVJ1Q,R12HPXBWA21J68,BostwickBooks,/gp/profile/amzn1.account.AEI55QAOVXVYBNOKBCSN...,"April 25, 2015",5,Well worth the investment and the wait!,26,1,Verified Purchase,I have owned a number of SLR and D-SLR cameras...
3,B0060MVJ1Q,R12SETF3JC8K19,JJ,/gp/profile/amzn1.account.AFLFZE2EANDOCIBAVBMQ...,"February 10, 2017",5,GODLY high ISO noise-level performance!,16,0,Verified Purchase,This camera is a BEAST! I've been renting it f...
4,B0060MVJ1Q,R1AJESWNGQYMK0,Randy B,/gp/profile/amzn1.account.AEXWXUXVB6JP7TCBEORN...,"January 1, 2019",5,Camera is fun to operate,3,0,Verified Purchase,Just did a 2019 calendar and out of the thirte...


Class imbalance in the scraped data?

Insert here to show

<a name="datacleaning"> </a>
## Data Cleaning

**Removing unnecessary features**
    - Marketplace - all belong to the US marketplace
    - Product Category - all within camera category
    - Vine - all values were 'No'

**Missing values**
    - None

**Duplicated reviews**
    - No duplicated reviews

**Data types**
    - Adjusting review_date to datetime64


| Feature | Data Type|
| --- | --- |
| customer_id | int64|
| review_id | object|
| product_id | object|
| product_parent  | int64|
| product_title  | object|
| star_rating   | int64|
| helpful_votes | int64|
| total_votes | int64|
| verified_purchase | object|
| review_headline | object|
| review_body | object|
| review_date | datetime64[ns]|


**Unique categorical variable counts**
    - 4260 unique reviews 
    - 405 total unique products
    - 2701 verified purchases / 1559 non verified

**Checking for outliers**
    - Star_rating ranges from 1-5 as expected. Mean rating 4.43 stars and median of 5. This is very high!
    - Total votes - Mean 12, median 3. 
    Maximum count for one review was 1357 votes which will skew the mean upwards.
    - Helpful Votes has 8.9 mean average per review. Median of 2. The maximum a review had was1310. (Turns out this was a very high quality review pasted below so this is expected)

**Reset index** 
    - After this initial cleaning process was completed I reset the index.


**Review with the most helpful votes (1310)**

['Does the 7D beat full frame cameras?']
["No, but it's so good that one starts to contemplate this question, which was never the case before the 7D was introduced. Both systems, crop and full frame, have their pros and cons and place in photography. But before I get into that let me say I have not been as excited about a camera since the introduction of the 5D MK I four years ago. That's because the 7D raises the crop camera bar to the point where crop users will not feel at a disadvantage to full frame camera users, especially if coupled with awesome ef-s lenses such as the 17-55 f2.8.<br /><br />How so? The 7D sets a new standard in four major ways.<br /><br />1. It produces whopping 18MP pictures, which are just 3MP shy of the current top of the line full frame Canon cameras. Just few years ago most pros were producing stellar results using the 1Ds MKII 16MP camera. Now you have more MPs in a crop sensor, that's a major achievement. This achievement translates into bigger prints and, perhaps more importantly, cropping power. Out shooting wildlife with a 300mm instead of 400mm? You can crop the 7D files down to 50% of their original file size and still obtain sharp pictures. It's just not that easy with the 1D MK III 10MP files.<br /><br />2. Many worried that extra MPs in small crop sensors would translate into nosier pictures, but the amazing thing is that this camera produces images with what seems to be less noise than the 1Ds MKII. The noise level is very good. At ISO 1600 I still prefer pictures coming from my 5D MKII, but below ISO1600 they are very close. Frankly, I can go with either camera because most of my professionally shot portraits and product pictures are shot at ISO100. At ISO100 both produce very clean files and are practically indistinguishable.<br /><br />3. Focus is the one area that was lacking on the previous 1.6 crop Canon cameras and this camera changes that. It's not a 1D in focus speed and accuracy, but it's the next best thing compared to them. It's faster than the Canon 5D MKII, which is known to be slightly faster or around the focus performance range of the 50D and 40D.<br /><br />4. The drive chain is fast, so fast it's beyond anything I needed in my professional work in portrait, commercial, and product photography. Going through pictures taken at 8fps produces very little difference from frame to frame. One probably has to shoot a very fast moving subject/object to see the advantage of such fast drive system.<br /><br />There are obviously many other things that I have not covered in this review. But based on the above, all I can say is that this camera has really raised the bar for all cameras and made it much more affordable to obtain a professional level camera for all types of photography. If you were considering buying the 5D MKII as an upgrade give this camera a test because it might be all you need.<br /><br />As for the advantages of crop cameras I always find it odd that casual users who shoot many things but focus on landscape think they need a full frame to realize their potential. Crop cameras such as the 7D and 50D are fine for most users and offer many advantages including:<br /><br />1. greater depth of field at lower aperture for landscape photography<br /><br />2. greater tilt and shift effect because of sensor size relative to effect (8mm in shift is greater in effect relative to a 22mm sensor compared to a 35mm sensor)<br /><br />3. greater magnification with micro lenses and extension tubes because of smaller sensor (1:1 in full frame equals 35mm, 1:1 in crop equals 22mm)<br /><br />4. smaller lighter lenses with wider aperture that achieve greater reach (such as the 17-55 2.8 vs the 24-70 2.8 similar reach but much lighter and smaller)<br /><br />Traditionally the three areas full frame cameras outshine crop cameras are a bigger brighter viewfinder, shallower depth of field for portrait photography, and better ISO performance, which on the last point the 7D has proven not be an issue anymore.<br />And for the second point really, most beautiful low depth of field portraits are done around f2.8-2.0 in full frame (going wider will make depth of field too narrow to place two eyes in focus). Hence, if one is using a wide prime, a crop sensor will produce the same depth of field at 2.0-1.4. Considering an affordable 50mm f1.4 lens on crop has the same field of view as 85mm lens on full frame there is really no reason to discount a crop camera any more as the 7D levels the playing field.-2.0 in full frame (going wider will make depth of field too narrow to place two eyes in focus). Hence, if one is using a wide prime, a crop sensor will produce the same depth of field at 2.0-1.4. Considering an affordable 50mm f1.4 lens on crop has the same field of view as 85mm lens on full frame there is really no reason to discount a crop camera any more as the 7D levels the playing field."]

<a name="feature"> </a>
## Feature Engineering

### Helpful_ratio - ratio of helpful votes to total votes.

In [None]:
# To determine which reviews were of higher quality and use to customers.
df['helpful_ratio'] = df['helpful_votes'] / df['total_votes']

In [None]:
#Data imputing
df.helpful_ratio.fillna(df.helpful_ratio.mean(), inplace=True)

838 observations had no total votes. Producing NaN values due to attempted division by 0. I decided to fill the missing values with the mean helpfulness.

### Review_length - word count of the review body

In [None]:
# length of review in words. To determine if longer reviews are more helpful or predict higher ratings
df['review_length'] = df['review_body'].apply(lambda x: len(str(x).split(' ')))

###  Camera_brand -  extracted from the product_title. 

In [32]:
# Checking value counts, initial cleaning and grouping for misspellings and errors
replace_dict = {'silicon':'silicon valley',
                'chrome':'nikon',
                'pentaxistdl':'pentax',
                'pentaxistds2':'pentax',
                'refurbished':'minolta',
               'portable,':'canon',
               '181':'canon',
               '645':'contax',
               'brand':'nikon'}

df['camera_brand'] = df.product_title.map(lambda x: x.split()[0].lower()).replace(replace_dict)

In [10]:
#Final result of camera brand value counts after further cleaning

with open('Pickles/brand_list.pkl', 'rb') as f:
    brand_list = pickle.load(f)
brand_list

canon             1330
nikon             1072
sony               465
olympus            320
pentax             312
panasonic          170
vivitar            162
vangoddy           113
minolta             84
fujifilm            51
leica               50
sigma               36
silicon valley      30
samsung             28
hasselblad          12
polaroid             5
kodak                3
ricoh                2
vixen                1
phoenix              1
lytro                1
dslrpros             1
toyo                 1
contax               1
mamiya               1
Name: camera_brand, dtype: int64

###  Review Sentiment VADER scores

For each review assigned features
    * vader_compound, vader_neg, vader_neu, vader_pos

Where
    * positive sentiment: compound score >= 0.05
    * neutral sentiment: (compound score between -0.05 and 0.05)
    * negative sentiment: compound score <= -0.05

from https://github.com/cjhutto/vaderSentiment#about-the-scoring

The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate. 

The pos, neu, and neg scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.


###  DataFrame output with new features

In [11]:
with open('Pickles/sentiment_df.pkl', 'rb') as f:
    sentiment_df = pickle.load(f)
sentiment_df.head(5)

Unnamed: 0,customer_id,review_id,product_id,product_parent,product_title,star_rating,helpful_votes,total_votes,verified_purchase,review_date,review_headline,review_body,helpful_ratio,review_length,camera_brand,vader_compound,vader_neg,vader_neu,vader_pos
0,30739283,R190J2PDOZ5GVK,B00ZDWGFR2,390090468,Sony a7R II Full-Frame Mirrorless Interchangea...,3,36,51,Y,2015-08-31,"Great camera, but there are shortcomings","Great camera, but there are shortcomings : -4...",0.705882,199,sony,0.8964,0.103,0.768,0.129
1,15760475,R3SGZ5G1GJAWVU,B00TSR7YPK,515216474,Nikon D750 DSLR Camera (Body Only) + 32GB Extr...,5,1,1,Y,2015-08-31,"This camera is amazing, I have a d610 that I r...","This camera is amazing, I have a d610 that I r...",1.0,21,nikon,0.51,0.0,0.788,0.212
2,10861723,R3BWM499VCMGS7,B00ZDWGFR2,390090468,Sony a7R II Full-Frame Mirrorless Interchangea...,5,48,52,Y,2015-08-31,Absolutly Must have camera,"Im a Canon guy, but as soon as I found out thi...",0.923077,212,sony,0.9967,0.022,0.707,0.271
3,22343417,R3SMKIWNMR55UB,B00O29LKN6,474362814,Canon EOS 7D Mark II Digital SLR Camera (Body ...,5,0,0,Y,2015-08-29,Canon Digital,I read several review's before my purchase. T...,0.700245,14,canon,0.0,0.0,1.0,0.0
4,106593,RNBM8M0T11BV0,B009F2OUOQ,968361935,Olympus OM-G OMG OM-20 Manual Focus Film Camer...,1,3,3,Y,2015-08-29,This is only for the camera and strap. There ...,This is only for the camera and strap. There i...,1.0,33,olympus,-0.6808,0.185,0.815,0.0


<a name="eda"> </a>
## Exploratory Data Analysis (EDA)

### Time period Trends

Occasions where monthly average review ratings drop in 2000, 2004, 2010. Even over time average review is high - between 4/5 stars.
<img src="img/time.png" width="600"/>

Monthly review count grows over time for cameras. This matches the trend for the whole dataset.
<img src="img/reviewcounttime.png" width="600"/>

### Product insights

397 unique products in the dataset the majority are 5 star reviews 
<img src="img/ratings.png" width="600"/>

Canon, Nikon and Sony had the most reviews attributed to them. Expected as they hold the most market share and release the most cameras.

<img src="img/brands.png" width="600"/>

Average review rating by brand. Not a huge range of values as in the whole dataset reviews are skewed to high ratings. A few low reviewed brands stand out such as Polaroid 

In [12]:
with open('pickles/brand_ratings.pkl', 'rb') as f:
    brand_ratings = pickle.load(f)
brand_ratings

Unnamed: 0_level_0,star_rating
camera_brand,Unnamed: 1_level_1
vixen,5.0
lytro,5.0
dslrpros,5.0
toyo,5.0
contax,5.0
mamiya,5.0
pentax,4.647436
panasonic,4.594118
olympus,4.559375
nikon,4.545709


**Top 3 most reviewed cameras**

- 609 ['Canon EOS 7D 18 MP CMOS Digital SLR Camera Body Only discontinued by manufacturer']
- 207 ['Nikon D200 10.2MP Digital SLR Camera (Body Only)']
- 152 ['Olympus 16MP Mirrorless Digital Camera with 3-Inch LCD - Body Only']

### Review helpfulness scores

Higher rated products tended to attract more helpful reviews
<img src="img/helpfulbyrating.png" width="600"/>

Helpful review ratio by brand. (recall reviews with 0 votes had imputed mean 0.7) which affects the interpretability of this graph.

<img src="img/brandhelpfulratio.png" width="600"/>

### Review length

Median review is 85 words for a review. One extremely long review occurs - 8542 words

| Statistic | Value|
| --- | --- |
| count | 4252.000000|
| mean | 190.542803|
| std | 343.538724|
| min  | 1.000000|
| 25%  | 32.000000|
| 50%   | 85.000000|
| 75% | 218.250000|
| max | 8542.000000|      

Review length and rating correlation appears to have slightly shorter reviews for low rated products
<img src="img/reviewlengthratings.png" width="600"/>

Do people write more for certain brands? Apart from outliers fairly consistent when comparing Sony, Nikon, Canon who have many reviews.
<img src="img/reviewlengthavgbrand.png" width="600"/>

In [13]:
#Length of most helpful reviews
with open('Pickles/most_helpful_length.pkl', 'rb') as f:
    most_helpful_length = pickle.load(f)
most_helpful_length

Unnamed: 0,product_title,star_rating,helpful_votes,total_votes,helpful_ratio,review_length
3946,Canon EOS 20D DSLR Camera (Body Only) (OLD MODEL),5,96,96,1.0,1507
2546,Nikon D4 16.2 MP CMOS FX Digital SLR with Full...,5,69,69,1.0,579
3506,Nikon D40 6.1MP Digital SLR Camera (Body Only),5,58,58,1.0,1800
4228,Canon EOS Elan IIe Date 35mm SLR Camera (Body ...,5,47,47,1.0,301
3104,Canon EOS Rebel XS 10.1-Megapixel Digital SLR ...,5,35,35,1.0,147


Helpfulness of reviews under 40 words is lower
<img src="img/reviewlengthhelpfulness.png" width="600"/>


### Verified purchases

A large proportion of unverified purchases. This could pose a problem for legitimacy. 
<img src="img/verified.png" width="600"/>

### Text analysis

Popular words in review headlines
<img src="img/headline_wordcloud.png" width="600"/>

In [4]:
#Most frequent bigrams in 1 star reviews
with open('Pickles/freq_bigrams_1star.pkl', 'rb') as f:
    freq_bigrams_1star = pickle.load(f)
freq_bigrams_1star

memory card         22
customer service    16
camera body         15
image quality       13
high iso            13
fn iii              13
used camera         12
picture quality     12
hot pixels          12
low light           11
piece junk           9
brand new            9
buy camera           9
auto focus           9
purchased camera     9
take pictures        8
camera one           8
camera work          8
digital camera       8
shutter speed        8
could get            8
take picture         8
camera would         8
received camera      7
white balance        7
sd card              7
got camera           7
use camera           7
low iso              7
camera case          7
dtype: int64

In [5]:
#Most frequent bigrams in 5 star reviews
with open('Pickles/freq_bigrams_5star.pkl', 'rb') as f:
    freq_bigrams_5star = pickle.load(f)
freq_bigrams_5star

low light           457
image quality       416
full frame          331
great camera        328
point shoot         244
auto focus          221
high iso            218
love camera         192
live view           188
kit lens            182
easy use            174
ed af               163
camera body         158
picture quality     154
digital camera      153
much better         152
mark ii             147
white balance       137
dynamic range       136
digital slr         135
battery life        134
shutter speed       133
camera great        123
af dx               120
use camera          120
highly recommend    120
canon eos           119
per second          114
build quality       113
depth field         109
dtype: int64

### Reviewer behaviour

In [15]:
#Most frequent reviewers. * A handful of users gave multiple reviews in this sample. Two users gave 6 reviews.

with open('Pickles/freq_reviewers.pkl', 'rb') as f:
    freq_reviewers = pickle.load(f)
freq_reviewers

45664110    6
40109303    6
9115336     5
19541636    4
27140716    4
10572690    4
46160224    4
17237889    3
44250386    3
49410262    3
Name: customer_id, dtype: int64

Fascinatingly the most frequent reviewer always left 4 star reviews.

All following a similar pattern in structure ending with a link reccomendation leading me to believe they are affiliate spam.

**Example**

"This Canon EOS 7D is one of the best midrange D-SLRs money can buy.<br /><br />Pros:<br />+ Excellent still-image and HD-video quality.<br />+ Fast performance.<br />+ Various HD video recording options.<br /><br />Cons:<br />- Pricey.<br />- Video recording is not as simple as with a dedicated camcorder.<br /><br />On what you will receive for the price and in terms of spec's I would recommend comparing with this Nikon: http://amzn.to/1BWQgGG",

In [21]:
#Reviewers who write the most tend to have a high helpful ratio.
with open('Pickles/long_reviewers.pkl', 'rb') as f:
    long_reviewers = pickle.load(f)
long_reviewers

Unnamed: 0_level_0,star_rating,total_votes,helpful_ratio,review_length
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
53089089,5.0,118.0,0.983051,8542.0
47739285,5.0,233.0,0.969957,5307.0
37149793,5.0,239.0,0.983264,3933.0
49055733,5.0,57.0,0.877193,3857.0
47904504,4.0,900.0,0.98,3830.0
32013796,4.0,153.0,0.993464,3530.0
10550648,5.0,463.0,0.967603,3225.0
12019261,4.0,8.0,0.766667,2752.5
40888222,5.0,75.0,0.986667,2397.0
20866378,5.0,23.0,1.0,2147.0


### Correlation heatmap

* Helpful_ratio has positive explanatory correlation (0.27) to the star rating.
* Vader negative correlates -0.39
* Vader compound has the most correlation with star rating at 0.5

<img src="img/heatmap.png" width="600"/>

<a name="modelselection"> </a>
## Model Selection & Evaluation

**Preprocessing**

- Train Test Split: Models trained on 70% of stratified data and tested on 30% unseen data. 
- Multiple feature modelling used a custom Pipeline to apply the following preprocessing:
    - **Numeric_features** ['helpful_ratio', 'review_length', 'vader_pos', 'vader_neg'] were Standardised with StandardScaler().
    - **Categorical_features** ['verified_purchase', 'camera_brand'] dummified with OneHotEncoder().
    - **Text field** ['review_body'] processed with CountVectoriser or TfidfVectoriser. 
- Resampling methods accounting for imbalance in the dataset were added to the pipeline in the final series of modelling. 

**The Following classification models were trialled**

 - Logistic Regression
 - Multinomial Naive Bayes	
 - Random Forest	
 - Gradient Boosting	
 - K-Nearest Neighbors (KNN)
 - Support Vector Machines (SVM)
 
**HyperParameter tuning**
- Models were compared using both CountVectoriser and TfidfVectoriser. Ngrams(1,2) gave the most useful words and optimal results, English stopwords were removed in addition to html tag [br].
- Grid Search on highest performing models using parameters. 
    - Logistic regression: 'classifier__C': [0.1, 1.0, 10, 100],
    - SVM: 'classifier__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],

**Evaluation metrics**
- Comparison to baseline accuracy
- Due to imbalance in data the **weighted average F1 score (the harmonic mean of precision and recall) is a more suitable metric for comparison.**
<img src="img/f1.png"/>

- **Precision:** 
Precision is the number of True Positives divided by the number of True Positives and False Positives. Put another way; it is the number of positive predictions divided by the total number of positive class values predicted. Precision can be thought of as a measure of a classifier's exactness. A low precision can also indicate a large number of False Positives.

- **Recall:** Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate.
Recall can be thought of as a measure of a classifier's completeness. A low recall indicates many False Negatives.

- A more in depth look at classification behaviour using the Confusion Matrix to determine the types of errors made. 
- Viewing misclassified reviews from highest performing models to identify cause of errors. 

In [None]:
y = df['star_rating']
X = df[['helpful_ratio','verified_purchase','review_body','review_length','camera_brand','vader_pos','vader_neg']]

#Create train test split 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                             test_size=0.3,
                                             stratify=y, random_state=1)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['helpful_ratio', 'review_length', 'vader_pos', 'vader_neg']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['verified_purchase', 'camera_brand']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

nlp_features = 'review_body'
nlp_transformer = Pipeline(steps=[
     ('Tfidf', TfidfVectorizer(stop_words =stop, ngram_range=(1,2)) ),
#    ('cvec', CountVectorizer(stop_words =stop, ngram_range=(1,2)) ) 
])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('nlp', nlp_transformer, nlp_features)])

<a name="results"> </a>
# Results

<a name="multiclass"> </a>
## Multi Classification models

Baseline accuracy for multi classification: 0.714

| Star_rating | Value_counts normalized   | count|
|------|------|-----|
|   5  | 0.713547|3034 |
|   4  | 0.153104|651 |
|   1  | 0.057385|244 |
|   3  | 0.050094|213 |
|   2  | 0.025870|110 |

<a name="reviewbody"> </a>
### Single feature - Review body

**Features used:** 
['review_body']

In [20]:
with open('Pickles/review_body_summary.pkl', 'rb') as f:
    review_body_summary = pickle.load(f)
review_body_summary

Unnamed: 0,Logreg CVEC,Logreg TFIDF,MN NaiveBayes,Random Forest,Gradient Boost,KNN,SVM
precision,0.647411,0.652059,0.618355,0.588417,0.548345,0.587785,0.687722
recall,0.710815,0.700627,0.71395,0.702978,0.71395,0.710031,0.651254
f1-score,0.67371,0.67432,0.597667,0.615236,0.597588,0.608114,0.664229
support,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0
cross_val training score,0.70766,0.714046,0.714048,0.699272,0.713374,0.704303,0.643136
test score,0.710815,0.700627,0.71395,0.702978,0.71395,0.710031,0.651254


**Analysis**
- Cross validated training scores, and test scores across all models did not significantly vary from 0.714 baseline score. SVM classifier was an exception with a lower performance 0.64 on the training set.

- Although SVM had the lowest accuracy score there was a greater spread predictions across 1-5 star ratings. This is an expected trade off for accuracy in this imbalanced dataset. 

- The Confusion Matrix for the models below make it more evident why accuracy is not the most explanatory metric if we want to predict all classes.
    - Multinomial Naive Bayes and Gradient Boosting classifiers largely predicted 5 stars every time. Predicting the majority class with the 5 star review imbalance in the data is likely to optimise for accuracy but not predict 1,2,3,4 star rated reviews well.
    - Random Forest and KNN increased predictions of the lower classes in comparison. Across all models there were very few 2 star classifications.
    - Finally Logistic Regression and SVC had a wider spread of predictions across classes and predicted 1 star reviews with much larger frequency. With more errors due to predicting away from the majority class this also meant greater correct predicitons of the minority classes. This is why the F1 score going ahead will be my main metric of comparison.
    
- Initial findings show that predicting multi classification pose difficulties.  
    - Neighboring ratings are difficult to differentiate. It is unlikely that the nuance in sentiment provides much clarity. Across all models this is observed by the number of 4* predictions that were actually 5* reviews. 
- The next step is to consider if adding further features can increase both accuracy and the F1 score.


**Logistic regression CVEC**

|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|39	  |2	|3	  |5	|24   |
|True 2|12	  |0	|5	  |6	|10   |
|True 3|11	  |0	|5	  |14	|34   |
|True 4|4	  |0	|10	  |35	|146  |
|True 5|13	  |0	|7	  |63	|828  |



**Logistic regression Tfidf**

|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|45    |	0	|3	  |8	|17   |
|True 2|12    |	0	|2	  |12	|7    |
|True 3|11    |	3	|1 	  |23	|26   |
|True 4|9     |	2	|10	  |44	|130  |
|True 5|15    |	0	|1	  |91	|804  |


**Multinomial Naive Bayes CVEC**

|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|	1 |	0	|0	  |0	|72   |
|True 2|	0 |	0	|0	  |0	|33   |
|True 3|	0 |	0	|0	  |0	|64   |
|True 4|	0 |	0	|0	  |1	|194  |
|True 5|    0 |	0	|0	  |2	|909  |


**Random Forest CVEC**

|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|	6|	0|	0|	2	|65  |
|True 2|	3|	0|	1|	0|	29  |
|True 3|	3|	0|	0|	1|	60  |
|True 4|	2|	0|	1|	13|	179 |
|True 5|    8|	1|	4|	20|	878 |

	

**Gradient Boosting CVEC**

|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|	2	|0	  |0	|0	|71  |
|True 2|	0	|0	  |0	|0	|33  |
|True 3|	0	|0	  |0	|0	|64   |
|True 4|	0 |	0	|0	  |0	|195  |
|True 5|    1 |	1	|0	  |0	|909  |


**KNeighborsClassifier CVEC**

|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|	3|	0|	1|	2|	67   |
|True 2|	1|	0|	0|	0|	32  |
|True 3|	1|	0|	0|	0|	63   |
|True 4|	2|	0|	1|	7|	185  |
|True 5|    0|	0|	2|	13|	896  |



**SVM: Support Vector Classifier Tfidf**

|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|	42|	1	|7	|7	|16   |
|True 2|	12|	1	|7	|12	|1   |
|True 3|	12|	2	|10	|24	|16   |
|True 4|	10|	0	|14	|82	|89  |
|True 5|    15|	3	|25	|172|	696  |
	




<a name="multiplefeatures"> </a>
### Multiple features and parameter tuning

**Features used:** 
['helpful_ratio','verified_purchase','review_body','review_length','camera_brand','vader_pos','vader_neg']

In [19]:
#Using TFIDF Vectoriser

with open('Pickles/multiple_features_summary_tfidf.pkl', 'rb') as f:
    multiple_features_summary_tfidf = pickle.load(f)
multiple_features_summary_tfidf

Unnamed: 0,KNN,SVC,Logreg,RandForest,GradBoost,MultinomialNB
precision,0.610031,0.691934,0.649611,0.57651,0.652488,0.509724
recall,0.695925,0.706113,0.719436,0.710031,0.728056,0.71395
f1-score,0.643642,0.69782,0.66914,0.617243,0.631536,0.594795
support,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0
cross_val training score,0.704979,0.694901,0.722777,0.709691,0.733201,0.71472
test score,0.695925,0.706113,0.719436,0.710031,0.728056,0.71395


In [18]:
#Using Count Vectoriser

with open('Pickles/multiple_features_summary_cvec.pkl', 'rb') as f:
    multiple_features_summary_cvec = pickle.load(f)
multiple_features_summary_cvec

Unnamed: 0,KNN,SVC,Logreg,RandForest,GradBoost,MultinomialNB
precision,0.596557,0.62936,0.658725,0.588333,0.570702,0.613339
recall,0.705329,0.666144,0.73511,0.709248,0.724138,0.727273
f1-score,0.623883,0.64566,0.686371,0.609829,0.626816,0.642532
support,1276.0,1276.0,1276.0,1276.0,1276.0,1276.0
cross_val training score,0.710002,0.689562,0.735233,0.709342,0.734883,0.723452
test score,0.705329,0.666144,0.73511,0.709248,0.724138,0.727273


**Analysis**

- Count Vectoriser and Tfidf Vectoriser comparison across models. ngram range was (1,2) found this performed best across models using trial and error.

- Models with highest F1 score:
    - SVC Tfidf Vectoriser (0.697820) 
    - Logistic Regression CVEC (0.686371)

- Logistic Regression accuracy improved upon the single feature model. Cross validated training scores of (0.735233) and a test score of (0.735110).

- SVC gained accuracy improvements from the single featured model. Gaining from 0.64/0.65 to 0.68/0.66. 

- Performing a Grid Search for parameters tuning below:

### Grid search on best performing classifiers

**Logistic regression CVEC**

Tuning Parameter: 'classifier__C': [0.1, 1.0, 10, 100],
    

|            |  precision |	recall   |   f1-score|	  support|
|------      |------      |-----       |-----      |-----      |  
|1           |0.5104      |0.6712      |0.5799     |    73     |  
|2           |0.4000      |0.0606      |0.1053     |   33      |  
|3           |0.1600      |0.0625      |0.0899     |   64      |  
|4           |0.3023      |0.1333      |0.1851     |  195      |  
|5           |0.8064      |0.9418      |0.8689     |  911      |  
|------      |------      |-----       |-----      |-----      |  
|accuracy    |            |            |**0.7359**     | 1276      |  
|macro avg   | 0.4358     |0.3739      |0.3658     | 1276      |  
|weighted avg| 0.6695     |0.7359      |**0.6890**     | 1276      |  



|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|	49|   0 |  4  | 4   |16       |
|True 2|	15|   2 |  2  | 7   | 7        |
|True 3|	14|   2 |  4  |12   |32       |
|True 4|	7 |  1  |10   |26   |151       |
|True 5|    11|   0 |  5  |37   |858   |



**Support Vector Classifier (SVC) w/ Tfidf Vectoriser**

Tuning Parameter: 'classifier__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    
 |            |  precision |	recall   |   f1-score|	  support|
|------      |------      |-----       |-----      |-----      |  
|1           |0.4783      |0.6027      |0.5333     |    73     |  
|2           |0.1667      |0.0909      |0.1176     |   33      |  
|3           |0.2195      |0.1406      |0.1714     |   64      |  
|4           |0.3093      |0.3077      |0.3085     |  195      |  
|5           |0.8432      |0.8617      |0.8523     |  911      |  
|------      |------      |-----       |-----      |-----      |  
|accuracy    |            |            |**0.7061** | 1276      |  
|macro avg   | 0.4034     |0.4007      |0.3966     | 1276      |  
|weighted avg| 0.6919     |0.7061      |**0.6978**     | 1276      |  



|      |  p1  |	p2  |   p3|	  p4|	p5|
|------|------|-----|-----|-----|-----|     
|True 1|	44|   4 |  6  | 8   |11   |
|True 2|	14|   3 |  5  | 9   | 2   |
|True 3|	11|   5 |  9  |23   |16   |
|True 4|	10|   3 |  5  |60   |117  |
|True 5|    13|   3 | 16  |94   |785  |   
    
   

**Analysis continued**

- Logistic Regression predicted more 5* ratings correctly. Though to achieve this it did predict 5* more often.
- Logistic regression gave greater precision and recall with 1* reviews. Also greater recall with 5* reviews but less precision. 
- Both models classified 1* ratings similarly. With the Logistic Regression model edging ahead making fewer large errors which i define as (Predicting actual 5* ratings as 1*)


As the predictions seem to cluster around 1* and 5* I will try a binary classification model. Hoping the differentiating characteristics of positive and negative reviews is enough to increase model performance. 

<a name="binary"> </a>
## Binary classification model

Predicting high or low star rating.

Creating a new column in the dataset 'binary rating' with the following assigned depending on the star rating. 

- Low:  1 and 2
- High: 4 and 5 

In [None]:
#Creating a high or low label.
df['binary_rating'] = df.star_rating.map(lambda x: 1 if x > 3 else 0 if x < 3 else np.nan  )

Baseline accuracy for multi classification: **0.912**

| Star_rating | Value_counts normalized   | count|
|------       |------                      |-----|
|   1.0       | 0.912355                  |3685 |
|   0.0       | 0.087645                  |354  |


     

**Features used:** 
['helpful_ratio','verified_purchase','review_body','review_length','camera_brand','vader_pos','vader_neg']

In [16]:
#CVEC
with open('Pickles/multiple_features_summary_binary_cvec.pkl', 'rb') as f:
    multiple_features_summary_binary_cvec = pickle.load(f)
multiple_features_summary_binary_cvec

Unnamed: 0,KNN,SVC,Logreg,RandForest,GradBoost,MultinomialNB
precision,0.888478,0.937927,0.952023,0.920469,0.931145,0.934784
recall,0.914191,0.938944,0.95297,0.920792,0.931518,0.938119
f1-score,0.888662,0.938412,0.952455,0.890858,0.91236,0.925321
support,1212.0,1212.0,1212.0,1212.0,1212.0,1212.0
cross_val training score,0.921825,0.934553,0.944815,0.918288,0.926066,0.928197
test score,0.914191,0.938944,0.95297,0.920792,0.931518,0.938119


In [17]:
#tfidf
with open('Pickles/multiple_features_summary_binary_tfidf.pkl', 'rb') as f:
    multiple_features_summary_binary_tfidf = pickle.load(f)
multiple_features_summary_binary_tfidf

Unnamed: 0,KNN,SVC,Logreg,RandForest,GradBoost,MultinomialNB
precision,0.925095,0.946223,0.937891,0.921289,0.932601,0.832732
recall,0.932343,0.948845,0.915017,0.929868,0.930693,0.912541
f1-score,0.92722,0.947197,0.923103,0.914009,0.910073,0.870812
support,1212.0,1212.0,1212.0,1212.0,1212.0,1212.0
cross_val training score,0.92749,0.943758,0.916166,0.920409,0.927127,0.912275
test score,0.932343,0.948845,0.915017,0.929868,0.930693,0.912541


**Analysis**

- As with the multi classification results. Logistic Regression CVEC and SVC Tfidf gave the highest F1 and accuracy scores. Both the baseline accuracy of 0.912
- Logistic Regression F1 score of the lower class 0.7246 compared to multiple classification 0.5799 is an improvement on accurately predicting the minority classes.


- In comparison to SVC, Logistic Regression predicts less false positives 31 vs 38 - Predicting high incorrectly with an actual low rating. 

- While predicting marginally more false negatives than SVC 26 vs 24. (predicting low ratings when they are actually high. 



**Logistic Regression** 

- Test score 0.9529702970297029
- Cross Val train score 0.9448149616803049

|            |  precision |	recall   |   f1-score|	  support|
|------      |------      |-----       |-----      |-----      |  
|0.0           |0.7426      |0.7075      |0.7246     |    106     |  
|1.0           |0.9721      |0.9765      |0.9743     |   1106      |   
|------      |------      |-----       |-----      |-----      |  
|micro avg    | 0.9530           | 0.9530           |0.9530 | 1212     |  
|macro avg   | 0.8573    |0.8420      | 0.8495    | 1212     |  
|weighted avg| 0.9520      |0.9530      |**0.9525**     |  1212      |
            


|      |  pred Low  |	pred High  | 
|------|------|-----|    
|True Low|	75|   31 |  
|True High|	26|   1080 | 



**SVC**

- Test score 0.9488448844884488
- Cross Val train score 0.9437580313676595


 |            |  precision |	recall   |   f1-score|	  support|
|------      |------      |-----       |-----      |-----      |  
|0.0           |0.7391      |0.6415       |0.6869     |    106    |  
|1.0           |0.9661      |0.9783      |0.9721    |   1106      |   
|------      |------      |-----       |-----      |-----      |  
|micro avg    | 0.9488           | 0.9488           |0.9488 | 1212     |  
|macro avg   | 0.8526    |0.8099     | 0.8295   | 1212     |  
|weighted avg| 0.9462     |0.9488     |**0.9472**     |  1212      |


|      |  p1  |	p2  | 
|------|------|-----|    
|True 1|	68|   38 |  
|True 2|	24|   1082 |   




<a name="resampling"> </a>
### Resampling

Machine learning algorithms are built to minimize errors. Since the probability of instances belonging to the majority class is significantly high, the algorithms have shown to much more likely classify new observations to the majority class. 
 
Up until this point when possible I have used the hyperparameter "class_weights = balanced". Which gives greater importance to minority classes. 

Also stratified the data in the training set and changed the primary performance metric to the F1 score.
 
Furthermore in this section I will use the Imbalanced-learn library to re-sample the data in order to mitigate the effect caused by class imbalance to see if any further improvements can be made to the above model output. 

Here are some of the resampling methods employed:

- **Oversample** minority class - increase minority class members in training set. No information is lost but prone to overfitting.
- **Undersample** majority class - Reduces majority class samples. May discard useful information. **Tomeklinks** was also in this category. Definition below.
- Generate **synthetic samples** - Instead of replicating observations in the minority class creating new data points that are similar. e.g for oversammpling (**SMOTE and ADASYN**)

SMOTE first considers the K nearest neighbors of the minority instances. It then constructs feature space vectors between these K neighbors, generating new synthetic data points on the lines.

ADYSYN also creates synthetic data points with feature space vectors. However, for the new data points to be realistic, ADYSYN adds a small error to the data points to allow for some variance. This is because observations are not perfectly correlated in real life.

Tomeklinks removes unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class.

During the preprocessing step the sampling techniques were applied. Grid searching for the optimal sampling ratio of the minority:majority class.

In [22]:
with open('Pickles/multiple_features_summary_binary_resampling.pkl', 'rb') as f:
    multiple_features_summary_binary_resampling = pickle.load(f)
multiple_features_summary_binary_resampling

Unnamed: 0,"RandomUnderSampler(random_state=None, ratio=None, replacement=False,\n return_indices=False, sampling_strategy='auto')","RandomOverSampler(random_state=None, ratio=None, return_indices=False,\n sampling_strategy='auto')","SMOTE(k_neighbors=5, kind='deprecated', m_neighbors='deprecated', n_jobs=1,\n out_step='deprecated', random_state=None, ratio=None,\n sampling_strategy='auto', svm_estimator='deprecated')","ADASYN(n_jobs=1, n_neighbors=5, random_state=None, ratio=None,\n sampling_strategy='auto')","TomekLinks(n_jobs=1, random_state=None, ratio=None, return_indices=False,\n sampling_strategy='auto')"
precision,0.944046,0.951678,0.948005,0.950336,0.954723
recall,0.932343,0.95297,0.948845,0.95132,0.955446
f1-score,0.936694,0.952242,0.948399,0.950787,0.955057
support,1212.0,1212.0,1212.0,1212.0,1212.0
GS training score,0.969225,0.998939,0.998939,0.998585,0.998939
GS test score,0.932343,0.95297,0.948845,0.95132,0.955446


**Analysis**

- Only the Tomeklinks resampling method increased the F1 score compared to the previous binary classification method. Comparing the confusion matrix shows minor but beneficial gains in model behaviour.
           
- best test score from grid search: 0.955
- best train score from grid search: 0.999


 |            |  precision |	recall   |   f1-score|	  support|
|------      |------      |-----       |-----      |-----      |  
|0.0           |0.7549      |0.7264       |0.7404     |    106    |  
|1.0           |0.9739       |0.9774      |0.9756    |   1106      |   
|------      |------      |-----       |-----      |-----      |  
|accuracy    |          |            |0.9554 | 1212     |  
|macro avg   | 0.8644    |0.8519     | 0.8580    | 1212     |  
|weighted avg|  0.9547    |0.9554     |**0.9551**     |  1212      |


|      |  pred Low  |	pred High  | 
|------|------|-----|    
|True Low|	77|   29 |  
|True High|	25|   1081 |   


<a name="misclassified"> </a>
### Misclassified results

Brief analysis of the misclassified reviews by the Tomeklinks model

In [12]:
with open('../Pickles/incorrect_df.pkl', 'rb') as f:
    incorrect_df = pickle.load(f)

In [13]:
incorrect_df

Unnamed: 0,actual,predicted,mismatch,TEXT
138,1.0,1.0,False,Amazing upgrade to my T1i. I'm able to grab fo...
3500,1.0,1.0,False,The Canon 10D is a good camera as far as it's ...
3521,1.0,1.0,False,Impressive does not begin to contend for the i...
2192,1.0,1.0,False,"Being a Minolta fan for some years, I one day ..."
219,1.0,1.0,False,awesome camera. my uncle let me borrow his wit...
1111,1.0,1.0,False,excellent quality and meets all my expectation...
557,1.0,1.0,False,What a simply amazing camera. My love of photo...
3927,1.0,1.0,False,This is a great 35MM SLR. It provides a very g...
2242,1.0,1.0,False,It us a good deal for the meager price you pay...
3802,1.0,1.0,False,I'm very happy with this camera and with the S...


In [14]:
#0 is low, 1 is high. 
# Slightly more incorrect instances of predicting high, when the rating is actually low. 
incorrect_df[incorrect_df.mismatch == True].groupby(
    ['actual', 'predicted'])[['mismatch']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,mismatch
actual,predicted,Unnamed: 2_level_1
0.0,1.0,29
1.0,0.0,25


In [19]:
incorrect_df[incorrect_df.mismatch == True].values


array([[0.0, 1.0, True,
        "I'm a filmmaker and I was excited to try this camera. The experience turned out to be a big disappointment. I shot in 24p. MOV, all-intra,  shutter 50. And I have a SD card on 95mb/s. There is motion stutter all over my footage. If everything is still then the picture will be fine. As soon as I pan, tilt, or the person in the scene moves (like waving her arms), there will be stutter (staccato effect)  and the footage become unusable. I can see the stutter on the LCD, also when downloaded to my fast Macbook Pro.  It doesn't seem to be the normal jelly effect. I googled about this and found many threads about the issue, but without a solution. When shooting in 60p Mov or in AVCHD 24MBPS the stutter get better,.<br />I wonder how all the other people got beautiful footage. Maybe they shoot in lower bitrate. Or maybe my camera is defect. Anyway I'll have to return mine. Please let me know if you have a clue about this."],
       [1.0, 0.0, True,
        "I 

**Analysis**

Themes emerging from misclassified reviews:

True low / predicted high 
- Excited at first but then disappointed. 
- Satisfied with camera but an important downside (price, warranty or customer service)

True high / predicted low
- Big problem that was resolved. e.g replacement camera
- Similarly to above, satisfied with results but experienced a difficulty
- 'No complaints' terms like this.


Some words that may have give a negative sentiment emotionally but had low reviews. 

- Technical errors e.g motion stutter
- Product experience: Unusable, defect, unreliable
- Return e.g sent back to seller


One review even stated

- "Then Mo told me that the only way I could get a full refund, and he says \\\\"this is a deal breaker\\\\" would be if I gave a five star rating on Amazon.  They got five stars.  I got a full refund.'],"

<a name="challenges"> </a>

# Challenges

- After filtering down to cameras in the dataset this reduced the number of reviews to only ~4000. Also making the UK dataset unusable.
- Web scraping on Amazon today does not have a "total votes" number as they have changed the helpfulness indicator.. This meant I could not replicate the same features.
- There were a large proportion of unverified reviews in the dataset. It is not immediately clear how trustworthy non-verfied reviews are to draw conclusions from.
- Imbalance in dataset. Most reviews were of 4/5 star ratings. This meant there were less low rated reviews for the classifiers to learn from.
- Differentiating between reviews on a 1-5 scale is difficult for neighboring star ratings as there is not much to discern one from the other. 


<a name="risks"> </a>
# Risks
- Legitamacy of unverified reviews
- Too small of a sample once we identified only cameras


<a name="assumptions"> </a>
# Assumptions
- Assuming the cameras and reviews in the dataset were representative samples of the market. Perhaps cameras with low reviews are removed from Amazon hence why there are fewer low rated products.

<a name="takeaways"> </a>
# Key Takeaways


In [27]:
#Features contributing to 5 star reviews
#Prioritise these features in a new camera development and firmware updates
freq_bigrams_5star 

low light           457
image quality       416
full frame          331
great camera        328
point shoot         244
auto focus          221
high iso            218
love camera         192
live view           188
kit lens            182
easy use            174
ed af               163
camera body         158
picture quality     154
digital camera      153
much better         152
mark ii             147
white balance       137
dynamic range       136
digital slr         135
battery life        134
shutter speed       133
camera great        123
af dx               120
use camera          120
highly recommend    120
canon eos           119
per second          114
build quality       113
depth field         109
dtype: int64

In [28]:
#Features contributing to 1 star reviews
#Resolving these issues should lead to less low ratings

freq_bigrams_1star

memory card         22
customer service    16
camera body         15
image quality       13
high iso            13
fn iii              13
used camera         12
picture quality     12
hot pixels          12
low light           11
piece junk           9
brand new            9
buy camera           9
auto focus           9
purchased camera     9
take pictures        8
camera one           8
camera work          8
digital camera       8
shutter speed        8
could get            8
take picture         8
camera would         8
received camera      7
white balance        7
sd card              7
got camera           7
use camera           7
low iso              7
camera case          7
dtype: int64

**Review characteristics for cameras**

- 4/5 star reviews are most common.
- Review counts have been growing over time. 5x monthly growth from 2013-2015.
- Market leaders, Canon Nikon Sony gathered the most reviews. Though best selling models account for alot of these.
- Correlation between higher rated products and receiving helpful reviews. 
- Reviews under 40 words are less helpful. 
- 5* reviews average 40 words more than a low star reviews. Potentially 

**Modelling**

- Low/High binary classifier is more suitable for predictions. 
- Logistic Regression using Tomeklinks resampling gave us the highest classification accuracy and F1 score though it does look prone to overfitting.

- Test score: 0.955
- Training score: 0.999

|      |  pred Low  |	pred High  | 
|------|------|-----|    
|True Low|	77|   29 |  
|True High|	25|   1081 |   

- ngrams 1,2 gave the highest score. 
- Misclassified reviews were often due to having multiple opinions, or ones that change that confuse the sentiment. 



<a name="nextsteps"> </a>
# Next Steps

- Further text preprocessing - Sentence stemming and Lemmatization potentially to reduce misclassified results further. 
- Using the web scraped products and reviews, run a similar model without the helpful ratio feature but with pricing and sub-category features - DSLR/Mirrorless/Action/Point and shoot. 
- With more data and user information implementing a reccomendation engine.

Appendix

* https://www.bloomberg.com/news/articles/2019-07-10/amazon-back-on-cusp-of-1-trillion-valuation-after-7-day-streak 
    