# Product Analysis

In [1]:
import turicreate
products = turicreate.SFrame('baby.frame_idx')

_products_ have reviews on baby products

In [3]:
products.head()

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


### The first step is to build a word counter

In [4]:
products['word_count'] = turicreate.text_analytics.count_words(products['review'])

In [8]:
products.groupby('rating',operations={'count':turicreate.aggregate.COUNT()}).sort('count',ascending=False)


rating,count
5.0,107054
4.0,33205
3.0,16779
1.0,15183
2.0,11310


We won't take into account ratings of 3, so as to focus on the best and/or worst reviews

In [9]:
products = products[products['rating']!= 3]

We define rating of 4 and 5 as having a "sentiment" = 1

In [10]:
products['sentiment'] = products['rating'] >= 4

# Build a classifier with products['word_count'] as a feature

In [11]:
train_data,test_data = products.random_split(.7,seed=0)
sentiment_model = turicreate.logistic_classifier.create(train_data,target='sentiment', features=['word_count'], validation_set=test_data)

In [13]:
products['predicted_sentiment'] = sentiment_model.predict(products, output_type = 'probability')

Evaluating this model 

In [14]:
sentiment_model.evaluate(test_data)

{'accuracy': 0.914914914914915,
 'auc': 0.930971482342302,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      0       |        1        |  2100 |
 |      1       |        0        |  2150 |
 |      0       |        0        |  5929 |
 |      1       |        1        | 39771 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9492791674622876,
 'log_loss': 0.36417064002296934,
 'precision': 0.949845955434549,
 'recall': 0.9487130555091721,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+--------------------+--------------------+-------+------+
 | threshold |        fpr         |        tpr         |   p   |  n   |
 +-----------+--------------------+--------------------+-------+------+
 |    0.0 

### This model has an accuracy: 0.914 

### Next, let's build another model but only taking into account an specific set of words, instead of the whole word count

In [22]:
specific_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

Let's add a new column with the count of these specific words

In [24]:
for words in specific_words:
    products[words] = products['word_count'].apply(lambda counts: counts.get(words, 0))

We now have a column for each one of the specific_words with the word count

In [25]:
products.head()

name,review,rating,word_count,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'recommend': 1.0, 'disappointed': 1.0, ...",1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'quilt': 1.0, 'the': 1.0, 'than': 1.0, 'fu ...",1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'tool': 1.0, 'clever': 1.0, 'binky': 2.0, ...",1
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'rock': 1.0, 'many': 1.0, 'headaches': 1.0, ...",1
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'thumb': 1.0, 'or': 1.0, 'break': 1.0, 'trying': ...",1
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0,"{'for': 1.0, 'barnes': 1.0, 'at': 1.0, 'is': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0,"{'right': 1.0, 'because': 1.0, 'questions': 1.0, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0,"{'like': 1.0, 'and': 1.0, 'changes': 1.0, 'the': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'in': 1.0, 'pages': 1.0, 'out': 1.0, 'run': 1.0, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,"{'tracker': 1.0, 'now': 1.0, 'its': 1.0, 'sti ...",1

predicted_sentiment,awesome,great,fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate
0.9997880149622664,0.0,0.0,0.0,0.0,1.0,0,0,0.0,0,0,0
0.9980589101293116,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0,0,0
0.9997317561755628,0.0,0.0,0.0,0.0,2.0,0,0,0.0,0,0,0
0.9999948237308968,0.0,1.0,0.0,0.0,1.0,0,0,0.0,0,0,0
0.9999998679854072,0.0,1.0,0.0,0.0,0.0,0,0,0.0,0,0,0
0.9999055787484276,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0,0,0
0.999984659546595,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0,0,0
0.999995488755006,0.0,0.0,1.0,0.0,0.0,0,0,0.0,0,0,0
0.9961156446039708,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0,0,0
0.9999999753691046,0.0,0.0,0.0,0.0,2.0,0,0,0.0,0,0,0


In [26]:
train_data,test_data = products.random_split(.7, seed=0)
sentiment_words_model = turicreate.logistic_classifier.create(train_data,target='sentiment', features=specific_words, validation_set=test_data)

We can take a look at the coefficients found by the model:

In [27]:
sentiment_words_model.coefficients.print_rows(num_rows=12, num_columns=5)

+-------------+-------+-------+---------------------+----------------------+
|     name    | index | class |        value        |        stderr        |
+-------------+-------+-------+---------------------+----------------------+
| (intercept) |  None |   1   |  1.342040332247529  | 0.009555603535186554 |
|   awesome   |  None |   1   |  1.1391942442946887 | 0.09057896739427865  |
|    great    |  None |   1   |  0.8604360166059707 | 0.020278337106370434 |
|  fantastic  |  None |   1   |  0.9211800488726357 |  0.1217053309616806  |
|   amazing   |  None |   1   |  1.0504718174076177 | 0.10408326795563408  |
|     love    |  None |   1   |  1.3721326419955475 | 0.030266019127131654 |
|   horrible  |  None |   1   | -2.2675445277535813 | 0.08508901349011777  |
|     bad     |  None |   1   |  -0.98857018253907  | 0.04096936377590639  |
|   terrible  |  None |   1   | -2.2415346119436155 | 0.08355607343887422  |
|    awful    |  None |   1   | -2.0609833719744572 |  0.1074829324384413  |

Checking the accuracy of this new model, we obtain

In [28]:
sentiment_words_model.evaluate(test_data)

{'accuracy': 0.8454454454454454,
 'auc': 0.6907892606887874,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  250  |
 |      0       |        0        |  559  |
 |      0       |        1        |  7470 |
 |      1       |        1        | 41671 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9152225955942105,
 'log_loss': 0.3986598789498636,
 'precision': 0.8479884414236585,
 'recall': 0.9940364018033921,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+--------------------+-----+-------+------+
 | threshold |        fpr         | tpr |   p   |  n   |
 +-----------+--------------------+-----+-------+------+
 |    0.0    |        1.0         | 1.0 | 41921 | 802

### This specific word model has an accuracy of 0.845, in contrast to the first model 0.914 

## We conclude that, in this case, the most accurate model is the one considering the whole word count. 

### As a final exercise, let's check how well both models behave with one particular example 

We choose one product, Baby Trend Diaper Champ, and sort its reviewsdiaper_reviews = diaper_reviews.sort('predicted_sentiment', ascending=False)

In [33]:
diaper_reviews = products[products['name']== 'Baby Trend Diaper Champ']
diaper_reviews = diaper_reviews.sort('predicted_sentiment', ascending=False)

We can analyse the most positive review, according to the predicted_sentiment

In [34]:
diaper_reviews[0]['review']

"I originally put this item on my baby registry because friends of mine highly recommended the use of a diaper pail.  I decided to go with the Diaper Champ rather than the more popular Diaper Genie, because I didn't like the idea of having to buy special bags.  Too costly and not efficient if you ask me!THIS PRODUCT IS A LIFESAVER!! I HAVE NEVER HAD AN ODOR PROBLEM!No one has ever come to my house and noticed 'diaper odor.'  In fact, people who have been present while I changed my almost 3 month old son's diaper have been amazed at how the very 'potent' smell of the dirty diaper seems to disappear once I dispose of the diaper in the Diaper Champ.For those who are experiencing odor problems, here are some suggestions which should solve the problem:1. CHANGE BAGS FREQUENTLY!!  I change mine about 1 or 2 times a week.  For those who complain that the pail gets stuck, that is probably a good indication that the BAG IS GETTING TOO FULL!! Also, making sure that the tape on the diapers is sec

#### We can see it is a positive review. Let's see how the models predicted this review

In [35]:
sentiment_model.predict(diaper_reviews[0:1], output_type='probability')

dtype: float
Rows: 1
[0.9999999999993169]

In [36]:
sentiment_words_model.predict(diaper_reviews[0:1], output_type='probability')

dtype: float
Rows: 1
[0.7928252730975904]

### As we discussed before, the probability of a positive review returned from the whole word count model is greater than the selected words model