# Understanding Your Customers: A Fresh Take on Analyzing Product Reviews

Aisha Al-Khaldi & Huda Joad

## Where are we in the data science pipeline?

- **Ask an interesting question**
- Get the data
- Explore the data
- Model the data
- Communicate/visualize the results

## Our Question

When customers want to express their thoughts and sentiments about a product, they usually do it through written reviews and assign a rating for their experience (typically a numerical rating from 1-5). We would like to delve into the discussion of **the relationship between the content of the review, whether it's positive, negative or neutral and their assigned ratings**.

We will be using xxx API, and try to perform a sentiment analysis on reviews to answer our question.

## Where are we in the data science pipeline?

- Ask an interesting question
- **Get the data**
- Explore the data
- Model the data
- Communicate/visualize the results

In [45]:
# data collection

In [46]:
# !pip install -e git+https://github.com/gauravmm/jupyter-testing.git#egg=jupyter-testing

from testing.testing import test
import requests

In [47]:
def retrieve_html_test(retrieve_html):
    status_code, text = retrieve_html("http://www.example.com")
    test.equal(status_code, 200)
    test.true("http://censor.qa/?accessurl=d3d3LmV4YW1wbGUuY29t&urlclassname=&ipaddr=78.100.166.255" in text)

@test
def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """

    r = requests.get(url)
    return r.status_code, r.text

### TESTING retrieve_html: PASSED 2/2
###



Steam has a web API. This means we don't actually need an API key to extract what we need, which are the reviews for various games. Instead, we can simply use different URLs.

More about Steam's web API can be found [here](https://partner.steamgames.com/doc/store/getreviews).

The web API allows us to filter by positive and negative reviews, which means we can use this to get labeled data without having to do it ourselves. Currently, there are over 1.6 million apps, although all of them do not necessarily have reviews.

In [48]:
# https://store.steampowered.com/appreviews/50?json=1&review_type=negative

In [49]:
url = "https://store.steampowered.com/appreviews/50?json=1&review_type=negative"
r = requests.get(url)
data = r.json()

data

{'success': 1,
 'query_summary': {'num_reviews': 2},
 'reviews': [{'recommendationid': '148298468',
   'author': {'steamid': '76561199532044655',
    'num_games_owned': 0,
    'num_reviews': 3,
    'playtime_forever': 41,
    'playtime_last_two_weeks': 0,
    'playtime_at_review': 40,
    'last_played': 1697669378},
   'language': 'english',
   'review': 'YABBADA MY HALF LUNCHER IS NOT WORKING WHEN I PUT THE CODE IN YABABDA',
   'timestamp_created': 1697397516,
   'timestamp_updated': 1697397516,
   'voted_up': False,
   'votes_up': 2,
   'votes_funny': 0,
   'weighted_vote_score': '0.523809552192687988',
   'comment_count': 0,
   'steam_purchase': True,
   'received_for_free': False,
   'written_during_early_access': False,
   'hidden_in_steam_china': True,
   'steam_china_location': ''},
  {'recommendationid': '147535618',
   'author': {'steamid': '76561198079977656',
    'num_games_owned': 1149,
    'num_reviews': 98,
    'playtime_forever': 275,
    'playtime_last_two_weeks': 0,
  

In this example we can see we get 2 negative reviews in JSON format. To get the reviews alone, let's try this.

In [55]:
[print(review['review']) for review in data['reviews']]

YABBADA MY HALF LUNCHER IS NOT WORKING WHEN I PUT THE CODE IN YABABDA
the game is fun, but there's one enemy that just made me upset and i never had enough ammo to deal with them. They were just really annoying and they were the only thing i regretted after beating this game.


[None, None]

We would know that these are negative reviews as we filtered for negative reviews in the URL for the Steam web API. With all this information, we can now programatically get labeled reviews. For the sake of this project, we will simply get the first 5,000 positive and negative reviews and go through the apps in order starting from app ID 1.

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- **Explore the data**
- Model the data
- Communicate/visualize the results

In [51]:
# data processing

In [52]:
# exploration/visualization

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- Explore the data
- **Model the data**
- Communicate/visualize the results

The model we will be using is a Random Forest Classifier. A random forest classifier can capture more complex relationships between words and sentiment. It's a good choice when you want to explore a slightly more advanced model without diving too deep into complex algorithms.

In [53]:
# analysis/machine learning

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- Explore the data
- Model the data
- **Communicate/visualize the results**

In [54]:
# insights

# Delete Later

Questions to answer
- Does the project clearly identify the problem? 
- Does the project clearly describe the relevant data or/and its collection? 
- Does the project clearly explain how the data can be used to draw conclusions about the 
- underlying system? 
- Does the report clearly explain the work that was done? 
- Is the project innovative or novel? 
- Is the model built accurate enough? 
- Does the project use techniques presented in the course (or clearly related to topics 
covered? 
- in the course) to understand and analyze the data for this problem? 
- Does the report explain how this work fits around related work in this subject area? 
- Does the report provide directions for further investigation?