# Understanding Your Customers: A Fresh Take on Analyzing Product Reviews

Aisha Al-Khaldi & Huda Joad

## Where are we in the data science pipeline?

- **Ask an interesting question**
- Get the data
- Explore the data
- Model the data
- Communicate/visualize the results

## Our Question

When customers want to express their thoughts and sentiments about a product, they usually do it through written reviews and assign a rating for their experience (typically a numerical rating from 1-5). We would like to delve into the discussion of **the relationship between the content of the review, whether it's positive, negative or neutral and their assigned ratings**.

We will be using xxx API, and try to perform a sentiment analysis on reviews to answer our question.

## Where are we in the data science pipeline?

- Ask an interesting question
- **Get the data**
- Explore the data
- Model the data
- Communicate/visualize the results

In [1]:
# data collection

In [2]:
import requests
import pandas as pd
import time

Steam has a web API. This means we don't actually need an API key to extract what we need, which are the reviews for various games. Instead, we can simply use different URLs.

More about Steam's web API can be found [here](https://partner.steamgames.com/doc/store/getreviews).

The web API allows us to filter by positive and negative reviews, which means we can use this to get labeled data without having to do it ourselves. Currently, there are over 1.6 million apps, although all of them do not necessarily have reviews.

In [3]:
# https://store.steampowered.com/appreviews/50?json=1&review_type=negative

In [4]:
url = "https://store.steampowered.com/appreviews/50?json=1&review_type=negative"
r = requests.get(url)
data = r.json()

data

{'success': 1,
 'query_summary': {'num_reviews': 2},
 'reviews': [{'recommendationid': '148298468',
   'author': {'steamid': '76561199532044655',
    'num_games_owned': 0,
    'num_reviews': 3,
    'playtime_forever': 41,
    'playtime_last_two_weeks': 0,
    'playtime_at_review': 40,
    'last_played': 1697669378},
   'language': 'english',
   'review': 'YABBADA MY HALF LUNCHER IS NOT WORKING WHEN I PUT THE CODE IN YABABDA',
   'timestamp_created': 1697397516,
   'timestamp_updated': 1697397516,
   'voted_up': False,
   'votes_up': 2,
   'votes_funny': 0,
   'weighted_vote_score': '0.523809552192687988',
   'comment_count': 0,
   'steam_purchase': True,
   'received_for_free': False,
   'written_during_early_access': False,
   'hidden_in_steam_china': True,
   'steam_china_location': ''},
  {'recommendationid': '147535618',
   'author': {'steamid': '76561198079977656',
    'num_games_owned': 1150,
    'num_reviews': 98,
    'playtime_forever': 275,
    'playtime_last_two_weeks': 0,
  

In this example we can see we get 2 negative reviews in JSON format. To get the reviews alone, let's try this.

In [5]:
[print(review['review']) for review in data['reviews']]

YABBADA MY HALF LUNCHER IS NOT WORKING WHEN I PUT THE CODE IN YABABDA
the game is fun, but there's one enemy that just made me upset and i never had enough ammo to deal with them. They were just really annoying and they were the only thing i regretted after beating this game.


[None, None]

We would know that these are negative reviews as we filtered for negative reviews in the URL for the Steam web API. With all this information, we can now programatically get labeled reviews. For the sake of this project, we will simply a minimum of the first 5,000 positive and negative reviews and go through the apps in order starting from app ID 1.

Edit: We did not use this method, which is why it is commented out. The reason is mentioned below.

In [7]:
# Function to fetch reviews from Steam API for a given game ID and review type (positive or negative)
def fetch_reviews(game_id, review_type):
    # Construct the API URL with the game ID and review type
    url = f"https://store.steampowered.com/appreviews/{game_id}?json=1&review_type={review_type}"
    r = requests.get(url)  # Make an HTTP GET request to the API

    # If the request is successful, process the response
    if r.status_code == 200:
        data = r.json()  # Parse the JSON response
        return [review['review'] for review in data['reviews']]  # Return a list of reviews
    else:
        return []  # Return an empty list if the request fails

# Initialize lists to store positive and negative reviews
positive_reviews = []
negative_reviews = []
game_id = 1  # Start from the first game ID

# Loop until 5000 positive and 5000 negative reviews are collected
while len(positive_reviews) < 5000 or len(negative_reviews) < 5000:
    if len(positive_reviews) < 5000:
        # Fetch and add positive reviews for the current game ID
        positive_reviews.extend(fetch_reviews(game_id, 'positive'))

    if len(negative_reviews) < 5000:
        # Fetch and add negative reviews for the current game ID
        negative_reviews.extend(fetch_reviews(game_id, 'negative'))

    game_id += 1  # Increment the game ID for the next iteration
    time.sleep(0.2)  # Pause for 0.2 seconds to avoid hitting the rate limit

# Combine the positive and negative reviews into a single list with their corresponding sentiment labels
reviews_data = [(review, 'positive') for review in positive_reviews] + \
               [(review, 'negative') for review in negative_reviews]

# Create a DataFrame from the combined review data
df = pd.DataFrame(reviews_data, columns=['review', 'sentiment'])

# Save the DataFrame to a CSV file, without the index and using UTF-8 encoding
df.to_csv('steam_reviews.csv', index=False, encoding='utf-8')

This method would run for 8+ hours, so instead we'll try using the `steamreviews` package. More information about it can be found [here](https://pypi.org/project/steamreviews/).

In [None]:
# !pip install steamreviews

In [None]:
# import steamreviews

In [None]:
# request_params_test = {"language": "english", "review_type": "negative"}
# review_dict_test = steamreviews.download_reviews_for_app_id(10, chosen_request_params=request_params_test)
# print(review_dict_test)

In [None]:
# i = 1
# for review in list(review_dict_test[0]['reviews'].values()):
#     print(review['review'])
#     print(i)
#     i += 1
# # list(review_dict_test[0]['reviews'].values())[0]['review']

So with `steamreviews`, we have to extract each review as shown above.

The following code is how we generated the steam_reviews.csv file. It is commented out because it takes about 10 minutes to run.

In [None]:
# positive_reviews = []
# negative_reviews = []
# game_id = 1

# def extract_reviews(review_dict):
#     # Extract the review text from each review in the dictionary
#     reviews = [review['review'] for review in review_dict[0]['reviews'].items()]
#     return [review for review in reviews]

# while len(positive_reviews) < 5000 or len(negative_reviews) < 5000:
#     request_params = dict()
#     request_params['language'] = 'english'
    
#     if len(positive_reviews) < 5000:
#         request_params['review_type'] = 'positive'
#         review_dict = steamreviews.download_reviews_for_app_id(game_id, chosen_request_params=request_params)
#         positive_reviews.extend(extract_reviews(review_dict))

#     if len(negative_reviews) < 5000:
#         request_params['review_type'] = 'negative'
#         review_dict = steamreviews.download_reviews_for_app_id(game_id, chosen_request_params=request_params)
#         negative_reviews.extend(extract_reviews(review_dict))

#     game_id += 1

# # Determine the minimum number of reviews between positive and negative sets
# min_reviews_count = min(len(positive_reviews), len(negative_reviews))

# # Create DataFrame, slicing both lists to the minimum count
# reviews_data = [(review, 'positive') for review in positive_reviews[:min_reviews_count]] + \
#                [(review, 'negative') for review in negative_reviews[:min_reviews_count]]

# df = pd.DataFrame(reviews_data, columns=['review', 'sentiment'])

# df.to_csv('steam_reviews.csv', index=False, encoding='utf-8')

In [None]:
# positive_reviews = []
# negative_reviews = []
# game_id = 1

# def extract_reviews(review_dict):
#     # Assuming review_dict[0]['reviews'] is a dictionary
#     # Extract the review text from each review using the specified method
#     # list(review_dict_test[0]['reviews'].values())[0]['review']
#     ret_reviews = []
#     for review in list(review_dict[0]['reviews'].values()):
#         ret_reviews.append(review['review'])
#         print(len(ret_reviews))
#     # reviews = [review['review'] for review in list(review_dict[0]['reviews'].values())]
#     return ret_reviews

# while len(negative_reviews) < 1:# len(positive_reviews) < 1 or len(negative_reviews) < 1:
#     request_params = dict()
#     request_params['language'] = 'english'
    
#     # if len(positive_reviews) < 1:
#     #     request_params['review_type'] = 'positive'
#     #     print(request_params)
#     #     review_dict = steamreviews.download_reviews_for_app_id(game_id, chosen_request_params=request_params)
#     #     print(len(extract_reviews(review_dict)))
#     #     positive_reviews.extend(extract_reviews(review_dict))

#     if len(negative_reviews) < 1:
#         request_params['review_type'] = 'negative'
#         print(request_params)
#         review_dict = {}
#         review_dict = steamreviews.download_reviews_for_app_id(game_id, chosen_request_params={"language": "english", "review_type": "negative"})
#         print(len(review_dict[0]['reviews'].values()))
#         negative_reviews.extend(extract_reviews(review_dict))

#     game_id += 1

# # Determine the minimum number of reviews between positive and negative sets
# min_reviews_count = min(len(positive_reviews), len(negative_reviews))

# # Create DataFrame, slicing both lists to the minimum count
# reviews_data = [(review, 'positive') for review in positive_reviews[:min_reviews_count]] + \
#                [(review, 'negative') for review in negative_reviews[:min_reviews_count]]

# df = pd.DataFrame(reviews_data, columns=['review', 'sentiment'])

# df.to_csv('steam_reviews.csv', index=False, encoding='utf-8')

In [8]:
reviews_df = pd.read_csv('steam_reviews.csv', encoding='utf-8')

In [9]:
reviews_df

Unnamed: 0,review,sentiment
0,if ur tired of cs2 come back to 2000 and play ...,positive
1,Counter-Strike 1.6 was a significant part of m...,positive
2,rather pay for this than winrar,positive
3,old but gold,positive
4,Better then CS2,positive
...,...,...
8016,"Because of motherfucking SecuROM, this motherf...",negative
8017,Want the game to work on a modern PC?\n\nStep ...,negative
8018,Does not run on windows 10,negative
8019,Stupid games for windows account and online re...,negative


In [None]:
reviews_df.sentiment.value_counts()

Now we have 27,705 reviews for both positive and negative. Providing an equal number of data for each label is important because ???.

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- **Explore the data**
- Model the data
- Communicate/visualize the results

In [None]:
# data processing

In [None]:
# exploration/visualization

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- Explore the data
- **Model the data**
- Communicate/visualize the results

The model we will be using is a Random Forest Classifier. A random forest classifier can capture more complex relationships between words and sentiment. It's a good choice when you want to explore a slightly more advanced model without diving too deep into complex algorithms.

In [None]:
# analysis/machine learning

In order to model the data, first we want to tokenize it, similar to what we did in assignment 3 in this course.

In [None]:
# tokenize etc

In [None]:
# build random forest classifier

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- Explore the data
- Model the data
- **Communicate/visualize the results**

In [None]:
# insights

# Delete Later

Questions to answer
- Does the project clearly identify the problem? 
- Does the project clearly describe the relevant data or/and its collection? 
- Does the project clearly explain how the data can be used to draw conclusions about the 
- underlying system? 
- Does the report clearly explain the work that was done? 
- Is the project innovative or novel? 
- Is the model built accurate enough? 
- Does the project use techniques presented in the course (or clearly related to topics covered in the course) to understand and analyze the data for this problem? 
- Does the report explain how this work fits around related work in this subject area? 
- Does the report provide directions for further investigation?