# Understanding Your Customers: A Fresh Take on Analyzing Product Reviews

Aisha Al-Khaldi & Huda Joad

## Where are we in the data science pipeline?

- **Ask an interesting question**
- Get the data
- Explore the data
- Model the data
- Communicate/visualize the results

## Our Question

When customers want to express their thoughts and sentiments about a product, they usually do it through written reviews and assign a rating for their experience (in our case, a boolean of positive/negative). We would like to delve into the discussion of **the relationship between the content of the review and its assigned rating**.

We will be using Steam Web API, and try to perform a sentiment analysis on reviews to answer our question.

## Where are we in the data science pipeline?

- Ask an interesting question
- **Get the data**
- Explore the data
- Model the data
- Communicate/visualize the results

### Data Collection

In [2]:
import requests
import pandas as pd
import time
from requests.exceptions import ConnectionError

Steam has a web API. This means we don't actually need an API key to extract what we need, which are the reviews for various games. Instead, we can simply use different URLs.

More about Steam's web API can be found [here](https://partner.steamgames.com/doc/store/getreviews).

The web API allows us to filter by positive and negative reviews, which means we can use this to get labeled data without having to do it ourselves. Currently, there are over 1.6 million apps, although all of them do not necessarily have reviews.

In [4]:
url = "https://store.steampowered.com/appreviews/50?json=1&review_type=negative"
r = requests.get(url)
data = r.json()

data

{'success': 1,
 'query_summary': {'num_reviews': 1},
 'reviews': [{'recommendationid': '150455924',
   'author': {'steamid': '76561198053624818',
    'num_games_owned': 0,
    'num_reviews': 156,
    'playtime_forever': 357,
    'playtime_last_two_weeks': 357,
    'playtime_at_review': 357,
    'last_played': 1700459686},
   'language': 'english',
   'timestamp_created': 1700430077,
   'timestamp_updated': 1700460073,
   'voted_up': False,
   'votes_up': 1,
   'votes_funny': 0,
   'weighted_vote_score': '0.523809552192687988',
   'comment_count': 0,
   'steam_purchase': True,
   'received_for_free': False,
   'written_during_early_access': False,
   'hidden_in_steam_china': True,
   'steam_china_location': ''}],
 'cursor': 'AoIIPwYYanTn+L0E'}

In this example we can see we get 2 negative reviews in JSON format. To get the reviews alone, let's try this.

In [5]:
[print(review['review']) for review in data['reviews']]

The gameplay is great, but not as tight as Half Life. Considering Randy Pitchford made it, Opposing Force is surprising competent. Though the last few hours feel like blatant filler, the boss fight is the biggest waste of time.

The new weapons are welcome, but the squad concept is half-baked. Other than occasionally healing you, they are there solely to trigger scripted events. The rope physics are utterly broken, but that might just be a consequence of running this over 30 fps. The major problem is that the night-vision filter is the same color as the reticle and the weapon selection menu, so in the dark you not only can't aim, but you also can't see what weapon you are switching to, and this game has a lot of dark spaces. Gearbox didn't do basic QA testing evidently.



[None]

We would know that these are negative reviews as we filtered for negative reviews in the URL for the Steam web API. With all this information, we can now programatically get labeled reviews. For the sake of this project, we will simply a minimum of the first 5,000 positive and negative reviews and go through the apps in order starting from app ID 1.

!!! Do not run the cell below. It was for generating the csv file, and will run for 6+ hours. You can skip this cell as we have already provided the csv for you.

In [6]:
# Function to fetch reviews from Steam API for a given game ID and review type (positive or negative)
def fetch_reviews(game_id, review_type):
    url = f"https://store.steampowered.com/appreviews/{game_id}?json=1&review_type={review_type}"
    try:
        r = requests.get(url)  # Make an HTTP GET request to the API
        if r.status_code == 200:
            data = r.json()  # Parse the JSON response
            return [review['review'] for review in data['reviews']]  # Return a list of reviews
    except ConnectionError:
        print(f"Connection error for game ID {game_id}. Retrying...")
        time.sleep(5)  # Wait for 5 seconds before retrying
        return fetch_reviews(game_id, review_type)  # Recursive retry
    return []  # Return an empty list if the request fails or an exception occurs

# Initialize lists to store positive and negative reviews
positive_reviews = []
negative_reviews = []
game_id = 1  # Start from the first game ID

# Loop until 5000 positive and 5000 negative reviews are collected
while len(positive_reviews) < 5000 or len(negative_reviews) < 5000:
    print(f"Game ID: {game_id}")
    if len(positive_reviews) < 5000:
        # Fetch and add positive reviews for the current game ID
        positive_reviews.extend(fetch_reviews(game_id, 'positive'))
        print(f"    Total Positive Reviews: {len(positive_reviews)}")

    if len(negative_reviews) < 5000:
        # Fetch and add negative reviews for the current game ID
        negative_reviews.extend(fetch_reviews(game_id, 'negative'))
        print(f"    Total Negative Reviews: {len(negative_reviews)}")

    game_id += 1  # Increment the game ID for the next iteration
    time.sleep(0.2)  # Pause for 0.5 seconds to avoid hitting the rate limit

# Combine the positive and negative reviews into a single list with their corresponding sentiment labels
reviews_data = [(review, 'positive') for review in positive_reviews] + \
               [(review, 'negative') for review in negative_reviews]

# Create a DataFrame from the combined review data
df = pd.DataFrame(reviews_data, columns=['review', 'sentiment'])

# Save the DataFrame to a CSV file, without the index and using UTF-8 encoding
df.to_csv('steam_reviews.csv', index=False, encoding='utf-8')

Game ID: 1
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 2
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 3
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 4
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 5
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 6
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 7
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 8
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 9
    Total Positive Reviews: 0
    Total Negative Reviews: 0
Game ID: 10
    Total Positive Reviews: 20
    Total Negative Reviews: 3
Game ID: 11
    Total Positive Reviews: 20
    Total Negative Reviews: 3
Game ID: 12
    Total Positive Reviews: 20
    Total Negative Reviews: 3
Game ID: 13
    Total Positive Reviews: 20
    Total Negative Reviews: 3
Game ID: 14
    Total Positive Reviews: 20
    Total Negative Reviews

In [7]:
reviews_df = pd.read_csv('steam_reviews.csv', encoding='utf-8')

In [8]:
reviews_df

Unnamed: 0,review,sentiment
0,if ur tired of cs2 come back to 2000 and play ...,positive
1,Counter-Strike 1.6 was a significant part of m...,positive
2,rather pay for this than winrar,positive
3,Better then CS2,positive
4,"Every school that i have attended, had this in...",positive
...,...,...
9998,Its just a terrible clone of katamari. And IMO...,negative
9999,TL:DR this game was made 12 year old who got a...,negative
10000,Insultingly bad Katamari Damacy rip-off.,negative
10001,The poor man's Katamari Damacy.,negative


In [9]:
reviews_df.sentiment.value_counts()

negative    5003
positive    5000
Name: sentiment, dtype: int64

Now we have 10,003 total reviews from 15,970 apps. Providing an equal number of data for each label is important because ???.

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- **Explore the data**
- Model the data
- Communicate/visualize the results

### Data Processing

In [19]:
reviews_df.isna().sum()

review       5
sentiment    0
dtype: int64

Since we have enough reviews, we can simply drop the rows with null values.

In [24]:
reviews_df.dropna(inplace=True)
reviews_df.isna().sum()

review       0
sentiment    0
dtype: int64

In [25]:
reviews_df.shape

(9998, 2)

### Exploration/Visualization

In [None]:
# code

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- Explore the data
- **Model the data**
- Communicate/visualize the results

The model we will be using is a Random Forest Classifier. A random forest classifier can capture more complex relationships between words and sentiment. It's a good choice when you want to explore a slightly more advanced model without diving too deep into complex algorithms.

### Analysis/Machine Learning

In order to model the data, first we want to tokenize it, similar to what we did in assignment 3 in this course.

#### Text Preprocessing

In [None]:
# code

#### Feature Extraction

In [None]:
# code

#### Split the data

In [None]:
# code

#### Train the Random Forest classifier

In [None]:
# code

## Where are we in the data science pipeline?

- Ask an interesting question
- Get the data
- Explore the data
- Model the data
- **Communicate/visualize the results**

#### Evaluate the model

In [None]:
# code

#### Try using the model to predict the sentiment of new reviews

In [None]:
# code

### Insights

# Delete Later

Questions to answer
- Does the project clearly identify the problem? 
- Does the project clearly describe the relevant data or/and its collection? 
- Does the project clearly explain how the data can be used to draw conclusions about the 
- underlying system? 
- Does the report clearly explain the work that was done? 
- Is the project innovative or novel? 
- Is the model built accurate enough? 
- Does the project use techniques presented in the course (or clearly related to topics covered in the course) to understand and analyze the data for this problem? 
- Does the report explain how this work fits around related work in this subject area? 
- Does the report provide directions for further investigation?