You made it through another section — excellent work! This cumulative lab will return to the Amazon product review dataset and allow you to flex your new skills.
You will be able to:
- Recall what you learned in the previous section
- Practice writing loops to pull multiple pieces of data from a dataset
- Practice writing functions for organization and avoiding repetition
Once again, we are going to be working with data collected by Computer Science researchers at the University of California, San Diego. Their full paper citation is here:
Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP), 2019 pdf
We are still using a cleaned-up, coffee-specific, sample version of their full dataset.
Photo by Philipp Cordts on Unsplash
In some cases, we will write the function signature for you, e.g.
def review_sentiment(review):
# Replace None with appropriate code
None
Then you just need to fill in the relevant logic.
In other cases, you will need to write the function signature yourself, e.g.
# Your code here
While reusing some code from the previous cumulative lab, write code to loop over all of the records in the dataset to summarize its contents, specifically in terms of overall review sentiment and the years when the reviews were written.
Provide a sample of records that meet particular criteria.
Refactor the code from the previous cumulative lab so that it is contained in a function and prompts the user to select which review to summarize.
Once again, we've opened up the dataset and loaded it into a list of dictionaries called reviews
.
# Run this cell without changes
import json
with open("coffee_product_reviews.json") as f:
reviews = json.load(f)
type(reviews)
Previously, we found the length of the collection, and looked into the data types of each record's keys and values
# Run this cell without changes
num_reviews = len(reviews)
print("The coffee product review dataset contains {} reviews".format(num_reviews))
first_review = reviews[0]
first_review
# Run this cell without changes
first_review.keys()
# Run this cell without changes
first_review.values()
This time, let's do something a bit more sophisticated. Specifically:
- Count of positive, negative, and neutral reviews
- List of years contained in the dataset
Previously, we wrote something like this code to determine whether a specific review was positive, negative, or neutral:
# Run this cell without changes
selected_review = reviews[2]
selected_rating = selected_review["rating"]
if selected_rating >= 4:
print("This is a positive review")
elif selected_rating <= 2:
print("This is a negative review")
else:
print("This is a neutral review")
Now, rewrite that code as a function review_sentiment
, which takes in a review dictionary as an argument, and returns the string "positive"
, "negative"
, or "neutral"
def review_sentiment(review):
# Replace None with appropriate code
None
# Run this cell without changes
review_sentiment(reviews[2]) # 'positive'
# Run this cell without changes
review_sentiment(reviews[4]) # 'negative'
# Run this cell without changes
review_sentiment(reviews[47]) # 'neutral'
Ok, this is already much cleaner than copying and pasting that if
/elif
/else
sequence like we did before!
Now, write a function to loop over all of the reviews in the list, and count how many are positive, negative, and neutral.
The function should be called get_sentiment_counts
, take one argument (the list of reviews), and return a dictionary containing the counts. A counter dictionary has been initialized for you with "positive"
, "negative"
, and "neutral"
as the keys and values starting at 0.
def get_sentiment_counts(review_list):
sentiment_counts = {
"positive": 0,
"negative": 0,
"neutral": 0
}
# Your code here
return sentiment_counts
get_sentiment_counts(reviews) # {'positive': 67, 'negative': 15, 'neutral': 4}
This spread of sentiments seems reasonable. There is a well-known skew towards positive reviews in general, similar to "grade inflation", and people with neutral opinions are less likely to write reviews in the first place.
Previously, we wrote something like this code to extract the year of a review from the review dictionary:
# Run this cell without changes
selected_review = reviews[2]
selected_review_time = selected_review["review_time"]
selected_review_year = int(selected_review_time[-4:])
selected_review_year
Now, rewrite that code as a function review_year
, which takes in a review dictionary as an argument, and returns the year as an integer:
def review_year(review):
# Replace None with appropriate code
None
# Run this cell without changes
review_year(reviews[2]) # 2017
# Run this cell without changes
review_year(reviews[4]) # 2017
# Run this cell without changes
review_year(reviews[47]) # 2015
Now, write a function called get_years
to loop over all of the reviews in the review list and create a list of the years you find. Each year should only appear once, in ascending order. The function should accept one argument (review_list
) and should return a list of integers representing the years.
Hints:
- Remember that you can use the
set()
function to keep only the unique elements in a list. Just make sure you uselist()
afterwards to convert it back to a list data type. This is not the only solution, however! - There is a list method named
.sort()
(look it up in the python list documentation here) that will automatically order the years; you don't need to write sorting logic "by hand"
# Your code here
print(get_years(reviews)) # [2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
print(type(get_years(reviews))) # <class 'list'>
Now we know that we have data spanning 2007-2018, with no data from 2010. In some contexts that absence might be worth investigating — is it a random artifact of our sample, or are we missing 2010 data for a reason that matters? For now we'll just keep moving on to the next section, now that we have a clearer sense of the kinds of reviews in our dataset and the years they were written.
Once you have an overall sense of a dataset, it's a good idea to ask what are some examples of records in each category? For example, what are some examples of a negative review?
Although 86 records are few enough that you could technically read through all of them and just mentally note what we see, let's use an approach that will scale better to larger datasets with more categories: filtering to a subset of records then sampling to achieve a digestible amount of information.
Here we are going to make use of another built-in Python function: filter()
(docs here). To use this function, we first need to write a helper function that returns True
or False
based on the value passed in.
So, create a function is_negative
that takes in a review dictionary as an argument and returns True
if the review is negative, False
otherwise:
def is_negative(review):
# Replace None with appropriate code
None
print(is_negative(reviews[2])) # False (postive review)
print(is_negative(reviews[4])) # True
print(is_negative(reviews[47])) # False (neutral review)
Now we can use the filter()
function to create a list of negative reviews:
# Run this cell without changes
list(filter(is_negative, reviews))
Write a function called get_negative_reviews
that returns a list of these reviews. It should take the list of all reviews as an argument.
(This can be a one-line function.)
# Your code here
len(get_negative_reviews(reviews)) # 15
Again, since we have a relatively small dataset, we could just look at all 15 reviews. But let's take a more scalable approach instead, and take a random sample of negative reviews.
Recall the random
module, which must be imported:
# Run this cell without changes
import random
We'll use the random.sample()
function, which takes in a collection and a number, and returns that number of elements from the collection.
So, for example, if we want 3 negative reviews:
# Run this cell without changes
# You can run as many times as you want, to see different sample examples
random.sample(get_negative_reviews(reviews), 3)
Now, put that code into a function get_negative_review_sample
. This function should take a list of reviews and the number of samples to select, and should return a sample of negative reviews.
(You can assume that num_samples
is a valid number. The number of samples must be less than or equal to the number of elements in the collection.)
def get_negative_review_sample(review_list, num_samples):
# Replace None with appropriate code
None
get_negative_review_sample(reviews, 4)
Repeat the same process for positive reviews. That means we need:
- A helper function
is_positive
(this can't just benot is_negative
since neutral reviews are neither) - A function
get_positive_reviews
which returns a list of all positive reviews - A function
get_positive_review_sample
which returns a sample of positive reviews with the specified length
# Your code here
get_positive_review_sample(reviews, 4)
In addition to summarizing the dataset overall and sampling based on criteria, we want the user to be able to query any given record in order to view a summary. Before, we created a variable called review_index
that the user could modify. Now, let's write some reusable code that doesn't require the user to write any Python at all!
Recall that before, our final code looked something like this:
# Run this cell without changes
review_index = 2
# Extract review from list of reviews
selected_review = reviews[review_index]
# Extract title
selected_review_title = selected_review["review_title"]
# Extract rating and format as positive, negative, or neutral
selected_rating = selected_review["rating"]
if selected_rating >= 4:
selected_sentiment = "positive"
elif selected_rating <= 2:
selected_sentiment = "negative"
else:
selected_sentiment = "neutral"
# Extract author
selected_author = selected_review["reviewer_name"]
# Extract year (doesn't need to be int for this use case)
selected_year = selected_review["review_time"][-4:]
print(f'"{selected_review_title}": This was a {selected_sentiment} review written by {selected_author} in {selected_year}.')
Rewrite that code as a function called get_review_summary
, which takes a review dictionary as an argument, and returns a string that resembles the previous summary string, e.g.
"Bialetti is the Best!": This was a positive review written by Karen in 2017.
Hint: look back at the functions you have previously written to see which ones might be useful to call within this function!
# Your code here
print(get_review_summary(reviews[2])) # "Bialetti is the Best!": This was a positive review written by Karen in 2017.
Now, instead of copying and pasting that every time, we can just call it repeatedly!
Write a function that prompts the user to enter a review index, then prints the relevant review summary. The function should be called review_summary_prompt
, it should take a list of reviews as an argument, and should print information but not return anything.
Display the message "Please enter a review index: "
when prompting for input. You can assume that the user will enter a valid index between 0 and 85.
Hints:
- Use the built-in
input()
function (check the documentation here to see how to use it!) - Remember that this function always returns a string, so you will have to convert the user-supplied index into an integer, otherwise you'll get the error
TypeError: list indices must be integers or slices, not str
- If you're wondering about the type of a given variable, you can use the built-in
type()
function
def review_summary_prompt(list_of_reviews):
# Replace None with appropriate code
None
Run this cell, and try entering 2, 4, 52 (examples of positive, negative, neutral reviews)
You can also try any index you want, between 0 and 85!
# Run this cell without changes
review_summary_prompt(reviews)
In this section, we are just calling several of the previously-created functions to double-check that they are working as expected. You do not need to write any more code, although if you notice something wrong with one of your functions you can go back and fix it! Just make sure that you re-run the cell declaring the function if you want the behavior of calling the function to change.
# Run this cell without changes
print(f"The coffee product review dataset contains {len(reviews)} reviews")
print()
print("Review sentiment:")
for key, value in get_sentiment_counts(reviews).items():
print(f"{value} {key} reviews")
print()
print("Review years:")
print(get_years(reviews))
# Run this cell without changes
print("Examples of positive reviews:")
positive_samples = get_positive_review_sample(reviews, 5)
for review in positive_samples:
print(get_review_summary(review))
print()
print("Examples of negative reviews:")
negative_samples = get_negative_review_sample(reviews, 5)
for review in negative_samples:
print(get_review_summary(review))
# Run this cell without changes
review_summary_prompt(reviews)
Congratulations, you made it to the end of another cumulative lab! In this lab you practiced refactoring previously-written code to use functions, and using loops to avoid repetition and perform analyses of the whole dataset as well as certain subsets.