Skip to content

learn-co-curriculum/dsc-python-loops-and-functions-lab

Repository files navigation

Python Loops and Functions - Cumulative Lab

Introduction

You made it through another section — excellent work! This cumulative lab will return to the Amazon product review dataset and allow you to flex your new skills.

Objectives

You will be able to:

  • Recall what you learned in the previous section
  • Practice writing loops to pull multiple pieces of data from a dataset
  • Practice writing functions for organization and avoiding repetition

Your Task: Dynamically Query Amazon Review Data

Once again, we are going to be working with data collected by Computer Science researchers at the University of California, San Diego. Their full paper citation is here:

Justifying recommendations using distantly-labeled reviews and fined-grained aspects Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP), 2019 pdf

We are still using a cleaned-up, coffee-specific, sample version of their full dataset.

pouring coffee Photo by Philipp Cordts on Unsplash

In some cases, we will write the function signature for you, e.g.

def review_sentiment(review):
    # Replace None with appropriate code
    None

Then you just need to fill in the relevant logic.

In other cases, you will need to write the function signature yourself, e.g.

# Your code here

Requirements

1. Data Summary

While reusing some code from the previous cumulative lab, write code to loop over all of the records in the dataset to summarize its contents, specifically in terms of overall review sentiment and the years when the reviews were written.

2. Subset Sample

Provide a sample of records that meet particular criteria.

3. Individual Review Summary

Refactor the code from the previous cumulative lab so that it is contained in a function and prompts the user to select which review to summarize.

Data Summary

Once again, we've opened up the dataset and loaded it into a list of dictionaries called reviews.

# Run this cell without changes
import json
with open("coffee_product_reviews.json") as f:
    reviews = json.load(f)
type(reviews)

Previously, we found the length of the collection, and looked into the data types of each record's keys and values

# Run this cell without changes
num_reviews = len(reviews)
print("The coffee product review dataset contains {} reviews".format(num_reviews))

first_review = reviews[0]
first_review
# Run this cell without changes
first_review.keys()
# Run this cell without changes
first_review.values()

This time, let's do something a bit more sophisticated. Specifically:

  1. Count of positive, negative, and neutral reviews
  2. List of years contained in the dataset

Count of Positive, Negative, and Neutral Reviews

Previously, we wrote something like this code to determine whether a specific review was positive, negative, or neutral:

# Run this cell without changes
selected_review = reviews[2]
selected_rating = selected_review["rating"]

if selected_rating >= 4:
    print("This is a positive review")
elif selected_rating <= 2:
    print("This is a negative review")
else:
    print("This is a neutral review")

Now, rewrite that code as a function review_sentiment, which takes in a review dictionary as an argument, and returns the string "positive", "negative", or "neutral"

def review_sentiment(review):
    # Replace None with appropriate code
    None
# Run this cell without changes
review_sentiment(reviews[2]) # 'positive'
# Run this cell without changes
review_sentiment(reviews[4]) # 'negative'
# Run this cell without changes
review_sentiment(reviews[47]) # 'neutral'

Ok, this is already much cleaner than copying and pasting that if/elif/else sequence like we did before!

Now, write a function to loop over all of the reviews in the list, and count how many are positive, negative, and neutral.

The function should be called get_sentiment_counts, take one argument (the list of reviews), and return a dictionary containing the counts. A counter dictionary has been initialized for you with "positive", "negative", and "neutral" as the keys and values starting at 0.

def get_sentiment_counts(review_list):

    sentiment_counts = {
        "positive": 0,
        "negative": 0,
        "neutral": 0
    }

    # Your code here

    return sentiment_counts

get_sentiment_counts(reviews) # {'positive': 67, 'negative': 15, 'neutral': 4}

This spread of sentiments seems reasonable. There is a well-known skew towards positive reviews in general, similar to "grade inflation", and people with neutral opinions are less likely to write reviews in the first place.

List of Years Contained in the Dataset

Previously, we wrote something like this code to extract the year of a review from the review dictionary:

# Run this cell without changes
selected_review = reviews[2]
selected_review_time = selected_review["review_time"]
selected_review_year = int(selected_review_time[-4:])
selected_review_year

Now, rewrite that code as a function review_year, which takes in a review dictionary as an argument, and returns the year as an integer:

def review_year(review):
    # Replace None with appropriate code
    None
# Run this cell without changes
review_year(reviews[2]) # 2017
# Run this cell without changes
review_year(reviews[4]) # 2017
# Run this cell without changes
review_year(reviews[47]) # 2015

Now, write a function called get_years to loop over all of the reviews in the review list and create a list of the years you find. Each year should only appear once, in ascending order. The function should accept one argument (review_list) and should return a list of integers representing the years.

Hints:

  • Remember that you can use the set() function to keep only the unique elements in a list. Just make sure you use list() afterwards to convert it back to a list data type. This is not the only solution, however!
  • There is a list method named .sort() (look it up in the python list documentation here) that will automatically order the years; you don't need to write sorting logic "by hand"
# Your code here

print(get_years(reviews)) # [2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
print(type(get_years(reviews))) # <class 'list'>

Now we know that we have data spanning 2007-2018, with no data from 2010. In some contexts that absence might be worth investigating — is it a random artifact of our sample, or are we missing 2010 data for a reason that matters? For now we'll just keep moving on to the next section, now that we have a clearer sense of the kinds of reviews in our dataset and the years they were written.

Subset Sampling

Once you have an overall sense of a dataset, it's a good idea to ask what are some examples of records in each category? For example, what are some examples of a negative review?

Although 86 records are few enough that you could technically read through all of them and just mentally note what we see, let's use an approach that will scale better to larger datasets with more categories: filtering to a subset of records then sampling to achieve a digestible amount of information.

Filtering

Here we are going to make use of another built-in Python function: filter() (docs here). To use this function, we first need to write a helper function that returns True or False based on the value passed in.

So, create a function is_negative that takes in a review dictionary as an argument and returns True if the review is negative, False otherwise:

def is_negative(review):
    # Replace None with appropriate code
    None
    
print(is_negative(reviews[2]))  # False (postive review)
print(is_negative(reviews[4]))  # True
print(is_negative(reviews[47])) # False (neutral review)

Now we can use the filter() function to create a list of negative reviews:

# Run this cell without changes
list(filter(is_negative, reviews))

Write a function called get_negative_reviews that returns a list of these reviews. It should take the list of all reviews as an argument.

(This can be a one-line function.)

# Your code here

len(get_negative_reviews(reviews)) # 15

Sampling

Again, since we have a relatively small dataset, we could just look at all 15 reviews. But let's take a more scalable approach instead, and take a random sample of negative reviews.

Recall the random module, which must be imported:

# Run this cell without changes
import random

We'll use the random.sample() function, which takes in a collection and a number, and returns that number of elements from the collection.

So, for example, if we want 3 negative reviews:

# Run this cell without changes
# You can run as many times as you want, to see different sample examples
random.sample(get_negative_reviews(reviews), 3)

Now, put that code into a function get_negative_review_sample. This function should take a list of reviews and the number of samples to select, and should return a sample of negative reviews.

(You can assume that num_samples is a valid number. The number of samples must be less than or equal to the number of elements in the collection.)

def get_negative_review_sample(review_list, num_samples):
    # Replace None with appropriate code
    None
    
get_negative_review_sample(reviews, 4)

Repeat the same process for positive reviews. That means we need:

  1. A helper function is_positive (this can't just be not is_negative since neutral reviews are neither)
  2. A function get_positive_reviews which returns a list of all positive reviews
  3. A function get_positive_review_sample which returns a sample of positive reviews with the specified length
# Your code here

get_positive_review_sample(reviews, 4)

Individual Review Summary

In addition to summarizing the dataset overall and sampling based on criteria, we want the user to be able to query any given record in order to view a summary. Before, we created a variable called review_index that the user could modify. Now, let's write some reusable code that doesn't require the user to write any Python at all!

Recall that before, our final code looked something like this:

# Run this cell without changes

review_index = 2

# Extract review from list of reviews
selected_review = reviews[review_index]

# Extract title
selected_review_title = selected_review["review_title"]

# Extract rating and format as positive, negative, or neutral
selected_rating = selected_review["rating"]
if selected_rating >= 4:
    selected_sentiment = "positive"
elif selected_rating <= 2:
    selected_sentiment = "negative"
else:
    selected_sentiment = "neutral"
    
# Extract author
selected_author = selected_review["reviewer_name"]

# Extract year (doesn't need to be int for this use case)
selected_year = selected_review["review_time"][-4:]

print(f'"{selected_review_title}": This was a {selected_sentiment} review written by {selected_author} in {selected_year}.')

Rewrite that code as a function called get_review_summary, which takes a review dictionary as an argument, and returns a string that resembles the previous summary string, e.g.

"Bialetti is the Best!": This was a positive review written by Karen in 2017.

Hint: look back at the functions you have previously written to see which ones might be useful to call within this function!

# Your code here

print(get_review_summary(reviews[2])) # "Bialetti is the Best!": This was a positive review written by Karen in 2017.

Now, instead of copying and pasting that every time, we can just call it repeatedly!

Write a function that prompts the user to enter a review index, then prints the relevant review summary. The function should be called review_summary_prompt, it should take a list of reviews as an argument, and should print information but not return anything.

Display the message "Please enter a review index: " when prompting for input. You can assume that the user will enter a valid index between 0 and 85.

Hints:

  • Use the built-in input() function (check the documentation here to see how to use it!)
  • Remember that this function always returns a string, so you will have to convert the user-supplied index into an integer, otherwise you'll get the error TypeError: list indices must be integers or slices, not str
  • If you're wondering about the type of a given variable, you can use the built-in type() function
def review_summary_prompt(list_of_reviews):
    # Replace None with appropriate code
    None

Run this cell, and try entering 2, 4, 52 (examples of positive, negative, neutral reviews)

You can also try any index you want, between 0 and 85!

# Run this cell without changes
review_summary_prompt(reviews)

Putting It All Together

In this section, we are just calling several of the previously-created functions to double-check that they are working as expected. You do not need to write any more code, although if you notice something wrong with one of your functions you can go back and fix it! Just make sure that you re-run the cell declaring the function if you want the behavior of calling the function to change.

Data Summary

# Run this cell without changes

print(f"The coffee product review dataset contains {len(reviews)} reviews")
print()
print("Review sentiment:")
for key, value in get_sentiment_counts(reviews).items():
    print(f"{value} {key} reviews")
print()
print("Review years:")
print(get_years(reviews))

Subset Samples

# Run this cell without changes

print("Examples of positive reviews:")
positive_samples = get_positive_review_sample(reviews, 5)
for review in positive_samples:
    print(get_review_summary(review))
print()
print("Examples of negative reviews:")
negative_samples = get_negative_review_sample(reviews, 5)
for review in negative_samples:
    print(get_review_summary(review))

Summary Prompt

# Run this cell without changes
review_summary_prompt(reviews)

Conclusion

Congratulations, you made it to the end of another cumulative lab! In this lab you practiced refactoring previously-written code to use functions, and using loops to avoid repetition and perform analyses of the whole dataset as well as certain subsets.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published