In [1]:
import pandas as pd

## Read Data

In this section, we read in the `reviews` csv, which contains the webscraped reviews data for each movie.

The `reviews` dataframe consists of 4 columns:
* `tconst` - unique id of a movie 
* `username` - username of the user who left a rating/review for this movie
* `rating` - user's rating for the movie out of 10 (1 being the lowest, 10 being the highest) 
* `review` - user's textual review of the movie

We also read in `movie_mapping.pickle`, which contains a pickled dataframe consisting of the
following columns:
* `tconst` - unique id of a movie
* `mapping` - dictionary containing the mapping between common words that would be present in a
review of the movie and it's replacement
   * e.g. for the film "Titanic", the mapping would contain mappings such as "dicaprio" -> "actor",
   "rose" -> "actress", "titanic" -> "movie", etc.

In [2]:
# reading in reviews
number = 81551
new = True
reviews = pd.read_csv(f'../data/reviews/raw_reviews/raw_reviews_{number}.csv')
reviews.head()

Unnamed: 0,tconst,username,rating,review
0,tt0018671,F Gwynplaine MacIntyre,2/10,Even though I've written some historical ficti...
1,tt0018673,Maleejandra,8/10,"Bare Knees is the epitome of the Jazz Age, but..."
2,tt0018673,kidboots,9/10,Sometimes you can get a truer picture of teen ...
3,tt0018673,David-240,9/10,This must be the ultimate flapper comedy - an ...
4,tt0018673,JohnHowardReid,10/10,A comedy-drama I greatly enjoyed was 1928's Ba...


In [3]:
import shared_functions.pickling as pickling

# reading in movie mappings
movie_mapping = pickling.get_pickle('movie_mapping')

## Filter to Include Only English Reviews

Analysis will only be done on english reviews, as later procedures such as lemmatization and topic modelling will be english-focused. 

Below, we wil filter the rows such that only the ones with english reviews are kept. 

In [None]:
import shared_functions.cleaning as cleaning

if new:
    # select only rows where the review is in valid english
    english_reviews = reviews[reviews.apply(lambda row: cleaning.valid_english(row), axis=1)]

    # write to csv
    english_reviews.to_csv(f'../data/reviews/english_reviews/english_reviews_{number}.csv')
else:
    # read from csv
    english_reviews = pd.read_csv(f'../data/reviews/english_reviews/english_reviews_{number}.csv', index_col=0)

english_reviews.head()

## Clean Reviews

Currently, text in `english_reviews` and `movies` have uppercases, punctuation and digits. 

We want to fix this in the `review` column of `english_reviews`.

In [None]:
def update_review_with_clean_text(row):
    """
    Updates the given row with clean text for its review column.

    Parameters:
    row (Series): row that contains unclean reviews

    Returns:
    Row with cleaned review. 

   """
    text = cleaning.clean_text(row.review)
    row.review = text
    return row

In [None]:
if new:
    # clean english reviews
    clean_english_reviews = pd.DataFrame([update_review_with_clean_text(row) for i, row in english_reviews.iterrows()])

    # write to csv
    clean_english_reviews.to_csv(f'../data/reviews/english_reviews/clean_english_reviews_{number}.csv')
else:
    # read from csv
    clean_english_reviews = pd.read_csv(f'../data/reviews/english_reviews/clean_english_reviews_{number}.csv', index_col=0)

clean_english_reviews.head()

## Perform Replacements of Key Words in Reviews

Often, movie reviews will contain the title of the movie or cast/crew/character names. Clearly, these words are varied
between movies, meaning that these key words will not contribute to the topic distribution given by the topic
model.

For example, a review may focus heavily on the directing of aspect a movie and use the director's name often in the review.
However, because the frequency of this director's review will be too low when considering all reviews across all movies,
the topic model will most likely not recognize this review as having a high probability of being about
the directing. However,if we were map the director's name in each review to the term "director", the topic model
will be able to learn this word and correctly assign a higher probability to the "directing" topic.

We perform a replacement of these key words with more generic terms below.


In [None]:
def replace_title_cast_crew(eng_reviews):
    """
    Replaces any mention of the title of the movie, cast & crew names and character names with their category (role).

    Parameters:
    eng_reviews (DataFrame): a dataframe that contains all cleaned english reviews, the username of the reviewer, the
    rating given alongside the review, and the tconst of the movie to which it is referring to.
    Reviews of the same movie will be grouped together in adjacent rows in the dataframe.

    Returns:
    List of dictionaries containing the new rows with the replaced reviews. Each dictionary entry consists of only one
    user's review/rating for a movie corresponding with that tconst. 

   """
    prev_movie = None
    mapping = None
    entries = []
    # for each review
    for _, row in eng_reviews.iterrows():
        # if the review regards a movie that hasn't been analyzed yet
        if prev_movie != row.tconst:
            # set it the the current movie
            prev_movie = row.tconst
            # extract the mapping for this movie
            mapping = movie_mapping[movie_mapping.tconst == row.tconst].mapping.squeeze()

        # replace all mapped words in the review
        replaced_review = row.review
        for word, replacement in mapping.items():
            replaced_review = replaced_review.replace(word, replacement)

        entries.append({'tconst': row.tconst, 'username': row.username, 'rating': row.rating, 'review': replaced_review})

    return entries

replaced_reviews = pd.DataFrame(replace_title_cast_crew(clean_english_reviews))

## Lemmatize Reviews and Movies

In [None]:
from nltk.stem import WordNetLemmatizer

def lemmatize_row(row):
    """
    Updates the given row with lemmatized text for its review column.

    Parameters:
    row (Series): row that contains unlemmatized reviews

    Returns:
    Row with lemmatized review. 

   """  
    print(f'Processing row {row.name}...', end='\r')
    row.review = cleaning.lemmatize(row.review, lemmatizer)
    return row

lemmatizer = WordNetLemmatizer()
lemmatized_reviews = pd.DataFrame([lemmatize_row(row) for _, row in replaced_reviews.iterrows()])
print(f'Done processing all rows!', end='\r')
lemmatized_reviews

In [None]:
if new:
    lemmatized_reviews.to_csv(f'../data/reviews/lemmatized_reviews/lemmatized_reviews_{number}.csv')

