# Analyzing the Reviews

This notebook shows the codes for different kinds of text analysis on the reviews I collected from MyDramaList.com.

I plan to conduct three kinds of text analysis in my project:
1. Word and Phrase Frequency Analysis
2. Concordance Analysis
3. Sentiment Analysis

### Setup

In [1]:
import os
import string
import re
import random
import time
import json
import csv

import requests
import bs4
from bs4 import BeautifulSoup

from collections import Counter

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import WordPunctTokenizer

In [2]:
%run functions.ipynb

To carry out the analyses, we first need to load the list of reviews from each year and from each score range.

**Load the reviews by year**

In [3]:
reviews_2017 = json.load(open('data/Date/review_2017.json'))

In [4]:
type(reviews_2017)

list

In [5]:
len(reviews_2017)

18

In [6]:
print('This is an example of reviews from 2017.')

reviews_2017[3]

This is an example of reviews from 2017.


{'Date': 'Dec 25, 2017',
 'Overall Score': '9.5',
 'Review Text': 'The only regrets I have with this drama is binge watching it so quickly. Finishing it has left such an empty void in me and I can’t help but make comparisons with any other dramas I watch.\n\nThis drama is such a unique case for me because usually I only have interest in Korean dramas and have struggled to enjoy a Chinese drama despite been Chinese myself. I also generally dislike dramas where the female chases after the male. However this drama has a special charm that I am unable to resist, making it probably even my most favourite drama ever in my 4 years of watching Asian dramas.\n\nMany people may think of Xiaoxi as a dumb bimbo whose life revolves around a guy and while this may be true we must remember that she is a high school girl and at the time where the desire for romance is very strong.\xa0She makes mistakes and is most definitely not perfect.\xa0As the drama progresses, we begin to see her development into

In [7]:
reviews_2018 = json.load(open('data/Date/review_2018.json'))

In [8]:
type(reviews_2018)

list

In [9]:
len(reviews_2018)

58

In [10]:
print('This is an example of reviews from 2018.')

reviews_2018[0]

This is an example of reviews from 2018.


{'Date': 'Dec 12, 2018',
 'Overall Score': '10',
 'Username': 'c00kie'}

In [11]:
reviews_2019 = json.load(open('data/Date/review_2019.json'))

In [12]:
type(reviews_2019)

list

In [13]:
len(reviews_2019)

24

In [14]:
print('This is an example of reviews from 2019.')

reviews_2019[5]

This is an example of reviews from 2019.


{'Date': 'Sep  6, 2019',
 'Overall Score': '6.0',
 'Review Text': "I just don't understand the whole hype over this drama or the very high ratings.\n\nIt took 10ish episodes to start getting interesting. It started because Jiang Chen started to show he liked Xiaoxi and would do little things to show that he was interested. I also liked the second lead Bu Wongsu and how he tried his best, but I especially loved how Jiang Chen would feel jealous and threatened by him.\n\nBut Xiaoxi is such a nothing, whiny, submissive, pathetic character!\n\nA lot of the series is spent on school life, but because it was hard to connect to anyone, it wasn't particularly moving or interesting, so I began fastforwarding. Only the bits with Jiang Chen showing how he liked Xiaoxi got me squealing and excited.\n\nThen it moved to being in university and they did a LOT of time jumps. There was a lot of in the sky, of the sun or of leaves montages which was just unnecessary.\n\nVery strange writing, odd beats a

**Load the reviews by score range**

In [15]:
reviews_0_to_3 = json.load(open('data/Overall_Score/score_0_to_3.json'))

In [16]:
type(reviews_0_to_3)

list

In [17]:
len(reviews_0_to_3)

0

Since there is no review that rates the drama with a score between 0 and 3.0, the list `reviews_0_to_3` will be disregarded in the analyses.

In [18]:
reviews_3_to_6 = json.load(open('data/Overall_Score/score_3_to_6.json'))

In [19]:
type(reviews_3_to_6)

list

In [20]:
len(reviews_3_to_6)

7

In [21]:
print('This is an example of reviews with a rating score between 3.0 and 6.0.')

reviews_3_to_6[0]

This is an example of reviews with a rating score between 3.0 and 6.0.


{'Date': 'May 22, 2018',
 'Overall Score': '5.0',
 'Review Text': "Story:\n\nChildhood neighbors who become a couple after knowing each other for several years. Very basic story line with no surprises. Although I like that the OTPs friends were important and had all their own story lines, this dragged out the drama.\n\nThe male lead was really static and seemed to have no emotions at all. The female lead on the other hand was often annoying (some might call it bubbly) and all over the place.\n\nAgain, the basic idea of showing their love from the beginning to end is great. However, the beginning took about 16 episodes (their high school years), while their collage and after-collage life was wrapped up in 7...this also means 16 episodes of no real deep or romantic encounters, and 7 episodes where they are adults, but again we don't see many lovely scenes. Actually, the last episode was the best, as there was a lot of love included! \n\nActing/Cast:\n\nAs this was my first cdrama I don't

In [22]:
reviews_6_to_8 = json.load(open('data/Overall_Score/score_6_to_8.json'))

In [23]:
type(reviews_6_to_8)

list

In [24]:
len(reviews_6_to_8)

13

In [25]:
print('This is an example of reviews with a rating score between 6.0 and 8.0.')

reviews_6_to_8[5]

This is an example of reviews with a rating score between 6.0 and 8.0.


{'Date': 'Mar 12, 2018',
 'Overall Score': '7.5',
 'Review Text': "Chinese school romance dramas are kind of my thing and this one does not disappoint. It has everything, innocent love, intensinty, second lead, good friendship vibes, chemistry between the girl and the boy, nice built up, drama, high school dynamics. Sure, the story wasn't original and there are tons of dramas out there with the same theme, but at least they hundled it pretty neat and the final result was decent. Also, the girl was cute and funny, whether the boy was kind of a snob at first, got better afterwards, but kept his attitude, which was refreshing because in most dramas the male lead starts with a i-hate-you-i-am-too-cynical and makes a 360 degrees turn a few episodes later. So, 7.5 out of 10.",
 'Username': 'PHope'}

In [26]:
reviews_8_to_10 = json.load(open('data/Overall_Score/score_8_to_10.json'))

In [27]:
type(reviews_8_to_10)

list

In [28]:
len(reviews_8_to_10)

80

In [29]:
print('This is an example of reviews with a rating score between 8.0 and 10.')

reviews_8_to_10[20]

This is an example of reviews with a rating score between 8.0 and 10.


{'Date': 'Nov 19, 2018',
 'Overall Score': '10',
 'Review Text': "This drama is absolutely amazing, 10/10 my favorite drama of all time. I have watched it 4 times in total! And I have never rewatched anything this much. The story and overall show and characters is very different from your typical Chinese drama. This one is very special. It isn't over exaggerated and fake like most Chinese dramas such as Meteor Garden or Boys Over Flowers. This drama depicts REAL life in China as a highschool student. It really brings you into their lives and the things they deal with. This is a must see for anyone who likes Asian dramas.",
 'Username': 'Xiaoxi'}

## Word and Phrase Frequency Analysis

### Analyzing Reviews from 2017

**1. Word frequency list**

I am interested in the frequency of a word or a phrase that appears in the review text, so from the list `reviews_2017`, I am going to extract the review text from each dictionary containing the review information and put them into a list called `review_texts_2017`.

In [30]:
review_texts_2017 = [review['Review Text'] for review in reviews_2017]

In [31]:
review_texts_2017[0:5]

['Amidst all the positivity and perfection that others see in this drama, you can count on me to be a real grinch and center in on all the reasons that it was not entirely flawless. That is not to say I hated it; far from it, but I fail to share the unwavering, absolute love that many others seem to.\n\nMany things about A Love So Beautiful was a surprise to me. I\'ll be the first to admit that I harbor a real prejudice each time I enter the realm that is a Chinese drama - watching them throughout my childhood, they were unfailingly littered with a) tragedy, b) horrible CG, and/or c) gagworthy storylines (typically, an entertaining combination of all three). And quite honestly, few nowadays seem to impress me. Call it bias, but I still think the Mainland has much to learn regarding what constitutes a good show.\n\nA Love So Beautiful was different from the moment I picked it up. It does not present melodramatic conflict for you to brood over, but instead focuses on the innocence of a t

In [32]:
print('review_texts_2017 has {} pieces of review text, each is a string object.'.format(len(review_texts_2017)))

review_texts_2017 has 18 pieces of review text, each is a string object.


To create a word frequency list named `review_texts_2017_dist`, I need to:
1. loop over the list of review texts from 2017
2. tokenize the text using the `tokenize` function and turning the letters to lowercase and removing certain punctuations
3. update the frequency list

In [33]:
review_texts_2017_dist = Counter()

for review_text in review_texts_2017:
    tok = tokenize(review_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    review_texts_2017_dist.update(tok)

In [34]:
review_token_2017_cnt = sum(review_texts_2017_dist.values())
print('There are {} tokens in review_texts_2017_dist.'.format(review_token_2017_cnt))

There are 7678 tokens in review_texts_2017_dist.


In [35]:
print(review_texts_2017_dist.most_common(50))

[('the', 412), ('i', 235), ('and', 205), ('to', 181), ('a', 170), ('it', 162), ('is', 141), ('drama', 138), ('this', 137), ('of', 115), ('was', 94), ('that', 92), ('in', 86), ('so', 75), ('you', 72), ('for', 63), ('but', 61), ('love', 61), ('as', 55), ('my', 50), ('story', 45), ('not', 44), ('with', 44), ('me', 43), ('have', 42), ('their', 39), ('on', 38), ('they', 38), ('watch', 38), ('be', 37), ('school', 36), ('very', 35), ('all', 34), ('just', 34), ('cute', 32), ('really', 31), ('more', 30), ('high', 28), ('about', 27), ('how', 27), ('he', 27), ('like', 26), ('will', 26), ('beautiful', 25), ('because', 25), ('out', 25), ('at', 25), ('if', 25), ('chen', 24), ('from', 23)]


**A few observations**

* Function words like *`the`*, *`and`*, *`to`*, and *`a`* appear frequently in the reviews. Other words worthy of attention are *`i`*, *`my`*, and *`me`*. Reviewers refer to themselves very often in their reviews, probably to express their feelings and address their opinions about the drama.
* The words *`cute`* and *`beautiful`* are among the 50 most frequent words. The reviewers might use *`cute`* to describe the story plot or the characters, but *`beautiful`* might more likely be used to describe a character or a specific scene. It might also be possible that they use the word *`beautiful`* frequently because they mention the name of the drama (which contains the word *`beautiful`*) very often in their reviews.
* Another word that appears frequently is *`school`*. Most parts of the story plot take place in school, so it makes sense that reviewers mention school in their reviews. But which school are they referring to? The "school" in the drama or the "school" they went to?
* Ranking the 49th place is the word *`chen`*. An interesting fact about *`chen`* is that it is the last name of the female lead character and also the first name of the male lead character, so the word frequency list cannot really tell us which specific character the reviewers frequently write about, or whether or not the reviewers mention both lead characters together.

**2. Bigram frequency list and trigram frequency list**

Before we create a bigram frequency list and a trigram frequency list, we need to use the `get_ngram_tokens` function to tokenize the review texts. Instead of having one word as one token, this function makes two words (bigram, where n=2) or three words (trigram, where n=3) as one token. Then, we can create a bigram frequency list (`review_texts_2017_bigram_dist`) and a trigram frequency list (`review_texts_2017_trigram_dist`).

In [36]:
review_texts_2017_bigram_dist = Counter()
review_texts_2017_trigram_dist = Counter()

for review_text in review_texts_2017:
    tok = tokenize(review_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    bigram_tok = get_ngram_tokens(tok, n=2)
    trigram_tok = get_ngram_tokens(tok, n=3)
    
    review_texts_2017_bigram_dist.update(bigram_tok)
    review_texts_2017_trigram_dist.update(trigram_tok)

In [37]:
reviews_2017_bigram_token_cnt = sum(review_texts_2017_bigram_dist.values())
print('There are {} tokens in review_texts_2017_bigram_dist.'.format(reviews_2017_bigram_token_cnt))

reviews_2017_trigram_token_cnt = sum(review_texts_2017_trigram_dist.values())
print('There are {} tokens in review_texts_2017_trigram_dist.'.format(reviews_2017_trigram_token_cnt))

There are 7660 tokens in review_texts_2017_bigram_dist.
There are 7642 tokens in review_texts_2017_trigram_dist.


In [38]:
print(review_texts_2017_bigram_dist.most_common(50))
print('\n')
print(review_texts_2017_trigram_dist.most_common(50))

[('this drama', 63), ('of the', 37), ('the drama', 31), ('high school', 27), ('the story', 26), ('it was', 24), ('and the', 20), ('to the', 18), ('to watch', 16), ('drama is', 16), ('so beautiful', 15), ('is a', 15), ('this is', 15), ('and i', 15), ('a love', 13), ('love so', 13), ('the main', 13), ('is the', 13), ('to be', 12), ('between the', 12), ('made me', 12), ('in love', 12), ('jiang chen', 12), ('all the', 11), ('in this', 11), ('but i', 11), ('that the', 11), ('i would', 11), ('the acting', 11), ('hu yi', 11), ('if you', 11), ('that i', 10), ('as the', 10), ('this show', 10), ('love with', 10), ('yi tian', 10), ('watch this', 10), ('i am', 10), ('i have', 10), ('drama i', 10), ('that is', 9), ('is not', 9), ('such a', 9), ('my heart', 9), ('the music', 9), ('i was', 9), ('xiao xi', 9), ('he is', 9), ('cute and', 9), ('i will', 9)]


[('this drama is', 15), ('love so beautiful', 13), ('a love so', 12), ('in this drama', 10), ('in love with', 10), ('hu yi tian', 10), ('of the st

**A few observations**
* The phrase *`high school`* occurs 27 times and ranks the 4th in the bigram frequency list. Notice also that phrase like *`high school life`* and *`high school years`* also appear in the most common 50 trigrams list. The drama probably reminds many viewers about their times in high school.
* In the bigram frequency list, we can see that the male lead character's name *`jiang chen`* appears 12 times, and the female lead character's name *`xiao xi`* appears 9 times. If we look at the trigram frequency list, some reviewers write the full name of the female lead character *`chen xiao xi`* in their reviews. This probably explains why *`chen`* appears frequently. And the number of occurrence of each name also suggests that there is not a big difference in how often the reviewers talk about the male lead character and the female lead character.
* A little surprise is the phrase *`the music`*, which is among the 50 most common bigrams. This shows that when people watch a drama, they not only focus on the story, sometimes they pay attention to the music, too.

### Analyzing Reviews from 2018

Repeat the steps done for analyzing reviews from 2017.

**1. Word frequency list**

In [39]:
review_texts_2018 = [review['Review Text'] for review in reviews_2018]

In [40]:
review_texts_2018[0:5]

 'A love so beautiful was a great love story. I loved the storyline and how it flowed, It wasn\'t rushed and it wasn\'t too slow it was perfect. I usually don\'t like high school centered dramas and when I found out that this story focused a lot of time in High school I almost didn\'t watch. However, The cast is awesome, great chemistry. I was drawn in throughout the storyline I didn\'t even know who to root for the second lead was as mature, sweet and handsome as the main lead.  I hated how the mother kind of disliked the second lead because of his "dark skin" but it wasn\'t a big deal and didn\'t overshadow the show at all it was just something that bugged me personally because I thought he was a cutie.  I love how the show didn\'t waste time on trivial things other shows would be the main character failed her college entrance exam and had to retake the exam instead of focusing on that for 5 episodes the audience got a condensed episode of study scene and college life which was great

In [41]:
print('review_texts_2018 has {} pieces of review text, each is a string object.'.format(len(review_texts_2018)))

review_texts_2018 has 58 pieces of review text, each is a string object.


In [42]:
review_texts_2018_dist = Counter()

for review_text in review_texts_2018:
    tok = tokenize(review_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    review_texts_2018_dist.update(tok)

In [43]:
review_token_2018_cnt = sum(review_texts_2018_dist.values())
print('There are {} tokens in review_texts_2018_dist.'.format(review_token_2018_cnt))

There are 16981 tokens in review_texts_2018_dist.


In [44]:
print(review_texts_2018_dist.most_common(50))

[('the', 745), ('i', 546), ('and', 440), ('a', 352), ('to', 336), ('it', 287), ('of', 256), ('this', 253), ('is', 250), ('drama', 245), ('was', 204), ('that', 199), ('but', 176), ('in', 170), ('so', 158), ('for', 153), ('love', 132), ('you', 128), ('me', 116), ('like', 102), ('really', 92), ('their', 91), ('all', 90), ('with', 89), ('they', 86), ('my', 85), ('just', 85), ('not', 82), ('as', 81), ('be', 80), ('on', 79), ('story', 75), ('he', 73), ("it's", 68), ('because', 63), ('more', 63), ('one', 63), ('she', 62), ('chen', 60), ('watch', 58), ('have', 58), ('school', 57), ('his', 57), ('her', 56), ('lead', 55), ('how', 55), ('are', 54), ('cute', 54), ('very', 54), ('also', 51)]


**A few observations**
* Besides some function words (e.g., *`the`*, *`and`*, *`a`*, *`to`*, *`of`*, etc.), the words *`i`*, *`it`*, *`this`*, and *`drama`* are among the top 10 words that appear the most in the review texts.
    * When the reviewers are writing a review for a drama, they are likely to often refer to themselves and talk about their personal feelings or engagement with the drama, so the word *`i`* occurs a lot in the review text.
    * Similarly, since they are writing about a drama, they would keep referring to the drama, too. This might explain the frequent occurrence of the words *`it`*, *`this`*, and *`drama`*.
* The word *`love`* ranks the 17th. This is reasonable since the drama is about puppy love and might trigger the reviewers to talk about love -- either their own love stories or the love stories in the drama.
* The word *`school`* occurs 57 times. The drama has a setting in school, which might remind the reviewers of their youth memories in school.
* The word *`cute`* occurs 54 times. It is very likely the word that the reviewers would use to describe the story or the characters in the drama.
* One word that worths paying attention to is *`chen`*. This is the last name of the female lead character; it is also the first name of the male lead character. It is common that the reviews mention about the characters, but in this case, the problem with the one-word frequency list is that we would not know which character the *`chen`* is referring to. So it might be a good idea to look at a bigram frequency list or even a trigram frequency list.

**2. Bigram frequency list and trigram frequency list**

In [45]:
review_texts_2018_bigram_dist = Counter()
review_texts_2018_trigram_dist = Counter()

for review_text in review_texts_2018:
    tok = tokenize(review_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    bigram_tok = get_ngram_tokens(tok, n=2)
    trigram_tok = get_ngram_tokens(tok, n=3)
    
    review_texts_2018_bigram_dist.update(bigram_tok)
    review_texts_2018_trigram_dist.update(trigram_tok)

In [46]:
reviews_2018_bigram_token_cnt = sum(review_texts_2018_bigram_dist.values())
print('There are {} tokens in review_texts_2018_bigram_dist.'.format(reviews_2018_bigram_token_cnt))

reviews_2018_trigram_token_cnt = sum(review_texts_2018_trigram_dist.values())
print('There are {} tokens in review_texts_2018_trigram_dist.'.format(reviews_2018_trigram_token_cnt))

There are 16923 tokens in review_texts_2018_bigram_dist.
There are 16865 tokens in review_texts_2018_trigram_dist.


In [47]:
print(review_texts_2018_bigram_dist.most_common(50))
print('\n')
print(review_texts_2018_trigram_dist.most_common(50))

[('this drama', 114), ('of the', 62), ('jiang chen', 50), ('the drama', 47), ('it was', 45), ('the story', 38), ('high school', 37), ('and i', 36), ('but i', 35), ('drama is', 33), ('to be', 32), ('the main', 30), ('this is', 29), ('and the', 29), ('i was', 28), ('to watch', 28), ('xiao xi', 28), ('in the', 27), ('i think', 27), ('is a', 26), ('is the', 26), ('to the', 24), ('a love', 23), ('for the', 22), ('it is', 22), ('one of', 22), ('so beautiful', 21), ('i just', 21), ('love so', 20), ('the best', 20), ('a lot', 19), ('i really', 19), ('female lead', 19), ('that i', 19), ('want to', 19), ('all the', 19), ('on the', 19), ("i don't", 19), ('so much', 19), ('i liked', 19), ('i love', 18), ('for me', 18), ('is so', 18), ('if you', 18), ('the characters', 18), ('was a', 17), ('but it', 17), ('with the', 17), ('was so', 17), ('the end', 16)]


[('this drama is', 30), ('a love so', 18), ('love so beautiful', 18), ('a lot of', 14), ('the female lead', 14), ('one of the', 12), ('the story

**A few observations**

* The bigram *`jiang chen`* ranks the third in the bigram frequency list! This is the full name of the male lead character. The bigram *`xiao xi`* ranks the 17th. This is the first name of the female lead character. The occurrence of *`jiang chen`* is almost twice of that of *`xiao xi`*. But notice that there is a bigram *`female lead`* that occurs 19 times. If we take this 19 times into account for *`xiao xi`* when we consider the number of times that the reviews mention about the two lead characters, there isn't a big difference between the two. The reviewers do not seem to prefer writing about one more than the other in their reviews.
* The bigram *`high school`* appears 37 times. This suggests that high school as the setting actually leaves an impression among the audience, probably because it is a place where many people can relate to especially when it mingles with love stories.

* More names! The trigram *`hu yi tian`* is the name of the actor who plays *`jiang chen`*. It appears 8 times. Interestingly, one cannot find the name of the actress who plays *`xiao xi`* in the frequency list showing the top 50 bigrams or trigrams. Actually, the actress who plays *`xiao xi`* is Shen Yue; we would look into the bigram frequency list to find her name. In fact, the bigram *`shen yue`* occurs 12 times, though it is not in the top 50. Reviews usually write about the characters because they are the ones directly related to the drama. But sometimes, they also write about the actors or actresses. For example, they might discuss the acting skills or the appearance of these actors and actresses.
* Reviewers did not forget about the second male lead character, whose name *`wu bo song`* occurs 7 times in the reviews. Though much fewer times of mentioning compared to those of *`xiao xi`* and *`jiang chen`*, it shows that *`wu bo song`* also plays an impressive role and is worthy of audience discussion.
* Notice the trigrams like *`jiang chen and`*, *`and xiao xi`*, *`chen and xiao`*, and *`xiao xi and`*. Audience in fact often talk about the two main characters together.
* One interesting trigram is *`the chemistry between`*. When people talk about love, they often use the word "chemistry" when they describe the interactions between the couple -- this applies to their discussion of the couples in drama as well.

### Analyzing Reviews from 2019

Repeat the steps done for reviews from 2017 or 2018.

**Word frequency list**

In [48]:
review_texts_2019 = [review['Review Text'] for review in reviews_2019]

In [49]:
review_texts_2019[0:5]

["I was putting off watching this drama for a while. I always saw it under 'top dramas', my recommendation feed, reviews where it was called as another ISWAK/Itazura na Kiss/Playful Kiss, etc.\n\nBefore I began watching this I was only familiar with the male lead in a drama where his role wasn't too big. I did know that he is quite decent so decided to go ahead. Here's what I thought finally:\n\nSTORY: On the surface it comes off as another youth drama. However what I liked most about the drama was the simplicity of the plot. It's just about a group of friends, an annoying love triangle, a jerk male lead and ditz female lead. I can see the comparisons with ISWAK but nothing tops that. It would be fair to say this was at least inspired from that story. But what sets this drama apart are the likeable characters and the trivial-yet-relatable situations they go through. Their friendship was excellent and it was best part for me over the romantic plot. \n\nCHARACTERS AND LOVE TRIANGLE: I ha

In [50]:
print('review_texts_2019 has {} pieces of review text, each is a string object.'.format(len(review_texts_2019)))

review_texts_2019 has 24 pieces of review text, each is a string object.


In [51]:
review_texts_2019_dist = Counter()

for review_text in review_texts_2019:
    tok = tokenize(review_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    review_texts_2019_dist.update(tok)

In [52]:
review_token_2019_cnt = sum(review_texts_2019_dist.values())
print('There are {} tokens in review_texts_2019_dist.'.format(review_token_2019_cnt))

There are 6031 tokens in review_texts_2019_dist.


In [53]:
print(review_texts_2019_dist.most_common(50))

[('the', 287), ('i', 216), ('and', 184), ('a', 140), ('to', 114), ('of', 102), ('was', 97), ('is', 97), ('it', 96), ('this', 90), ('drama', 81), ('in', 71), ('but', 64), ('for', 63), ('that', 61), ('her', 49), ('he', 48), ('love', 48), ('chen', 43), ('with', 42), ('as', 40), ('so', 38), ('not', 35), ('you', 35), ('story', 34), ('jiang', 34), ('they', 33), ('how', 33), ('like', 32), ('xiaoxi', 32), ('at', 29), ('really', 29), ('me', 28), ('lead', 27), ('my', 26), ('when', 26), ('his', 25), ('just', 25), ('she', 25), ('on', 24), ('about', 24), ('their', 23), ('all', 23), ('one', 23), ('did', 22), ('also', 22), ('have', 22), ('are', 21), ('show', 21), ('more', 20)]


**A few observations**
* Words like *`i`*, *`me`*, and *`my`* still occur frequently. But there are more pronouns--*`her`* appears 49 times, *`she`* appears 25 times, and *`he`* appears 48 times.
* The last name of the male lead character *`jiang`* appears 34 times, and the first name of the female lead character *`xiaoxi`* (one-word form of *`Xiao Xi`*) appears 32 times, along with the frequent appearance of the word *`chen`* that is common to both lead characters.
* Also notice the word *`lead`*, which appears 27 times.
* These observations suggest that reviews from 2019 seem to talk about the characters very often.

**2. Bigram frequency list and trigram frequency list**

In [54]:
review_texts_2019_bigram_dist = Counter()
review_texts_2019_trigram_dist = Counter()

for review_text in review_texts_2019:
    tok = tokenize(review_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    bigram_tok = get_ngram_tokens(tok, n=2)
    trigram_tok = get_ngram_tokens(tok, n=3)
    
    review_texts_2019_bigram_dist.update(bigram_tok)
    review_texts_2019_trigram_dist.update(trigram_tok)

In [55]:
reviews_2019_bigram_token_cnt = sum(review_texts_2019_bigram_dist.values())
print('There are {} tokens in review_texts_2019_bigram_dist.'.format(reviews_2019_bigram_token_cnt))

reviews_2019_trigram_token_cnt = sum(review_texts_2019_trigram_dist.values())
print('There are {} tokens in review_texts_2019_trigram_dist.'.format(reviews_2019_trigram_token_cnt))

There are 6007 tokens in review_texts_2019_bigram_dist.
There are 5983 tokens in review_texts_2019_trigram_dist.


In [56]:
print(review_texts_2019_bigram_dist.most_common(50))
print('\n')
print(review_texts_2019_trigram_dist.most_common(50))

[('this drama', 36), ('jiang chen', 31), ('of the', 21), ('and i', 21), ('it was', 20), ('the drama', 17), ('but i', 17), ('i was', 14), ('and the', 14), ('in the', 14), ('a lot', 13), ('i loved', 13), ('the story', 12), ('the main', 12), ('xiao xi', 11), ('the end', 11), ('is the', 11), ('for me', 10), ('the second', 10), ('drama was', 9), ('second lead', 9), ('i would', 9), ('high school', 9), ('did a', 9), ('watching this', 8), ('when it', 8), ('if you', 8), ('to see', 8), ('that the', 8), ('that i', 8), ('lot of', 8), ('at the', 8), ('the only', 8), ('a little', 8), ('he is', 7), ('on the', 7), ('the plot', 7), ('female lead', 7), ('the other', 7), ('in this', 7), ('drama is', 7), ('for the', 7), ('this series', 7), ('the time', 7), ('it is', 7), ('is a', 7), ('lu yang', 7), ('this show', 7), ('with the', 6), ('it comes', 6)]


[('a lot of', 8), ('the second lead', 6), ('in the drama', 5), ('at the end', 5), ('did a great', 5), ('a great job', 5), ('when it comes', 5), ('it comes t

**A few observations**
* The bigram frequency list shows that *`jiang chen`* occurs 31 times, but the numbers of occurence of *`xiao xi`* and *`female lead`* only add up to 18 times. We can reasonably infer that reviewers from 2019 talk about the male lead character more than the female lead character.
* More characters show up! The phrase *`second lead`* and the name of a support role *`lu yang`* are among the 50 most common bigrams. Viewers also care about characters other than the main couple!
* In the trigram frequency list, there are two phrases related to friends -- *`group of friends`* and *`their friendship was`*. Although the drama is mainly about love stories, it also contains the theme of friendship, and the viewers do not just ignore this theme while enjoying the sweet love between the characters.

## Concordance Analysis

Though word and phrase frequency analysis can give us a general idea of what the reviewers cover in their reviews, it cannot provide more detailed information about the reviewers' thoughts about the drama. For example, we see words such as *`school`* and *`cute`* that show up very often in the reviews, but we cannot tell from the frequency list in what contexts these words appear and how these words are used. So for the next part, I am going to choose a word of interest from the word frequency list for each year and use concordance analysis to take a closer look at them. By investigating a specific word in context, we can summarize certain patterns used with the word and discover meaning in these patterns.

### Investigating a word of interest for reviews in 2017

Let's recall the word frequency list (the most common 50 words) for reviews from 2017 and choose a word that we are interested in investigating.

In [57]:
print(review_texts_2017_dist.most_common(50))

[('the', 412), ('i', 235), ('and', 205), ('to', 181), ('a', 170), ('it', 162), ('is', 141), ('drama', 138), ('this', 137), ('of', 115), ('was', 94), ('that', 92), ('in', 86), ('so', 75), ('you', 72), ('for', 63), ('but', 61), ('love', 61), ('as', 55), ('my', 50), ('story', 45), ('not', 44), ('with', 44), ('me', 43), ('have', 42), ('their', 39), ('on', 38), ('they', 38), ('watch', 38), ('be', 37), ('school', 36), ('very', 35), ('all', 34), ('just', 34), ('cute', 32), ('really', 31), ('more', 30), ('high', 28), ('about', 27), ('how', 27), ('he', 27), ('like', 26), ('will', 26), ('beautiful', 25), ('because', 25), ('out', 25), ('at', 25), ('if', 25), ('chen', 24), ('from', 23)]


The word *`school`* appears 36 times. We already know that the drama is mainly about Jiang Chen and Xiao Xi's love story in school, but is there another "school" that the reviewers are referring to in their reviews? Would they write about their own love stories during their school life?

To examine the use of the word *`school`* by the reviewers, we need to:
1. create a list of tokens for the review texts from 2017
2. create a KWIC (key word in context) concordance object using the `make_kwic` function
3. obtain a random sample of 20 lines
4. examine the KWIC listing and make some observations using the `sort_kwic` and `print_kwic` functions

In [58]:
review_texts_2017_tokens = []

for rev_text in review_texts_2017:
    rev_text_tok = tokenize(rev_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    review_texts_2017_tokens.extend(rev_text_tok)

In [59]:
school = make_kwic('school', review_texts_2017_tokens, win=6)

In [60]:
random.seed(5)
school_sample = random.sample(school, 20)

In [61]:
print_kwic(sort_kwic(school_sample, ['L5']))

                         lead syndrome a very kawaii chinese  school  novel the story is not groundbreaking
                   adulthood changing feelings love and hate  school  challenges and youthful childhood so nostalgic
                                 remember that she is a high  school  girl and at the time where
                      very smoothly it transitions from high  school  life to adult life while focusing
                      five friends trudge through their high  school  years having teachers scold you finding
                                    from being a kid at high  school  to an adult facing reality the
                        us have experienced studying in high  school  although not everyone had experiences of
                           drama life lessons that cute high  school  puppy love story and of course
                                 i'm saying from from a high  school  student this drama is beautiful in
                             the drama watchin

**Some observations of *`school`***

* In this sample, the word *`school`* is used in multiple ways:
    1. It is referring to the "school" in the drama (e.g., lines 4, 5, 10-16 and 20)
    2. It is referring to "school" in general or "school" related to the reviewers (e.g., lines 2, 3, 6, 7, 9, 17 and 19)
    3. It is modifying a noun to give specific information (e.g., lines 1, 3, 8 and 18)


* Since the drama tells stories that take place in school, reviewers bring up the word *`school`* a lot in their reviews -- whether or not related to the drama. This shows that school is an element in the drama that impresses the audience. While love stories in school are a memorable point of the drama, these stories also remind the audience of their own experiences with school.

### Investigating a word of interest for reviews in 2018

Recall the word frequency list (the most common 50 words) for reviews from 2018 and choose a word that we are interested in investigating.

In [62]:
print(review_texts_2018_dist.most_common(50))

[('the', 745), ('i', 546), ('and', 440), ('a', 352), ('to', 336), ('it', 287), ('of', 256), ('this', 253), ('is', 250), ('drama', 245), ('was', 204), ('that', 199), ('but', 176), ('in', 170), ('so', 158), ('for', 153), ('love', 132), ('you', 128), ('me', 116), ('like', 102), ('really', 92), ('their', 91), ('all', 90), ('with', 89), ('they', 86), ('my', 85), ('just', 85), ('not', 82), ('as', 81), ('be', 80), ('on', 79), ('story', 75), ('he', 73), ("it's", 68), ('because', 63), ('more', 63), ('one', 63), ('she', 62), ('chen', 60), ('watch', 58), ('have', 58), ('school', 57), ('his', 57), ('her', 56), ('lead', 55), ('how', 55), ('are', 54), ('cute', 54), ('very', 54), ('also', 51)]


The word *`cute`* occurs 54 times. What are the reviewers describing when they use this word in their reviews?

Repeat the steps as done in the part for reviews in 2017 to examine the use of the word *`cute`* by the reviewers in 2018.

In [63]:
review_texts_2018_tokens = []

for rev_text in review_texts_2018:
    rev_text_tok = tokenize(rev_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    review_texts_2018_tokens.extend(rev_text_tok)

In [64]:
cute = make_kwic('cute', review_texts_2018_tokens, win=6)

In [65]:
random.seed(2)
cute_sample = random.sample(cute, 20)

In [66]:
print_kwic(sort_kwic(cute_sample, ['R1']))

                                 they portray her love is so  cute  and amazing never did once i
                              where they were just acting so  cute  and next ep was literally hell
                                    to the end it was really  cute  and funny i watched with my
                                 too far the drama was quite  cute  and i appreciated being able to
                                  feelings and i find it rly  cute  and i think i keep reassure
                                was decent also the girl was  cute  and funny whether the boy was
                                 to expect i found the story  cute  at first i thought it was
                                 it wasn't cute that's not a  cute  behaviour that's plainly annoying to watch
                                 male lead lol no matter how  cute  he is he is so immature
                       and entertaining for those who enjoys  cute  high school youthful dramas that eventually
     

**Some observations of the usage of *`cute`***

* In this random sample, the word *`cute`* is being used:
    1. To describe the drama (e.g., lines 4 and 10)
    2. To describe the actor(s) or character(s) (e.g., lines 6, 9, 11, 12, 17-20)
    3. To describe the love story or the love itself (e.g., lines 1, 7, 14, 15, 16)
    4. To describe an action or behavior (e.g., lines 8, 13)


* Most reviewers think that the drama (in terms of the scenes), the actors and the characters they play, the love between the characters and their love stories, and the things the characters do are *cute*.
* But there might also be exception. For example, in line 16, the reviewer does not think that what the characters do was cute. Though the reviewer uses the word *`cute`*, he/she actually uses it along with the negation.

### Investigating a word of interest for reviews in 2019

Recall the word frequency list (the most common 50 words) for reviews from 2019 and choose a word that we are interested in investigating.

In [67]:
print(review_texts_2019_dist.most_common(50))

[('the', 287), ('i', 216), ('and', 184), ('a', 140), ('to', 114), ('of', 102), ('was', 97), ('is', 97), ('it', 96), ('this', 90), ('drama', 81), ('in', 71), ('but', 64), ('for', 63), ('that', 61), ('her', 49), ('he', 48), ('love', 48), ('chen', 43), ('with', 42), ('as', 40), ('so', 38), ('not', 35), ('you', 35), ('story', 34), ('jiang', 34), ('they', 33), ('how', 33), ('like', 32), ('xiaoxi', 32), ('at', 29), ('really', 29), ('me', 28), ('lead', 27), ('my', 26), ('when', 26), ('his', 25), ('just', 25), ('she', 25), ('on', 24), ('about', 24), ('their', 23), ('all', 23), ('one', 23), ('did', 22), ('also', 22), ('have', 22), ('are', 21), ('show', 21), ('more', 20)]


The word *`lead`* occurs 27 times. How do the reviewers use this word in their reviews? Is it referring to the lead characters? Does it describe any other things?

Repeat the steps as done in the part for reviews in 2017 and 2018 to examine the use of the word *`lead`* by the reviewers in 2019.

In [68]:
review_texts_2019_tokens = []

for rev_text in review_texts_2019:
    rev_text_tok = tokenize(rev_text, lowercase = True, strip_chars = '!"#$%&\()*+,-.:;<=>?@[\\]^_`{|}~')
    review_texts_2019_tokens.extend(rev_text_tok)

In [69]:
lead = make_kwic('lead', review_texts_2019_tokens, win=6)

In [70]:
random.seed(0)
lead_sample = random.sample(lead, 20)

In [71]:
print_kwic(sort_kwic(lead_sample, ['R1']))

                           and feels jealous of busan second  lead  a swimmer and xiaoxi's relationship these
                     easygoing i absolutely loved the second  lead  also the main couple made wonderful
                          annoying love triangle a jerk male  lead  and ditz female lead i can
                                  a great job and the female  lead  based off the plot was incredibly
                           the male protagonist is the worst  lead  character ive seen flat without any
                            couple was love busan the second  lead  guy damn i felt was teary
                          the persistence that the main girl  lead  has when it comes to chasing
                               comes to chasing the main guy  lead  however the songs are okay this
                              jerk male lead and ditz female  lead  i can see the comparisons with
                           doesn't go too overboard the male  lead  is a little petty at times


**Some observations of the usage of *`lead`***

* In this sample, the word `lead` is used in two ways:
    1. to refer to a character (e.g., lines 1-13, 17-20)
    2. to form the term "second lead syndrome" (e.g., lines 14-16)
    
    
* Most of the times when reviewers use the word *`lead`*, they are referring to the main couple, Jiang Chen and Xiao Xi. But sometimes, they also talk about the second lead, which shows that the second lead is also an impressive character.
* Some reviewers mention about the Second Lead Syndrome, which happens when they like the second male lead more than the first male lead and would like the first female lead to be with the second male lead at the end of the story. Though in this sample, the reviewers do not really have Second Lead Syndrome, it is possible that some other audience has it. And this may indicate that some audience like the second lead (see line 2 as an example).

## Sentiment Analysis

For my research question, I am interested in looking at the overall polarity of a review (i.e., whether it is positive or negative). The overall polarity should give a sense about whether the drama receives a good reputation and it is the most straightforward indication of how the audience feels about the drama in terms of good or bad. The arousal, dominance, and emotion might tell us about the specific emotional feelings of the reviewer towards the drama (or even a particular scene in the drama), but those are not my focus. So, for sentiment analysis, I am going to look at the dimension of valence only.

The sentiment analysis tool I am going to use is VADER -- Valence Aware Dictionary and sEntiment Reasoner. The VADER lexicon has a list of items, each of which has a valence rating.

We need to set up the `SentimentIntensityAnalyzer` in order to use VADER.

I will point to a `SentimentIntensityAnalyzer` object with pointer `sid`. We can use this object along with the VADER lexicon to classify words (or tokens) in a text.

In [72]:
sid = SentimentIntensityAnalyzer()

I will not go deep into examining the valence of every single word in each review text (since the reviews are long), but will apply the `polarity_scores` function of the `SentimentIntensityAnalyzer` to each review text. This function will take the proportion of words in the text that belongs to each valence category (*negative*, *neutral*, and *positive*) as their polarity score, and will calculate a compound score (between -1 and 1, where -1 means negative and 1 means positive) for the text.

For the sentiment analysis, I will analyze the review texts by date and by overall score in the following order:
1. reviews from 2017
2. reviews from 2018
3. reviews from 2019
4. reviews with an overall score between 3.0 and 6.0
5. reviews with an overall score between 6.0 and 8.0
6. reviews with an overall score between 8.0 and 10

### Sentiment Analysis by Date

### A. 2017 Reviews Sentiment Analysis

In [73]:
len(reviews_2017)

18

In [74]:
for rt in reviews_2017:      ## reviews_2017 is a list of dictionary
    rt_pol_score = sid.polarity_scores(rt['Review Text'])    ## calculate the polarity score of each review text
    rt.update(rt_pol_score)

In [75]:
reviews_2017[0:5]

[{'Date': 'Dec 30, 2017',
  'Overall Score': '8.5',
  'Review Text': 'Amidst all the positivity and perfection that others see in this drama, you can count on me to be a real grinch and center in on all the reasons that it was not entirely flawless. That is not to say I hated it; far from it, but I fail to share the unwavering, absolute love that many others seem to.\n\nMany things about A Love So Beautiful was a surprise to me. I\'ll be the first to admit that I harbor a real prejudice each time I enter the realm that is a Chinese drama - watching them throughout my childhood, they were unfailingly littered with a) tragedy, b) horrible CG, and/or c) gagworthy storylines (typically, an entertaining combination of all three). And quite honestly, few nowadays seem to impress me. Call it bias, but I still think the Mainland has much to learn regarding what constitutes a good show.\n\nA Love So Beautiful was different from the moment I picked it up. It does not present melodramatic conflic

Now, each piece of review information in `reviews_2017` includes a polarity compound score.

In [76]:
rt_pol_score_2017 = []

for rt in reviews_2017:
    pol_score = rt['compound']
    rt_pol_score_2017.append(pol_score)

sum(rt_pol_score_2017) / len(rt_pol_score_2017)
## calculate the average compound score of review texts from 2017

0.9808444444444445

Reviews from 2017 have an average compound score of 0.9808, which is very close to 1. Therefore, in general, reviewers who wrote a review for the drama in 2017 felt positive towards the drama.

We can also check the average overall score on the website (i.e., the average rating given by the reviewers) from reviews in 2017 to see if it suggests a similar attitude.

In [77]:
r_overallscore_2017 = []

for r in reviews_2017:
    overallscore = float(r['Overall Score'])
    r_overallscore_2017.append(overallscore)

sum(r_overallscore_2017) / len(r_overallscore_2017)

9.666666666666666

The highest overall score one can give to the drama on MyDramaList is 10. So, an average overall score of 9.6667 is high and we can say that it suggests a positive attitude of the reviewers from 2017 towards the drama.

I will repeat the above steps to conduct sentiment analysis on the reviews from 2018 and 2019 and on the reviews from each overall score range.

### B. 2018 Reviews Sentiment Analysis

In [78]:
len(reviews_2018)

58

In [79]:
for rt in reviews_2018:      ## reviews_2018 is a list of dictionary
    rt_pol_score = sid.polarity_scores(rt['Review Text'])    ## calculate the polarity score of each review text
    rt.update(rt_pol_score)

In [80]:
reviews_2018[0:5]

[{'Date': 'Dec 12, 2018',
  'Overall Score': '10',
  'Username': 'c00kie',
  'compound': 0.9944,
  'neg': 0.05,
  'neu': 0.6,
  'pos': 0.35},
 {'Date': 'Dec  8, 2018',
  'Overall Score': '9.5',
  'Review Text': 'A love so beautiful was a great love story. I loved the storyline and how it flowed, It wasn\'t rushed and it wasn\'t too slow it was perfect. I usually don\'t like high school centered dramas and when I found out that this story focused a lot of time in High school I almost didn\'t watch. However, The cast is awesome, great chemistry. I was drawn in throughout the storyline I didn\'t even know who to root for the second lead was as mature, sweet and handsome as the main lead.  I hated how the mother kind of disliked the second lead because of his "dark skin" but it wasn\'t a big deal and didn\'t overshadow the show at all it was just something that bugged me personally because I thought he was a cutie.  I love how the show didn\'t waste time on trivial things other shows would

In [81]:
rt_pol_score_2018 = []

for rt in reviews_2018:
    pol_score = rt['compound']
    rt_pol_score_2018.append(pol_score)

sum(rt_pol_score_2018) / len(rt_pol_score_2018)
## calculate the average compound score of review texts from 2018

0.960603448275862

Reviews from 2018 have an average compound score of 0.9606, which is very close to 1. Therefore, in general, reviewers who wrote a review for the drama in 2018 felt positive towards the drama.

We can also check the average overall score on the website (i.e., the average rating given by the reviewers) from reviews in 2018 to see if it suggests a similar attitude.

In [82]:
r_overallscore_2018 = []

for r in reviews_2018:
    overallscore = float(r['Overall Score'])
    r_overallscore_2018.append(overallscore)

sum(r_overallscore_2018) / len(r_overallscore_2018)

9.172413793103448

With the highest overall score being 10, an average overall score of 9.1724 is high and we can say that it suggests a positive attitude of the reviewers from 2018 towards the drama.

### C. 2019 Reviews Sentiment Analysis

In [83]:
len(reviews_2019)

24

In [84]:
for rt in reviews_2019:      ## reviews_2019 is a list of dictionary
    rt_pol_score = sid.polarity_scores(rt['Review Text'])    ## calculate the polarity score of each review text
    rt.update(rt_pol_score)

In [85]:
reviews_2019[0:5]

[{'Date': 'Dec 28, 2019',
  'Overall Score': '8.5',
  'Review Text': "I was putting off watching this drama for a while. I always saw it under 'top dramas', my recommendation feed, reviews where it was called as another ISWAK/Itazura na Kiss/Playful Kiss, etc.\n\nBefore I began watching this I was only familiar with the male lead in a drama where his role wasn't too big. I did know that he is quite decent so decided to go ahead. Here's what I thought finally:\n\nSTORY: On the surface it comes off as another youth drama. However what I liked most about the drama was the simplicity of the plot. It's just about a group of friends, an annoying love triangle, a jerk male lead and ditz female lead. I can see the comparisons with ISWAK but nothing tops that. It would be fair to say this was at least inspired from that story. But what sets this drama apart are the likeable characters and the trivial-yet-relatable situations they go through. Their friendship was excellent and it was best part f

In [86]:
rt_pol_score_2019 = []

for rt in reviews_2019:
    pol_score = rt['compound']
    rt_pol_score_2019.append(pol_score)

sum(rt_pol_score_2019) / len(rt_pol_score_2019)
## calculate the average compound score of review texts from 2019

0.7681833333333334

Reviews from 2019 have an average compound score of 0.7682, which is close to 1 but not a very high score on the positive side. Therefore, in general, reviewers who wrote a review for the drama in 2019 felt more positive than negative towards the drama, but the positive feeling is not very strong.

We can also check the average overall score on the website (i.e., the average rating given by the reviewers) from reviews in 2019 to see if it suggests a similar attitude.

In [87]:
r_overallscore_2019 = []

for r in reviews_2019:
    overallscore = float(r['Overall Score'])
    r_overallscore_2019.append(overallscore)

sum(r_overallscore_2019) / len(r_overallscore_2019)

8.229166666666666

An average overall score of 8.2292 out of 10 is not a high score but still expresses a positive attitude towards the drama. Compared to the average compound score (which is 0.7682 where the most positive score is 1), the average overall score on the website suggests a more positive attitude than the average compound score given by VADER does, but the general ideas does not differ much.

### D. Sentiment Analysis by Date Summary

Reviews from 2017 have the highest positive average compound score among the three sets of reviews from different years; on the other hand, reviews from 2019 have the lowest positive average compound score. We know that the drama was on air in 2017, so by 2019, it has already ended for almost two years. Maybe when it was still on air, the audience were more excited about it than when it is ended. The reason might be that when it was on air, the audience had to follow the schedule and had no chance to be spoiled; but if the drama already finished broadcasting, there may be spoilers and comments everywhere on the Internet, which might reduce (or affect, in a more neutral term) the audience's interest.

### Sentiment Analysis by Score

The goal for conducting the sentiment analysis by score is to see if the attitude suggested by the average compound score given by VADER matches with the attitude suggested by the overall score rated by the reviewers.

### A. Sentiment Analysis on Reviews with score 3.0 to 6.0

In [88]:
len(reviews_3_to_6)

7

In [89]:
for rt in reviews_3_to_6:      ## reviews_3_to_6 is a list of dictionary
    rt_pol_score = sid.polarity_scores(rt['Review Text'])    ## calculate the polarity score of each review text
    rt.update(rt_pol_score)

In [90]:
reviews_3_to_6[0:5]

[{'Date': 'May 22, 2018',
  'Overall Score': '5.0',
  'Review Text': "Story:\n\nChildhood neighbors who become a couple after knowing each other for several years. Very basic story line with no surprises. Although I like that the OTPs friends were important and had all their own story lines, this dragged out the drama.\n\nThe male lead was really static and seemed to have no emotions at all. The female lead on the other hand was often annoying (some might call it bubbly) and all over the place.\n\nAgain, the basic idea of showing their love from the beginning to end is great. However, the beginning took about 16 episodes (their high school years), while their collage and after-collage life was wrapped up in 7...this also means 16 episodes of no real deep or romantic encounters, and 7 episodes where they are adults, but again we don't see many lovely scenes. Actually, the last episode was the best, as there was a lot of love included! \n\nActing/Cast:\n\nAs this was my first cdrama I do

In [91]:
rt_pol_score_3_to_6 = []

for rt in reviews_3_to_6:
    pol_score = rt['compound']
    rt_pol_score_3_to_6.append(pol_score)

sum(rt_pol_score_3_to_6) / len(rt_pol_score_3_to_6)

0.2401857142857143

Reviews with an overall score between 3.0 and 6.0 has an average compound score of 0.2402, which is close to 0 but on the positive side. Therefore, in general, reviewers who rated the drama with a score between 3.0 and 6.0 have a rather neutral feeling towards the drama, with a small positive attitude.

(Recall: Earlier at the beginning, we classify the score range 3.0 ~ 6.0 as meaning the drama is "so-so." So, the reviewers think that the drama is not bad, but they do not like it very much.)

### B. Sentiment Analysis on Reviews with score 6.0 to 8.0

In [92]:
len(reviews_6_to_8)

13

In [93]:
for rt in reviews_6_to_8:      ## reviews_6_to_8 is a list of dictionary
    rt_pol_score = sid.polarity_scores(rt['Review Text'])    ## calculate the polarity score of each review text
    rt.update(rt_pol_score)

In [94]:
reviews_6_to_8[0:5]

[{'Date': 'Oct 14, 2018',
  'Overall Score': '8.0',
  'Review Text': "The drama was quite cute and I appreciated being able to watch the main characters age and grow together, especially the high school portions of the drama. There were a few parts of the plot and the writing which I found particularly week, especially about the main couple's inability to communicate properly. After dating for a while, I wasn't expecting their first big spat to end as childishly as it did, and I found the heroine pretty selfish, leading on a guy when she knew she wasn't fully over her first love, dragging on a relationship for actual years, which was so disrespectful on her part. There were a lot of moments where she was too irrational and too brash, but I respected that despite that, it seemed to somehow work with her character. Their relationship was awkward, but it really felt right, and the ending was more than satisfying, I admit I shed a tear when I saw them walk down the path as adults, holding 

In [95]:
rt_pol_score_6_to_8 = []

for rt in reviews_6_to_8:
    pol_score = rt['compound']
    rt_pol_score_6_to_8.append(pol_score)

sum(rt_pol_score_6_to_8) / len(rt_pol_score_6_to_8)

0.8878000000000001

Reviews with an overall score between 6.0 and 8.0 has an average compound score of 0.8878, which is very close to 1. Therefore, in general, reviewers who rated the drama with a score between 6.0 and 8.0 have a positive feeling towards the drama.

(Recall: We classify the score range 6.0 ~ 8.0 as meaning the drama is "good." So, the reviewers might approve the drama and might recommend it to others, but they might not say that it is one of their favorite dramas.)

### C. Sentiment Analysis on Reviews with score 8.0 to 10

In [96]:
len(reviews_8_to_10)

80

In [97]:
for rt in reviews_8_to_10:      ## reviews_8_to_10 is a list of dictionary
    rt_pol_score = sid.polarity_scores(rt['Review Text'])    ## calculate the polarity score of each review text
    rt.update(rt_pol_score)

In [98]:
reviews_8_to_10[0:5]

[{'Date': 'Dec 30, 2017',
  'Overall Score': '8.5',
  'Review Text': 'Amidst all the positivity and perfection that others see in this drama, you can count on me to be a real grinch and center in on all the reasons that it was not entirely flawless. That is not to say I hated it; far from it, but I fail to share the unwavering, absolute love that many others seem to.\n\nMany things about A Love So Beautiful was a surprise to me. I\'ll be the first to admit that I harbor a real prejudice each time I enter the realm that is a Chinese drama - watching them throughout my childhood, they were unfailingly littered with a) tragedy, b) horrible CG, and/or c) gagworthy storylines (typically, an entertaining combination of all three). And quite honestly, few nowadays seem to impress me. Call it bias, but I still think the Mainland has much to learn regarding what constitutes a good show.\n\nA Love So Beautiful was different from the moment I picked it up. It does not present melodramatic conflic

In [99]:
rt_pol_score_8_to_10 = []

for rt in reviews_8_to_10:
    pol_score = rt['compound']
    rt_pol_score_8_to_10.append(pol_score)

sum(rt_pol_score_8_to_10) / len(rt_pol_score_8_to_10)

0.9822987499999998

Reviews with an overall score between 8.0 and 10 has an average compound score of 0.9823, which is extremely close to 1. Therefore, in general, reviewers who rated the drama with a score between 8.0 and 10 have a strong positive feeling towards the drama.

(Recall: We classify the score range 8.0 ~ 10 as meaning the drama is "excellent." So, the reviewers might like the drama very much and would be very happy to recommend it to others.)

### D. Sentiment Analysis by Score Summary

As expected, reviews with a score ranging from 8.0 to 10 have the highest positive average compound score, indicating that the reviewers who gave a score in this range have a positive feeling towards the drama. Reviews with a score in the range of 3.0 ~ 6.0 have the lowest positive average compound score, which indicates almost a neutral feeling towards the drama. In fact, there is an even lower score range from 0 to 3.0, which is disregarded in the analysis since there is no review from 2017 to 2019 that rates the drama with a score in this range. But if there are such reviews, what would the average compound score be? Would it be a negative number, given that the average compound score for reviews with a score ranging from 3.0 to 6.0 is very close to the neutral value 0?

### Sentiment Analysis in General

Lastly, I want to conduct the sentiment analysis on the reviews from 2017 to 2019 altogether and see what the average compound score is. Would this average compound score suggest a similar attitude to the one suggested by the average overall score shown on the website? In general, does *A Love so Beautiful* gain a good reputation outside of Mainland China?

In [100]:
reviews_2017_to_2019 = json.load(open('data/review_2017_to_2019.json'))

In [101]:
len(reviews_2017_to_2019)

100

In [102]:
for rt in reviews_2017_to_2019:      ## reviews_2017_to_2019 is a list of dictionary
    rt_pol_score = sid.polarity_scores(rt['Review Text'])    ## calculate the polarity score of each review text
    rt.update(rt_pol_score)

In [103]:
reviews_2017_to_2019[0:5]

[{'Date': 'Dec 28, 2019',
  'Overall Score': '8.5',
  'Review Text': "I was putting off watching this drama for a while. I always saw it under 'top dramas', my recommendation feed, reviews where it was called as another ISWAK/Itazura na Kiss/Playful Kiss, etc.\n\nBefore I began watching this I was only familiar with the male lead in a drama where his role wasn't too big. I did know that he is quite decent so decided to go ahead. Here's what I thought finally:\n\nSTORY: On the surface it comes off as another youth drama. However what I liked most about the drama was the simplicity of the plot. It's just about a group of friends, an annoying love triangle, a jerk male lead and ditz female lead. I can see the comparisons with ISWAK but nothing tops that. It would be fair to say this was at least inspired from that story. But what sets this drama apart are the likeable characters and the trivial-yet-relatable situations they go through. Their friendship was excellent and it was best part f

In [104]:
rt_pol_score_2017_to_2019 = []

for rt in reviews_2017_to_2019:
    pol_score = rt['compound']
    rt_pol_score_2017_to_2019.append(pol_score)

sum(rt_pol_score_2017_to_2019) / len(rt_pol_score_2017_to_2019)

0.9180660000000005

The reviews have an average compound score of 0.9181, which is close to 1. Therefore, in general, reviewers who wrote a review for the drama felt positive towards it.

We can also check the average overall score on the website (i.e., the average rating given by the reviewers) for reviews from 2017 to 2019 to see if it suggests a similar attitude.

In [105]:
r_overallscore_2017_to_2019 = []

for r in reviews_2017_to_2019:
    overallscore = float(r['Overall Score'])
    r_overallscore_2017_to_2019.append(overallscore)

sum(r_overallscore_2017_to_2019) / len(r_overallscore_2017_to_2019)

9.035

The drama has an average overall score of 9.035 out of the highest 10 on the website, which suggests that it is a good drama and the majority of the audience (or at least reviewers) like it. This is parallel with the result we get from the sentiment analysis on all the reviews.