# Collecting and Organizing Data

This notebook shows how I collected drama reviews for the Chinese drama *A Love So Beautiful* from MyDramaList.com.

I organized these reviews in two ways, by date and by rating score.

### Setup

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt

import os
import pandas as pd
import numpy as np
import string
import re
import random
import time
import json
import csv

import requests
import bs4
from bs4 import BeautifulSoup

from collections import Counter

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import WordPunctTokenizer

### Functions

In [2]:
%run functions.ipynb

## Collecting the Corpus

For my project, I need English reviews for the Chinese drama *A Love So Beautiful* and I collected 100 reviews (posted from December 2017 to December 2019) from MyDramaList.com.


**First, I downloaded the HTML for each page of reviews.**


To do this, 
1. Use `requests` to get HTML
2. Write the text of response objects to the `data/raw_HTML` folder

In [3]:
if not os.path.exists('data/raw_HTML'):
    os.makedirs('data/raw_HTML')

## make a "raw_HTML" folder in the "data" folder

In [4]:
user_agent = {'User_Agent':'Mozilla/5.0'}

page_url = 'https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page={}'

In [5]:
for page in range(1, 10):    ## We need 9 pages and the range is end exclusive, so the range should go from 1 to 10
    current_url = page_url.format(page)
    print('Retrieving', current_url)    ## to show the link for the page that is being retrieved

    resp = requests.get(current_url, headers = user_agent)
    
    while not resp.ok:
        print('error...retrying')
        time.sleep(2.0)     ## delay execution for 2 seconds
        resp = requests.get(current_url, headers = user_agent)
    
    html = resp.text
    
    with open('data/raw_HTML/page{}.html'.format(page), 'w') as out:
        out.write(html)

Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=1
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=2
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=3
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=4
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=5
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=6
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=7
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=8
Retrieving https://mydramalist.com/24644-a-love-so-beautiful/reviews?sort=recent&xlang=en-US&page=9


In [6]:
os.listdir('data/raw_HTML')   ## show a list of file names in the directory

['page1.html',
 'page2.html',
 'page3.html',
 'page4.html',
 'page5.html',
 'page6.html',
 'page7.html',
 'page8.html',
 'page9.html']

**Then, I extracted the reviews from each downloaded HTML page.**

Four steps need to be done:
1. Load the HTML into a `string` object;
2. Parse the `string` object into `BeautifulSoup`;
3. Find all `<div>` elements that contain reviews;
4. Extract the *review text*, *username*, *overall score*, and *date* on which the review was posted.

Two functions were built to carry out these steps.
* `process_html` function - To load the HTML from the files saved earlier and to get a list of `<div class="review">` BeautifulSoup objects
* `process_review` function - To take a BeautifulSoup object for the `<div class="review">` and to return a dictionary that contains the *review text*, *username*, *overall score*, and *date*

In [7]:
def process_html(filename):
    
    ## 1. load HTML
    fpath = os.path.join('data', 'raw_HTML', filename)    ## join two or more pathname components
    
    if not os.path.exists(fpath):    ## test to see if a path exists
        return None
    
    print('Processing', fpath)
    
    html = open(fpath).read()      ## load HTML into a string object
    
    
    ## 2. create a BeautifulSoup object
    doc = BeautifulSoup(html, 'lxml')
    
    
    ## 3. find all <div class="review"> elements
    rev_divs = doc.find_all('div', attrs = {'class':'review'})
    
    num_of_rev = len(rev_divs)
    print ('Found {} reviews'.format(num_of_rev))
    
    return rev_divs        ## return a list of <div class="review"> BeautifulSoup objects

In [8]:
def process_review(rev):
    
    username = rev.find('b').text
    overall_score = rev.find('span', attrs = {'class':'score pull-right'}).text
    time_posted = rev.find('small', attrs = {'class':'datetime'}).text
    
    review_dict = {}
    review_dict['Username'] = username
    review_dict['Overall Score'] = overall_score
    review_dict['Date'] = time_posted
    
    rev_body = rev.find('div', attrs = {'class':re.compile('review-body(full-read)?')})
        ## re.compile() compiles a regular expression pattern
        ## the "?" indicates zero or one occurrences of the preceding element "full-read"
    
    rev_text_list = rev_body.findChildren(text = True, recursive = False)
        ## recursive tells BeautifulSoup whether to go all the way down the parse tree
        ## recursive = False means only the immediate children of the tag are searched
    rev_text = '\n\n'.join(rev_text_list).strip()
    
    review_dict['Review Text'] = rev_text
    
    return review_dict

In [9]:
html_files = []

for file in os.listdir('data/raw_HTML'):
    if file.endswith('.html'):
        html_files.append(file)

In [10]:
review_data = []

for page_html in html_files:
    rev_divs = process_html(page_html)
## process the HTML and get <div class="review"> BeautifulSoup objects

    for rev in rev_divs:
        rev_info = process_review(rev)
        review_data.append(rev_info)
    ## extract username, overall score, review text, and date

Processing data/raw_HTML/page1.html
Found 12 reviews
Processing data/raw_HTML/page2.html
Found 12 reviews
Processing data/raw_HTML/page3.html
Found 12 reviews
Processing data/raw_HTML/page4.html
Found 12 reviews
Processing data/raw_HTML/page5.html
Found 12 reviews
Processing data/raw_HTML/page6.html
Found 12 reviews
Processing data/raw_HTML/page7.html
Found 12 reviews
Processing data/raw_HTML/page8.html
Found 12 reviews
Processing data/raw_HTML/page9.html
Found 11 reviews


We now have a list of reviews.

Each review is a dictionary that has four keys: 1) Username, 2) Overall Score, 3) Review Text, and 4) Date.

In [11]:
len(review_data)

107

There are a total of 107 English reviews for *A Love So Beautiful* on MyDramaList.com, but 7 of them were posted in 2020, which is out of the time range I am interested in (i.e., December 2017 to December 2019). I will disregard these 7 reviews in the analysis.

In [13]:
print('Here is one of the reviews.')

review_data[21]

Here is one of the reviews.


{'Date': 'Jun  2, 2019',
 'Overall Score': '9.5',
 'Review Text': "To be honest, I went into this thinking I was going to hate it. The fact that the FL already had a huge crush on the ML and was super obvious and desperate about it was something I just knew I was going to cringe at—and that's the only reason why I couldn't give this one a full 10/10.\n\nObviously, I ended up enjoying this a LOT more than I expected. It was super cute and I loved all the little things that the ML did for the FL without her knowing. I know the second lead did a lot of stuff too—and that a lot of people had second lead syndrome but I actually did not, whew!\n\nI was also hesitant about this drama because of the time jumps that I knew happened throughout the later episodes. These are so easily ruined, but the drama actually did a really good job of dealing with the relationship growths during the time we didn't see.\n\nOverall, this was a really great story about romance and friendship, and I am so happy b

**Next, I saved the list of reviews as a JSON file.**

In [14]:
with open('data/all_reviews_info.json', 'w') as out:
    out.write(json.dumps(review_data, indent = 4))

## Organizing the Reviews

**All reviews from 2017 to 2019**
* I created a file in the `data` folder to hold all the reviews posted from 2017 to 2019.

In [15]:
review_2017_to_2019 = []

for review in review_data:
    if '2017' in review['Date']:
        review_2017_to_2019.append(review)
    if '2018' in review['Date']:
        review_2017_to_2019.append(review)  
    if '2019' in review['Date']:
        review_2017_to_2019.append(review)

In [16]:
len(review_2017_to_2019)

100

In [17]:
with open('data/review_2017_to_2019.json', 'w') as out:
    out.write(json.dumps(review_2017_to_2019, indent = 4))

**Organize by Year**

In [28]:
review_2017 = []

for review in review_2017_to_2019:
    if '2017' in review['Date']:
        review_2017.append(review)

In [29]:
len(review_2017)

18

In [30]:
with open('data/Date/review_2017.json', 'w') as out:
    out.write(json.dumps(review_2017, indent = 4))

In [31]:
review_2018 = []

for review in review_2017_to_2019:
    if '2018' in review['Date']:
        review_2018.append(review)

In [32]:
len(review_2018)

58

In [33]:
with open('data/Date/review_2018.json', 'w') as out:
    out.write(json.dumps(review_2018, indent = 4))

In [35]:
review_2019 = []

for review in review_2017_to_2019:
    if '2019' in review['Date']:
        review_2019.append(review)

In [36]:
len(review_2019)

24

In [37]:
with open('data/Date/review_2019.json', 'w') as out:
    out.write(json.dumps(review_2019, indent = 4))

In [38]:
print('There are {} reviews in total from 2017 to 2019.'.format(len(review_2017) + len(review_2018) + len(review_2019)))

There are 100 reviews in total from 2017 to 2019.


**Organize by Score**

In [39]:
score_0_to_3 = []

for review in review_2017:
    if float(review['Overall Score']) <= 3:       ## float() converts a string to a floating point number
        score_0_to_3.append(review)
        
for review in review_2018:
    if float(review['Overall Score']) <= 3:
        score_0_to_3.append(review)

for review in review_2019:
    if float(review['Overall Score']) <= 3:
        score_0_to_3.append(review)

In [40]:
len(score_0_to_3)

0

In [41]:
with open('data/Overall_Score/score_0_to_3.json', 'w') as out:
    out.write(json.dumps(score_0_to_3, indent = 4))

In [42]:
score_3_to_6 = []

for review in review_2017:
    if float(review['Overall Score']) > 3 and float(review['Overall Score']) <= 6:
        score_3_to_6.append(review)
        
for review in review_2018:
    if float(review['Overall Score']) > 3 and float(review['Overall Score']) <= 6:
        score_3_to_6.append(review)

for review in review_2019:
    if float(review['Overall Score']) > 3 and float(review['Overall Score']) <= 6:
        score_3_to_6.append(review)

In [43]:
len(score_3_to_6)

7

In [44]:
with open('data/Overall_Score/score_3_to_6.json', 'w') as out:
    out.write(json.dumps(score_3_to_6, indent = 4))

In [45]:
score_6_to_8 = []

for review in review_2017:
    if float(review['Overall Score']) > 6 and float(review['Overall Score']) <= 8:
        score_6_to_8.append(review)
        
for review in review_2018:
    if float(review['Overall Score']) > 6 and float(review['Overall Score']) <= 8:
        score_6_to_8.append(review)
        
for review in review_2019:
    if float(review['Overall Score']) > 6 and float(review['Overall Score']) <= 8:
        score_6_to_8.append(review)

In [46]:
len(score_6_to_8)

13

In [47]:
with open('data/Overall_Score/score_6_to_8.json', 'w') as out:
    out.write(json.dumps(score_6_to_8, indent = 4))

In [48]:
score_8_to_10 = []

for review in review_2017:
    if float(review['Overall Score']) > 8 and float(review['Overall Score']) <= 10:
        score_8_to_10.append(review)

for review in review_2018:
    if float(review['Overall Score']) > 8 and float(review['Overall Score']) <= 10:
        score_8_to_10.append(review)
        
for review in review_2019:
    if float(review['Overall Score']) > 8 and float(review['Overall Score']) <= 10:
        score_8_to_10.append(review)

In [49]:
len(score_8_to_10)

80

In [50]:
with open('data/Overall_Score/score_8_to_10.json', 'w') as out:
    out.write(json.dumps(score_8_to_10, indent = 4))

Here is a list of files in the `data` folder.

In [51]:
os.listdir('data')

['Date',
 '.ipynb_checkpoints',
 'Overall_Score',
 'raw_HTML',
 'all_reviews_info.json',
 'review_2017_to_2019.json']

Here is a list of files in the `data/Date` folder and in the `data/Overall_Score` folder, respectively.

In [52]:
os.listdir('data/Date')

['review_2017.json', 'review_2018.json', 'review_2019.json']

In [53]:
os.listdir('data/Overall_Score')

['score_0_to_3.json',
 'score_3_to_6.json',
 'score_6_to_8.json',
 'score_8_to_10.json']

I organized my data in two ways:
1. by date (on which the review was posted)
2. by overall score

These are also the two key dimensions of my analysis.

For the "date" dimension, I separated the review data by year.
* Reviews from 2017
* Reviews from 2018
* Reviews from 2019

Reviews from the same year are in the same JSON file with filename `review_{year}`. These three files are in a folder named `Date`.

For the "overall score" dimension, I separated the review data by the overall score.
* Reviews with a score between 0 to 3.0 -- these scores mean the drama is "poor"
* Reviews with a score between 3.0 to 6.0 -- these scores mean the drama is "so-so"
* Reviews with a score between 6.0 to 8.0 -- these scores mean the drama is "good"
* Reviews with a score between 8.0 to 10 -- these scores mean the drama is "excellent"

Reviews with a score in the indicated score range are in the same JSON file with filename `score_{}_to_{}`. These four files are in a folder named `Overall_Score`.

## Data Descriptives

In [54]:
print('{: <25}\t{}'.format('folder name', 'number of texts'))
print('=====' * 10)

for file in os.listdir('data'):
    doc = os.path.join('data', file)
    try:
        num_of_docs = len(os.listdir(doc))
    except NotADirectoryError:         ## skip the all_reviews_info.json and review_2017_to_2019.json
        continue;
    print('{: <30}\t{: <3} texts'.format(file, num_of_docs))

folder name              	number of texts
Date                          	3   texts
.ipynb_checkpoints            	0   texts
Overall_Score                 	4   texts
raw_HTML                      	9   texts


In the `Date` folder, each text is a list of dictionaries.

In [55]:
os.listdir('data/Date')

['review_2017.json', 'review_2018.json', 'review_2019.json']

In [56]:
for year in os.listdir('data/Date'):
    rev_by_year = open('data/Date/{}'.format(year))
    num_of_reviews_by_year = len(json.load(rev_by_year))
    print('There are {} dictionaries in {}.'.format(num_of_reviews_by_year, year))

There are 18 dictionaries in review_2017.json.
There are 58 dictionaries in review_2018.json.
There are 24 dictionaries in review_2019.json.


Similarly, in the `Overall_Score` folder, each text is also a list of dictionaries.

In [57]:
for score in os.listdir('data/Overall_Score'):
    rev_by_score = open('data/Overall_Score/{}'.format(score))
    num_of_reviews_by_score = len(json.load(rev_by_score))
    print('There are {} dictionaries in {}.'.format(num_of_reviews_by_score, score))

There are 0 dictionaries in score_0_to_3.json.
There are 7 dictionaries in score_3_to_6.json.
There are 13 dictionaries in score_6_to_8.json.
There are 80 dictionaries in score_8_to_10.json.
