# Playground

Understand data through various approaches.

## Load and Check

In [4]:
import numpy as np
import pandas as pd
import json
from tqdm import tqdm
import random

# consts
ORI_COMMENT_PATH = '../data/original/Movies_and_TV.json'
ORI_META_PATH = '../data/original/meta_Movies_and_TV.json'
MAX_LINES = 1000 # read no more than this number of lines


In [5]:
# read original comment
comments = []
with open(ORI_COMMENT_PATH) as f:
    with tqdm(total=MAX_LINES) as pbar:
        for i in range(MAX_LINES):
            line = f.readline()
            try:
                comments.append(json.loads(line))
            except:
                print(f'Error line at line #{i + 1}: {line}')
            pbar.update(1)

print(comments[10])

100%|██████████| 1000/1000 [00:00<00:00, 75875.18it/s]

{'overall': 5.0, 'verified': True, 'reviewTime': '09 11, 2012', 'reviewerID': 'A1XIXLXK9B4DAJ', 'asin': '0005089549', 'style': {'Format:': ' Audio CD'}, 'reviewerName': 'MMnMM', 'reviewText': 'Product received quickly from seller. Product was in great condition as stated. People who enjoy southern gospel music will be thrilled with this offering by the various Cathedral performers, with the exception of Payne and Younce, down through the years. Also quite happy with service. Would use seller again.', 'summary': 'A Reunion by Cathedral Quartet', 'unixReviewTime': 1347321600}





In [11]:
rand_idx = random.randint(0, MAX_LINES)
print(f'Randomly showing comment #{rand_idx}')
print(json.dumps(comments[rand_idx], indent=4))

Randomly showing data #272
{
    "overall": 5.0,
    "verified": true,
    "reviewTime": "12 21, 2016",
    "reviewerID": "A1V0KH72H0BCDZ",
    "asin": "0005019281",
    "style": {
        "Format:": " Amazon Video"
    },
    "reviewerName": "Kurt",
    "reviewText": "It was great!!!!!Been years since we watched the movie.",
    "summary": "Five Stars",
    "unixReviewTime": 1482278400
}


Select following features of *comment*:

- overall `float`
- unixReviewTime `int`
- reviewerName `string` (later move to *reviewer*)
- reviewText `string` (replace newline)
- summary `string`
- asin `string`
- reviewerID `string`


In [12]:
# read original meta
metas = []
with open(ORI_META_PATH) as f:
    with tqdm(total=MAX_LINES) as pbar:
        for i in range(MAX_LINES):
            line = f.readline()
            try:
                metas.append(json.loads(line))
            except:
                print(f'Error line at line #{i + 1}: {line}')
            pbar.update(1)

print(metas[10])

100%|██████████| 1000/1000 [00:00<00:00, 25327.77it/s]

{'category': ['Movies & TV', 'Genre for Featured Categories', 'Faith & Spirituality'], 'tech1': '', 'description': ['The angel showed Dumitru all of California, Las Vegas, New York City and Florida, and then said, "you see what I have shown you. This is Sodom and Gomorrah. In one day it will burn. Its sin has reached the Holy One. I love the people of this country and I want to save them, but America will burn."\nDumitru said, "It will start with an internal revolution in America, started by the Communists. Some of the people will start fighting against the government. The government will be busy with internal problems. Then, from the oceans, Russia, Cuba, Micaragua, Central America, Mexico, and two other countries which I cannot remember, will attack! The Russians will bombard the nuclear missile silos in America and America will burn. \n"In the church there is divorce, adultery, fornication, sodomy, abortion, and all kinds of sin. Jesus Christ lives in Holiness." The angel said, "Tel




In [24]:
rand_idx = random.randint(0, MAX_LINES)
print(f'Randomly showing comment #{rand_idx}')
print(json.dumps(metas[rand_idx], indent=4))

Randomly showing comment #420
{
    "category": [
        "Movies & TV",
        "Studio Specials",
        "Warner Home Video",
        "Warner Video Bargains",
        "Drama"
    ],
    "tech1": "",
    "description": [
        "First broadcast on HBO in June of 1998--shortly before the theatrical release of Steven Spielberg's <I>Saving Private Ryan</I>--this World War II drama offers an equally intimate and devastating study of combat and its tragic aftermath. Set in Germany during the closing days of the war, the film uses a little-known episode of U.S. military history--the bloody battle of the Hurtigen Forest--as the backdrop for the story of a battle-weary private (Ron Eldard) who is the only surviving member of his platoon. Despite his request for dismissal on the grounds of mental disability and shell-shock, he is considered a promising soldier by his superiors, promoted to sergeant, and assigned to command a fresh platoon of young, inexperienced soldiers. The cycle of war co

Select following features of *meta* (later as *product*):

- title `string`
- brand `string`
- description `string` (concat with comma, choose first)
- imageURLHighRes `string` (as imageUrl, choose first)
- asin `string`
- price `float` (remove $, empty as -1)
- category `string[]` (as categories, concat with comma, remove 'Movies & TV')
- rank `int` (select numericals)