In [1]:
# !pip install ndjson

## Workflow

1. Read the video game review data. Take a look at the text of the reviews and the ratings, which you will work with in this milestone.
   - Note that your data is not pure JSON, but new line delimited JSON. To be able to read it, install and import ndjson.
2. Import Altair and pandas.
3. Create a plot of the ratings of the product. Study the distribution of the five categories.
4. Take a random sample of the reviews by selecting 1500 reviews with rating 1, 500-500-500 reviews with ratings 2, 3, 4, and 1500 reviews with rating 5. This way you get a smaller balanced corpus, on which you will work with in Milestones 2-4.
   - It is recommended to use **imblearn.under_sampling.RandomUnderSampler** from the **imbalanced-learn** package, which should be first installed, then imported.
If you want to get identical results to the reference solution, use 42 as a random state.
5. Take a random sample of the reviews by selecting 100,000 reviews. This way you get a bigger representative corpus, on which you will work in Milestones 4 and 5.
   - It is recommended to use **numpy.random.randint**.
   - If you want to get identical results to the reference solution, use 42 as a random state.
6. Export your corpora to two separate .csv files. Both of your tables should contain a column for the reviews and a column for the ratings. From now on we call the review text of the JSON key “reviews” and the overall key “ratings.” Name your corpora small_corpus and big_corpus.

In [2]:
import ndjson
import numpy as np
from imblearn.under_sampling import RandomUnderSampler

# Read corpus
ratings = []
reviews = []
summaries = []

with open("..\\data\\raw\\Video_Games_5.json", "r") as infile:
    reader = ndjson.reader(infile)

    for review in reader:
        try:
            rating = review["overall"]
            rv = review["reviewText"]
            s = review["summary"]
        except Exception as e:
            continue
        if len(rv) > 0 and len(s) > 0:
            ratings.append(rating)
            reviews.append(rv)
            summaries.append(s)

In [3]:
# Plot the distribution of ratings of the product.
# let's see the distributions
from collections import Counter
import altair as alt
import pandas as pd

# the distribution of review ratings
rating_counts = Counter(ratings)
data1 = pd.DataFrame(
    {
        "ratings": [str(e) for e in list(rating_counts.keys())],
        "counts": list(rating_counts.values()),
    }
)

chart1 = alt.Chart(data1).mark_bar().encode(x="ratings", y="counts")
chart1.save(".\\plots\\00\\rating_counts.html")

In [4]:
data1 

Unnamed: 0,ratings,counts
0,5.0,299541
1,4.0,93636
2,3.0,49138
3,2.0,24129
4,1.0,30872


In [5]:
###############################################################################
####                             Sampling                                 #####
###############################################################################
NUM_SMALL_SAMPLE = 1500
NUM_BIG_SAMPLE = 100000

indices = list(range(len(reviews)))

# using the same seed (random_stat=42) you can get the same samples!
rus = RandomUnderSampler(
    sampling_strategy={1.0: NUM_SMALL_SAMPLE, 
                       2.0: NUM_SMALL_SAMPLE, 
                       3.0: NUM_SMALL_SAMPLE, 
                       4.0: NUM_SMALL_SAMPLE, 
                       5.0: NUM_SMALL_SAMPLE},
    random_state=42,
)
indices_sample, ratings_sample = rus.fit_resample(
    np.array(indices).reshape(-1, 1), np.array(ratings).reshape(-1, 1)
)

indices_sample = np.ndarray.flatten(indices_sample)
indices_other = [i for i in list(range(len(reviews))) if i not in indices_sample]
reviews_sample = [reviews[i] + " " + summaries[i] for i in indices_sample]
reviews_other = [reviews[i] for i in indices_other]

np.random.seed(42)
big_sample = np.random.randint(len(reviews_other), size=NUM_BIG_SAMPLE)
reviews_to_be_saved = [reviews_other[i].replace("\n", " ") for i in big_sample]
reviews_to_be_saved = "\n".join(reviews_to_be_saved)

df = pd.DataFrame({"rating": ratings_sample, "review": reviews_sample})

with open("..\\data\\raw\\review_corpus.tsv", "w") as outfile:
    outfile.write(df.to_csv(index=False, sep="\t"))

with open("..\\data\\raw\\reviews_without_ratings.txt", "w") as outfile:
    outfile.write(reviews_to_be_saved)

In [6]:
df

Unnamed: 0,rating,review
0,1.0,Yet another garbage CoD game. Zombies is unpla...
1,1.0,$80? .... No way. This is NOT worth $80. $80?....
2,1.0,One of the worst games ever. I bought and down...
3,1.0,I did a lot of homework before I decided to by...
4,1.0,"I am really into RPG games, I loved Skyrim, Bo..."
...,...,...
7495,5.0,"Worked good, girlfriend loves this game. Five ..."
7496,5.0,This is my 3rd Mystery PI game and I've enjoye...
7497,5.0,work like brand new wont brake any time soon
7498,5.0,This remote works fantastic. I love it. Five S...
