<table>
    <tr><td>
         <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/Introduction.ipynb">
         <img alt="start" src="figures/button_previous.jpg" width= 70% height= 70%>
    </td><td>
        <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/Index.ipynb">
         <img alt="start" src="figures/button_table-of-contents.jpg" width= 70% height= 70%>
    </td><td>
         <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/[Data_Exploration]Data_Visualization_ratings.ipynb">
         <img alt="start" src="figures/button_next.jpg" width= 70% height= 70%>
    </td></tr>
</table>

# Data lookup 

In this first section of the data exploration chapter, an initial data screening will be conducted with descriptive statistics in order to provide a high level understanding of the dataset.

This dataset contains reviews and metadata of video-games sold in Amazon. It includes hundreds of thousands of reviews spanning May 1996 - July 2014.

In [1]:
import pandas as pd
import numpy as np

#Reproduce the same result every time if the script is kept consistent otherwise each run will produce different results (for classification)
np.random.seed(500)
    
#[1] Read the data
Corpus = pd.read_json(r"C:\Users\Panos\Desktop\Dissert\Code\Video_Games_5.json", lines=True, encoding='latin-1')

#Remove review with blank reviewText
Corpus = Corpus[~Corpus['reviewText'].isnull()]

#Print the first 3 rows
Corpus.iloc[:3]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2HD75EMZR8QLN,700099867,123,"[8, 12]",Installing the game was a struggle (because of...,1,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,A3UR8NLLY1ZHCX,700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4,Good rally game,1372550400,"06 30, 2013"
2,A1INA0F5CWW3J4,700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,1,Wrong key,1403913600,"06 28, 2014"


As a first step, the data were imported from a JSON file. After that, an initial data screening was conducted in order to understand the data structure by printing out the first three rows of the dataset. It is noticeable that there are a few columns that can be assumed unnecessary for the main goal for this moment (reviewerID, reviewTime, etc.). Although, almost all columns can be useful for extracting information which can help improve the model later. The most important columns for classification, not only in natural language processing (NLP) are, first, the data which the model will be trained on in order to categorize every review into the second most important column, the labels. For this classification problem, those columns are 'reviewText' and 'overall'. For a row in the dataset, 'reviewText' includes a review that a user left for a specific product (video-game) and the 'overall' contains the rating he submitted for that product from one to five. In this case the model will 'decide' in which rating category a given review text would fit best. For the next step, it is important to print out some valuable counts from the dataset.

In [2]:
import nltk
from nltk.probability import FreqDist
import os
from collections import Counter
import statistics

#Calculate the total number of unique video games
asinString = " ".join(Corpus['asin'])
tokens = nltk.word_tokenize(asinString)
fdist = FreqDist(tokens)
count_asin = len(fdist)

# Calculate the total number of unique reviewers
reviewerIDString = " ".join(Corpus['reviewerID'])
tokens = nltk.word_tokenize(reviewerIDString)
fdist = FreqDist(tokens)
count_reviewerID = len(fdist)

# Set data in variables for the dataframe
totalReviews = int(Corpus.shape[0])
file_size = os.path.getsize(r"C:\Users\Panos\Desktop\Dissert\Code\Video_Games_5.json")
fileSize = int(round((file_size/2**20),2))
avgPerUser = int(round(statistics.mean(Counter(Corpus['reviewerID']).values()),2))
avgPerVG = int(round(statistics.mean(Counter(Corpus['asin']).values()),2))
minReviews = min(Counter(Corpus['asin']).values())
maxReviews = max(Counter(Corpus['asin']).values())

# Create a new dataframe to print a clear table
d1 = {'Description': ["Total reviews",
                      "File size (MB)",
                      "Total number of unique video games",
                      "Total number of unique reviewers",
                      "Average number of reviews per user", 
                      "Average number of reviews per videogame",
                      "Videogame with the minimum reviews",
                      "Videogame with the maximum reviews"],
      'Data': [totalReviews, fileSize, count_asin, count_reviewerID, avgPerUser, avgPerVG, minReviews, maxReviews]}

df1 = pd.DataFrame(data=d1)
df1

Unnamed: 0,Description,Data
0,Total reviews,231780
1,File size (MB),304
2,Total number of unique video games,10672
3,Total number of unique reviewers,24303
4,Average number of reviews per user,9
5,Average number of reviews per videogame,21
6,Videogame with the minimum reviews,5
7,Videogame with the maximum reviews,802


The dataset consists of 231780 records which are not considered alot for a classification problem although it is enough data for building a model with a descent accuracy percent. Also, the dataset contains reviews from 10672 different games reviewed by 24303 different users which means that there is a variety of different personalities to help build a generic model.

In [2]:
#Save this session for the next notebook
import dill
dill.dump_session('notebook_env.db')

In the following section a more detailed analysis will be done using data visualisation.

<a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/[Data_Exploration]Data_Visualization_ratings.ipynb">
         <img alt="start" src="figures/button_next.jpg">