# Working with extracted Steam Reviews

Code for datacleaning of the reviews (see Notebook "Steam-Reviewextraction"). Keep in mind, that this code is designed for data-processing, which excludes reviewdata based on a certain year (in this case the reviews from 2022).

## (A) Initiate the loop through the files and save the reviews regarding to their review-update date in lists 

### Necessary Modules

In [None]:
import json
import datetime 
import glob 
import codecs 
import csv 
import os 
import uuid
import arrow  
import re 

### A.1 Sort the reviews of a language according to their update-date and save the information of interest as sublist in a pre-defined list per year - reviewSorter()

This function sorts the reviews in the JSON files by date in lists. Important for this function is to create the required date list(s) as empty list(s) beforehand. 
If a year is left out, it must also be created as a date list, since this list will be used later as a reference list to be subtracted.The function itself requires 4 variables:
- date = the year as string 
- variable = the respective date variable to be sorted by (in this case Time_updated)
- the previously created empty list
- previously defined variable containing a regex which matches on certain words to sort the languages in the best possible way --> necessary because the languages do not always match despite the assignment

This code extracts the following data: steam id (randomized); playtime up to review; if the game was bought or gifted by dev; if the review is in the language of interest based on the before defined regex; 

It also extracts the date of creation and of a (possible) update of the review;


In [None]:
ReviewList_2020 = []
ReviewList_2021 = []
SubstractedReviewList_2022 = [] 
LanguageRegex = ' ' # <-- contains 'defining' words for a language 

def reviewSorter(date, variable, liste: list, regex:str):
    if (date in variable):
        realIDs = reviewdata['author']['steamid'] #randomized gamerID for dataprotection 
        for IDs in realIDs: 
            fakeID = str(uuid.uuid4())
            gamerID = fakeID.split('-')[0]
                  
        if 'playtime_at_review' not in reviewdata['author']:
            gameTime = 'not measurable'
            gameTime_rounded = 0
        else:
            gameTime = reviewdata['author']['playtime_at_review'] / 60
            gameTime_rounded = round(gameTime, 1)
        free = reviewdata['received_for_free']
     
        if reviewdata['received_for_free'] == True:
            selfbought = 'no'
        else: 
            selfbought = 'yes'       
        if bool(re.search( regex, reviewdata['review'], flags=re.IGNORECASE)) == True:            
            langCorrelation = 'yes'
            
        else:
            langCorrelation = 'no'
        
        reviews = reviewdata['review']  
        liste.append([gamerID, Time_written, Time_updated, gameTime_rounded, selfbought, langCorrelation, reviews])       

### A.2 time-conversion and function call inside of a loop

Opens all the jsonfiles; saves the reviews in a variable; converts the unix-timestamps and calls the reviewSorter function inside of the loop; 

In [None]:
for json_file in glob.glob("path_to_local_stored_JSON/jsonfile*.json"):
    with codecs.open(json_file, 'r','utf-8') as file: 
        reviews = json.load(file)
            
        AllReviews = reviews['reviews']
        for reviewdata in AllReviews:
            date_created = reviewdata['timestamp_created']
            date_updated = reviewdata['timestamp_updated']
            realtime_created = datetime.datetime.fromtimestamp(date_created)
            realtime_updated = datetime.datetime.fromtimestamp(date_updated)
            Time_written = (f"{realtime_created:%Y-%m-%d}")
            Time_updated = (f"{realtime_updated:%Y-%m-%d}")  
          
        #function needs to be called inside of the loop, since it uses two variables which are defined inside of the loop
            reviewSorter('2020', Time_updated, ReviewList_2020, LanguageRegex) 
            reviewSorter('2021', Time_updated, ReviewList_2021, LanguageRegex)
            reviewSorter('2022', Time_updated, SubstractedReviewList_2022, LanguageRegex)               


### A.3 Prepare the lists for table generation in future

Inserts list with possible 'header-row' for table generation later on

In [None]:
ReviewList_2020.insert(0, ['gamerID', 'Review written on:', 'Review updatet on:', 'Playtime to reviewcreation', 'Game was bought by player:', 'Language is consistent with extraction parmeter:','reviewtext' ])   
ReviewList_2021.insert(0, ['gamerID', 'Review written on:', 'Review updatet on:', 'Playtime to reviewcreation', 'Game was bought by player:', 'Language is consistent with extraction parmeter:','reviewtext' ]) 
SubstractedReviewList_2022.insert(0, ['gamerID', 'Review written on:', 'Review updatet on:', 'Playtime to reviewcreation', 'Game was bought by player:', 'Language is consistent with extraction parmeter:','reviewtext' ]) 


##  (B) Create overview-table with basic information on the review

### B.1 Function for creating an overview table - ReviewWriter_overview()

This function extracts all necessary information from the first JSON file to create a general overview file for the given language.  This info includes:
- Total number of reviews 
- Total number of positive reviews 
- Total number of negative reviews 
- Amount of used reviews (if years are being excluded)
- Language 
- Game ranking 

The function must be passed the path to the first JSON file, the language as string, the the directory to save the table and the list of the year which is not considered in the research. 
At this point there also will be initated a list for later calculating purposes (calculatorlist), which will be filled during this function.

In [None]:
baseinfo = "path_to_saved_json/firstjsonfile.json"
tableSavepath = "C:/Users/kitsu/JSON_FILES/jsongerman/CSV"
calculatorList= [] #for calculation operations during the YearlyOverviewCreator-function later on

In [None]:
def ReviewWriter_overview (pathToFirstJson, language:str, pathToSavingLocation, excludedList:list):
    info = []
    itemsInList = excludedList[1::]
    itemsToSubstract = len(itemsInList)
    print(itemsToSubstract)
    Existing = os.path.exists(pathToSavingLocation)
    if not Existing:
      # Create a new directory if it does not exist
      os.makedirs(pathToSavingLocation)
      print("The new directory is created!")
    
    with open(pathToFirstJson, 'r') as firstfile: #Retrieve basic Info on Reviews 
        json_file = json.load(firstfile)
        ReviewSum = json_file['query_summary']['total_reviews']  
        RealReviewCount = ReviewSum - itemsToSubstract
        rcalculatorList.append(ReviewSum)       
        ReviewPos = json_file['query_summary']['total_positive']
        ReviewNeg = json_file['query_summary']['total_negative']
        ReviewLang = language
        ReviewScore = json_file['query_summary']['review_score_desc']        
        info.append([ReviewSum, ReviewPos, ReviewNeg, RealReviewCount, ReviewLang, ReviewScore])
        info.insert(0, ['Overall amount of reviews', 'Positive reviews', 'Negative reviews', 'Amount of used reviews', 'Language', 'Game ranking'])

        overviewTitle = f'Reviews{language}_GeneralOverview.csv'
        saveToFolder = os.path.join (pathToSavingLocation,overviewTitle)

        with open (saveToFolder, "w", newline='') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerows(info)
            

#### Function call

In [None]:
ReviewWriter_overview(baseinfo, 'German', tableSavepath , SubstractedReviewList_2022)

#### B.2 !!Optional!! function to save the content of the lists (e.g. ReviewList_2020) as a table - ReviewWriter_all ()

This function retrieves the data saved in the 'annual review lists' (e.g. ReviewList_2020) and saves them in tables (one per year). This tables contain the following data: 

- fake ID
- information about when a review was written and updated
- how long the game was played until the review was written
- whether the game itself was purchased 
- and the review text

The function must be passed a "path variable" (which is initialized before - tableSavepath), the language as string, the respective list and the respective year as parameters. 


In [None]:
def ReviewWriter_all (pathToSavingLocation, language, liste: list, year:int):
    
    Existing = os.path.exists(pathToSavingLocation)
    if not Existing:
      # Create a new directory if non existent
      os.makedirs(tableSavepath )
      print("The new directory is created!")
    
    filename = f'Reviews{language}_{year}.csv'
    saveToFolder = os.path.join (pathToSavingLocation,filename)
    
    with open (saveToFolder, "w", newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(liste)

In [None]:
ReviewWriter_all(tableSavepath, 'German', ReviewList_2020, 2020)
ReviewWriter_all(tableSavepath, 'German', ReviewList_2021, 2021)

## (C) Create Table with statistical information

###  C.1 Function for saving statistical data on the reviews in a list - YearlyOverviewCreator ()

This function takes the required information from the annual review lists (e.g. ReviewList_2020) to create statistical overviews per year.
The function needs three parameters - the annual list in question; the excluded annual list; and an empty list functioning as a temporary savespace for the data. 
**Important**: The annual list in question must contain a _header-row_ list at index 0. <br> 
The extracted data contains : <br>

- Year in question
- Amount of Reviews written/updated in that year
- Amount of reviews written in the assigned language 
- percentage of users who bought the game by themselves
- average playtime before review was written 
- amount of updated reviews
- average amount of dates between creation and update of a review 

This function has to be calles for all annual review lists created! The information is saved as sublist per year in the prepared empty list. 

In [None]:
StatisticalOverview = [['Year', 'Reviewamount', '... % of the overall amount of reviews', '... are in the assigned language ', 'equals ...% of the annual review amount', '... written in another language', ' equals...% of the annual review amount', '...% of gamers bought the game by themselves', 'average gametime up to review creation', '... of reviews updated', 'avaerage amount of days between creation and update']]


def YearlyOverviewCreator (reviewliste: list, ignoreddata:list, newlist:list):    
    # necessary lists as temporary save location 
    gameTime_List = []
    purchaseList = []
    upDateList = []
    upDateDays = []
    langList = []       
    
    CleanedList = reviewliste[1::] #excludes the header row of the list in question
    Reviewamount = len(CleanedList) # counts the items
    allReviews = calculatorList[0] # uses the calculaterlist, in which the overall amount of reviews was saved in the ReviewWriter_overview function
    Year = CleanedList[1][2][0:4]
    
    
    PercentageOfAllReviews = str(round((Reviewamount / allReviews) * 100, 2)) + ' %'
    
    for sublist in CleanedList:
        gameTime_List.append(sublist[3])
        purchaseList.append(sublist[4])
        initialReviewDate = sublist[1]
        updateReviewDate = sublist[2]
        langList.append(sublist[5])
        
        if initialReviewDate == updateReviewDate:
            Updatemessage = 'no Update'
        
        else:
            orig = arrow.get(initialReviewDate)
            upd = arrow.get(updateReviewDate)
            daycount = (upd-orig) 
            days = daycount.days
            upDateDays.append(days) #remember! reviews have been sorted according their updatedate! 
            
            Updatemessage = 'Update'
            upDateList.append(Updatemessage)
        
    
    Updates = len(upDateList)
    DaysBetweenUpdates = round(sum(upDateDays)/Updates, 2)
    AverageGameTime = round(sum(gameTime_List)/len(gameTime_List), 2)
    PurchaseInfo = round((purchaseList.count('yes') / Reviewamount) * 100, 2)
    rightLang = langList.count('yes')
    rightLangPerc = str(round((rightLang / Reviewamount) *100, 2)) + ' %'
    wrongLang = langList.count('no')
    wrongLangPerc = str(round((wrongLang / Reviewamount) *100, 2)) + ' %'
    
    newlist.append([Year, Reviewamount, PercentageOfAllReviews, rightLang , rightLangPerc , wrongLang, wrongLangPerc,  PurchaseInfo, AverageGameTime, Updates, DaysBetweenUpdates])   

#### function call 

In [None]:
YearlyOverviewCreator (ReviewList_2020, SubstractedReviewList_2022, StatisticalOverview)
YearlyOverviewCreator (ReviewList_2021, SubstractedReviewList_2022, StatisticalOverview)

### C.2 Function to save the retrieved statistical data in a table - Statistical_Overview() 

This function writes the data, saved in the prepared list (e.g. StatisticalOverview) in a table. The function needs the following parameters: 

- variable, which contains the path to the save-folder<br>
- language<br>
- the list with the statistical data

In [None]:
def Statistical_Overview (pathToSavingLocation, language:str, year: list):
    
    Existing = os.path.exists(pathToSavingLocation)
    if not Existing:
      # Create a new directory because it does not exist 
      os.makedirs(Basispfad)
      print("The new directory is created!")
    
    filename = f'Reviews{language}_StatisticalOverview.csv'
    saveToFolder = os.path.join (pathToSavingLocation, filename)
    
    with open (saveToFolder, "w", newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(liste)

Statistical_Overview(tableSavepath , 'German', StatisticalOverview)

## (D) Save the review text als TXT files 

### D.1 Function to save the reviewtexts per year in a TXT-file to further analyse the reviewtexts - ReviewToTXT ()

Ths function extracts the reviewtexts of a list (e.g. ReviewList_2020) and saves it as a TXT file. The function needs the following parameters:<br>
- annual review list
- language
- year
- variable which contains the path to the storage folder of the txt files

Since the reviews here are now scouted for possible language errors, a empty list for all the non-suitable reviews is created. During the function call all the non-suitable reviews will be written in this list.

In [None]:
pathToTXT = "pathToStoreTXTfiles"

DropoutReviewsAll = [] #for non-suitable reviews

In [None]:
def ReviewToTXT (reviewListe:list, language:str, year, pathvariable):
    #necessary temporary lists
    ReviewCollection= []
    DropoutReviews = []

    for reviews in reviewListe: 
        Rev = reviews[6]
        if bool(re.search( 'yes', reviews[5], flags=re.IGNORECASE)) == True:            
            ReviewCollection.append(Rev)                       
        else:
            DropoutReviews.append(Rev)
               

        Existing = os.path.exists(pathvariable)
        if not Existing:
          # Create a new directory if it does not exist
            os.makedirs(path)
            print("The new directory is created!")

        filename = f'Reviews{language}-{year}.txt'
        saveToFolder = os.path.join (pathvariable,filename)

        with open (saveToFolder, "w", encoding='utf-8') as txtfile:
             for item in ReviewCollection[1:]:
                txtfile.write( "%s\n" % item)  
            
    DropoutReviewsAll.append(DropoutReviews[1:])      

#### Function call 

In [None]:
ReviewToTXT (ReviewList_2020, 'German', 2020, pathToTXT)
ReviewToTXT (ReviewList_2021, 'German', 2021, pathToTXT)

### D.2 Save the Dropout Reviews 

You can also save the Dropout reviews - if necessary. Keep in mind those are not ordered by date. All Dropout Reviews are saved in a single TXT-file! 

In [None]:
dropout_filename= f'Dropoutfiles_german.txt'
SavePath = os.path.join (pathToTXT,dropout_filename)

with open (SavePath, "w", encoding='utf-8') as txtfile:
             for content in DropoutReviewsAll:
                for review in content:
                    txtfile.write( "%s\n" % review)  
