In [1]:
import numpy as np
import pandas as pd
import spacy
import re
import data_utils

In [2]:
nlp = spacy.load("en_core_web_lg")

In [3]:
train = pd.read_csv("train_data/train.csv")
corpus = list(train['user_review'].values)

In [5]:
train.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,1,Spooky's Jump Scare Mansion,2016.0,I'm scared and hearing creepy voices. So I'll...,1
1,2,Spooky's Jump Scare Mansion,2016.0,"Best game, more better than Sam Pepper's YouTu...",1
2,3,Spooky's Jump Scare Mansion,2016.0,"A littly iffy on the controls, but once you kn...",1
3,4,Spooky's Jump Scare Mansion,2015.0,"Great game, fun and colorful and all that.A si...",1
4,5,Spooky's Jump Scare Mansion,2015.0,Not many games have the cute tag right next to...,1


In [130]:
text_sent = [data_utils.expand_contractions(sent) for sent in corpus]
text_sent = [re.sub("http[s]*://[^\s]+"," ",text) for text in text_sent]
text_sent = [data_utils.remove_accented_chars(sent) for sent in text_sent]
text_sent = [re.sub("[-!\"#$%&'()*+,./:;<=>?@\][^_`|}{~']"," ",text) for text in text_sent]
text_sent = [text.replace("\\"," ") for text in text_sent]
text_sent = [re.sub(r'\s+', ' ',sent) for sent in text_sent]
text_sent = list(map(str.lower,text_sent))
docs = nlp.pipe(text_sent,disable=["ner","parser"])
cleaned_corpus = []
for doc in docs:
    doc_text = []
    for token in doc:
        if token.lemma_ != '-PRON-':
            doc_text.append(token.lemma_)
        else:
            doc_text.append(token.text)
    cleaned_corpus.append(doc_text)
cleaned_corpus = [" ".join(cleaned_text) for cleaned_text in cleaned_corpus]

def give_train_clean():
    for i in range(train.shape[0]):
        yield[train.user_review.values[i],cleaned_corpus[i]]
gtc = give_train_clean()

In [205]:
vals = next(gtc)
print("{:-^50}".format("ORIGINAL"))
print(vals[0])
print("{:-^50}".format(""))
print("")
print("{:-^50}".format("CLEANED"))
print(vals[1])
print("{:-^50}".format(""))

---------------------ORIGINAL---------------------
Best game, more better than Sam Pepper's YouTube account. 10/10What you'll need to play:A computerSome extra pants.Pros:Scary as hell.Fun.Adventure.Spooky.Did I forgot to mention that its scary as hell?You'll get more pants/briefs in your wardrobe.Time consuming if you're bored.Cons:Buying pants/briefs. You haven't downloaded it yet.
--------------------------------------------------

---------------------CLEANED----------------------
good game more well than sam peppers youtube account 10 10what you will need to play a computersome extra pant pro scary as hell fun adventure spooky do i forget to mention that its scary as hell you will get more pant brief in your wardrobe time consume if you be bored con buy pant brief you have not download it yet
--------------------------------------------------


In [200]:
final_frame = train.copy()
final_frame['user_review'] = cleaned_corpus

In [204]:
final_frame.to_csv("latest_cleaned_data/train_lemmatized_withstop.csv",index=False)

In [206]:
"""OVERVIEW CLEANING"""

'OVERVIEW CLEANING'

In [207]:
overview = pd.read_csv("train_data/game_overview.csv")
overview.head()

Unnamed: 0,title,developer,publisher,tags,overview
0,Spooky's Jump Scare Mansion,Lag Studios,Lag Studios,"['Horror', 'Free to Play', 'Cute', 'First-Pers...",Can you survive 1000 rooms of cute terror? Or ...
1,Sakura Clicker,Winged Cloud,Winged Cloud,"['Nudity', 'Anime', 'Free to Play', 'Mature', ...",The latest entry in the Sakura series is more ...
2,WARMODE,WARTEAM,WARTEAM,"['Early Access', 'Free to Play', 'FPS', 'Multi...",Free to play shooter about the confrontation o...
3,Fractured Space,Edge Case Games Ltd.,Edge Case Games Ltd.,"['Space', 'Multiplayer', 'Free to Play', 'PvP'...",Take the helm of a gigantic capital ship and g...
4,Counter-Strike: Global Offensive,"Valve, Hidden Path Entertainment",Valve,"['FPS', 'Multiplayer', 'Shooter', 'Action', 'T...",Counter-Strike: Global Offensive (CS: GO) expa...


In [209]:
corpus = list(overview['overview'].values)
text_sent = [data_utils.expand_contractions(sent) for sent in corpus]
text_sent = [re.sub("http[s]*://[^\s]+"," ",text) for text in text_sent]
text_sent = [data_utils.remove_accented_chars(sent) for sent in text_sent]
text_sent = [re.sub("[-!\"#$%&'()*+,./:;<=>?@\][^_`|}{~']"," ",text) for text in text_sent]
text_sent = [text.replace("\\"," ") for text in text_sent]
text_sent = [re.sub(r'\s+', ' ',sent) for sent in text_sent]
text_sent = list(map(str.lower,text_sent))
docs = nlp.pipe(text_sent,disable=["ner","parser"])
cleaned_corpus = []
for doc in docs:
    doc_text = []
    for token in doc:
        if token.lemma_ != '-PRON-':
            doc_text.append(token.lemma_)
        else:
            doc_text.append(token.text)
    cleaned_corpus.append(doc_text)
cleaned_corpus = [" ".join(cleaned_text) for cleaned_text in cleaned_corpus]

def give_train_clean():
    for i in range(overview.shape[0]):
        yield[overview.overview.values[i],cleaned_corpus[i]]
gtc = give_train_clean()

In [210]:
vals = next(gtc)
print("{:-^50}".format("ORIGINAL"))
print(vals[0])
print("{:-^50}".format(""))
print("")
print("{:-^50}".format("CLEANED"))
print(vals[1])
print("{:-^50}".format(""))

---------------------ORIGINAL---------------------
Can you survive 1000 rooms of cute terror? Or will you break once the cuteness starts to fade off and you're running for your life from the unspeakable hideous beings that shake and writhe in bowels of this house? They wait for you, they wait and hunger for meeting you. They long to finally meet you and show you how flexible your skin can be after it has soaked in blood. Will you brave this journey, will you set to beat the impossible, the insane, and the incorporeal?
--------------------------------------------------

---------------------CLEANED----------------------
can you survive 1000 room of cute terror or will you break once the cuteness start to fade off and you be run for your life from the unspeakable hideous being that shake and writhe in bowel of this house they wait for you they wait and hunger for meet you they long to finally meet you and show you how flexible your skin can be after it have soak in blood will you brave t

In [211]:
final_frame = overview.copy()
final_frame['overview'] = cleaned_corpus
final_frame.to_csv("latest_cleaned_data/overview_lemmatized_withstop.csv",index=False)

In [212]:
"""TEST DATA"""

'TEST DATA'

In [213]:
test = pd.read_csv("test_data/test.csv")
test.head()

Unnamed: 0,review_id,title,year,user_review
0,1603,Counter-Strike: Global Offensive,2015.0,"Nice graphics, new maps, weapons and models. B..."
1,1604,Counter-Strike: Global Offensive,2018.0,I would not recommend getting into this at its...
2,1605,Counter-Strike: Global Offensive,2018.0,Edit 11/12/18I have tried playing CS:GO recent...
3,1606,Counter-Strike: Global Offensive,2015.0,The game is great. But the community is the wo...
4,1607,Counter-Strike: Global Offensive,2015.0,I thank TrulyRazor for buying this for me a lo...


In [218]:
corpus = list(test['user_review'].values)
text_sent = [data_utils.expand_contractions(sent) for sent in corpus]
text_sent = [re.sub("http[s]*://[^\s]+"," ",text) for text in text_sent]
text_sent = [data_utils.remove_accented_chars(sent) for sent in text_sent]
text_sent = [re.sub("[-!\"#$%&'()*+,./:;<=>?@\][^_`|}{~']"," ",text) for text in text_sent]
text_sent = [text.replace("\\"," ") for text in text_sent]
text_sent = [re.sub(r'\s+', ' ',sent) for sent in text_sent]
text_sent = list(map(str.lower,text_sent))
docs = nlp.pipe(text_sent,disable=["ner","parser"])
cleaned_corpus = []
for doc in docs:
    doc_text = []
    for token in doc:
        if token.lemma_ != '-PRON-':
            doc_text.append(token.lemma_)
        else:
            doc_text.append(token.text)
    cleaned_corpus.append(doc_text)
cleaned_corpus = [" ".join(cleaned_text) for cleaned_text in cleaned_corpus]

def give_train_clean():
    for i in range(overview.shape[0]):
        yield[test.user_review.values[i],cleaned_corpus[i]]
gtc = give_train_clean()

In [219]:
vals = next(gtc)
print("{:-^50}".format("ORIGINAL"))
print(vals[0])
print("{:-^50}".format(""))
print("")
print("{:-^50}".format("CLEANED"))
print(vals[1])
print("{:-^50}".format(""))

---------------------ORIGINAL---------------------
Nice graphics, new maps, weapons and models. But developers should listen to the customers a bit more. Developers you are focused too much on things that are not important at all. You should focus on changing the tick rate of the match making servers to 128 and improving VAC a lot. Those two are what customers really want and you should focus on. Not stickers, UI and HUD changes or skins. And stop messing around with the weapons.
--------------------------------------------------

---------------------CLEANED----------------------
nice graphic new map weapon and model but developer should listen to the customer a bit more developer you be focus too much on thing that be not important at all you should focus on change the tick rate of the match make server to 128 and improve vac a lot those two be what customer really want and you should focus on not sticker ui and hud change or skin and stop mess around with the weapon
----------------

In [220]:
final_frame = test.copy()
final_frame['user_review'] = cleaned_corpus
final_frame.to_csv("latest_cleaned_data/test_lemmatized_withstop.csv",index=False)

In [221]:
final_frame.head()

Unnamed: 0,review_id,title,year,user_review
0,1603,Counter-Strike: Global Offensive,2015.0,nice graphic new map weapon and model but deve...
1,1604,Counter-Strike: Global Offensive,2018.0,i would not recommend get into this at its cur...
2,1605,Counter-Strike: Global Offensive,2018.0,edit 11 12 18i have try play cs go recently an...
3,1606,Counter-Strike: Global Offensive,2015.0,the game be great but the community be the bad...
4,1607,Counter-Strike: Global Offensive,2015.0,i thank trulyrazor for buy this for me a long ...
