### Data Preparation for Game Reviews Data

In this notebook, we prepare the data which we have scraped from the games review website of Common Sense Media.  Data preparation consists of:  
- removing rows without labels
- removing rows that are marked to be deleted (indicated in column 'remove')
- concatenating review titles and reviews proper to form a single column for review content 

The final dataframe consists of 606 rows, and retains only 2 columns: 'review' and 'label'.  This dataframe is output as a csv file.

In [1]:
import pandas as pd

In [58]:
# read in the raw scraped data
df = pd.read_csv("data/run_results13.csv")
df

Unnamed: 0,cont_button_rev_links_selection2_name,cont_button_rev_links_selection2_selection3,label,remove
0,It’s too scary,It's funny but the Car never has any clothes o...,a,
1,A GOOD RACING GAME!!!,"Although the movie is not that good, the game ...",s,
2,Great game!,"I think this game is great! My husband, son, a...",s,
3,"Son loves it, but too violent",I am disappointed that there are guns and shoo...,a,
4,Nostalgia.,has quite a few explosions (not that bad) and ...,s,
...,...,...,...,...
693,Best bottle Royale game I've ever played,Well this game does have a little bit of blood...,,
694,Let your kids play this .,If you are a parent and think that your 10 yea...,,
695,delete fortnite get this,this game is very fun. you take away the blood...,,
696,Really Good!,Its really good and not gory at all. I think i...,,


In [60]:
# get the counts of the different non-null label values
df['label'].value_counts()

s    427
a    187
Name: label, dtype: int64

In [61]:
# filter out rows without labels (i.e. NaN)
df_labels = df[ df['label'].isnull() == False ]
df_labels

Unnamed: 0,cont_button_rev_links_selection2_name,cont_button_rev_links_selection2_selection3,label,remove
0,It’s too scary,It's funny but the Car never has any clothes o...,a,
1,A GOOD RACING GAME!!!,"Although the movie is not that good, the game ...",s,
2,Great game!,"I think this game is great! My husband, son, a...",s,
3,"Son loves it, but too violent",I am disappointed that there are guns and shoo...,a,
4,Nostalgia.,has quite a few explosions (not that bad) and ...,s,
...,...,...,...,...
614,Listen,This game is one I am sure a lot of parents ar...,a,
615,An epic game that can be for anyone just needs...,I bought this game recently even tho the more ...,s,
616,Amazing game,"My son wanted the game, At first I wasnt havin...",s,
617,Probably the worst in the franchise but fine f...,,s,


In [62]:
# remove rows with 'remove' column is not NaN
df_labels['remove'].value_counts()

delete     7
delete?    1
Name: remove, dtype: int64

In [63]:
df_clean = df_labels[ df_labels['remove'].isnull() == True ]
df_clean

Unnamed: 0,cont_button_rev_links_selection2_name,cont_button_rev_links_selection2_selection3,label,remove
0,It’s too scary,It's funny but the Car never has any clothes o...,a,
1,A GOOD RACING GAME!!!,"Although the movie is not that good, the game ...",s,
2,Great game!,"I think this game is great! My husband, son, a...",s,
3,"Son loves it, but too violent",I am disappointed that there are guns and shoo...,a,
4,Nostalgia.,has quite a few explosions (not that bad) and ...,s,
...,...,...,...,...
614,Listen,This game is one I am sure a lot of parents ar...,a,
615,An epic game that can be for anyone just needs...,I bought this game recently even tho the more ...,s,
616,Amazing game,"My son wanted the game, At first I wasnt havin...",s,
617,Probably the worst in the franchise but fine f...,,s,


In [64]:
# drop the 'remove' column
df_clean.drop(['remove'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean.drop(['remove'], axis=1, inplace=True)


In [65]:
# check the counts of the label values
df_clean['label'].value_counts()

s    420
a    186
Name: label, dtype: int64

In [66]:
# a function to replace the NaNs in 'review' column with empty str
def replace_nan( val ):
    if type(val) == str:
        return val
    else:
        return ""


In [67]:
# rename the columns
df_clean.columns = ['title', 'review', 'label']
df_clean

Unnamed: 0,title,review,label
0,It’s too scary,It's funny but the Car never has any clothes o...,a
1,A GOOD RACING GAME!!!,"Although the movie is not that good, the game ...",s
2,Great game!,"I think this game is great! My husband, son, a...",s
3,"Son loves it, but too violent",I am disappointed that there are guns and shoo...,a
4,Nostalgia.,has quite a few explosions (not that bad) and ...,s
...,...,...,...
614,Listen,This game is one I am sure a lot of parents ar...,a
615,An epic game that can be for anyone just needs...,I bought this game recently even tho the more ...,s
616,Amazing game,"My son wanted the game, At first I wasnt havin...",s
617,Probably the worst in the franchise but fine f...,,s


In [68]:
# replace all NaNs in the 'review' column with empty string
df_clean['review'] = df_clean['review'].apply( replace_nan )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['review'] = df_clean['review'].apply( replace_nan )


In [69]:
df_clean

Unnamed: 0,title,review,label
0,It’s too scary,It's funny but the Car never has any clothes o...,a
1,A GOOD RACING GAME!!!,"Although the movie is not that good, the game ...",s
2,Great game!,"I think this game is great! My husband, son, a...",s
3,"Son loves it, but too violent",I am disappointed that there are guns and shoo...,a
4,Nostalgia.,has quite a few explosions (not that bad) and ...,s
...,...,...,...
614,Listen,This game is one I am sure a lot of parents ar...,a
615,An epic game that can be for anyone just needs...,I bought this game recently even tho the more ...,s
616,Amazing game,"My son wanted the game, At first I wasnt havin...",s
617,Probably the worst in the franchise but fine f...,,s


In [70]:
# create a new column that contains the review title appended with the review proper
df_clean['title_review'] = df_clean['title'] + " " + df_clean['review']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['title_review'] = df_clean['title'] + " " + df_clean['review']


In [71]:
df_clean

Unnamed: 0,title,review,label,title_review
0,It’s too scary,It's funny but the Car never has any clothes o...,a,It’s too scary It's funny but the Car never ha...
1,A GOOD RACING GAME!!!,"Although the movie is not that good, the game ...",s,A GOOD RACING GAME!!! Although the movie is no...
2,Great game!,"I think this game is great! My husband, son, a...",s,Great game! I think this game is great! My hus...
3,"Son loves it, but too violent",I am disappointed that there are guns and shoo...,a,"Son loves it, but too violent I am disappointe..."
4,Nostalgia.,has quite a few explosions (not that bad) and ...,s,Nostalgia. has quite a few explosions (not tha...
...,...,...,...,...
614,Listen,This game is one I am sure a lot of parents ar...,a,Listen This game is one I am sure a lot of par...
615,An epic game that can be for anyone just needs...,I bought this game recently even tho the more ...,s,An epic game that can be for anyone just needs...
616,Amazing game,"My son wanted the game, At first I wasnt havin...",s,"Amazing game My son wanted the game, At first ..."
617,Probably the worst in the franchise but fine f...,,s,Probably the worst in the franchise but fine f...


In [75]:
# check that the new 'title_review' column does not have any null values
any( df_clean['title_review'].isnull() )

False

In [76]:
# drop 'title' and 'review' columns
df_clean.drop( ['title', 'review'], axis=1, inplace=True )

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean.drop( ['title', 'review'], axis=1, inplace=True )


In [85]:
# check for any null values in the dataframe
any( df_clean['title_review'].isnull() )

False

In [86]:
any( df_clean['label'].isnull())

False

In [89]:
df_clean.isnull().any()

label           False
title_review    False
dtype: bool

In [78]:
# do a final look through, before writing data out to file
df_clean

Unnamed: 0,label,title_review
0,a,It’s too scary It's funny but the Car never ha...
1,s,A GOOD RACING GAME!!! Although the movie is no...
2,s,Great game! I think this game is great! My hus...
3,a,"Son loves it, but too violent I am disappointe..."
4,s,Nostalgia. has quite a few explosions (not tha...
...,...,...
614,a,Listen This game is one I am sure a lot of par...
615,s,An epic game that can be for anyone just needs...
616,s,"Amazing game My son wanted the game, At first ..."
617,s,Probably the worst in the franchise but fine f...


In [79]:
# write the cleaned dataframe to a csv file
df_clean.to_csv("data/labelled_reviews.csv", index=False)