# Capstone: Sentiment Analysis for CMON
## Data Cleaning, Pre-processing and EDA

## Problem Statement

CMON is a listed company on the HK stock Exchange that sells it's boardgames through a suite of online platforms. It brings boardgames to live by Kickstarter funding and boardgames would go into production once the funding quota is met.

In order to come up with games that are appealing to consumers, it is imperative for CMON to identify customer sentiments, their likes and dislikes and painpoints.

For phase 1 of this project, the data science team has been tasked to classify positive and negative reviews on Boardgamegeeks.com for all of CMON's games using Natural Language Procesisng (NLP).

The model that achieves the highest accuracy and recall on the validation set would be selected for production. The team would also be identifying the key contributors for the classifications.

For phase 2 of the project, the team would be looking at building a recommeder system for CMON based on ratings of the board games.

Phase 1 is crucial for the chief creative director so that CMON is able to understand what consumers like or dislike in their board games so that resources can be channeled to ensuring that the games they launch on kickstarter can be fully and hopefully over subscribed.

Phase 2 would enhance their company's sales by recommending games that players enjoy to encourage purchase.

Deadline for phase 1 would be by the beginning of August 2020 and phase 2 rollout would be dependent on the successful completion of phase 1.


## Data Guidelines

Data has been scapped from https://boardgamegeek.com/ for all CMON games. Script is used to scrape all the ratings and comments for all CMON boardgames and saved into CSV format. There are 4 columns including the index.
Other columns include the username ( registered user name for the forum), ratings given by user for the boardgame (from 1 to 10) and comments from the user.

Ratings from 5 and below would be deemed to be negative, ratings from 6 to 10 would be deemed to be positive.

There is one csv for each board games and the data would be compiled for review. Preliminary assessment indicates that there would be approximately 16K rows.

They would be split into 50% (train-test) set and 50% holdout set.

The comments column would be lemmatize/ tokenised into useful words.

Data cleaning:

Remove duplicated reviews
Remove reviews that do not have any meaningful words
Remove reviews that are non-English or gibberish
Pre-processing:

Remove HTML tags
Use regular expression to remove special characters and numbers
Lowercase words
Use NLTK to remove stopwords
Remove common occurring words that appear in both positive and negative sentiments
Use NLTK to stem words to their root form

## EDA Guidelines

EDA would be performed on the the comments.

1. Wordcloud to visualise keywords
2. Count to visualise keywords
3. Plot to identify distribution of valuable words

etc

## 1. Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('max_columns', None)
pd.set_option('max_rows', None)

import regex as re
from bs4 import BeautifulSoup
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer
from os import path
from PIL import Image

%matplotlib inline

In [2]:
# Import Ankh Gods of Egypt comments csv into a dataframe
ankh = pd.read_csv('../GA Capstone CMON/datasets/Ankh Gods of Egypt comments.csv')  

In [3]:
# Import Arcadia Quest comments csv into a dataframe
arcadia = pd.read_csv('../GA Capstone CMON/datasets/Arcadia Quest comments.csv')  

In [4]:
# Import Blood Rage comments csv into a dataframe
rage = pd.read_csv('../GA Capstone CMON/datasets/Blood Rage comments.csv')  

In [5]:
# Import Bloodborne The Board Game comments csv into a dataframe
borne = pd.read_csv('../GA Capstone CMON/datasets/Bloodborne The Board Game comments.csv')  

In [6]:
# Import Blue Moon City comments csv into a dataframe
blue = pd.read_csv('../GA Capstone CMON/datasets/Blue Moon City comments.csv')

In [7]:
# Import Chyba comments csv into a dataframe
chyba = pd.read_csv('../GA Capstone CMON/datasets/Chyba comments.csv')  

In [8]:
# Import Cthulhu Death May Die comments csv into a dataframe
may = pd.read_csv('../GA Capstone CMON/datasets/Cthulhu Death May Die comments.csv') 

In [9]:
# Import de hielo y fuego El juego de miniaturas comments csv into a dataframe
miniaturas = pd.read_csv('../GA Capstone CMON/datasets/de hielo y fuego El juego de miniaturas comments.csv') 

In [10]:
# Import Duckomenta Art comments csv into a dataframe
art = pd.read_csv('../GA Capstone CMON/datasets/Duckomenta Art comments.csv') 

In [11]:
# Import Ethnos comments comments csv into a dataframe
ethnos = pd.read_csv('../GA Capstone CMON/datasets/Ethnos comments.csv') 

In [12]:
# Import Foodies comments comments csv into a dataframe
food = pd.read_csv('../GA Capstone CMON/datasets/Foodies comments.csv') 

In [13]:
# Import Genios Victorianos comments csv into a dataframe
victoria = pd.read_csv('../GA Capstone CMON/datasets/Genios Victorianos comments.csv') 

In [14]:
# Import Gizmos comments csv into a dataframe
gizmos = pd.read_csv('../GA Capstone CMON/datasets/Gizmos comments.csv') 

In [15]:
# Import God of War Das Kartenspiel comments csv into a dataframe
war = pd.read_csv('../GA Capstone CMON/datasets/God of War Das Kartenspiel comments.csv') 

In [16]:
# Import Ha Ver A comments csv into a dataframe
haver = pd.read_csv('../GA Capstone CMON/datasets/Ha Ver A comments.csv') 

In [17]:
# Import HATE comments csv into a dataframe
hate = pd.read_csv('../GA Capstone CMON/datasets/HATE comments.csv') 

In [18]:
# Import Los Autos Locos El Juego de Mesa comments csv into a dataframe
los = pd.read_csv('../GA Capstone CMON/datasets/Ha Ver A comments.csv') 

In [19]:
# Import Marvel United comments csv into a dataframe
marvel = pd.read_csv('../GA Capstone CMON/datasets/Marvel United comments.csv') 

In [20]:
# Import Massive Darkness comments csv into a dataframe
massive = pd.read_csv('../GA Capstone CMON/datasets/Massive Darkness comments.csv') 

In [21]:
# Import Moloch comments csv into a dataframe
moloch = pd.read_csv('../GA Capstone CMON/datasets/Moloch comments.csv') 

In [22]:
# Import Munchkin Dungeon comments csv into a dataframe
munchkin = pd.read_csv('../GA Capstone CMON/datasets/Munchkin Dungeon comments.csv') 

In [23]:
# Import Nap comments csv into a dataframe
nap = pd.read_csv('../GA Capstone CMON/datasets/Nap comments.csv') 

In [24]:
# Import Narcos hra comments into a dataframe
narcos = pd.read_csv('../GA Capstone CMON/datasets/Narcos hra comments.csv') 

In [25]:
# Import Project ELITE comments csv into a dataframe
project = pd.read_csv('../GA Capstone CMON/datasets/Project ELITE comments.csv') 

In [26]:
# Import Sheriff of Nottingham Edition comments csv into a dataframe
sheriff = pd.read_csv('../GA Capstone CMON/datasets/Sheriff of Nottingham Edition comments.csv') 

In [27]:
# Import Starcadia Quest comments csv into a dataframe
starcadia = pd.read_csv('../GA Capstone CMON/datasets/Starcadia Quest comments.csv') 

In [28]:
# Import Sugar Blast comments csv into a dataframe
sugar = pd.read_csv('../GA Capstone CMON/datasets/Sugar Blast comments.csv') 

In [29]:
# Import The Grizzled comments csv into a dataframe
grizzled = pd.read_csv('../GA Capstone CMON/datasets/The Grizzled comments.csv') 

In [30]:
# Import Trudvang Legends comments csv into a dataframe
legends = pd.read_csv('../GA Capstone CMON/datasets/Trudvang Legends comments.csv') 

In [31]:
# Import Wrath of Kings comments csv into a dataframe
wrath = pd.read_csv('../GA Capstone CMON/datasets/Wrath of Kings comments.csv') 

In [32]:
# Import Zombicide comments csv into a dataframe
zombie = pd.read_csv('../GA Capstone CMON/datasets/Zombicide comments.csv') 

In [33]:
# Combined all dataframes into one
df = pd.concat([ankh, arcadia, rage, borne, blue, chyba, may, miniaturas, art,
                ethnos, food, victoria,gizmos, war, haver, hate, los, marvel,
                massive, moloch, munchkin, nap, narcos, project, sheriff,
                starcadia, sugar, grizzled, legends, wrath, zombie])
# Reindex the new dataframe
df.reset_index(drop=True, inplace=True)

In [35]:
df.shape

(23089, 4)

In [36]:
# Check that the data has been re-indexed
df.tail()

Unnamed: 0.1,Unnamed: 0,username,rating,comment
23084,2528,Zyrallus,3.8,"So much potential here. Seriously, with a less..."
23085,2529,zyx0xyz,7.6,唯一的亮点就是模型和版图美工了，但本人作为一个十多年的生化危机fans，体验zombicid...
23086,2530,zzool73,8.0,If you are creative there are a lot of house r...
23087,2531,_ph_,9.0,"Still playing a lot of the game with friends, ..."
23088,2532,_The_Inquiry_,4.0,Prior to 2020: 1 play\n\nIf there's a single g...


In [38]:
# Drop selftext duplicates as we only want unique posts
df.drop_duplicates('comment', inplace=True)

In [40]:
df.shape

(21156, 4)

In [39]:
# Check for null values in iphonehelp
df.isnull().sum()

Unnamed: 0       0
username         0
rating        4786
comment          1
dtype: int64

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21156 entries, 0 to 23088
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  21156 non-null  int64  
 1   username    21156 non-null  object 
 2   rating      16370 non-null  float64
 3   comment     21155 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 826.4+ KB


## Data Preprocessing

Pre-processing would enable transformation of our text into a more digestible form so that our classifier can perform better. The steps taken is as follows:

- 1. Remove html tags using beautifulsoup
- 2. Lowercase all words and split word up
- 3. Remove non-letters: Remove special characters and numbers
- 4. Remove keywords that points to a speciic subreddit
- 5. Remove stopwords: These are common words that are not useful for text classification
- 6. Lemmatize words: This will convert each word to its base form
- 7. Finally, rejoin words back into a string

In [45]:
# remove bs4 warnings as scrapping includes pinned moderator posts with many url links and pictures
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

In [42]:
# Initialise Lemmatizer
lemmatizer = WordNetLemmatizer()

In [46]:
# Write a function to convert text to a string of meaningful words
def meaningful_text(self_text):
    
    # 1. Remove html tags
    words = BeautifulSoup(self_text).get_text()
    
    # 2. Convert words to lower case and split each word up
    words = self_text.lower()
    
    # iphon likely to be spelling error for iphone, removing it as we do not want iphone inside too
    #words = words.replace('iphon', '') 
    
    # 3. Remove non-letters
    words = re.sub("[^a-zA-Z]", " ", words).split()    
    
    #Searching through a set is faster than searching through a list,so we will convert stopwords to a set
    stops = set(stopwords.words('english'))
    
    # 4. Add certain keywords to stopwords as its too obvious for which reddit
    #stops.update(['Android','Iphone','android','iphon','phone','http','www','com', 'Iphon','IPHON'])
    
    # 5. Remove stopwords
    meaningful_words = [w for w in words if w not in stops]
    
    # 6. Lemmatize words
    meaningful_words = [lemmatizer.lemmatize(w) for w in meaningful_words]
   
    # 7. Join words back into one string, with a space in between each word
    return(" ".join(meaningful_words))

In [47]:
# Creating clean selftext and clean title, and store them in new columns
df['comment_clean'] = df['comment'].map(meaningful_text)

TypeError: object of type 'float' has no len()