# Actionable Insights: Extracting Constructive Material from Customer Reviews
--- 

# PART 2: EDA and Data Cleaning

### INTRODUCTION

To prepare the data for modeling, the text must be preprocessed.  I will label the data using a custom component within SpaCy to find sentences containing actionable words, as determined from a list.  The list of actionable words is by no means exhaustive, but should serve as an aid in labeling the data, and with the help of semantic simlarity search (another SpaCy method) can be extended as appropriate.  This labeling process was augmented using the MonkeyLearn API and verified manually. 

To distill actionable insights, the reviews are broken by sentence.  However the context of the insights would be lost if broken down further to the individual word level, so instead we will analyze tokenized, and vectorized sentences for the models.  SpaCy will assign each token a 300 dimension vector, and each sentence will also be assigned a 300 dimentional vector based on the tokens components.  

Finally the sentences were saved as dataframes for use in the following modeling notebook.

### Contents

- [Import Libraries](#Import-Libraries)
- [Load in G2 Data](#Load-in-G2-Data)
- [Search for Actionable Words](#Search-for-Actionable-Words)
- [Extracting Sentences](#Extracting-Sentences)
- [Checking the Target Distribution](#Checking-the-Target-Distribution)
- [Adding Labels](#Adding-Labels)
- [Combining the labels](#Combining-the-Labels)
- [Saving Dataframes](#Saving-Dataframes)

## Import Libraries

In [2]:
# Basic libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 
import numpy as np
import json
import gzip
import io


# Plot formatting
%matplotlib inline
sns.set_context('poster')
sns.set_style('white')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

# NLP libraries
import en_core_web_md
import spacy
from spacy.tokens import Span, Doc, Token
from spacy.matcher import Matcher
import scattertext as st

# Modeling and Machine Learning libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from skmultilearn.problem_transform import LabelPowerset
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.manifold import TSNE
import hdbscan

# Hide excessive future warnings
import warnings
warnings.filterwarnings('ignore')

##SQL libraries for the Google Play Store Review dataset
# from sqlalchemy import create_engine
# import psycopg2
# import sqlalchemy
# from sqlalchemy.ext.declarative import declarative_base
# import mysql
# import mysql.connector
# import plotly.express as px

In [None]:
# Load in the SpaCy NLP Library
nlp = spacy.load('en_core_web_md')

## Load in G2 Data

In [None]:
G2=pd.read_json('../Data/G2_hubspot_data.jl', lines=True)

Dropping 106 empty review rows

In [4]:
G2=G2[G2['reviews']!={}]

In [5]:
G2.reviews[1].keys()

dict_keys(['What do you like best?', 'What do you dislike?', 'Recommendations to others considering the product:', 'What problems are you solving with the product?  What benefits have you realized?'])

In [6]:
G2['pro'] =[G2.reviews[x]['What do you like best?'] for x in G2.index]
G2['con'] =[G2.reviews[x]['What do you dislike?'] for x in G2.index]
# Recommendation question not always asked, so including a conditional for this column
G2['user_recs'] =[G2.reviews[x]['Recommendations to others considering the product:'] if 'Recommendations to others considering the product:'in G2.reviews[x].keys() else "" for x in G2.index]
G2['value'] =[G2.reviews[x]['What problems are you solving with the product?  What benefits have you realized?'] for x in G2.index]

## Search for Actionable Words
Search for actionable words and search for similar words

The most_similar function below allows me to find similar words to expand the list of actionable insight keywords to search for.  This will help with the labeling process.  It uses SpaCy's semantic similarity method based on the [GloVe](https://nlp.stanford.edu/projects/glove/) word vector algorithm developed at Stanford.  This list is by no means exhaustive but should prove useful in aiding the labeling process.  The words were decided upon after reviewing a sample of reviews manually. 

In [48]:
def most_similar(word):
    word=nlp.vocab[word]
    queries = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
    by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
    return [word.lower_ for word in by_similarity[:10]]

In [49]:
# Enter search term here to find 10 similar terms
most_similar('wish')

['wish',
 'want',
 'hope',
 'wishing',
 'you',
 'know',
 'forget',
 'wished',
 'let',
 'glad']

In [45]:
# Actionable insight words
words = 'able ability dislikes really favor ability should add glitch wish want hope glad maybe fix bug improve price better feature frustrating useful would hard consider try improve more issue suggest please request quality error flaw trouble could update break workaround solution want error issue hate love favorite'
actionable_words=nlp(words)
actionable_words=[token.lemma_ for token in actionable_words]    # Using Lemma's to capture more instances
action_pattern = [{"LEMMA": {"IN": actionable_words}}]

Here I am setting a [custom method](https://spacy.io/usage/processing-pipelines#custom-components-simple) to tag sentences containing the 'actionable' words from above 'has_action'.  Then I'm separating the reviews based on the question they are answering.  This can be used for context when looking at the sentences once tokenized.

In [51]:
# Setting a custom extention
is_action_getter = lambda token: token.lemma_ in actionable_words
has_action_getter = lambda obj: any([t.lemma_ in actionable_words for t in obj])

Token.set_extension("is_action", getter=is_action_getter, force=True)
Doc.set_extension("has_action", getter=has_action_getter, force=True)
Span.set_extension("has_action", getter=has_action_getter, force=True)

# Setting up NLP for each question's response
ProDocs = list(nlp.pipe(G2['pro']))
ConDocs = list(nlp.pipe(G2['con']))
RecsDocs = list(nlp.pipe(G2['user_recs']))
ValueDocs = list(nlp.pipe(G2['value']))

# Creating a Dataframe Column for each question's response within the reviews
G2["pro_tokens"] = ProDocs
G2["con_tokens"] = ConDocs
G2["recs_tokens"] = RecsDocs
G2["value_tokens"] = ValueDocs

I will be able to extract a great deal of context from the vectorized column.

## Extracting Sentences
### Seperating out sentences into a seperate dataframe and removing stop words and punctuation

In [55]:
target=[]
sent_list=[]
tokens=[]
vectors=[]
G2_index=[]
column = []
for i in G2.index:
    for col in ['pro_tokens','con_tokens','recs_tokens','value_tokens']:
        for sent in G2[col][i].sents:
            column.append(col)
            G2_index.append(i)
            sent_list.append(sent)
            
            # Lemmatizing, removing stop words and punctuation
            sent = [token.lemma_ for token in sent if token.is_stop==False and token.is_punct ==False and token.lemma_ !='-PRON-']
            sent = ' '.join(sent)
            sent = nlp.make_doc(sent)   
            tokens.append(sent)
            vectors.append(sent.vector)
            if sent._.has_action:
                target.append(1)
            else:
                target.append(0)

# Creating Dataframe of sentencized reviews
G2_sent=pd.DataFrame({'from_row': G2_index,
                      'from_col': column,
                      'sentence': sent_list,
                      'tokens'  : tokens,
                      'vector'  : vectors,
                      'is_actionable': target,
                      })

## Checking the Target Distribution

In [58]:
G2_sent['is_actionable'].value_counts(normalize=True)

0    0.715135
1    0.284865
Name: is_actionable, dtype: float64

## Adding Labels
I manually labelled 400 of the sentences and will use a [Monkeylearn](https://monkeylearn.com/) api trained on those 400 sentences to suggest tags for the rest, which I will validate manually.

In [113]:
G2_labeled=pd.read_csv('../Data/G2_Review_Categories.csv')

In [114]:
G2_labeled.isnull().sum()

Text             0
Tag           4960
Confidence    5352
dtype: int64

In [115]:
# Filling null values in tags.
G2_labeled['Tag'].fillna('none', inplace=True)
if G2_labeled['Tag'].isnull().sum() ==0:
    print('null values removed successfully from "Tags"')

null values removed successfully from "Tags"


In [118]:
G2_labeled = pd.concat([G2_labeled.drop('Confidence', axis=1), G2_labeled['Tag'].str.get_dummies(sep=':')], axis=1)
# Reviewing the updated target distribution
G2_sent['is_actionable'].value_counts(normalize=True)

In [123]:
# Converting spacy docs and spans back into strings to allow or dataset merge
# G2_sent['sentence']=G2_sent['sentence'].map(lambda x: x.text)
# G2_labeled['Text']=G2_labeled['Text'].map(lambda x:x.text)
# Merge datasets
G2_labeled=pd.merge(left=G2_sent, right=G2_labeled, how='left',left_on='sentence',right_on='Text', right_index=False)

In [124]:
G2_labeled.sample(1)

Unnamed: 0,from_row,from_col,sentence,tokens,vector,is_actionable,Text,Tag,Actionable,No tags,UX,alternatives,bugs,features,integrations,none,price,support,updates,wishes
0,1,pro_tokens,I have used HubSpot in few companies from star...,"(HubSpot, company, start, up, medium, sized, o...","[0.07115748, 0.19749413, -0.06404737, -0.10758...",1,I have used HubSpot in few companies from star...,none,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1,pro_tokens,And based on my experirnce I think the key hig...,"(base, experirnce, think, key, highlight, user...","[0.06888762, 0.08969499, -0.06350486, 0.048314...",0,,,,,,,,,,,,,,
2,1,con_tokens,\r,(\r),"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0,,,,,,,,,,,,,,
3,1,recs_tokens,"Everything is very well connected, and you can...","(connected, develop, tool, have, user, mind)","[0.08063999, 0.120635055, -0.15673333, -0.0605...",0,"Everything is very well connected, and you can...",none,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1,recs_tokens,I have used or tested other marketing automati...,"(test, marketing, automation, tool, HubSpot, e...","[-0.19968951, 0.16349284, -0.07618576, 0.02723...",0,,,,,,,,,,,,,,


In [132]:
G2_labeled['Actionable'].value_counts(normalize=True)

0.0    5117
1.0     220
Name: Actionable, dtype: int64

In [152]:
# Isolate labeled training set which can be used for sub categorization training of the actionable sentences
G2_train = G2_labeled.loc[G2_labeled['Text'].isnull()==False]

# Fill NaN's
G2_labeled.fillna(0, inplace=True)
if G2_labeled.isnull().sum().sum()==0:
    print('Nulls have been successfully filled')
else:
    print(f'there are still null values:', G2_labeled.isnull().sum())

Nulls have been successfully filled


## Combining the Labels

In [122]:
# Re-vectorizing the text
G2_labeled['Text']=list(nlp.pipe(labeled_g2['Text']))

# Combining labels fo actionable sentences
G2_labeled['is_actionable'].loc[G2_labeled['Actionable']==1]=1
G2_labeled['Actionable'] = G2_labeled['is_actionable']
G2_labeled.drop('is_actionable', inplace=True, axis=1)
G2_labeled['Actionable'].value_counts()

## Saving Dataframes

In [153]:
G2_train.to_pickle('../Data/G2_train.pkl')
G2_labeled.to_pickle('../Data/G2_labeled.pkl')