<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#[Testing]-Tagging-review-keywords-using-keyword-semantic-similarity" data-toc-modified-id="[Testing]-Tagging-review-keywords-using-keyword-semantic-similarity-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>[Testing] Tagging review keywords using keyword semantic similarity</a></span><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href="#Approach" data-toc-modified-id="Approach-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Approach</a></span></li><li><span><a href="#Import-libraries" data-toc-modified-id="Import-libraries-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Import libraries</a></span></li><li><span><a href="#Define-topics-and-keywords" data-toc-modified-id="Define-topics-and-keywords-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Define topics and keywords</a></span></li><li><span><a href="#Define-keywords" data-toc-modified-id="Define-keywords-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Define keywords</a></span></li><li><span><a href="#Identify-the-closest-topic" data-toc-modified-id="Identify-the-closest-topic-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Identify the closest topic</a></span></li></ul></li></ul></div>

# [Testing] Tagging review keywords using keyword semantic similarity 

## Purpose
In order to auto-tag user reviews, we choose to identify topics and sub topics that we would like to then automatically assign to the reviews. This post, 
[Best Practice for Tagging Customer Feedback](https://monkeylearn.com/blog/auto-tagging-customer-feedback-with-machine-learning/#Why-Great-Categorization-is-Important), provides us with recommendations on how to identify tags, how many we should use, and how to define them.

## Approach
What approach use when it comes to matching our opinion units with topics? This post,
[Approach to tag reviews](https://engineering.reviewtrackers.com/using-word2vec-to-classify-review-keywords-a5fa50ce05dc), recommends first to extract keywords from reviews, and then use keyword semantic similarity. 
Two other options are introduced, but not recommended due to their limitations:
* keyword string similarity, as it's limited to character similarity, and does not take into account semantics
* topic modelling, as in particular they tend to produce rather abstract topics, and users cannot easily (re)define their topics.

## Import libraries

In [1]:
import os
import pandas as pd

In [2]:
import spacy
nlp = spacy.load('en_core_web_md')

In [3]:
import itertools
import numpy as np

## Define topics and keywords

In order to identify main topics from the app reviews, the best approach is to read part of them in order to grasp main topics. The definition of the topics is also based on the objectives for analysing app reviews.

The main topics identified are:
* Usage
    - Sources of pollution
    - Family and activities
    - Actions taken
* Product Feedback
    - Performance
    - Usability (Content, Navigation, Complexity)
    - Feature request
* Pricing

In [4]:
topic_labels = [
    'Conditions',
    'Familiy and activities',
    'Actions taken',
    'Sources of pollution',
    'Performance',
    'Usability - Content',
    'Usability - Navigation',
    'Usability - Complexity',
    'Feature request',
    'Pricing'
]
len(topic_labels)

10

In [5]:
topic_keywords = [
    'sickness health COPD lung lungs disease condition chronic pneumonia asthma asthmatic breathing sensitive allergy sufferers allergies',
    'sport children dog leisure travel',
    'mask purifier window',
    'fire industry traffic burning',
    'bug issues fix reliable accuracy performance',
    'content areas information',
    'navigation design',
    'easy complexity',
    'wish',
    'cost price expensive'
]
len(topic_keywords)

10

In [6]:
topic_docs = list(nlp.pipe(topic_keywords))

In [7]:
topic_vectors = np.array([doc.vector
                         if doc.has_vector else spacy.vocab[0].vector 
                         for doc in topic_docs])

## Define keywords

In [8]:
keywords = [
            'accurate',
            'informative',
            'fires issue',
            'reports California smog',
            'comprehensive all occupied countries',
            'frequent traveller',
            'filter out pollution',
            'understand real time',
            'lots bugs',
            'easy use'
            ]

In [9]:
keyword_docs = list(nlp.pipe(keywords))

In [10]:
keyword_docs[1]

informative

In [11]:
keywords_vectors = np.array([doc.vector
                            if doc.has_vector else spacy.vocab[0].vector
                            for doc in keyword_docs])

## Identify the closest topic

Let's test tagging topics using similarity on keywords with a list of topics!

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
simple_sim = cosine_similarity(keywords_vectors, topic_vectors)

In [14]:
topic_idx = simple_sim.argmax(axis = 1)

In [15]:
topic_idx

array([4, 5, 3, 3, 5, 1, 2, 8, 4, 7])

In [16]:
for i in range(len(topic_labels)):
    print(f'topic no:{i} - {topic_labels[i]}')

topic no:0 - Conditions
topic no:1 - Familiy and activities
topic no:2 - Actions taken
topic no:3 - Sources of pollution
topic no:4 - Performance
topic no:5 - Usability - Content
topic no:6 - Usability - Navigation
topic no:7 - Usability - Complexity
topic no:8 - Feature request
topic no:9 - Pricing


In [17]:
for i in range(len(topic_idx)):
    print(f'"{keywords[i]}" is about: "{topic_labels[topic_idx[i]]}"')

"accurate" is about: "Performance"
"informative" is about: "Usability - Content"
"fires issue" is about: "Sources of pollution"
"reports California smog" is about: "Sources of pollution"
"comprehensive all occupied countries" is about: "Usability - Content"
"frequent traveller" is about: "Familiy and activities"
"filter out pollution" is about: "Actions taken"
"understand real time" is about: "Feature request"
"lots bugs" is about: "Performance"
"easy use" is about: "Usability - Complexity"


Tags assigned to the keywords extracted are correct, except for "understand real time". The keywords listed for "Feature extraction" might need to be reviewed.