# Leveraging data analytics and machine learning to improve customer satisfaction
--- 

Jose Oliveira da Cruz | jose [at] jfocruz [dot] com


## Index of Jupyter Notebook: `nb03_modeling-part2.ipynb`
---
- [Background](#background)
- [Load and clean data](#load)
- [Topic Modeling](#model)


## Background

The available data contains comments from users that can be explored to extract the reasons for a specific score. 

- What are three main complaints in case of tickets with bad CSAT based on comments?

To solve this complex problem, I used NLP Topic Modeling with [Latent Dirichlet Allocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation).

In [1]:
# Load libraries
import os
import datetime
import missingno as msno # to visualize missing data
import re

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# fix random seed
seed = 42
np.random.seed(seed)



save_figs = True
plt.style.use('ggplot')
fig_kwargs = dict(bbox_inches='tight')

<a id="load"><a/>
## Load and Clean Data

In [2]:
# Get tickets for which we have CSAT reviews
df_tickets_with_satisfaction = pd.read_csv('../data/processed/merged_datasets.csv', index_col=0)

In [3]:
# Extract comments
comments = df_tickets_with_satisfaction[df_tickets_with_satisfaction.satisfaction.isin(['Bad'])].comment.to_frame().dropna()

# Remove comments without information
comments = comments[~comments.comment.isin(['N / A'])].reset_index(drop=True)

<a id="model"><a/>
## Topic Modeling

In [4]:
# for npl
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('stopwords', quiet=True);
nltk.download('wordnet', quiet=True);
nltk.download('punkt', quiet=True);
nltk.download('averaged_perceptron_tagger', quiet=True);
nltk.download('omw-1.4', quiet=True)

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

STOP_WORDS_ENG = stopwords.words('english')
STOP_WORDS_ENG.extend(['la', 'de', 'le', 'pa', 'del', 'el'])
ENGLISH_VOCABULARY = list(w.lower() for w in nltk.corpus.words.words())

In [5]:
def tokenize(text):
    """Case normalize, clean, tokenize, verify english corpus and lemmatize text.
    
    Parameters
    ----------
    text : str
    
    Returns
    -------
    tokens_lem : list
        List of clean, normalized and lemmatized tokens.
    """
    # Remove non-alphanumeric characters
    text = re.sub(r'[^0-9a-zA-Z]', ' ', text)

    # tokenization
    tokens = word_tokenize(text)
    
    # remove words that are not in english and with len < 2 char
    tokens = [word for word in tokens if word in ENGLISH_VOCABULARY and len(word) > 1]

    # lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(token.strip().lower()) for token in tokens
                  if token not in STOP_WORDS_ENG]

    return tokens_lem

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.pipeline import make_pipeline

In [7]:
# How many topics?
number_of_topics = 3

# Create model object
model = make_pipeline(TfidfVectorizer(tokenizer=tokenize), # use custom tokenizer
                      LatentDirichletAllocation(n_components=number_of_topics,
                                                n_jobs=-1,
                                                random_state=seed)) 

In [8]:
# Fit model and tranform data
id_topic = model.fit_transform(comments.comment)

# get feature names (== tokens)
vocabulary = model.named_steps.get('tfidfvectorizer').get_feature_names()

In [9]:
def display_topics(number_of_top_words, model_pipeline, vocab):
    """Returns words per topic.
    
    Parameters
    ----------
    number_of_top_words : int
        Number of words per topic
        
    model_pipeline : sklearn.pipeline
        Must have a fited LDA model
    
    Notes
    ----- 
    Visualization taken from: https://stackoverflow.com/questions/44208501/getting-topic-word-distribution-from-lda-in-scikit-learn
    
    
    """

    topic_words = {}

    for topic, component in enumerate(model.named_steps.get('latentdirichletallocation').components_):
        # for the n-dimensional array "arr":
        # argsort() returns a ranked n-dimensional array of arr, call it "ranked_array"
        # which contains the indices that would sort arr in a descending fashion
        # for the ith element in ranked_array, ranked_array[i] represents the index of the
        # element in arr that should be at the ith index in ranked_array
        # ex. arr = [3,7,1,0,3,6]
        # np.argsort(arr) -> [3, 2, 0, 4, 5, 1]
        # word_idx contains the indices in "topic" of the top num_top_words most relevant
        # to a given topic ... it is sorted ascending to begin with and then reversed (desc. now)    
        word_idx = np.argsort(component)[::-1][:number_of_top_words]

        # store the words most relevant to the topic
        topic_words[topic] = [vocab[i] for i in word_idx]
        
    for topic, words in topic_words.items():
        
        print(f'Topic: {topic + 1}')
        
        print(f'Words:  {", ".join(words)}')

In [10]:
display_topics(number_of_top_words=5, model_pipeline=model, vocab=vocabulary)

Topic: 1
Words:  service, delivery, mode, inefficient, bad
Topic: 2
Words:  agent, solution, communication, behavior, slow
Topic: 3
Words:  like, unreliable, resolution, useless, dont


---
2022 - Jose Oliveira da Cruz