# Vulnerable Customers

4 key drivers of vulnerability:
1. Health – disabilities or illnesses that affect the ability to carry out day-to-day tasks
2. Life Events – major life events such as bereavement, job loss or relationship breakdown
3. Resilience – low ability to withstand financial or emotional shocks
4. Capability – low knowledge of financial matters or low confidence in managing money (financial capability) and low capability in other relevant areas such as literacy, or digital skills

Normalising this into 16 distinct topics:
1. Learning disability
2. Low income
3. Mental health issues
4. Health problems
5. Being a carer
6. Age
7. Physical disability
8. Lack of connectivity
9. Living alone
10. Lone parent
11. Loss of income
12. Leaving care
13. Bereavement
14. Relationship breakdown
15. Release from prison
16. Legal proceedings

Potential approaches:
1. Bag-of-Words: will need enough training data for us to come to some sensible features. But this will essentially be goal-seeking because only sensible features will be synonyms of the topic at hand.
2. Similarity measure: use wordnet based similarity measure to monitor stream of text for mention of words close in meaning to these topics, ie. synonyms.

Winner = I will use the similarity measure as this will also not require large training data to be collected.

# NLTK semantic similarity

In [38]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet

In [206]:
def sentence_cleaner(phrase):
    '''Removes stopwords, punctuation from text, and converts into a list of word tokens
    Args:
    phrase = text string
    
    Outputs:
    list of word tokens
    '''
    stop_words = set(stopwords.words('english'))
    stop_words.update([',','.'])
    word_tokens = word_tokenize(phrase)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    return filtered_sentence

def synonym_scorer(filtered_sentence, topic, sim_thresh = 0.6):
    '''
    For each word in a sentence, retrieves the synonym set. For each synonym we measure the wup_similarity
    to the topic at hand. If similarity > sim_threshold, the topic is said to have been mentioned.
    
    Args:
    filtered_sentence = tokenized sentence, preferrably stripped of stopwords
    topic = Synset of the topic in question.
    sim_threshold = threshold for topic similarity (default = 0.6)
    
    Outputs:
    Integer count of the number of mentions of the topic in the filtered_sentence
    '''
    word_scores = []
    for w in range(len(filtered_sentence)):
        syns = wordnet.synsets(filtered_sentence[w])
        syns_sim = [topic.wup_similarity(syns[s]) for s in range(len(syns))]
        syns_sim = [x if x is not None else 0 for x in syns_sim]
        syns_sim = np.max([1 if x > sim_thresh else 0 for x in syns_sim])
        word_scores.append(syns_sim)
    return np.sum(word_scores)

In [207]:
topic_dictionary = {'disability': 'disabled.n.01',
                    'death': 'die.v.01',
                    'health problems': 'ill.a.01',
                    'being a carer': 'care.v.02',
                    'living alone': 'alone.s.01'}

In [208]:
phrase = 'People grieve in different ways and there is no right or wrong way to react to the death of a colleague, friend or family member.  Many people find it helpful to reach out and talk to someone about their feelings, other may wish to deal with the loss in private. Disability ill disabled'

sim_scores = {}
for topic in list(topic_dictionary.keys()):
    topic_synset = wordnet.synset(topic_dictionary['{}'.format(topic)])
    sim_scores['{}'.format(topic)] = synonym_scorer(sentence_cleaner(phrase), topic_synset)
sim_scores

{'disability': 3,
 'death': 1,
 'health problems': 1,
 'being a carer': 0,
 'living alone': 0}