# Eleos - Resource Recommendation Engine

## Introduction

...

In [613]:
# Importing the pandas and numpy libraries, as well as particular modules from sklearn.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

# Importing our resource database.
df = pd.read_csv('resources.csv')
df.head()

Unnamed: 0,r_id,name,url,tags,vote_average,vote_count
0,1,Headspace,https://www.headspace.com/,"[mindfulness, meditation, sleep, stress, anxie...",5,343
1,2,Samaritans Website,https://www.samaritans.org/,"[information, suicide, suicidal, kill, harm, d...",4,1964
2,3,NHS Suicide Help,https://www.nhs.uk/conditions/suicide/,"[information, suicide]",5,5787
3,4,NHS Self Harm Help,https://www.nhs.uk/conditions/self-harm/,"[information, selfharm]",5,1575
4,5,Samaritans Support,https://www.samaritans.org/how-we-can-help/con...,"[helpline, phoneline, chat, suicide, suicidal,...",4,5868


## Resource Search

In our resource database, we have inputted resources that fit under a specific category of other related resources. During the onboarding experience, users are yet to have informed us of their preferences of certain resources, meaning it is necessary to deliver to them a list of relevant resources that can fulfill a particular need. Here, we are creating a new dataframe that contains a list of resources that are only related to `grief`.

In [614]:
# Creating a new dataframe that contains resources with the keyword `grief`
dfS = df[df['tags'].str.contains("grief")]
dfS

Unnamed: 0,r_id,name,url,tags,vote_average,vote_count
5,6,Young Minds: Grief and Loss,https://youngminds.org.uk/find-help/feelings-a...,"[grief, loss, bereavement, grieve, death]",2,3031
6,7,NHS: Bereavement and Young People,https://www.nhs.uk/conditions/stress-anxiety-d...,"[grief, loss, bereavement, grieve, death]",3,4093
7,8,Grief Encounter Phoneline,https://www.griefencounter.org.uk/,"[helpline, grief, loss, bereavement, grieve, d...",3,5281
21,22,Grief: Support for Young People,https://apps.apple.com/gb/app/grief-support-fo...,"[app, journaling, grief, loss, bereavement, gr...",2,1694


## Resource Ranking

It is important that, from our resource dataframe, we are able to recommend users the best resources based on the feedback of other users. There are two ways in which we use this system of ranking resources:

1. Ranking a set of resources for a particular category, e.g. `grief`, based on user feedback in order to inform users during the onboarding experience of the best resources for a emotional wellbeing need. This is **not** personalised, and is simply a way of delivering users resources that have worked for other users.
2. Ranking a set of resources for a particular category, e.g. `grief`, based on feedback from users similar to other users who have rated other resources similarly, in order to allow them to explore options for resources that they may like. This **is** personalised and is only possible after a user has registered and informed us of their ratings of other resources. 

The ranking system employed in this recommendation engine is based on the weighting formula designed by IMDB. It works as follows:

1. An average is taken of the average rating for each resource in the resource dataframe.
2. The 60th percentile of the number of votes of the resources is taken. This is our 'minimum' vote count.
3. A new dataframe is created containing a list of resources from the main dataframe such that the vote count of each resource is equal to or greater than the 'minimum' vote count. These are now the resources that 'qualify' to be displayed to the user.
4. Using the vote count and vote average for each qualifying resources, a score is calculated using the IMDB weighting formula and stored in a new column, `score`.
5. The dataframe containing all qualifying resources is ordered by score in descending fashion.
6. These resources are finally outputted.

In [615]:
# Calculating the average vote average for each resource in the dataframe
C = dfS['vote_average'].mean()
C

2.5

In [616]:
# Calculating the 30th percentile for the vote count of all resources in the dataframe
m = dfS['vote_count'].quantile(0.3)
m

2897.2999999999997

In [617]:
# Creating a new dataframe containing only resources that meet the vote count threshold
dfQ = dfS.copy().loc[dfS['vote_count'] >= m]
dfQ.shape

(3, 6)

In [618]:
# Weighting the vote count and average for each resource
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [619]:
# Defining a 'score' and calculating its value with `weighted_rating()`
dfQ['score'] = dfQ.apply(weighted_rating, axis=1)

# Sorting the qualifying recommendations based on their score
dfQ = dfQ.sort_values('score', ascending=False)

# Printing the top recommendations
dfQ[['name', 'url']].head()

Unnamed: 0,name,url
7,Grief Encounter Phoneline,https://www.griefencounter.org.uk/
6,NHS: Bereavement and Young People,https://www.nhs.uk/conditions/stress-anxiety-d...
5,Young Minds: Grief and Loss,https://youngminds.org.uk/find-help/feelings-a...


## Content-Based Filtering

Once a user has informed us of their preferences of resources by individually rating them, it is important for us to be able to recommend other resources within that specific category that meet similar emotional wellbeing needs. This is achieved by vectorising the tags assigned to each resource and using trigonometric ratios to determine how similar a given resource is to the other resources available in the dataframe. The output of this function is then stripped down to 3 recommendations but is not ordered in any way.

In [620]:
df[['name', 'tags', 'url']].head()

Unnamed: 0,name,tags,url
0,Headspace,"[mindfulness, meditation, sleep, stress, anxie...",https://www.headspace.com/
1,Samaritans Website,"[information, suicide, suicidal, kill, harm, d...",https://www.samaritans.org/
2,NHS Suicide Help,"[information, suicide]",https://www.nhs.uk/conditions/suicide/
3,NHS Self Harm Help,"[information, selfharm]",https://www.nhs.uk/conditions/self-harm/
4,Samaritans Support,"[helpline, phoneline, chat, suicide, suicidal,...",https://www.samaritans.org/how-we-can-help/con...


In [621]:
count = CountVectorizer()
count_matrix = count.fit_transform(df['tags'])
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [622]:
indices = pd.Series(df['name'])

In [623]:
def recommend(name, cosine_sim = cosine_sim):
    recommendations = []
    idx = indices[indices == name].index[0]
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top = list(score_series.iloc[1:4].index)
    
    for i in top:
        recommendations.append(list(df['name'])[i])
        
    dfR = df[df['name'].isin([recommendations[0], recommendations[1], recommendations[2]])]
    dfR = dfR[['name', 'url']]
    return dfR

In [624]:
rec = input("Ask me to recommend you a resource: ")
dfR = recommend(rec)
dfR

Ask me to recommend you a resource: Wysa


Unnamed: 0,name,url
0,Headspace,https://www.headspace.com/
18,Stoic,https://www.stoicroutine.com/
39,Flora,https://flora.appfinca.com/


## Collaborative Filtering

In order to further personalise the resources that are recommended to users, we will use a machine learning model that is able to predict how a user might rate a given resource by capturing the similarity between users and items, through a latent factor model. We are able to do this by provided the prediction algorithm with a dataframe of ratings made by users for the resources found in the main resources dataframe. By predicting the rating that a user might give to a set of resources, we are then able to rank this list by the product of the predicted rating and the already calculated weighted score, in order to generate a personalised list of resources based on the both the users interest in similar resources and the interests of similar users.

In [625]:
from surprise import Reader
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
reader = Reader()
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,u_id,r_id,rating
0,44,27,1
1,90,9,5
2,40,4,3
3,50,6,2
4,68,3,4


In [626]:
data = Dataset.load_from_df(ratings[['u_id', 'r_id', 'rating']], reader)
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.4657  1.6393  1.3414  1.3810  1.4625  1.4580  0.1024  
MAE (testset)     1.2607  1.4698  1.1560  1.2001  1.2781  1.2729  0.1076  
Fit time          0.06    0.03    0.02    0.02    0.02    0.03    0.02    
Test time         0.01    0.00    0.00    0.00    0.00    0.00    0.00    


{'test_rmse': array([1.46568765, 1.63926273, 1.34135454, 1.38103464, 1.46251326]),
 'test_mae': array([1.2607092 , 1.46975444, 1.15595313, 1.20006824, 1.27806652]),
 'fit_time': (0.05863308906555176,
  0.033702850341796875,
  0.021668672561645508,
  0.01994800567626953,
  0.018419981002807617),
 'test_time': (0.007441043853759766,
  0.0009410381317138672,
  0.0007350444793701172,
  0.0007178783416748047,
  0.0005171298980712891)}

In [627]:
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x11eefcbb0>

In [628]:
u_id = int(input('Ask me to predict the ratings for a user: '))
# ratings[ratings['u_id'] == u_id]

predictions = []
for i in range (0, len(df)):
    prediction = algo.predict(u_id, i)
    predictions.append(prediction.est)

dfP = pd.DataFrame(predictions, columns = ["prediction"])
dfP = pd.merge(df, dfP, left_index=True, right_index=True)
dfP = dfP[dfP['tags'].str.contains("grief")]


# Calculating the average vote average for each resource in the dataframe
C = dfP['vote_average'].mean()
C

# Calculating the 60th percentile for the vote count of all resources in the dataframe
m = dfP['vote_count'].quantile(0.6)
m

# Defining a 'score' and calculating its value with `weighted_rating()`
dfP['score'] = dfP.apply(weighted_rating, axis=1)

#dfS = pd.merge(dfS, dfP, left_index=True, right_index=True)

# Weighting the vote count and average for each qualifying resource
def weighted_prediction(x):
    score = x['score']
    prediction = x['prediction']
    return score * prediction

# Defining a 'score' and calculating its value with `weighted_rating()`
dfP['prediction'] = dfP.apply(weighted_prediction, axis=1)

# Sorting the qualifying recommendations based on their score
dfP = dfP.sort_values('prediction', ascending=False)

# Printing the top recommendations
dfP[['name', 'prediction']].head()

Ask me to predict the ratings for a user: 5


Unnamed: 0,name,prediction
7,Grief Encounter Phoneline,8.37934
5,Young Minds: Grief and Loss,7.600437
6,NHS: Bereavement and Young People,7.504209
21,Grief: Support for Young People,6.003086


In [629]:
from monkeylearn import MonkeyLearn

ml = MonkeyLearn('da0088bcfbb0add6f5cccff301a6d98bfbac4d77')
data = ["School has been stressing me out."]
model_id = 'ex_YCya9nrn'
result = ml.extractors.extract(model_id, data)
print(result.body)

[{'text': 'School has been stressing me out.', 'external_id': None, 'error': False, 'extractions': [{'tag_name': 'KEYWORD', 'parsed_value': 'school', 'count': 1, 'relevance': '0.909', 'positions_in_text': [0]}]}]


In [630]:
import pandas as pd
from pandas.io.json import json_normalize
dfB = pd.json_normalize(result.body)
dfB

Unnamed: 0,text,external_id,error,extractions
0,School has been stressing me out.,,False,"[{'tag_name': 'KEYWORD', 'parsed_value': 'scho..."


In [631]:
dfB = pd.json_normalize(dfB.iloc[0]['extractions']).head()
dfB.rename(columns={'parsed_value':'keywords'}, inplace=True)
dfB = dfB[['keywords', 'relevance']]

In [632]:
dfB.head(1)

Unnamed: 0,keywords,relevance
0,school,0.909


In [633]:
#Python3 code to remove whitespace 
def strip(keyword): 
    return keyword.replace(" ", "") 

keyword = strip(dfB.iloc[0]['keywords']).lower()
dfL = df[df['tags'].str.contains(keyword)]
dfL

Unnamed: 0,r_id,name,url,tags,vote_average,vote_count
24,25,For Me,https://apps.apple.com/gb/app/for-me-app/id109...,"[app, chat, forums, counselling, mood tracking...",4,5474
34,35,Student Health App,https://www.nhs.uk/apps-library/student-health...,"[app, student, school, health]",3,1609
35,36,Headspace App,https://www.headspace.com/headspace-meditation...,"[app, relax, meditation, midnfullness, student...",1,2122
37,38,Surviving Exams,https://www.nhs.uk/conditions/stress-anxiety-d...,"[information, exams, school, students, stress]",4,4597
39,40,Flora,https://flora.appfinca.com/,"[app, focus, productivity, stress, school, exa...",5,5097
40,41,My Life,https://my.life/,"[app, students, school, study, calm, breathe, ...",4,5554
41,42,NHS: Student Stress,https://www.nhs.uk/conditions/stress-anxiety-d...,"[information, school, stress, students]",2,867
