# DonorsChoose: Donor-Project Matching with Recommender Systems
Data and project idea come from a [Kaggle competition](https://www.kaggle.com/donorschoose/io).
Much of the recommender work is based on a [tutorial](https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101/code) by [Gabriel Moreira](https://www.kaggle.com/gspmoreira).

# Donors Choose
Founded in 2000 by a Bronx history teacher, DonorsChoose.org has raised $685 million for America's classrooms. Teachers at three-quarters of all the public schools in the U.S. have come to DonorsChoose.org to request what their students need, making DonorsChoose.org the leading platform for supporting public education.

To date, 3 million people and partners have funded 1.1 million DonorsChoose.org projects. But teachers still spend more than a billion dollars of their own money on classroom materials. To get students what they need to learn, the team at DonorsChoose.org needs to be able to connect donors with the projects that most inspire them.

In the second Kaggle Data Science for Good challenge, DonorsChoose.org, in partnership with Google.org, is inviting the community to help them pair up donors to the classroom requests that will most motivate them to make an additional gift. To support this challenge, DonorsChoose.org has supplied anonymized data on donor giving from the past five years. The winning methods will be implemented in DonorsChoose.org email marketing campaigns.

# Problem Statement
DonorsChoose.org has funded over 1.1 million classroom requests through the support of 3 million donors, the majority of whom were making their first-ever donation to a public school. If DonorsChoose.org can motivate even a fraction of those donors to make another donation, that could have a huge impact on the number of classroom requests fulfilled.

A good solution will enable DonorsChoose.org to build targeted email campaigns recommending specific classroom requests to prior donors. Part of the challenge is to assess the needs of the organization, uncover insights from the data available, and build the right solution for this problem. Submissions will be evaluated on the following criteria:

Performance - How well does the solution match donors to project requests to which they would be motivated to donate? DonorsChoose.org will not be able to live test every submission, so a strong entry will clearly articulate why it will be effective at motivating repeat donations.

Adaptable - The DonorsChoose.org team wants to put the winning submissions to work, quickly. Therefore a good entry will be easy to implement in production.

Intelligible - A good entry should be easily understood by the DonorsChoose.org team should it need to be updated in the future to accommodate a changing marketplace.

# Proposed Solution

I will address the problem by using [Recommender System](https://en.wikipedia.org/wiki/Recommender_system) (RecSys) techniques. The objective of a RecSys is to recommend relevant items for users, based on their preference. Preference and relevance are subjective, and they are generally inferred by items users have consumed previously.

The main RecSys techniques are:  
   - [**Collaborative Filtering**](https://en.wikipedia.org/wiki/Collaborative_filtering): This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.   
   - [**Content-Based Filtering**](http://recommender-systems.org/content-based-filtering/): This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.  
   - **Hybrid methods**:  Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

## Load libraries

In [1]:
#enable auto complete
%config IPCompleter.greedy=True
%matplotlib inline

In [2]:
import pandas as pd # package for high-performance, easy-to-use data structures and data analysis
import numpy as np # fundamental package for scientific computing with Python
import matplotlib as cm
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
import scipy
from scipy.sparse.linalg import svds
import math
import random
import sklearn

from numpy import array

from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


#import os
#print(os.listdir("../input"))

from sklearn import preprocessing
# Supress unnecessary warnings so that presentation looks clean
import warnings
warnings.filterwarnings("ignore")

# Print all rows and columns
pd.set_option('display.max_columns', 21)
pd.set_option('display.max_rows', None)

## Load data

In [3]:
projects = pd.read_csv('../input/Projects.csv')
donations = pd.read_csv('../input/Donations.csv')
donors = pd.read_csv('Donors.csv', low_memory=False)

print(' donations: ',donations.shape,'\n','donors: ',donors.shape,'\n','schools',schools.shape,'\n','teachers',teachers.shape,'\n','projects',teachers.shape,'\n','resources',resources.shape)

In [4]:
donations.rename(index=str, columns={"Project ID": "project_id"},inplace=True)
projects.rename(index=str, columns={"Project ID": "project_id"},inplace=True)

In [5]:
donors.head(5)

Unnamed: 0,Donor ID,Donor City,Donor State,Donor Is Teacher,Donor Zip
0,00000ce845c00cbf0686c992fc369df4,Evanston,Illinois,No,602
1,00002783bc5d108510f3f9666c8b1edd,Appomattox,other,No,245
2,00002d44003ed46b066607c5455a999a,Winton,California,Yes,953
3,00002eb25d60a09c318efbd0797bffb5,Indianapolis,Indiana,No,462
4,0000300773fe015f870914b42528541b,Paterson,New Jersey,No,75


In [6]:
#this piece of code converts Donor_ID which is a 32-bit Hex int to the log of the base 10 equivalent
#dfM = pd.read_csv('Donors.csv', low_memory=False)

#dfM.rename(index=str, columns={"Donor ID": "donor_id"},inplace=True)

dfM=donors['Donor ID'].apply(pd.Series)

dfM["new0"] = np.nan
dfM["new1"] = np.nan
dfM["new2"] = np.nan
dfM["new3"] = np.nan
dfM["new4"] = np.nan
dfM["power"] = np.nan

dfM.new1 = dfM[0].apply(lambda x: str(x[-int(len(x)/2):]))
dfM.new0 = dfM[0].apply(lambda x: str(x[:-int(len(x)/2)]))

dfM.new0 = dfM.new0.apply(lambda x: (int(x, 16)))
dfM.new1 = dfM.new1.apply(lambda x: (int(x, 16)))

dfM.power = dfM.new1.apply(lambda x: int(math.log10(x)))

dfM.new2 = dfM.new0.apply(lambda x: math.log10(x))
dfM.new3 = dfM.new1.apply(lambda x: math.log10(x))

dfM.new4 = ((dfM.new2 + dfM.new3 + dfM.power)*10**15)
dfM.new4.astype(int)
dfM.new4.apply(np.floor)
dfM = dfM.drop(['new0', 'new1', 'new2','new3','power'], axis=1)

dfM.rename(index=str, columns={0: "Donor ID", "new4":"donor_id"},inplace=True)


#add logDonorID as donor_id to donors
donors = donors.merge(dfM, left_on='Donor ID', right_on='Donor ID', how="left")

donations = donations.merge(dfM, left_on='Donor ID', right_on='Donor ID', how="right")

In [7]:
#check for donor_id duplicates
from collections import Counter
mylist = donors.donor_id
a=[k for k,v in Counter(mylist).items() if v>1]
len(a)

0

In [None]:
donors.shape

In [8]:
donations.head(5)

Unnamed: 0,project_id,Donation ID,Donor ID,Donation Included Optional Donation,Donation Amount,Donor Cart Sequence,Donation Received Date,donor_id
0,000009891526c0ade7180f8423792063,688729120858666221208529ee3fc18e,1f4b5b6e68445c6c4a0509b3aca93f38,No,178.37,11.0,2016-08-23 13:15:57,5.508017e+16
1,016510b8226e70d740130ac2dcfb6c5e,f7fc7cf0b8980fb00840b4afe7c1e761,1f4b5b6e68445c6c4a0509b3aca93f38,No,807.92,20.0,2016-12-21 13:03:59,5.508017e+16
2,03c8766872a129240d14be8c385b5f1a,5015b2df023ed47e7609e91ca65f7559,1f4b5b6e68445c6c4a0509b3aca93f38,No,288.99,71.0,2018-01-25 17:01:41,5.508017e+16
3,04bfceb168d816a3cbe52f1e70d30bf0,b8871d3666020f0a527c8d6b56361d1e,1f4b5b6e68445c6c4a0509b3aca93f38,No,1200.05,38.0,2017-10-18 12:26:15,5.508017e+16
4,05a4e3418a97f2df3a6cc8ae8fbde60c,8bc4de01f65d42a611236e083c6f3473,1f4b5b6e68445c6c4a0509b3aca93f38,No,565.26,75.0,2018-01-25 18:00:23,5.508017e+16


In [9]:
donors.head(10)

Unnamed: 0,Donor ID,Donor City,Donor State,Donor Is Teacher,Donor Zip,donor_id
0,00000ce845c00cbf0686c992fc369df4,Evanston,Illinois,No,602.0,4.78244e+16
1,00002783bc5d108510f3f9666c8b1edd,Appomattox,other,No,245.0,4.972488e+16
2,00002d44003ed46b066607c5455a999a,Winton,California,Yes,953.0,4.836073e+16
3,00002eb25d60a09c318efbd0797bffb5,Indianapolis,Indiana,No,462.0,5.026328e+16
4,0000300773fe015f870914b42528541b,Paterson,New Jersey,No,75.0,5.071083e+16
5,00004c31ce07c22148ee37acd0f814b9,,other,No,,5.064371e+16
6,00004e32a448b4832e1b993500bf0731,Stamford,Connecticut,No,69.0,5.045585e+16
7,00004fa20a986e60a40262ba53d7edf1,Green Bay,Wisconsin,No,543.0,5.201484e+16
8,00005454366b6b914f9a8290f18f4aed,Argyle,New York,No,128.0,5.072579e+16
9,0000584b8cdaeaa6b3de82be509db839,Valparaiso,Indiana,No,463.0,5.209977e+16


In [10]:
donors.donor_id.iloc[0]

47824396695228792.0

# EDA

In [11]:
#schools.head(3)

In [12]:
#teachers.head(3)

In [13]:
#plt.rcParams["figure.figsize"] = [12,6]
#teachers['Teacher Prefix'].plot(kind = 'bar')
#sns.countplot(x='Teacher Prefix', data=teachers);

In [14]:
projects.head(3)

Unnamed: 0,project_id,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Current Status,Project Fully Funded Date
0,77b7d3f2ac4e32d538914e4a8cb8a525,c2d5cb0a29a62e72cdccee939f434181,59f7d2c62f7e76a99d31db6f62b7b67c,2,Teacher-Led,Anti-Bullying Begins with Me,do you remember your favorite classroom from e...,"Applied Learning, Literacy & Language","Character Education, Literacy",Grades PreK-2,Books,$490.38,2013-01-01,Fully Funded,2013-03-12
1,fd928b7f6386366a9cad2bea40df4b25,8acbb544c9215b25c71a0c655200baea,8fbd92394e20d647ddcdc6085ce1604b,1,Teacher-Led,Ukuleles For Middle Schoolers,what sound is happier than a ukulele? we have...,Music & The Arts,Music,Grades 6-8,Supplies,$420.61,2013-01-01,Expired,
2,7c915e8e1d27f10a94abd689e99c336f,0ae85ea7c7acc41cffa9f81dc61d46df,9140ac16d2e6cee45bd50b0b2ce8cd04,2,Teacher-Led,"Big Books, Flip Books, And Everything In Between","my 1st graders may be small, but they have big...","Literacy & Language, Special Needs","Literacy, Special Needs",Grades PreK-2,Books,$510.46,2013-01-01,Fully Funded,2013-01-07


In [15]:
#resources.head(3)

Donors who have donated to more than 1 campaign

In [16]:
donations_per_donor = donations.groupby('donor_id')['Donor Cart Sequence'].max()
donations_per_donor1 = round(((donations_per_donor == 1).mean() *100),2)
print("No more than 1 donation is given by: "+ str(donations_per_donor1) +"% donors")
donations_per_donor_more_than_1 = round(((donations_per_donor > 1).mean() *100),2)
print("More than 1 donation is given by: "+ str(donations_per_donor_more_than_1) +"% donors")

No more than 1 donation is given by: 65.5% donors
More than 1 donation is given by: 29.61% donors


In [17]:
donations_per_donor = donations.groupby('donor_id')['Donor Cart Sequence'].max()
donations_per_donor1 = round(((donations_per_donor == 1).mean() *100),3)
donations_per_donor_under_5 = (donations_per_donor < 5).mean() *100
donations_per_donor_under_10 = (donations_per_donor <10).mean() *100
donations_per_donor_under_15 = (donations_per_donor <15).mean() *100
donations_per_donor_under_20 = (donations_per_donor <20).mean() *100
donations_per_donor_under_25 = (donations_per_donor <25).mean() *100
donations_per_donor_under_30 = (donations_per_donor <30).mean() *100
donations_per_donor_over_29 = round(((donations_per_donor > 29).mean() *100),3)

between1_5=round(donations_per_donor_under_5-donations_per_donor1,3)
between5_10=round(donations_per_donor_under_10-donations_per_donor_under_5,3)
between10_15=round(donations_per_donor_under_15-donations_per_donor_under_10,3)
between15_20=round(donations_per_donor_under_20-donations_per_donor_under_15,3)
between20_25=round(donations_per_donor_under_25-donations_per_donor_under_20,3)
between25_30=round(donations_per_donor_under_30-donations_per_donor_under_25,3)

print("Only one time donation is given by: "+ str(donations_per_donor1) +"% donors")
print("2 to 4 donations are given by: "+ str(between1_5) +"% donors")
print("5 to 9 donations are given by: "+ str(between5_10) +"% donors")
print("10 to 14 donations are given by: "+ str(between10_15) +"% donors")
print("15 to 19 donations are given by: "+ str(between15_20) +"% donors")
print("20 to 24 donations are given by: "+ str(between20_25) +"% donors")
print("25 to 29 donations are given by: "+ str(between25_30) +"% donors")
print("29 or more donations are given by: "+ str(donations_per_donor_over_29) +"% donors")

Only one time donation is given by: 65.502% donors
2 to 4 donations are given by: 22.388% donors
5 to 9 donations are given by: 4.662% donors
10 to 14 donations are given by: 1.206% donors
15 to 19 donations are given by: 0.49% donors
20 to 24 donations are given by: 0.257% donors
25 to 29 donations are given by: 0.15% donors
29 or more donations are given by: 0.461% donors


In [18]:
total=donations_per_donor1+between1_5+between5_10+between10_15+between15_20+between20_25+between25_30+donations_per_donor_over_29

print('Percentages added together: '+ str(total)+'%')

Percentages added together: 95.116%


In [19]:
donations_per_donor_under_10 = (donations_per_donor <10).mean() *100
donations_per_donor0 = (donations_per_donor > 0).mean() *100
donations_per_donor_over_9=round((donations_per_donor0-donations_per_donor_under_10),2)
print("10 or more donations are given by "+str(donations_per_donor_over_9)+"% donors")

10 or more donations are given by 2.56% donors


Before modeling, we need to measure the relation strength between a donor and a project. Although most donors only donate once in the dataset, there are donors who donated to the same project multiple times, and users who donated to multiple projects. The donation amount also varies. To better measure this strength, we combine the times and amounts of donations, and create a new dataset containing unique donation relations between a donor, a project, and the relation strength. he number of projects and unique donor-project donation events:

### Set up test mode where only 10000 rows of donation/donor dataframe are used. 
When testing is complete we will need to turn off test mode.

In [20]:
# Set up test mode to save some time
test_mode = True

# Merge datasets
donations = donations.merge(donors, on="donor_id", how="left")
df = donations.merge(projects,on="project_id", how="left")


# only load a few lines in test mode
if test_mode:
    df = df.head(10000)

donations_df = df
print('shape of df is ',df.shape)

shape of df is  (10000, 27)


In [21]:
donors.head(3)

Unnamed: 0,Donor ID,Donor City,Donor State,Donor Is Teacher,Donor Zip,donor_id
0,00000ce845c00cbf0686c992fc369df4,Evanston,Illinois,No,602,4.78244e+16
1,00002783bc5d108510f3f9666c8b1edd,Appomattox,other,No,245,4.972488e+16
2,00002d44003ed46b066607c5455a999a,Winton,California,Yes,953,4.836073e+16


In [22]:
donors.head(2)

Unnamed: 0,Donor ID,Donor City,Donor State,Donor Is Teacher,Donor Zip,donor_id
0,00000ce845c00cbf0686c992fc369df4,Evanston,Illinois,No,602,4.78244e+16
1,00002783bc5d108510f3f9666c8b1edd,Appomattox,other,No,245,4.972488e+16


In [23]:
#df -> donors + donations + projects
print('shape of df is ',df.shape)

shape of df is  (10000, 27)


In [24]:
# Deal with missing values
donations["Donation Amount"] = donations["Donation Amount"].fillna(0)

# Define event strength as the donated amount to a certain project
donations_df['eventStrength'] = donations_df['Donation Amount']

def smooth_donor_preference(x):
    return math.log(1+x, 2)
    
donations_full_df = donations_df \
                    .groupby(['donor_id', 'project_id'])['eventStrength'].sum() \
                    .apply(smooth_donor_preference).reset_index()
        
# Update projects dataset
project_cols = projects.columns
projects = df[project_cols].drop_duplicates()

print('# of projects: %d' % len(projects))
print('# of unique user/project donations: %d' % len(donations_full_df))

# of projects: 8810
# of unique user/project donations: 8907


In [25]:
donations_full_df.head()

Unnamed: 0,donor_id,project_id,eventStrength
0,5.161127e+16,0000c0bdc0f15bd239cfffa884791a10,4.392317
1,5.161127e+16,e2c3be0c5473779d91e1a3c9a1cfac5a,4.392317
2,5.257667e+16,0000bbd74feb563a324fe441eae19feb,6.044394
3,5.282365e+16,0000c0ea0aecb2ad60e8d234eab6ed28,3.459432
4,5.289632e+16,0000fc11407901bcacdfad1db909b9f6,4.70044


# Evaluation

Evaluation is important for machine learning projects, because it allows to compare objectivelly different algorithms and hyperparameter choices for models.
One key aspect of evaluation is to ensure that the trained model generalizes for data it was not trained on, using Cross-validation techniques. We are using here a simple cross-validation approach named holdout, in which a random data sample (20% in this case) are kept aside in the training process, and exclusively used for evaluation. All evaluation metrics reported here are computed using the test set.

Ps. A more robust evaluation approach could be to split train and test sets by a reference date, where the train set is composed by all interactions before that date, and the test set are interactions after that date. For the sake of simplicity, we chose the first random approach for this notebook, but you may want to try the second approach to better simulate how the recsys would perform in production predicting "future" users interactions.

In [26]:
donations_train_df, donations_test_df = train_test_split(donations_full_df,
                                   test_size=0.20,
                                   random_state=42)

print('# donations on Train set: %d' % len(donations_train_df))
print('# donations on Test set: %d' % len(donations_test_df))

# donations on Train set: 7125
# donations on Test set: 1782


In [27]:
#Indexing by donor_id to speed up the searches during evaluation
donations_full_indexed_df = donations_full_df.set_index('donor_id')
donations_train_indexed_df = donations_train_df.set_index('donor_id')
donations_test_indexed_df = donations_test_df.set_index('donor_id')

person_id -> 'donor_id'
contentId -> project_id
articles_df -> donations_df
item_id -> 'project_id'
interactions_df -> donations_df
interactions -> donations
items -> projects
interacted -> donated`

In [28]:
#get_projects_donated replaced with get_proj_donated
def get_proj_donated(donor_id, donations_df):
    # Get the user's data and merge in project info
    donated_projects = donations_df.loc[donor_id]['project_id']
    return set(donated_projects if type(donated_projects) == pd.Series else [donated_projects])

In [29]:
df.head(5)

Unnamed: 0,project_id,Donation ID,Donor ID_x,Donation Included Optional Donation,Donation Amount,Donor Cart Sequence,Donation Received Date,donor_id,Donor ID_y,Donor City,...,Project Essay,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Current Status,Project Fully Funded Date,eventStrength
0,000009891526c0ade7180f8423792063,688729120858666221208529ee3fc18e,1f4b5b6e68445c6c4a0509b3aca93f38,No,178.37,11.0,2016-08-23 13:15:57,5.508017e+16,1f4b5b6e68445c6c4a0509b3aca93f38,West Jordan,...,the music students in our classes perform freq...,Music & The Arts,Music,Grades 6-8,Other,$529.68,2016-05-13,Fully Funded,2016-08-23,178.37
1,016510b8226e70d740130ac2dcfb6c5e,f7fc7cf0b8980fb00840b4afe7c1e761,1f4b5b6e68445c6c4a0509b3aca93f38,No,807.92,20.0,2016-12-21 13:03:59,5.508017e+16,1f4b5b6e68445c6c4a0509b3aca93f38,West Jordan,...,the biggest challenge i face is providing enou...,Math & Science,"Applied Sciences, Environmental Science",Grades 6-8,Technology,"$1,229.32",2016-09-02,Fully Funded,2016-12-21,807.92
2,03c8766872a129240d14be8c385b5f1a,5015b2df023ed47e7609e91ca65f7559,1f4b5b6e68445c6c4a0509b3aca93f38,No,288.99,71.0,2018-01-25 17:01:41,5.508017e+16,1f4b5b6e68445c6c4a0509b3aca93f38,West Jordan,...,my students love to do hands on activities in ...,Math & Science,Applied Sciences,Grades 9-12,Computers & Tablets,$679.98,2017-11-15,Fully Funded,2018-01-25,288.99
3,04bfceb168d816a3cbe52f1e70d30bf0,b8871d3666020f0a527c8d6b56361d1e,1f4b5b6e68445c6c4a0509b3aca93f38,No,1200.05,38.0,2017-10-18 12:26:15,5.508017e+16,1f4b5b6e68445c6c4a0509b3aca93f38,West Jordan,...,my students come to school excited to learn. t...,Literacy & Language,Literacy,Grades PreK-2,Books,"$2,848.64",2017-09-04,Fully Funded,2017-10-18,1200.05
4,05a4e3418a97f2df3a6cc8ae8fbde60c,8bc4de01f65d42a611236e083c6f3473,1f4b5b6e68445c6c4a0509b3aca93f38,No,565.26,75.0,2018-01-25 18:00:23,5.508017e+16,1f4b5b6e68445c6c4a0509b3aca93f38,West Jordan,...,my students are amazing teenagers who want to ...,Applied Learning,College & Career Prep,Grades 9-12,Trips,"$1,468.24",2017-10-09,Fully Funded,2018-01-25,565.26


In [30]:
donations_full_indexed_df.head(10)

Unnamed: 0_level_0,project_id,eventStrength
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1
5.161127e+16,0000c0bdc0f15bd239cfffa884791a10,4.392317
5.161127e+16,e2c3be0c5473779d91e1a3c9a1cfac5a,4.392317
5.257667e+16,0000bbd74feb563a324fe441eae19feb,6.044394
5.282365e+16,0000c0ea0aecb2ad60e8d234eab6ed28,3.459432
5.289632e+16,0000fc11407901bcacdfad1db909b9f6,4.70044
5.289632e+16,5b355d7256f4c0059084f648aa53e207,4.954196
5.289632e+16,96cca7084fc1fcbb75923bd019b39eeb,3.459432
5.289632e+16,990f8ac152b0f62451660c0380134abc,2.584963
5.289632e+16,994e024c4301e2ac6a24e34e4ba14197,2.584963
5.289632e+16,ca8c719cfe6029a2968b8f7a5c756697,2.584963


In [31]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_PROJECTS = 100

class ModelEvaluator:


    def get_not_donated_projects_sample(self, donor_id, sample_size, seed=42):
        donated_projects = get_proj_donated(donor_id, donations_full_indexed_df)
        all_projects = set(projects_df['project_id'])
        non_donated_projects = all_projects - donated_projects

        random.seed(seed)
        non_donated_projects_sample = random.sample(non_donated_projects, sample_size)
        return set(non_donated_projects_sample)

    def _verify_hit_top_n(self, project_id, recommended_projects, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_projects) if c == project_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, donor_id):
        #Getting the projects in test set
        donated_values_testset = donations_test_indexed_df.loc[donor_id]
        if type(donated_values_testset['project_id']) == pd.Series:
            person_donated_projects_testset = set(donated_values_testset['project_id'])
        else:
            person_donated_projects_testset = set([int(donated_values_testset['project_id'])])  
        donated_projects_count_testset = len(person_donated_projects_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_projects(donor_id, 
                                               projects_to_ignore=get_proj_donated(donor_id, 
                                                                                    donations_train_indexed_df), 
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has donated in test set
        for project_id in person_donated_projects_testset:
            #Getting a random sample (100) projects the user has not donated 
            #(to represent projects that are assumed to be no relevant to the user)
            non_donated_projects_sample = self.get_not_donated_projects_sample(donor_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_PROJECTS, 
                                                                          seed=project_id%(2**32))

            #Combining the current donated item with the 100 random projects
            projects_to_filter_recs = non_donated_projects_sample.union(set(['project_id']))

            #Filtering only recommendations that are either the donated item or from a random sample of 100 non-donated projects
            valid_recs_df = person_recs_df[person_recs_df['project_id'].isin(projects_to_filter_recs)]                    
            valid_recs = valid_recs_df['project_id'].values
            #Verifying if the current donated item is among the Top-N recommended projects
            hit_at_5, index_at_5 = self._verify_hit_top_n(project_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(project_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the donated projects that are ranked among the Top-N recommended projects, 
        #when mixed with a set of non-relevant projects
        recall_at_5 = hits_at_5_count / float(donated_projects_count_testset)
        recall_at_10 = hits_at_10_count / float(donated_projects_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'donated_count': donated_projects_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, donor_id in enumerate(list(donations_test_indexed_df.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, donor_id)  
            person_metrics['_donor_id'] = donor_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('donated_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['donated_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['donated_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator() 

## Content-Based Filtering model
We will use Content-Based Filtering method to find projects that are similar to the project(s) that a donor has already donated to. We can calculate the similarity between projects based on data and/or text features extracted from the text data.

I used [a tutortial by](https://www.kaggle.com/gunnvant/building-content-recommender-tutorial/notebook) user [gunnvant](https://www.kaggle.com/gunnvant) to construct word vectors with TF-IDF.

In [32]:
# Preprocessing of text data
textfeats = ["Project Title","Project Essay"]
for cols in textfeats:
    projects[cols] = projects[cols].astype(str) 
    projects[cols] = projects[cols].astype(str).fillna('') # FILL NA
    projects[cols] = projects[cols].str.lower() # Lowercase all text, so that capitalized words dont get treated differently
 
text = projects["Project Title"] + ' ' + projects["Project Essay"]
vectorizer = TfidfVectorizer(strip_accents='unicode',
                             analyzer='word',
                             lowercase=True, # Convert all uppercase to lowercase
                             stop_words='english', # Remove commonly found english words ('it', 'a', 'the') which do not typically contain much signal
                             max_df = 0.9, # Only consider words that appear in fewer than max_df percent of all documents
                             # max_features=5000 # Maximum features to be extracted                    
                            )                        
project_ids = projects['project_id'].tolist()
tfidf_matrix = vectorizer.fit_transform(text)
tfidf_feature_names = vectorizer.get_feature_names()
tfidf_matrix

<8810x23439 sparse matrix of type '<class 'numpy.float64'>'
	with 857980 stored elements in Compressed Sparse Row format>

To model the user profile, we take all the item profiles the user has interacted and average them. The average is weighted by the interaction strength, in other words, the articles the user has interacted the most (eg. liked or commented) will have a higher strength in the final user profile.

In [33]:
def get_project_profile(project_id):
    idx = project_ids.index(project_id)
    project_profile = tfidf_matrix[idx:idx+1]
    return project_profile

def get_project_profiles(ids):
    project_profiles_list = [get_project_profile(x) for x in np.ravel([ids])]
    project_profiles = scipy.sparse.vstack(project_profiles_list)
    return project_profiles

def build_donors_profile(donor_id, donations_indexed_df):
    donations_donor_df = donations_indexed_df.loc[donor_id]
    donor_project_profiles = get_project_profiles(donations_donor_df['project_id'])
    donor_project_strengths = np.array(donations_donor_df['eventStrength']).reshape(-1,1)
    #Weighted average of project profiles by the donations strength
    donor_project_strengths_weighted_avg = np.sum(donor_project_profiles.multiply(donor_project_strengths), axis=0) / (np.sum(donor_project_strengths)+1)
    donor_profile_norm = sklearn.preprocessing.normalize(donor_project_strengths_weighted_avg)
    return donor_profile_norm


def build_donors_profiles(): 
    donations_indexed_df = donations_full_df[donations_full_df['project_id'].isin(projects['project_id'])].set_index('donor_id')
    donor_profiles = {}
    for donor_id in donations_indexed_df.index.unique():
        donor_profiles[donor_id] = build_donors_profile(donor_id, donations_indexed_df)
    return donor_profiles

In [34]:
donor_profiles = build_donors_profiles()
print("# of donors with profiles: %d" % len(donor_profiles))

# of donors with profiles: 87


In [35]:
donations_full_indexed_df.head(10)

Unnamed: 0_level_0,project_id,eventStrength
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1
5.161127e+16,0000c0bdc0f15bd239cfffa884791a10,4.392317
5.161127e+16,e2c3be0c5473779d91e1a3c9a1cfac5a,4.392317
5.257667e+16,0000bbd74feb563a324fe441eae19feb,6.044394
5.282365e+16,0000c0ea0aecb2ad60e8d234eab6ed28,3.459432
5.289632e+16,0000fc11407901bcacdfad1db909b9f6,4.70044
5.289632e+16,5b355d7256f4c0059084f648aa53e207,4.954196
5.289632e+16,96cca7084fc1fcbb75923bd019b39eeb,3.459432
5.289632e+16,990f8ac152b0f62451660c0380134abc,2.584963
5.289632e+16,994e024c4301e2ac6a24e34e4ba14197,2.584963
5.289632e+16,ca8c719cfe6029a2968b8f7a5c756697,2.584963


Get top 5 terms for 10 donors 

In [36]:
ind_donor=donations_full_indexed_df.index.values[0:10]
ind_donor

array([  5.16112736e+16,   5.16112736e+16,   5.25766747e+16,
         5.28236548e+16,   5.28963184e+16,   5.28963184e+16,
         5.28963184e+16,   5.28963184e+16,   5.28963184e+16,
         5.28963184e+16])

In [37]:
donor1 = ind_donor[0]
donor2 = ind_donor[1]
donor3 = ind_donor[2]
donor4 = ind_donor[3]
donor5 = ind_donor[4]
donor6 = ind_donor[5]
donor7 = ind_donor[6]
donor8 = ind_donor[7]
donor9 = ind_donor[8]
donor10 = ind_donor[9]

In [38]:
donor1_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor1].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token1', 'relevance1'])
donor2_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor2].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token2', 'relevance2'])
donor3_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor3].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token3', 'relevance3'])
donor4_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor4].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token4', 'relevance4'])
donor5_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor5].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token5', 'relevance5'])
donor6_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor6].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token6', 'relevance6'])
donor7_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor7].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token7', 'relevance7'])
donor8_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor8].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token8', 'relevance8'])
donor9_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor9].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token9', 'relevance9'])
donor10_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor10].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token10', 'relevance10'])

Join all the info into one table

In [39]:
example_profiles = donor1_profile.join(donor2_profile)
example_profiles = example_profiles.join(donor3_profile)
example_profiles = example_profiles.join(donor4_profile)
example_profiles = example_profiles.join(donor6_profile)
example_profiles = example_profiles.join(donor7_profile)
example_profiles = example_profiles.join(donor8_profile)
example_profiles = example_profiles.join(donor9_profile)
example_profiles = example_profiles.join(donor10_profile)

Examine the results

In [40]:
example_profiles.head(10)

Unnamed: 0,token1,relevance1,token2,relevance2,token3,relevance3,token4,relevance4,token6,relevance6,token7,relevance7,token8,relevance8,token9,relevance9,token10,relevance10
0,puzzles,0.297099,puzzles,0.297099,flocabulary,0.567676,bulletin,0.477752,little,0.193138,little,0.193138,little,0.193138,little,0.193138,little,0.193138
1,green,0.149853,green,0.149853,hip,0.252473,board,0.306504,dots,0.185108,dots,0.185108,dots,0.185108,dots,0.185108,dots,0.185108
2,yoga,0.146921,yoga,0.146921,hop,0.246232,22k,0.230062,wow,0.180755,wow,0.180755,wow,0.180755,wow,0.180755,wow,0.180755
3,games,0.141635,games,0.141635,personalized,0.214115,ruin,0.213081,alot,0.166482,alot,0.166482,alot,0.166482,alot,0.166482,alot,0.166482
4,kids,0.140523,kids,0.140523,engage,0.17805,median,0.199371,math,0.165857,math,0.165857,math,0.165857,math,0.165857,math,0.165857


## Content-Based Recommender

In [41]:
projects.head(3)

Unnamed: 0,project_id,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Current Status,Project Fully Funded Date
0,000009891526c0ade7180f8423792063,5aa86a53f658c198fd4e42c541411c76,6d5b22d39e68c656071a842732c63a0c,6.0,Teacher-Led,ohms musician chair cart,the music students in our classes perform freq...,Music & The Arts,Music,Grades 6-8,Other,$529.68,2016-05-13,Fully Funded,2016-08-23
1,016510b8226e70d740130ac2dcfb6c5e,b489371612a5613a68568a97355a7574,c06f15c6dd7ebe89c00426e16d54ff8d,2.0,Teacher-Led,ipad crazy!!,the biggest challenge i face is providing enou...,Math & Science,"Applied Sciences, Environmental Science",Grades 6-8,Technology,"$1,229.32",2016-09-02,Fully Funded,2016-12-21
2,03c8766872a129240d14be8c385b5f1a,21732e18374c452f163298db4a84ac40,57f35f5085ca75a04ac5b21a68827933,7.0,Teacher-Led,kindles for education,my students love to do hands on activities in ...,Math & Science,Applied Sciences,Grades 9-12,Computers & Tablets,$679.98,2017-11-15,Fully Funded,2018-01-25


In [42]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, projects=None):
        self.project_ids = project_ids
        self.projects = projects
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_projects_to_donor_profile(self, donor_id, topn=1000):
        #Computes the cosine similarity between the donor profile and all project profiles
        cosine_similarities = cosine_similarity(donor_profiles[donor_id], tfidf_matrix)
        #Gets the top similar projects
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        #Sort the similar projects by similarity
        similar_projects = sorted([(project_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_projects
        
    def recommend_projects(self, donor_id, projects_to_ignore=[], topn=10, verbose=False):
        similar_projects = self._get_similar_projects_to_donor_profile(donor_id)
        #Ignores projects the donor has already donated
        similar_projects_filtered = list(filter(lambda x: x[0] not in projects_to_ignore, similar_projects))
        
        recommendations_df = pd.DataFrame(similar_projects_filtered, columns=['project_id', 'recStrength']).head(topn)

        recommendations_df = recommendations_df.merge(self.projects, how = 'left', 
                                                    left_on = 'project_id', 
                                                    right_on = 'project_id')[['recStrength', 'project_id', 'Project Title', 'Project Essay']]


        return recommendations_df

In [43]:
content_based_recommender_model = ContentBasedRecommender(projects)
content_based_recommender_model.recommend_projects(donor1)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,0.724031,e2c3be0c5473779d91e1a3c9a1cfac5a,fidgeting our way to success,our school is located in what's considered the...
1,0.724031,0000c0bdc0f15bd239cfffa884791a10,code green!,"this year, i'm welcoming 22 new first graders ..."
2,0.302676,f34a1e255fc9525f6c6370bf367b059f,"puzzles, puzzles, puzzles!","as a teacher in a low-income, high-poverty sch..."
3,0.301598,265b369ab0b6b812eaa15c4c2ec23f19,making learning fun!,"my students attend school in los angeles, cali..."
4,0.282747,86bc136dbee9fb9a7686cb6ec786fc71,college bound learners are hands on!,my students come from low income/high poverty ...
5,0.269707,7a9af2599c12896967b2e0b6173861e1,learning is less puzzling with learning puzzles,studies have shown that there is a strong conn...
6,0.263415,c3c08d97210cfa5625c1d6cd77919433,help us learn our sight words,my students love kindergarten. they are always...
7,0.244557,c9519eebcd71c4946e4dcc1380ffc674,math mania!,my students are always eager to learn and very...
8,0.239769,008c3cbef08cc405169bf9c9ead358e1,puzzled pre-k,"my class consists of 22 bright-eyed, energetic..."
9,0.239679,70f59b4356c64870e57415636f78ad2e,games galore for first grade!,my first grade students are inquisitive and lo...


In [44]:
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model...


TypeError: not all arguments converted during string formatting

# Collaborative Filtering model


## Create the donor-project matrix

Matrix Factorization

In [46]:
#Creating a sparse pivot table with donors in rows and projects in columns
donors_projects_pivot_matrix_df = donations_train_df.pivot(index='donor_id', 
                                                          columns='project_id', 
                                                          values='eventStrength').fillna(0)

donors_projects_pivot_matrix_df.head(3)

project_id,000009891526c0ade7180f8423792063,00000ce845c00cbf0686c992fc369df4,00002d44003ed46b066607c5455a999a,00002eb25d60a09c318efbd0797bffb5,00005454366b6b914f9a8290f18f4aed,00006084c3d92d904a22e0a70f5c119a,00008f7aaca8ab932c1bc1d0bc449186,0000bbd74feb563a324fe441eae19feb,0000be4b3c81e1cef858d536bb740052,0000c0bdc0f15bd239cfffa884791a10,...,ffa911b07a68c45fa7f508056f3ef5f1,ffb558e5a555557f2ec4c82a32695610,ffbc2293e6696f1a101c858bc744caa4,ffc7c9769b42f288649a169a39d169d4,ffd2ae03255d6b2b9e4550d593487826,ffe96054e89cce1fbda7d1aaa3cd44be,ffed4e0ab9bc52d207c649dd8beced1a,ffedc8c49d2737c349021af22b0164df,fff3f56002231520103512f3e6b1cec4,fff6e4dcf92365bfd08936a53d8a2986
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5.161127e+16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5.257667e+16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.044394,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5.282365e+16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
# Transform the donor-project dataframe into a matrix
donors_projects_pivot_matrix = donors_projects_pivot_matrix_df.as_matrix()
donors_projects_pivot_matrix[:3]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [48]:
# Get donor_ids
donors_ids = list(donors_projects_pivot_matrix_df.index)
donors_ids[:10]

[51611273639215696.0,
 52576674669238528.0,
 52823654786098544.0,
 52896318438453456.0,
 53481021534723608.0,
 53605679345801232.0,
 53825276492916632.0,
 54246872853671864.0,
 54523272816550024.0,
 54618988597161768.0]

In [49]:
# Print the first 5 rows of the donor-project matrix
donors_projects_pivot_matrix[:5]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Now we will use SVD to get latent factors. After the factorization, we will try to reconstruct the original matrix by multiplying its factors. The resulting matrix is not sparse any more. It is the generated predictions for projects the donor have not yet donated to, which we will exploit for recommendations.

In [50]:
#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(donors_projects_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

In [51]:
U.shape

(78, 15)

In [52]:
Vt.shape

(15, 7060)

In [53]:
sigma = np.diag(sigma)
sigma.shape

(15, 15)

In [54]:
# Reconstruct the matrix by multiplying its factors
all_donor_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
all_donor_predicted_ratings

array([[ -1.53185983e-16,   1.19107493e-18,  -1.40358768e-19, ...,
         -1.77661507e-16,  -1.77661507e-16,  -6.87288529e-17],
       [ -1.47000968e-19,   2.56265784e-21,  -3.20328825e-22, ...,
         -3.77284642e-19,  -3.77284642e-19,  -1.45953623e-19],
       [  6.59237774e-16,  -2.81314799e-18,   3.75789799e-19, ...,
          4.56032332e-16,   4.56032332e-16,   1.76417388e-16],
       ..., 
       [  7.42315826e-16,   1.21983721e-08,   1.26279735e-06, ...,
          1.33121595e-03,   1.33121595e-03,   5.14984627e-04],
       [ -1.96754046e-15,  -2.63614824e-18,  -5.27533564e-20, ...,
          9.73964090e-17,   9.73964090e-17,   3.76780742e-17],
       [ -1.94869737e-16,   2.41002307e-18,  -2.79500353e-19, ...,
         -3.39311416e-16,  -3.39311416e-16,  -1.31263574e-16]])

In [55]:
#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_donor_predicted_ratings, 
                           columns = donors_projects_pivot_matrix_df.columns, 
                           index=donors_ids).transpose()
#cf_preds_df.head(10)
## Error: IOPub data rate exceeded.

In [56]:
len(cf_preds_df.columns)

78

## Build the Collaborative Filtering Model

In [57]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, projects=None):
        self.cf_predictions_df = cf_predictions_df
        self.projects = projects
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_projects(self, donor_id, projects_to_ignore=[], topn=10):
        # Get and sort the donor's predictions
        sorted_donor_predictions = self.cf_predictions_df[donor_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={donor_id: 'recStrength'})

        # Recommend the highest predicted projects that the donor hasn't donated to
        recommendations_df = sorted_donor_predictions[~sorted_donor_predictions['project_id'].isin(projects_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

 
        recommendations_df = recommendations_df.merge(self.projects, how = 'left', 
                                                          left_on = 'project_id', 
                                                          right_on = 'project_id')[['recStrength', 'project_id', 'Project Title', 'Project Essay']]


        return recommendations_df

In [58]:
cf_recommender_model = CFRecommender(cf_preds_df, projects)

In [59]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df.head(10)

Evaluating Collaborative Filtering (SVD Matrix Factorization) model...


TypeError: not all arguments converted during string formatting

In [60]:
cf_recommender_model.recommend_projects(donor1)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.283118e-15,5941f4c48d7eff309fd616d7a8f3852b,the power of art: first and last day with clay,"i am writing this project on january 20, 2013...."
1,1.233796e-15,9a61dd94cd1800dafbb5b9fd30eeebdf,get oganized! chair pockets needed,i was the ell teacher for five years at my sch...
2,1.224936e-15,669db86d5e0dc70e2b435002fb6ce9d3,boogie down to success,my 5th grade students are more fully engaged i...
3,1.216916e-15,835a7ce055bd7e04b9b2de07e1a3bbed,aquaponics: when hydroponics and aquaculture c...,"over the past few years, the development of a ..."
4,1.216301e-15,707ee58288d515b1e82b1042bb8ead1b,magnet pull to environmental literature,my school was just converted to a environmenta...
5,1.187906e-15,07de9e633d7c5b0816df960c25a729a2,watercolor techniques and experimenting.,"""don't you wish we could stay in art all day?..."
6,1.149117e-15,31102da36a270a45d3cea10c59428db9,longing for a library,a typical day in my classroom is spent explori...
7,1.147324e-15,4a49f485e22a42bac3ac83ae162e20fd,garden galore ii,"""sometimes, it's just easier to say yes to tha..."
8,1.139512e-15,33d2f58ff0a5a6bb74208b27924b2849,school-wide poetry performance,poetry is word magic. i want to bring the enc...
9,1.138176e-15,109041f8e8c90c9bb3c9f4e8129153cc,holiday cards for patients,everyone deserves to know that someone is thi...


In [61]:
cf_recommender_model.recommend_projects(donor2)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.283118e-15,5941f4c48d7eff309fd616d7a8f3852b,the power of art: first and last day with clay,"i am writing this project on january 20, 2013...."
1,1.233796e-15,9a61dd94cd1800dafbb5b9fd30eeebdf,get oganized! chair pockets needed,i was the ell teacher for five years at my sch...
2,1.224936e-15,669db86d5e0dc70e2b435002fb6ce9d3,boogie down to success,my 5th grade students are more fully engaged i...
3,1.216916e-15,835a7ce055bd7e04b9b2de07e1a3bbed,aquaponics: when hydroponics and aquaculture c...,"over the past few years, the development of a ..."
4,1.216301e-15,707ee58288d515b1e82b1042bb8ead1b,magnet pull to environmental literature,my school was just converted to a environmenta...
5,1.187906e-15,07de9e633d7c5b0816df960c25a729a2,watercolor techniques and experimenting.,"""don't you wish we could stay in art all day?..."
6,1.149117e-15,31102da36a270a45d3cea10c59428db9,longing for a library,a typical day in my classroom is spent explori...
7,1.147324e-15,4a49f485e22a42bac3ac83ae162e20fd,garden galore ii,"""sometimes, it's just easier to say yes to tha..."
8,1.139512e-15,33d2f58ff0a5a6bb74208b27924b2849,school-wide poetry performance,poetry is word magic. i want to bring the enc...
9,1.138176e-15,109041f8e8c90c9bb3c9f4e8129153cc,holiday cards for patients,everyone deserves to know that someone is thi...


# Hybrid Method

In [62]:
class HybridRecommender:
    
    MODEL_NAME = 'Hybrid'
    
    def __init__(self, cb_rec_model, cf_rec_model, projects_df):
        self.cb_rec_model = cb_rec_model
        self.cf_rec_model = cf_rec_model
        self.projects_df = projects_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_projects(self, donor_id, projects_to_ignore=[], topn=10):
        #Getting the top-1000 Content-based filtering recommendations
        cb_recs_df = self.cb_rec_model.recommend_projects(donor_id, projects_to_ignore=projects_to_ignore, 
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCB'})
        
        #Getting the top-1000 Collaborative filtering recommendations
        cf_recs_df = self.cf_rec_model.recommend_projects(donor_id, projects_to_ignore=projects_to_ignore,  
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCF'})
        
        #Combining the results by project_id
        recs_df = cb_recs_df.merge(cf_recs_df,
                                   how = 'inner', 
                                   left_on = 'project_id', 
                                   right_on = 'project_id')
        
        #Computing a hybrid recommendation score based on CF and CB scores
        recs_df['recStrengthHybrid'] = recs_df['recStrengthCB'] * recs_df['recStrengthCF']
        
        #Sorting recommendations by hybrid score
        recommendations_df = recs_df.sort_values('recStrengthHybrid', ascending=False).head(topn)

        recommendations_df = recommendations_df.merge(self.projects_df, how = 'left', 
                                                    left_on = 'project_id', 
                                                    right_on = 'project_id')[['recStrengthHybrid', 
                                                                              'project_id', 'Project Title', 
                                                                              'Project Essay']]


        return recommendations_df
    
hybrid_recommender_model = HybridRecommender(content_based_recommender_model, cf_recommender_model, projects)

In [63]:
hybrid_recommender_model.recommend_projects(donor1)

Unnamed: 0,recStrengthHybrid,project_id,Project Title,Project Essay
0,2.137515e-16,265b369ab0b6b812eaa15c4c2ec23f19,making learning fun!,"my students attend school in los angeles, cali..."
1,1.831127e-16,c3c08d97210cfa5625c1d6cd77919433,help us learn our sight words,my students love kindergarten. they are always...
2,1.722116e-16,2b8c9c1568df2ffedee49e0845a77c10,creating passionate readers,my fourth grade students struggle significantl...
3,1.72e-16,70f59b4356c64870e57415636f78ad2e,games galore for first grade!,my first grade students are inquisitive and lo...
4,1.636152e-16,aa12a6397cc3da0f2327ce219ff76d7d,an ipad for tech savvy second graders!,imagine a 7 year old who reads below grade lev...
5,1.634996e-16,bee54d525555c03f767afd77d24f6cef,restless in our classroom,"i am privileged to teach a group of energetic,..."
6,1.612091e-16,636b4912ddf19e6f1ad1709867c66a1f,"small groups, whole groups, we love to read gr...",i have a class of twenty-five wonderful studen...
7,1.601675e-16,c80597e9d5d34e8f0322442995f9bc77,marvelous math manipulatives!,this year my students are thriving in an envir...
8,1.580826e-16,f7733b45a6f2f58c2a4830673b21bea9,organizational tools,"at the end of the school year, i want my stude..."
9,1.579042e-16,95e945ccf4d56688c53a6aea83dec3c0,"learning to read, so we can read to learn",my students love to learn. they get so proud ...


In [64]:
hybrid_recommender_model.recommend_projects(donor2)

Unnamed: 0,recStrengthHybrid,project_id,Project Title,Project Essay
0,2.137515e-16,265b369ab0b6b812eaa15c4c2ec23f19,making learning fun!,"my students attend school in los angeles, cali..."
1,1.831127e-16,c3c08d97210cfa5625c1d6cd77919433,help us learn our sight words,my students love kindergarten. they are always...
2,1.722116e-16,2b8c9c1568df2ffedee49e0845a77c10,creating passionate readers,my fourth grade students struggle significantl...
3,1.72e-16,70f59b4356c64870e57415636f78ad2e,games galore for first grade!,my first grade students are inquisitive and lo...
4,1.636152e-16,aa12a6397cc3da0f2327ce219ff76d7d,an ipad for tech savvy second graders!,imagine a 7 year old who reads below grade lev...
5,1.634996e-16,bee54d525555c03f767afd77d24f6cef,restless in our classroom,"i am privileged to teach a group of energetic,..."
6,1.612091e-16,636b4912ddf19e6f1ad1709867c66a1f,"small groups, whole groups, we love to read gr...",i have a class of twenty-five wonderful studen...
7,1.601675e-16,c80597e9d5d34e8f0322442995f9bc77,marvelous math manipulatives!,this year my students are thriving in an envir...
8,1.580826e-16,f7733b45a6f2f58c2a4830673b21bea9,organizational tools,"at the end of the school year, i want my stude..."
9,1.579042e-16,95e945ccf4d56688c53a6aea83dec3c0,"learning to read, so we can read to learn",my students love to learn. they get so proud ...


In [65]:
print('Evaluating Hybrid model...')
hybrid_global_metrics, hybrid_detailed_results_df = model_evaluator.evaluate_model(hybrid_recommender_model)
print('\nGlobal metrics:\n%s' % hybrid_global_metrics)
hybrid_detailed_results_df.head(10)

Evaluating Hybrid model...


TypeError: not all arguments converted during string formatting

# Comparing Methods

In [66]:
global_metrics_df = pd.DataFrame([pop_global_metrics, cf_global_metrics, cb_global_metrics, hybrid_global_metrics]) \
                        .set_index('modelName')
global_metrics_df

NameError: name 'pop_global_metrics' is not defined

In [67]:
%matplotlib inline
ax = global_metrics_df.transpose().plot(kind='bar', figsize=(15,8))
for p in ax.patches:
    ax.annotate("%.3f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

NameError: name 'global_metrics_df' is not defined

# Testing

In [68]:
def inspect_donations(donor_id, test_set=True):
    if test_set:
        donations_df = donations_test_indexed_df
    else:
        donations_df = donations_train_indexed_df
    return donations_df.loc[donor_id].merge(projects_df, how = 'left', 
                                                      left_on = 'project_id', 
                                                      right_on = 'project_id') \
                          .sort_values('eventStrength', ascending = False)[['eventStrength', 
                                                                          'project_id']]

In [69]:
inspect_donations(donor1, test_set=False).head(20)


AttributeError: 'Series' object has no attribute 'merge'

In [70]:
hybrid_recommender_model.recommend_projects(donor1, topn=20)

Unnamed: 0,recStrengthHybrid,project_id,Project Title,Project Essay
0,2.137515e-16,265b369ab0b6b812eaa15c4c2ec23f19,making learning fun!,"my students attend school in los angeles, cali..."
1,1.831127e-16,c3c08d97210cfa5625c1d6cd77919433,help us learn our sight words,my students love kindergarten. they are always...
2,1.722116e-16,2b8c9c1568df2ffedee49e0845a77c10,creating passionate readers,my fourth grade students struggle significantl...
3,1.72e-16,70f59b4356c64870e57415636f78ad2e,games galore for first grade!,my first grade students are inquisitive and lo...
4,1.636152e-16,aa12a6397cc3da0f2327ce219ff76d7d,an ipad for tech savvy second graders!,imagine a 7 year old who reads below grade lev...
5,1.634996e-16,bee54d525555c03f767afd77d24f6cef,restless in our classroom,"i am privileged to teach a group of energetic,..."
6,1.612091e-16,636b4912ddf19e6f1ad1709867c66a1f,"small groups, whole groups, we love to read gr...",i have a class of twenty-five wonderful studen...
7,1.601675e-16,c80597e9d5d34e8f0322442995f9bc77,marvelous math manipulatives!,this year my students are thriving in an envir...
8,1.580826e-16,f7733b45a6f2f58c2a4830673b21bea9,organizational tools,"at the end of the school year, i want my stude..."
9,1.579042e-16,95e945ccf4d56688c53a6aea83dec3c0,"learning to read, so we can read to learn",my students love to learn. they get so proud ...
