# DonorsChoose: Donor-Project Matching with Recommender Systems
Data and project idea come from a [Kaggle competition](https://www.kaggle.com/donorschoose/io).
Much of the recommender work is based on a [tutorial](https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101/code) by [Gabriel Moreira](https://www.kaggle.com/gspmoreira).

# Donors Choose
Founded in 2000 by a Bronx history teacher, DonorsChoose.org has raised $685 million for America's classrooms. Teachers at three-quarters of all the public schools in the U.S. have come to DonorsChoose.org to request what their students need, making DonorsChoose.org the leading platform for supporting public education.

To date, 3 million people and partners have funded 1.1 million DonorsChoose.org projects. But teachers still spend more than a billion dollars of their own money on classroom materials. To get students what they need to learn, the team at DonorsChoose.org needs to be able to connect donors with the projects that most inspire them.

In the second Kaggle Data Science for Good challenge, DonorsChoose.org, in partnership with Google.org, is inviting the community to help them pair up donors to the classroom requests that will most motivate them to make an additional gift. To support this challenge, DonorsChoose.org has supplied anonymized data on donor giving from the past five years. The winning methods will be implemented in DonorsChoose.org email marketing campaigns.

# Problem Statement
DonorsChoose.org has funded over 1.1 million classroom requests through the support of 3 million donors, the majority of whom were making their first-ever donation to a public school. If DonorsChoose.org can motivate even a fraction of those donors to make another donation, that could have a huge impact on the number of classroom requests fulfilled.

A good solution will enable DonorsChoose.org to build targeted email campaigns recommending specific classroom requests to prior donors. Part of the challenge is to assess the needs of the organization, uncover insights from the data available, and build the right solution for this problem. Submissions will be evaluated on the following criteria:

Performance - How well does the solution match donors to project requests to which they would be motivated to donate? DonorsChoose.org will not be able to live test every submission, so a strong entry will clearly articulate why it will be effective at motivating repeat donations.

Adaptable - The DonorsChoose.org team wants to put the winning submissions to work, quickly. Therefore a good entry will be easy to implement in production.

Intelligible - A good entry should be easily understood by the DonorsChoose.org team should it need to be updated in the future to accommodate a changing marketplace.

# Proposed Solution

I will address the problem by using [Recommender System](https://en.wikipedia.org/wiki/Recommender_system) (RecSys) techniques. The objective of a RecSys is to recommend relevant items for users, based on their preference. Preference and relevance are subjective, and they are generally inferred by items users have consumed previously.

The main RecSys techniques are:  
   - [**Collaborative Filtering**](https://en.wikipedia.org/wiki/Collaborative_filtering): This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.   
   - [**Content-Based Filtering**](http://recommender-systems.org/content-based-filtering/): This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.  
   - **Hybrid methods**:  Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.

## Load libraries

In [1]:
#enable auto complete
%config IPCompleter.greedy=True
%matplotlib inline

In [2]:
import pandas as pd # package for high-performance, easy-to-use data structures and data analysis
import numpy as np # fundamental package for scientific computing with Python
import matplotlib as cm
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
import scipy
from scipy.sparse.linalg import svds
import math
import random
import sklearn

from numpy import array

from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


#import os
#print(os.listdir("../input"))

from sklearn import preprocessing
# Supress unnecessary warnings so that presentation looks clean
import warnings
warnings.filterwarnings("ignore")

# Print all rows and columns
pd.set_option('display.max_columns', 21)
pd.set_option('display.max_rows', 21)

## Load data

In [3]:
projects = pd.read_csv('../input/Projects.csv')
donations = pd.read_csv('../input/Donations.csv')
donors = pd.read_csv('Donors.csv', low_memory=False)

In [155]:
donors.donor_id.apply(lambda x: int(x, 16))

def str2int(s, chars):
    i = 0
    for c in reversed(s):
        i *= len(chars)
        i += chars.index(c)
    return i


OverflowError: Python int too large to convert to C unsigned long

In [51]:
#print(' donations: ',donations.shape,'\n','donors: ',donors.shape,'\n','schools',schools.shape,'\n','teachers',teachers.shape,'\n','projects',teachers.shape,'\n','resources',resources.shape)

In [52]:
donors.rename(index=str, columns={"Donor ID": "donor_id"},inplace=True)

In [53]:
donations.rename(index=str, columns={"Donor ID": "donor_id", "Project ID": "project_id"},inplace=True)

In [54]:
projects.rename(index=str, columns={"Project ID": "project_id"},inplace=True)

# EDA

In [55]:
donations.head()

Unnamed: 0,project_id,Donation ID,donor_id,Donation Included Optional Donation,Donation Amount,Donor Cart Sequence,Donation Received Date
0,000009891526c0ade7180f8423792063,688729120858666221208529ee3fc18e,1f4b5b6e68445c6c4a0509b3aca93f38,No,178.37,11,2016-08-23 13:15:57
1,000009891526c0ade7180f8423792063,dcf1071da3aa3561f91ac689d1f73dee,4aaab6d244bf3599682239ed5591af8a,Yes,25.0,2,2016-06-06 20:05:23
2,000009891526c0ade7180f8423792063,18a234b9d1e538c431761d521ea7799d,0b0765dc9c759adc48a07688ba25e94e,Yes,20.0,3,2016-06-06 14:08:46
3,000009891526c0ade7180f8423792063,38d2744bf9138b0b57ed581c76c0e2da,377944ad61f72d800b25ec1862aec363,Yes,25.0,1,2016-05-15 10:23:04
4,000009891526c0ade7180f8423792063,5a032791e31167a70206bfb86fb60035,6d5b22d39e68c656071a842732c63a0c,Yes,25.0,2,2016-05-17 01:23:38


In [56]:
donors.head(10)

Unnamed: 0,donor_id,Donor City,Donor State,Donor Is Teacher,Donor Zip
0,00000ce845c00cbf0686c992fc369df4,Evanston,Illinois,No,602.0
1,00002783bc5d108510f3f9666c8b1edd,Appomattox,other,No,245.0
2,00002d44003ed46b066607c5455a999a,Winton,California,Yes,953.0
3,00002eb25d60a09c318efbd0797bffb5,Indianapolis,Indiana,No,462.0
4,0000300773fe015f870914b42528541b,Paterson,New Jersey,No,75.0
5,00004c31ce07c22148ee37acd0f814b9,,other,No,
6,00004e32a448b4832e1b993500bf0731,Stamford,Connecticut,No,69.0
7,00004fa20a986e60a40262ba53d7edf1,Green Bay,Wisconsin,No,543.0
8,00005454366b6b914f9a8290f18f4aed,Argyle,New York,No,128.0
9,0000584b8cdaeaa6b3de82be509db839,Valparaiso,Indiana,No,463.0


In [57]:
#schools.head(3)

In [58]:
#teachers.head(3)

In [59]:
#plt.rcParams["figure.figsize"] = [12,6]
#teachers['Teacher Prefix'].plot(kind = 'bar')
#sns.countplot(x='Teacher Prefix', data=teachers);

In [60]:
projects.head(3)

Unnamed: 0,project_id,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,...,Project Resource Category,Project Cost,Project Posted Date,Project Current Status,Project Fully Funded Date
0,77b7d3f2ac4e32d538914e4a8cb8a525,c2d5cb0a29a62e72cdccee939f434181,59f7d2c62f7e76a99d31db6f62b7b67c,2,Teacher-Led,...,Books,$490.38,2013-01-01,Fully Funded,2013-03-12
1,fd928b7f6386366a9cad2bea40df4b25,8acbb544c9215b25c71a0c655200baea,8fbd92394e20d647ddcdc6085ce1604b,1,Teacher-Led,...,Supplies,$420.61,2013-01-01,Expired,
2,7c915e8e1d27f10a94abd689e99c336f,0ae85ea7c7acc41cffa9f81dc61d46df,9140ac16d2e6cee45bd50b0b2ce8cd04,2,Teacher-Led,...,Books,$510.46,2013-01-01,Fully Funded,2013-01-07


In [61]:
#resources.head(3)

Donors who have donated to more than 1 campaign

In [62]:
donations_per_donor = donations.groupby('donor_id')['Donor Cart Sequence'].max()
donations_per_donor1 = round(((donations_per_donor == 1).mean() *100),2)
print("No more than 1 donation is given by: "+ str(donations_per_donor1) +"% donors")
donations_per_donor_more_than_1 = round(((donations_per_donor > 1).mean() *100),2)
print("More than 1 donation is given by: "+ str(donations_per_donor_more_than_1) +"% donors")

No more than 1 donation is given by: 68.93% donors
More than 1 donation is given by: 31.07% donors


In [95]:
donations_per_donor = donations.groupby('donor_id')['Donor Cart Sequence'].max()
donations_per_donor1 = round(((donations_per_donor == 1).mean() *100),3)
donations_per_donor_under_5 = (donations_per_donor < 5).mean() *100
donations_per_donor_under_10 = (donations_per_donor <10).mean() *100
donations_per_donor_under_15 = (donations_per_donor <15).mean() *100
donations_per_donor_under_20 = (donations_per_donor <20).mean() *100
donations_per_donor_under_25 = (donations_per_donor <25).mean() *100
donations_per_donor_under_30 = (donations_per_donor <30).mean() *100
donations_per_donor_over_29 = round(((donations_per_donor > 29).mean() *100),3)

between1_5=round(donations_per_donor_under_5-donations_per_donor1,3)
between5_10=round(donations_per_donor_under_10-donations_per_donor_under_5,3)
between10_15=round(donations_per_donor_under_15-donations_per_donor_under_10,3)
between15_20=round(donations_per_donor_under_20-donations_per_donor_under_15,3)
between20_25=round(donations_per_donor_under_25-donations_per_donor_under_20,3)
between25_30=round(donations_per_donor_under_30-donations_per_donor_under_25,3)

print("Only one time donation is given by: "+ str(donations_per_donor1) +"% donors")
print("2 to 4 donations are given by: "+ str(between1_5) +"% donors")
print("5 to 9 donations are given by: "+ str(between5_10) +"% donors")
print("10 to 14 donations are given by: "+ str(between10_15) +"% donors")
print("15 to 19 donations are given by: "+ str(between15_20) +"% donors")
print("20 to 24 donations are given by: "+ str(between20_25) +"% donors")
print("25 to 29 donations are given by: "+ str(between25_30) +"% donors")
print("29 or more donations are given by: "+ str(donations_per_donor_over_29) +"% donors")

Only one time donation is given by: 68.933% donors
2 to 4 donations are given by: 23.49% donors
5 to 9 donations are given by: 4.889% donors
10 to 14 donations are given by: 1.265% donors
15 to 19 donations are given by: 0.514% donors
20 to 24 donations are given by: 0.269% donors
25 to 29 donations are given by: 0.157% donors
29 or more donations are given by: 0.483% donors


In [96]:
total=donations_per_donor1+between1_5+between5_10+between10_15+between15_20+between20_25+between25_30+donations_per_donor_over_29

print('Percentages added together: '+ str(total)+'%')

Percentages added together: 100.0%


In [97]:
donations_per_donor_under_10 = (donations_per_donor <10).mean() *100
donations_per_donor0 = (donations_per_donor > 0).mean() *100
donations_per_donor_over_9=round((donations_per_donor0-donations_per_donor_under_10),2)
print("10 or more donations are given by "+str(donations_per_donor_over_9)+"% donors")

10 or more donations are given by 2.69% donors


Before modeling, we need to measure the relation strength between a donor and a project. Although most donors only donate once in the dataset, there are donors who donated to the same project multiple times, and users who donated to multiple projects. The donation amount also varies. To better measure this strength, we combine the times and amounts of donations, and create a new dataset containing unique donation relations between a donor, a project, and the relation strength. he number of projects and unique donor-project donation events:

### Set up test mode where only 10000 rows of donation/donor dataframe are used. 
When testing is complete we will need to turn off test mode.

In [66]:
# Set up test mode to save some time
test_mode = True

# Merge datasets
donations = donations.merge(donors, on="donor_id", how="left")
df = donations.merge(projects,on="project_id", how="left")

# only load a few lines in test mode
if test_mode:
    df = df.head(10000)

donations_df = df
print('shape of df is ',df.shape)

shape of df is  (10000, 25)


In [67]:
#df -> donors + donations + projects
print('shape of df is ',df.shape)

shape of df is  (10000, 25)


In [68]:
# Deal with missing values
donations["Donation Amount"] = donations["Donation Amount"].fillna(0)

# Define event strength as the donated amount to a certain project
donations_df['eventStrength'] = donations_df['Donation Amount']

def smooth_donor_preference(x):
    return math.log(1+x, 2)
    
donations_full_df = donations_df \
                    .groupby(['donor_id', 'project_id'])['eventStrength'].sum() \
                    .apply(smooth_donor_preference).reset_index()
        
# Update projects dataset
project_cols = projects.columns
projects = df[project_cols].drop_duplicates()

print('# of projects: %d' % len(projects))
print('# of unique user/project donations: %d' % len(donations_full_df))

# of projects: 1889
# of unique user/project donations: 8648


In [69]:
donations_full_df.head()

Unnamed: 0,donor_id,project_id,eventStrength
0,0003aba06ccf49f8c44fc2dd3b582411,0081553d51ed5d2529e2e38b0827133a,5.672425
1,000f7306e8ddb36296f0d97a34d67d76,007e2a1a47ce50ded4538692d0bf601b,4.70044
2,00125f251b05d9e447a5448bef981028,0055c89fe4b1085db791edeb67ace2e0,4.70044
3,0013dfb2a873420fe6e7d750ef24ce98,004baba788df541cc469c0f4f21493d6,3.087463
4,0016b23800f7ea46424b3254f016007a,004c7c5e1a8cbce0ee63d14574096aeb,5.672425


# Evaluation

Evaluation is important for machine learning projects, because it allows to compare objectivelly different algorithms and hyperparameter choices for models.
One key aspect of evaluation is to ensure that the trained model generalizes for data it was not trained on, using Cross-validation techniques. We are using here a simple cross-validation approach named holdout, in which a random data sample (20% in this case) are kept aside in the training process, and exclusively used for evaluation. All evaluation metrics reported here are computed using the test set.

Ps. A more robust evaluation approach could be to split train and test sets by a reference date, where the train set is composed by all interactions before that date, and the test set are interactions after that date. For the sake of simplicity, we chose the first random approach for this notebook, but you may want to try the second approach to better simulate how the recsys would perform in production predicting "future" users interactions.

In [70]:
donations_train_df, donations_test_df = train_test_split(donations_full_df,
                                   test_size=0.20,
                                   random_state=42)

print('# donations on Train set: %d' % len(donations_train_df))
print('# donations on Test set: %d' % len(donations_test_df))

# donations on Train set: 6918
# donations on Test set: 1730


In [71]:
#Indexing by donor_id to speed up the searches during evaluation
donations_full_indexed_df = donations_full_df.set_index('donor_id')
donations_train_indexed_df = donations_train_df.set_index('donor_id')
donations_test_indexed_df = donations_test_df.set_index('donor_id')

person_id -> 'donor_id'
contentId -> project_id
articles_df -> donations_df
item_id -> 'project_id'
interactions_df -> donations_df
interactions -> donations
items -> projects
interacted -> donated`

In [72]:
#get_projects_donated replaced with get_proj_donated
def get_proj_donated(donor_id, donations_df):
    # Get the user's data and merge in project info
    donated_projects = donations_df.loc[donor_id]['project_id']
    return set(donated_projects if type(donated_projects) == pd.Series else [donated_projects])

In [73]:
df.describe

<bound method NDFrame.describe of                             project_id                       Donation ID  \
0     000009891526c0ade7180f8423792063  688729120858666221208529ee3fc18e   
1     000009891526c0ade7180f8423792063  dcf1071da3aa3561f91ac689d1f73dee   
2     000009891526c0ade7180f8423792063  18a234b9d1e538c431761d521ea7799d   
3     000009891526c0ade7180f8423792063  38d2744bf9138b0b57ed581c76c0e2da   
4     000009891526c0ade7180f8423792063  5a032791e31167a70206bfb86fb60035   
...                                ...                               ...   
9995  00897214102859c52600f4acab28eeeb  fa04ac16e0ac9c4bdde191a557e4506c   
9996  0089759ee8b44b9f3908059f10510d94  be2d0a4c03de4eaeb24d953861a33fb1   
9997  0089759ee8b44b9f3908059f10510d94  668503b774f0aeb89501865ae9d74dc9   
9998  0089759ee8b44b9f3908059f10510d94  aaedccae1ccc3c81b4bf4ea6bea7f05f   
9999  0089759ee8b44b9f3908059f10510d94  9e042c745dae05f2c63753eb0dfbba2c   

                              donor_id Donation Inclu

In [129]:
donations_full_indexed_df.head(5)

Unnamed: 0_level_0,project_id,eventStrength
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0003aba06ccf49f8c44fc2dd3b582411,0081553d51ed5d2529e2e38b0827133a,5.672425
000f7306e8ddb36296f0d97a34d67d76,007e2a1a47ce50ded4538692d0bf601b,4.70044
00125f251b05d9e447a5448bef981028,0055c89fe4b1085db791edeb67ace2e0,4.70044
0013dfb2a873420fe6e7d750ef24ce98,004baba788df541cc469c0f4f21493d6,3.087463
0016b23800f7ea46424b3254f016007a,004c7c5e1a8cbce0ee63d14574096aeb,5.672425


In [130]:
#Top-N accuracy metrics consts
EVAL_RANDOM_SAMPLE_NON_PROJECTS = 100

class ModelEvaluator:


    def get_not_donated_projects_sample(self, donor_id, sample_size, seed=42):
        donated_projects = get_proj_donated(donor_id, donations_full_indexed_df)
        all_projects = set(projects_df['project_id'])
        non_donated_projects = all_projects - donated_projects

        random.seed(seed)
        non_donated_projects_sample = random.sample(non_donated_projects, sample_size)
        return set(non_donated_projects_sample)

    def _verify_hit_top_n(self, project_id, recommended_projects, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_projects) if c == project_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, donor_id):
        #Getting the projects in test set
        donated_values_testset = donations_test_indexed_df.loc[donor_id]
        if type(donated_values_testset['project_id']) == pd.Series:
            person_donated_projects_testset = set(donated_values_testset['project_id'])
        else:
            person_donated_projects_testset = set([int(donated_values_testset['project_id'])])  
        donated_projects_count_testset = len(person_donated_projects_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_projects(donor_id, 
                                               projects_to_ignore=get_proj_donated(donor_id, 
                                                                                    donations_train_indexed_df), 
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has donated in test set
        for project_id in person_donated_projects_testset:
            #Getting a random sample (100) projects the user has not donated 
            #(to represent projects that are assumed to be no relevant to the user)
            non_donated_projects_sample = self.get_not_donated_projects_sample(donor_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_PROJECTS, 
                                                                          seed=project_id%(2**32))

            #Combining the current donated item with the 100 random projects
            projects_to_filter_recs = non_donated_projects_sample.union(set(['project_id']))

            #Filtering only recommendations that are either the donated item or from a random sample of 100 non-donated projects
            valid_recs_df = person_recs_df[person_recs_df['project_id'].isin(projects_to_filter_recs)]                    
            valid_recs = valid_recs_df['project_id'].values
            #Verifying if the current donated item is among the Top-N recommended projects
            hit_at_5, index_at_5 = self._verify_hit_top_n(project_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(project_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the donated projects that are ranked among the Top-N recommended projects, 
        #when mixed with a set of non-relevant projects
        recall_at_5 = hits_at_5_count / float(donated_projects_count_testset)
        recall_at_10 = hits_at_10_count / float(donated_projects_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'donated_count': donated_projects_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, donor_id in enumerate(list(donations_test_indexed_df.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, donor_id)  
            person_metrics['_donor_id'] = donor_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('donated_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['donated_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['donated_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator() 

## Content-Based Filtering model
We will use Content-Based Filtering method to find projects that are similar to the project(s) that a donor has already donated to. We can calculate the similarity between projects based on data and/or text features extracted from the text data.

I used [a tutortial by](https://www.kaggle.com/gunnvant/building-content-recommender-tutorial/notebook) user [gunnvant](https://www.kaggle.com/gunnvant) to construct word vectors with TF-IDF.

In [131]:
# Preprocessing of text data
textfeats = ["Project Title","Project Essay"]
for cols in textfeats:
    projects[cols] = projects[cols].astype(str) 
    projects[cols] = projects[cols].astype(str).fillna('') # FILL NA
    projects[cols] = projects[cols].str.lower() # Lowercase all text, so that capitalized words dont get treated differently
 
text = projects["Project Title"] + ' ' + projects["Project Essay"]
vectorizer = TfidfVectorizer(strip_accents='unicode',
                             analyzer='word',
                             lowercase=True, # Convert all uppercase to lowercase
                             stop_words='english', # Remove commonly found english words ('it', 'a', 'the') which do not typically contain much signal
                             max_df = 0.9, # Only consider words that appear in fewer than max_df percent of all documents
                             # max_features=5000 # Maximum features to be extracted                    
                            )                        
project_ids = projects['project_id'].tolist()
tfidf_matrix = vectorizer.fit_transform(text)
tfidf_feature_names = vectorizer.get_feature_names()
tfidf_matrix

<1889x12490 sparse matrix of type '<class 'numpy.float64'>'
	with 182757 stored elements in Compressed Sparse Row format>

To model the user profile, we take all the item profiles the user has interacted and average them. The average is weighted by the interaction strength, in other words, the articles the user has interacted the most (eg. liked or commented) will have a higher strength in the final user profile.

In [132]:
def get_project_profile(project_id):
    idx = project_ids.index(project_id)
    project_profile = tfidf_matrix[idx:idx+1]
    return project_profile

def get_project_profiles(ids):
    project_profiles_list = [get_project_profile(x) for x in np.ravel([ids])]
    project_profiles = scipy.sparse.vstack(project_profiles_list)
    return project_profiles

def build_donors_profile(donor_id, donations_indexed_df):
    donations_donor_df = donations_indexed_df.loc[donor_id]
    donor_project_profiles = get_project_profiles(donations_donor_df['project_id'])
    donor_project_strengths = np.array(donations_donor_df['eventStrength']).reshape(-1,1)
    #Weighted average of project profiles by the donations strength
    donor_project_strengths_weighted_avg = np.sum(donor_project_profiles.multiply(donor_project_strengths), axis=0) / (np.sum(donor_project_strengths)+1)
    donor_profile_norm = sklearn.preprocessing.normalize(donor_project_strengths_weighted_avg)
    return donor_profile_norm


def build_donors_profiles(): 
    donations_indexed_df = donations_full_df[donations_full_df['project_id'].isin(projects['project_id'])].set_index('donor_id')
    donor_profiles = {}
    for donor_id in donations_indexed_df.index.unique():
        donor_profiles[donor_id] = build_donors_profile(donor_id, donations_indexed_df)
    return donor_profiles

In [133]:
donor_profiles = build_donors_profiles()
print("# of donors with profiles: %d" % len(donor_profiles))

# of donors with profiles: 8015


In [134]:
donations_full_indexed_df.head(10)

Unnamed: 0_level_0,project_id,eventStrength
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0003aba06ccf49f8c44fc2dd3b582411,0081553d51ed5d2529e2e38b0827133a,5.672425
000f7306e8ddb36296f0d97a34d67d76,007e2a1a47ce50ded4538692d0bf601b,4.70044
00125f251b05d9e447a5448bef981028,0055c89fe4b1085db791edeb67ace2e0,4.70044
0013dfb2a873420fe6e7d750ef24ce98,004baba788df541cc469c0f4f21493d6,3.087463
0016b23800f7ea46424b3254f016007a,004c7c5e1a8cbce0ee63d14574096aeb,5.672425
00199e3565635f8a5ebefd3b5985a7f3,006a17f2eff0c3dae79630c295e2a666,7.838069
00309a47b765e12714d817ee3215de1e,006a366c97f485d4f349fad018d95f42,4.075533
0036448e416b71ab040182c428958b6f,000c43686474a41cbd1b04110149160c,4.70044
00393e12bc4f2eefa1a342a83559c2be,006a366c97f485d4f349fad018d95f42,5.672425
0052dd04a7cf2d91db791c94dec448ac,002a3115d0e459d096baa65e9f9e3d6e,4.70044


Get top 5 terms for 10 donors 

In [135]:
donor1 = "0003aba06ccf49f8c44fc2dd3b582411"
donor2 = "0016b23800f7ea46424b3254f016007a"
donor3 = "00125f251b05d9e447a5448bef981028"
donor4 = "0013dfb2a873420fe6e7d750ef24ce98"
donor5 = "0016b23800f7ea46424b3254f016007a"
donor6 = "00199e3565635f8a5ebefd3b5985a7f3"
donor7 = "00309a47b765e12714d817ee3215de1e"
donor8 = "0036448e416b71ab040182c428958b6f"
donor9 = "00393e12bc4f2eefa1a342a83559c2be"
donor10 = "0052dd04a7cf2d91db791c94dec448ac"
donor1_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor1].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token1', 'relevance1'])
donor2_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor2].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token2', 'relevance2'])
donor3_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor3].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token3', 'relevance3'])
donor4_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor4].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token4', 'relevance4'])
donor5_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor5].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token5', 'relevance5'])
donor6_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor6].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token6', 'relevance6'])
donor7_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor7].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token7', 'relevance7'])
donor8_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor8].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token8', 'relevance8'])
donor9_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor9].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token9', 'relevance9'])
donor10_profile = pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor10].flatten().tolist()), 
                        key=lambda x: -x[1])[:5],
                        columns=['token10', 'relevance10'])


Join all the info into one table

In [136]:
example_profiles = donor1_profile.join(donor2_profile)
example_profiles = example_profiles.join(donor3_profile)
example_profiles = example_profiles.join(donor4_profile)
example_profiles = example_profiles.join(donor6_profile)
example_profiles = example_profiles.join(donor7_profile)
example_profiles = example_profiles.join(donor8_profile)
example_profiles = example_profiles.join(donor9_profile)
example_profiles = example_profiles.join(donor10_profile)

Examine the results

In [137]:
example_profiles.head(10)

Unnamed: 0,token1,relevance1,token2,relevance2,token3,relevance3,token4,relevance4,token6,relevance6,token7,relevance7,token8,relevance8,token9,relevance9,token10,relevance10
0,sets,0.313237,pollinators,0.672316,castles,0.322847,computers,0.342839,brockton,0.318679,pinocchio,0.515412,cubbies,0.595062,pinocchio,0.515412,diary,0.440622
1,reading,0.28672,plants,0.306352,ed,0.259267,technology,0.293033,play,0.268543,conversations,0.264213,coat,0.228447,conversations,0.264213,freedom,0.248476
2,zoom,0.283351,module,0.224105,frames,0.253062,computer,0.246495,kits,0.214771,engage,0.217155,belong,0.185914,engage,0.217155,writers,0.224292
3,levels,0.232165,pollination,0.212532,art,0.224827,complete,0.186191,challenged,0.189753,limited,0.208959,guessing,0.160592,limited,0.208959,8th,0.20912
4,books,0.213696,seeds,0.181173,love,0.200784,laptop,0.18204,bonded,0.168017,experiences,0.197402,mittens,0.160592,experiences,0.197402,literature,0.187446


## Content-Based Recommender

In [138]:
class ContentBasedRecommender:
    
    MODEL_NAME = 'Content-Based'
    
    def __init__(self, projects=None):
        self.project_ids = project_ids
        self.projects = projects
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def _get_similar_projects_to_donor_profile(self, donor_id, topn=1000):
        #Computes the cosine similarity between the donor profile and all project profiles
        cosine_similarities = cosine_similarity(donor_profiles[donor_id], tfidf_matrix)
        #Gets the top similar projects
        similar_indices = cosine_similarities.argsort().flatten()[-topn:]
        #Sort the similar projects by similarity
        similar_projects = sorted([(project_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
        return similar_projects
        
    def recommend_projects(self, donor_id, projects_to_ignore=[], topn=10, verbose=False):
        similar_projects = self._get_similar_projects_to_donor_profile(donor_id)
        #Ignores projects the donor has already donated
        similar_projects_filtered = list(filter(lambda x: x[0] not in projects_to_ignore, similar_projects))
        
        recommendations_df = pd.DataFrame(similar_projects_filtered, columns=['project_id', 'recStrength']).head(topn)

        recommendations_df = recommendations_df.merge(self.projects, how = 'left', 
                                                    left_on = 'project_id', 
                                                    right_on = 'project_id')[['recStrength', 'project_id', 'Project Title', 'Project Essay']]


        return recommendations_df

In [139]:
content_based_recommender_model = ContentBasedRecommender(projects)
content_based_recommender_model.recommend_projects(donor1)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,0081553d51ed5d2529e2e38b0827133a,help us zoom up through the reading levels!,i am working hard to advance my first graders'...
1,0.336841,006d4d96be19ec61c2a393377727953b,we want to read!,my 7th and 8th grade students come from povert...
2,0.329714,007cea81560c630edefe71c4d7a862e3,reading about us,i have 27 third grade students who are ready t...
3,0.311516,000f7306e8ddb36296f0d97a34d67d76,learning to read is fun with leveled books!,our school is part of a very diverse district ...
4,0.309028,0016e8d58b28067a2f03e0ad84e8af3a,creating life-long readers,"in our classroom, we thrive to be the best we ..."
5,0.307533,0012f7359b9705f46355a1c2b8ecbc1d,leveled books to help us read!,"have you ever been told you need to read, but ..."
6,0.305458,007cf7ea1a98f1fd5cbe34e8fe2ab813,we are in need of books,my students are amazing students for many reas...
7,0.304062,006961199ddeb18ad3b0999d7f8a73ca,help readers grow by growing their library!,i am a veteran teacher. i have taught in nyc p...
8,0.302696,003bcb350495dc3faca41238632892d4,love for literacy,"as dr. seuss best said, ""the more that you rea..."
9,0.296526,005ea039e7fdbfbd6bd097dd0b64ac1c,guided reading leveled library needed for vora...,"students love shopping for books, but finding ..."


In [140]:
content_based_recommender_model.recommend_projects(donor2)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,004c7c5e1a8cbce0ee63d14574096aeb,power partnerships: plants and pollinators!,"my students are creative, curious, and excited..."
1,0.213049,0016309bd7290ade640f436ad894dab2,let's plant and learn,our school is a title 1 school. 100% of stude...
2,0.190093,004986d49a0b6a0f1b6bbe2e5f42b485,what time is it? it's time to plant,my students are active and eager learners who ...
3,0.189295,004b8c9575a1d1a37df067d8dc016df0,"don't plant it, clone it! the cloning of an a...",being a small rural school we do a lot of trad...
4,0.173815,00022a0f4f0062d861b26fcd96abc68c,pollinating their minds! stem in action,"""science is a way of life...science is the pro..."
5,0.159281,00236e176405ce085a6f7200e148dd7e,help us learn about life science,my second grade students love to come to schoo...
6,0.15538,006ad0535b78bb00ffee54200e747fa5,intriguing reading for intelligent writing,i teach 28 fourth graders in a neighborhood sc...
7,0.147112,006b49a52fdba1ef30d71c075ce0f203,bookworms rule the world,in my classroom we are working hard to become ...
8,0.142242,0062b388efbc3b5e23dcdf6faf6344ef,"read, read to learn!","as a teacher in a diverse, low-income, high-po..."
9,0.137087,0012f7359b9705f46355a1c2b8ecbc1d,leveled books to help us read!,"have you ever been told you need to read, but ..."


In [141]:
content_based_recommender_model.recommend_projects(donor3)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,0055c89fe4b1085db791edeb67ace2e0,kindergarten students empowered through art...,"at the end of the school year, i would love fo..."
1,0.253029,0000d4777d14b33a1406dd6c9019fe89,artistic creativity here we come!,our school is amazing with wonderful artistic ...
2,0.237152,003cf9c97245f7b95171b2fc4dc8a9a4,the colorful art classroom,i have been able to introduce my students to n...
3,0.221125,001206ea335bb1e6b91614e915de941d,"""every child is an artist"" pablo picasso","""creativity is inventing, experimenting, growi..."
4,0.21138,003cfcc0c2ced54a9acfa9478fe33899,artful learning through reading,our third grade is comprised of 22 creative an...
5,0.209826,006e8bb6283132856529410247aea983,little picassos need art supplies,my pre-k students are four and five year of a...
6,0.20962,00584269b48696db32f60172d15e3ecf,little humans art exhibit,"every day, i begin with a lesson focusing on t..."
7,0.198851,004f3f81045ab9c1bc31d5f1e5dd4e13,creating works of art in writing and beyond!,our upper elementary school serves 3rd-5th gra...
8,0.194365,007eb73952edcd90d48cf1e2454462e5,we need color!,"""your attitude is like a box of crayons that c..."
9,0.194199,001cd1a7b01d4630d217128fd6235e60,i can make a 3-d painting!,save our art class! we need your help to conti...


In [142]:
content_based_recommender_model.recommend_projects(donor4)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,004baba788df541cc469c0f4f21493d6,technology enhances content learning,it is imperative that elementary students have...
1,0.3315,006113e4fe3d94b2d218477e07949342,time to be technologically advanced,all of my students are english language learne...
2,0.299312,0005a4aaf3799c37553d1329ddc8fdce,technology for technology magnet,a typical day in the classroom involves studen...
3,0.283818,00589577d61473566a0d72e01ce2d523,technolgy in the classroom,remember your first computer? how exciting it...
4,0.275195,0031737ad4b56c73ac452e601250cfa3,you can help my second graders excel in readin...,"""class, can someone tell me what an encycloped..."
5,0.272246,00710aef02686a61bb26b693a936b1cc,creating 21st century learners,my class consists of 35 fourth graders who ar...
6,0.271018,001181cd7805e4d5d888d95a900a65e8,"listening, learning and loving it!",as a teacher in a low-income/high poverty scho...
7,0.26993,0067eff06d195e1aa561d9de7c5aa4ed,flamingo techno tekkies flip over technology!!!,now more than ever we need to expose students ...
8,0.267556,00840b72210776ac10dd204112f77d58,bringing technology into the classroom,"""technology can bring the real world into the ..."
9,0.266485,0000e4e8ebb8ebacc6374cb2096ab7f4,tech savvy second graders,i teach second grade in an urban setting....


In [143]:
content_based_recommender_model.recommend_projects(donor5)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,004c7c5e1a8cbce0ee63d14574096aeb,power partnerships: plants and pollinators!,"my students are creative, curious, and excited..."
1,0.213049,0016309bd7290ade640f436ad894dab2,let's plant and learn,our school is a title 1 school. 100% of stude...
2,0.190093,004986d49a0b6a0f1b6bbe2e5f42b485,what time is it? it's time to plant,my students are active and eager learners who ...
3,0.189295,004b8c9575a1d1a37df067d8dc016df0,"don't plant it, clone it! the cloning of an a...",being a small rural school we do a lot of trad...
4,0.173815,00022a0f4f0062d861b26fcd96abc68c,pollinating their minds! stem in action,"""science is a way of life...science is the pro..."
5,0.159281,00236e176405ce085a6f7200e148dd7e,help us learn about life science,my second grade students love to come to schoo...
6,0.15538,006ad0535b78bb00ffee54200e747fa5,intriguing reading for intelligent writing,i teach 28 fourth graders in a neighborhood sc...
7,0.147112,006b49a52fdba1ef30d71c075ce0f203,bookworms rule the world,in my classroom we are working hard to become ...
8,0.142242,0062b388efbc3b5e23dcdf6faf6344ef,"read, read to learn!","as a teacher in a diverse, low-income, high-po..."
9,0.137087,0012f7359b9705f46355a1c2b8ecbc1d,leveled books to help us read!,"have you ever been told you need to read, but ..."


In [144]:
content_based_recommender_model.recommend_projects(donor6)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,006a17f2eff0c3dae79630c295e2a666,getting active adventure,a typical day in my classroom is my students b...
1,0.192123,005a09fdb0c48cd470e1d2affb9f0292,moving to learn math and literacy skills,my students are very excited about learning. ...
2,0.191665,006382f8b99f98a9b2c3f007d162f83e,our students play like pros,imagine your 9 year-old self experiencing the ...
3,0.158888,0038ed8f7ee0db9fc6dbb5b0bbb93e68,purposeful play on the playground,our students come from a title i school in jer...
4,0.152216,0015ebccd7a0902cb417339693ad9453,dramatic play center for my kinders!,i teach kindergarten at a charter school and s...
5,0.151949,0034b07d897333d083e0ad64e2581d7a,let the pretend play begin!,my classroom is full of amazing students who w...
6,0.144469,00204bb45ffe8a1a273f6e0e1cbc2606,play doh needed for building hand strength,i teach some of the youngest kindergarten stud...
7,0.143173,004ee732fdf548bcb5943caf76505b4a,logical thinking games,welcome to the discovery classroom where learn...
8,0.13359,00728836eab95708d5d6e011768f2bf9,"first downs to touchdowns, that's how we roll!",our students live in a high poverty and high c...
9,0.130605,0021f3c899bea4fd4afdf802cb484abc,math games rule!,i teach in a low income / high poverty inner-c...


In [145]:
content_based_recommender_model.recommend_projects(donor7)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,006a366c97f485d4f349fad018d95f42,growing with pinocchio,our students come from diverse backgrounds and...
1,0.141485,002c90beca45ceee5dcdfc860f808b65,it's just the bare necessities!,my students are four-year-old preschool studen...
2,0.135485,005daf098e244c8958dceacebdeb68bc,fairy tale origins,our students were learning about different hom...
3,0.130705,007b3ab53712de9badc0a533b82d2f0d,fun through fine motor and sensory experiences!,my preschool classroom is two half-day session...
4,0.124836,007fbc43516cd661644822b13ff148c3,learning socialization in the kindergarten cla...,"abraham lincoln once said ""the best way to pre..."
5,0.12436,0015ebccd7a0902cb417339693ad9453,dramatic play center for my kinders!,i teach kindergarten at a charter school and s...
6,0.122791,006382f8b99f98a9b2c3f007d162f83e,our students play like pros,imagine your 9 year-old self experiencing the ...
7,0.121548,002ce7dd7a7dee02c7fdf583621ef927,toning our perspectives for tolerance,i am a general education fourth grade teacher ...
8,0.120889,0084fa052ee4d1a72e03db26c4fb0538,reading is the key to success!,how can we close the achievement gap when stud...
9,0.117799,008739b9a6f6ae82e478bfbe709a1040,virtual learning in the classroom,i would describe my classroom as a place where...


In [146]:
content_based_recommender_model.recommend_projects(donor8)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,000c43686474a41cbd1b04110149160c,cubbies for change - hats off to no more missi...,"""it was ability that mattered, not disability,..."
1,0.271619,004d63521a5129e542e2456085c9a976,making over our math manipulatives with cubbies,a place for everything and everything in its p...
2,0.21828,001ff199bc52e78c9a85f99143081e46,flexible seating cubbies,"i work at a low-income school in clarkston, wa..."
3,0.138,0053126b5f7b53c89abf3bc3dba6c2d8,second grade seat sacks,this is my first year teaching second grade! i...
4,0.137508,0020f731697b2f9946fb511245c2237e,trays for cubbies,when i look out at my students everyday i see ...
5,0.120256,000b95f12aff8580e1315505914cc52b,improving students attitude to learning,my students are mostly minority students who l...
6,0.111929,0011f7ff0ebb09e07210be73c13163ea,kids with special needs need class tools for l...,my students start every day with the calendar ...
7,0.104302,0069fa1654a8647fae7fa0842b0a7b10,"cuisinart rods, rekenreks, counting bears, oh ...","""mistakes are the portal to discovery. ""i once..."
8,0.104009,0029fa889f61af02aa78bcecbbab7e0e,technology in the life skills world 2015,i will use these items for communication with ...
9,0.100254,002a0b0e44190c3913519bd22d1aeafc,kindergarten leaders in organization and goal-...,i teach in a title 1 school where my students ...


In [147]:
content_based_recommender_model.recommend_projects(donor9)


Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,006a366c97f485d4f349fad018d95f42,growing with pinocchio,our students come from diverse backgrounds and...
1,0.141485,002c90beca45ceee5dcdfc860f808b65,it's just the bare necessities!,my students are four-year-old preschool studen...
2,0.135485,005daf098e244c8958dceacebdeb68bc,fairy tale origins,our students were learning about different hom...
3,0.130705,007b3ab53712de9badc0a533b82d2f0d,fun through fine motor and sensory experiences!,my preschool classroom is two half-day session...
4,0.124836,007fbc43516cd661644822b13ff148c3,learning socialization in the kindergarten cla...,"abraham lincoln once said ""the best way to pre..."
5,0.12436,0015ebccd7a0902cb417339693ad9453,dramatic play center for my kinders!,i teach kindergarten at a charter school and s...
6,0.122791,006382f8b99f98a9b2c3f007d162f83e,our students play like pros,imagine your 9 year-old self experiencing the ...
7,0.121548,002ce7dd7a7dee02c7fdf583621ef927,toning our perspectives for tolerance,i am a general education fourth grade teacher ...
8,0.120889,0084fa052ee4d1a72e03db26c4fb0538,reading is the key to success!,how can we close the achievement gap when stud...
9,0.117799,008739b9a6f6ae82e478bfbe709a1040,virtual learning in the classroom,i would describe my classroom as a place where...


In [148]:
content_based_recommender_model.recommend_projects(donor10)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,1.0,002a3115d0e459d096baa65e9f9e3d6e,freedom writers diary: hook to literature,"ya author sherman alexie has said ""i write to ..."
1,0.175032,006d4d96be19ec61c2a393377727953b,we want to read!,my 7th and 8th grade students come from povert...
2,0.174181,0012f7359b9705f46355a1c2b8ecbc1d,leveled books to help us read!,"have you ever been told you need to read, but ..."
3,0.172251,003b4bc5a9dcaa132d6b5d5ac4fc1f69,mythology for middle schoolers,"my students are hardworking, dedicated, salt o..."
4,0.171606,007cf7ea1a98f1fd5cbe34e8fe2ab813,we are in need of books,my students are amazing students for many reas...
5,0.169766,0087adde3167d7749ea9dcb639e08940,wildcats love literature,we are a rural country town and one of the che...
6,0.15956,0054bd1a84d329a2d2f4e52261b331fd,books we'd love to get our hands on!,"frank serafini once said, “there is no such th..."
7,0.158991,00232e9d509f4052b669dd9ac605cedd,daily 5 make reading alive!,daily 5 is a literacy structure that gives chi...
8,0.15898,0010faf6abb4eb5430c621528233f91d,reading gives us a place to go when we have t...,my busy bees are always reading and trying new...
9,0.158946,0012d94ac914624f70e45fb22206e47e,best books of 2015,there's no such thing as a kid who hates readi...


In [149]:
print('Evaluating Content-Based Filtering model...')
cb_global_metrics, cb_detailed_results_df = model_evaluator.evaluate_model(content_based_recommender_model)
print('\nGlobal metrics:\n%s' % cb_global_metrics)
cb_detailed_results_df.head(10)

Evaluating Content-Based Filtering model...


ValueError: invalid literal for int() with base 10: '0053a266af3840dbf8b033a7c8331cf1'

# Collaborative Filtering model


## Create the donor-project matrix

Matrix Factorization

In [100]:
#Creating a sparse pivot table with donors in rows and projects in columns
donors_projects_pivot_matrix_df = donations_train_df.pivot(index=donor_id, 
                                                          columns='project_id', 
                                                          values='eventStrength').fillna(0)

donors_projects_pivot_matrix_df.head(3)

project_id,000009891526c0ade7180f8423792063,00000ce845c00cbf0686c992fc369df4,00002d44003ed46b066607c5455a999a,00002eb25d60a09c318efbd0797bffb5,0000300773fe015f870914b42528541b,...,0089118ee1816d34fd680c8b12e0d31a,008914744ac8a5c273eff7c29c6cc169,00891f05dd76342b8f287ac50f0d2525,00897214102859c52600f4acab28eeeb,0089759ee8b44b9f3908059f10510d94
donor_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0003aba06ccf49f8c44fc2dd3b582411,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0
000f7306e8ddb36296f0d97a34d67d76,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0
00125f251b05d9e447a5448bef981028,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0


In [101]:
# Transform the donor-project dataframe into a matrix
donors_projects_pivot_matrix = donors_projects_pivot_matrix_df.as_matrix()
donors_projects_pivot_matrix[:3]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [102]:
# Get donor_ids
donors_ids = list(donors_projects_pivot_matrix_df.index)
donors_ids[:10]

['0003aba06ccf49f8c44fc2dd3b582411',
 '000f7306e8ddb36296f0d97a34d67d76',
 '00125f251b05d9e447a5448bef981028',
 '0013dfb2a873420fe6e7d750ef24ce98',
 '0016b23800f7ea46424b3254f016007a',
 '00199e3565635f8a5ebefd3b5985a7f3',
 '00309a47b765e12714d817ee3215de1e',
 '0036448e416b71ab040182c428958b6f',
 '0052dd04a7cf2d91db791c94dec448ac',
 '0056442e9cb3c6b6a2d7fceef36e1c1c']

In [103]:
# Print the first 5 rows of the donor-project matrix
donors_projects_pivot_matrix[:5]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Now we will use SVD to get latent factors. After the factorization, we will try to reconstruct the original matrix by multiplying its factors. The resulting matrix is not sparse any more. It is the generated predictions for projects the donor have not yet donated to, which we will exploit for recommendations.

In [104]:
#The number of factors to factor the user-item matrix.
NUMBER_OF_FACTORS_MF = 15
#Performs matrix factorization of the original user item matrix
U, sigma, Vt = svds(donors_projects_pivot_matrix, k = NUMBER_OF_FACTORS_MF)

In [105]:
U.shape

(6471, 15)

In [106]:
Vt.shape

(15, 1756)

In [107]:
sigma = np.diag(sigma)
sigma.shape

(15, 15)

In [108]:
# Reconstruct the matrix by multiplying its factors
all_donor_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
all_donor_predicted_ratings

array([[  1.39737772e-33,   2.58294461e-36,  -2.80034123e-33, ...,
          5.48172675e-22,  -6.75928123e-34,   2.59931190e-19],
       [  2.77341101e-33,   1.22059731e-35,  -1.05169616e-32, ...,
          2.71459004e-21,  -3.72645233e-34,   1.28719773e-18],
       [  6.85927256e-35,   7.08416850e-38,  -9.56648797e-35, ...,
         -3.00903831e-23,   1.63508398e-35,  -1.42681849e-20],
       ..., 
       [  4.06169420e-34,  -3.82615081e-37,   8.25439117e-35, ...,
         -3.33223449e-23,   1.50289774e-34,  -1.58007087e-20],
       [  1.31141538e-35,  -1.58437973e-36,   1.15780720e-33, ...,
         -4.68073259e-22,   4.27057768e-34,  -2.21949844e-19],
       [  8.84650990e-35,  -2.83520386e-36,   8.92884131e-34, ...,
          1.81324554e-21,   1.26929147e-33,   8.59800375e-19]])

In [109]:
#Converting the reconstructed matrix back to a Pandas dataframe
cf_preds_df = pd.DataFrame(all_donor_predicted_ratings, 
                           columns = donors_projects_pivot_matrix_df.columns, 
                           index=donors_ids).transpose()
#cf_preds_df.head(10)
## Error: IOPub data rate exceeded.

In [110]:
len(cf_preds_df.columns)

6471

## Build the Collaborative Filtering Model

In [111]:
class CFRecommender:
    
    MODEL_NAME = 'Collaborative Filtering'
    
    def __init__(self, cf_predictions_df, projects=None):
        self.cf_predictions_df = cf_predictions_df
        self.projects = projects
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_projects(self, donor_id, projects_to_ignore=[], topn=10):
        # Get and sort the donor's predictions
        sorted_donor_predictions = self.cf_predictions_df[donor_id].sort_values(ascending=False) \
                                    .reset_index().rename(columns={donor_id: 'recStrength'})

        # Recommend the highest predicted projects that the donor hasn't donated to
        recommendations_df = sorted_donor_predictions[~sorted_donor_predictions['project_id'].isin(projects_to_ignore)] \
                               .sort_values('recStrength', ascending = False) \
                               .head(topn)

 
        recommendations_df = recommendations_df.merge(self.projects, how = 'left', 
                                                          left_on = 'project_id', 
                                                          right_on = 'project_id')[['recStrength', 'project_id', 'Project Title', 'Project Essay']]


        return recommendations_df

In [112]:
cf_recommender_model = CFRecommender(cf_preds_df, projects)

In [150]:
print('Evaluating Collaborative Filtering (SVD Matrix Factorization) model...')
cf_global_metrics, cf_detailed_results_df = model_evaluator.evaluate_model(cf_recommender_model)
print('\nGlobal metrics:\n%s' % cf_global_metrics)
cf_detailed_results_df.head(10)

Evaluating Collaborative Filtering (SVD Matrix Factorization) model...


ValueError: invalid literal for int() with base 10: '0053a266af3840dbf8b033a7c8331cf1'

In [114]:
cf_recommender_model.recommend_projects(donor1)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,6.319834e-17,006256a8e6df10612ff859688b67ed61,all hands on deck to see the siege of boston!,taunton public schools is a low income/hi...
1,4.3786570000000006e-17,0078e2e24ae7e0de0817a5a17bd2f48c,little people - big minds,my students are african american and hispanic....
2,2.512098e-17,006e4d8a02dde4626662fcfdb5a4bc41,1st full-year h.s. astrochemistry class in nat...,emerson was once asked what we would do if the...
3,2.127461e-17,0031eb3832f55a565b91b467ccb961d4,"may i have this dance, please?",the tango music begins. the students look at o...
4,1.830023e-17,004a152bbe8952ea5e9d5ef89c179933,claymation experimentation,after seeing my students sewing a jabba the hu...
5,1.6144930000000002e-17,0029e426fd3296af4fc333580fa895fe,"everyone needs an address, especially maniac m...","""...people will forget what you said, people w..."
6,1.4917870000000002e-17,0066cfb9ded063e2078cf3973e2fa6aa,music for oyler,"my students do not have money, but they do hav..."
7,1.2811980000000001e-17,0012d94ac914624f70e45fb22206e47e,best books of 2015,there's no such thing as a kid who hates readi...
8,1.1630860000000001e-17,008847ff394b52dc61013213ac34ed44,help us document our artistic growth!,the art room is a very busy place! it is full...
9,1.084886e-17,0015703508d8a6703bc0d7f71027fdb4,lets all become aware of different cultures,help my students make a difference by acknowle...


In [115]:
cf_recommender_model.recommend_projects(donor2)

Unnamed: 0,recStrength,project_id,Project Title,Project Essay
0,7.475687e-17,0078e2e24ae7e0de0817a5a17bd2f48c,little people - big minds,my students are african american and hispanic....
1,4.5593770000000004e-17,0066cfb9ded063e2078cf3973e2fa6aa,music for oyler,"my students do not have money, but they do hav..."
2,3.557929e-17,00177290279939fb33386b29198c450e,reader's workshop = instilling a love of readi...,we are a brand new charter school that has onl...
3,3.2766000000000006e-17,006256a8e6df10612ff859688b67ed61,all hands on deck to see the siege of boston!,taunton public schools is a low income/hi...
4,2.0831820000000003e-17,0031eb3832f55a565b91b467ccb961d4,"may i have this dance, please?",the tango music begins. the students look at o...
5,1.791936e-17,004a152bbe8952ea5e9d5ef89c179933,claymation experimentation,after seeing my students sewing a jabba the hu...
6,1.580891e-17,0029e426fd3296af4fc333580fa895fe,"everyone needs an address, especially maniac m...","""...people will forget what you said, people w..."
7,1.414506e-17,00388e34c088cfc273a32642b67c7e60,chromebooks for keyboarding,our students are some of the hardest working k...
8,1.3330620000000002e-17,0066d5cb75c16a32bb569097798d747e,urban middle school band program needs new ins...,in my school 50% of the students are socioecon...
9,1.3057860000000001e-17,00882cd1730cf05b346c91604d609f15,scooter boards for safe learning of new motor ...,as i walk through the halls of our new public ...


# Hybrid Method

In [116]:
class HybridRecommender:
    
    MODEL_NAME = 'Hybrid'
    
    def __init__(self, cb_rec_model, cf_rec_model, projects_df):
        self.cb_rec_model = cb_rec_model
        self.cf_rec_model = cf_rec_model
        self.projects_df = projects_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_projects(self, donor_id, projects_to_ignore=[], topn=10):
        #Getting the top-1000 Content-based filtering recommendations
        cb_recs_df = self.cb_rec_model.recommend_projects(donor_id, projects_to_ignore=projects_to_ignore, 
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCB'})
        
        #Getting the top-1000 Collaborative filtering recommendations
        cf_recs_df = self.cf_rec_model.recommend_projects(donor_id, projects_to_ignore=projects_to_ignore,  
                                                           topn=1000).rename(columns={'recStrength': 'recStrengthCF'})
        
        #Combining the results by project_id
        recs_df = cb_recs_df.merge(cf_recs_df,
                                   how = 'inner', 
                                   left_on = 'project_id', 
                                   right_on = 'project_id')
        
        #Computing a hybrid recommendation score based on CF and CB scores
        recs_df['recStrengthHybrid'] = recs_df['recStrengthCB'] * recs_df['recStrengthCF']
        
        #Sorting recommendations by hybrid score
        recommendations_df = recs_df.sort_values('recStrengthHybrid', ascending=False).head(topn)

        recommendations_df = recommendations_df.merge(self.projects_df, how = 'left', 
                                                    left_on = 'project_id', 
                                                    right_on = 'project_id')[['recStrengthHybrid', 
                                                                              'project_id', 'Project Title', 
                                                                              'Project Essay']]


        return recommendations_df
    
hybrid_recommender_model = HybridRecommender(content_based_recommender_model, cf_recommender_model, projects)

In [117]:
hybrid_recommender_model.recommend_projects(donor1)

Unnamed: 0,recStrengthHybrid,project_id,Project Title,Project Essay
0,3.220508e-18,0078e2e24ae7e0de0817a5a17bd2f48c,little people - big minds,my students are african american and hispanic....
1,2.892621e-18,0012d94ac914624f70e45fb22206e47e,best books of 2015,there's no such thing as a kid who hates readi...
2,1.231851e-18,004a152bbe8952ea5e9d5ef89c179933,claymation experimentation,after seeing my students sewing a jabba the hu...
3,9.062529e-19,0012f7359b9705f46355a1c2b8ecbc1d,leveled books to help us read!,"have you ever been told you need to read, but ..."
4,8.697161999999998e-19,003ed2285483b759f508f29076143447,language arts resources,"""the more that you read, the more things you w..."
5,6.418145999999999e-19,00173eb8e417bbe9fecc3da05893878c,a calming classroom carpet,"""sometimes the questions are complicated and t..."
6,4.0534529999999995e-19,0029c10b4286065811a493b54f85c97e,writing our way through kindergarten!!,my students are motivated and eager to learn. ...
7,3.6730359999999997e-19,003c964592d53d6089d0d8b0d3ee4c0a,to read or not to read to meet ccss? we want t...,students are in a rut with a lack of reading m...
8,1.8376629999999999e-19,0072118edd7e0c5e84d23be2424bebd9,our learning is on fire!,we are an urban school with a large shelter (h...
9,1.8163939999999998e-19,0002d777804d485adbcca7aad4ad96c5,document cameras for student centered learning!,has someone ever tried to explain a concept to...


In [118]:
hybrid_recommender_model.recommend_projects(donor2)

Unnamed: 0,recStrengthHybrid,project_id,Project Title,Project Essay
0,5.052022e-18,0078e2e24ae7e0de0817a5a17bd2f48c,little people - big minds,my students are african american and hispanic....
1,4.1583490000000004e-18,00177290279939fb33386b29198c450e,reader's workshop = instilling a love of readi...,we are a brand new charter school that has onl...
2,1.369002e-18,0012d94ac914624f70e45fb22206e47e,best books of 2015,there's no such thing as a kid who hates readi...
3,6.394845999999999e-19,0015703508d8a6703bc0d7f71027fdb4,lets all become aware of different cultures,help my students make a difference by acknowle...
4,5.558455e-19,0029c10b4286065811a493b54f85c97e,writing our way through kindergarten!!,my students are motivated and eager to learn. ...
5,4.733063e-19,003ed2285483b759f508f29076143447,language arts resources,"""the more that you read, the more things you w..."
6,4.0042119999999997e-19,004a094179fcb1dabc82c73d468c5ee7,astronomical astronomy,my classroom is a melting pot in a suburb of n...
7,3.9556529999999997e-19,0012f7359b9705f46355a1c2b8ecbc1d,leveled books to help us read!,"have you ever been told you need to read, but ..."
8,2.1034429999999998e-19,0052af01fcc5aa908b6e2aa537fb6462,binders for writing success,writing is an important part of my student's e...
9,1.8050319999999999e-19,003c964592d53d6089d0d8b0d3ee4c0a,to read or not to read to meet ccss? we want t...,students are in a rut with a lack of reading m...


In [151]:
print('Evaluating Hybrid model...')
hybrid_global_metrics, hybrid_detailed_results_df = model_evaluator.evaluate_model(hybrid_recommender_model)
print('\nGlobal metrics:\n%s' % hybrid_global_metrics)
hybrid_detailed_results_df.head(10)

Evaluating Hybrid model...


ValueError: invalid literal for int() with base 10: '0053a266af3840dbf8b033a7c8331cf1'

# Comparing Methods

In [152]:
global_metrics_df = pd.DataFrame([pop_global_metrics, cf_global_metrics, cb_global_metrics, hybrid_global_metrics]) \
                        .set_index('modelName')
global_metrics_df

NameError: name 'pop_global_metrics' is not defined

In [153]:
%matplotlib inline
ax = global_metrics_df.transpose().plot(kind='bar', figsize=(15,8))
for p in ax.patches:
    ax.annotate("%.3f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

NameError: name 'global_metrics_df' is not defined

# Testing

In [122]:
def inspect_donations(donor_id, test_set=True):
    if test_set:
        donations_df = donations_test_indexed_df
    else:
        donations_df = donations_train_indexed_df
    return donations_df.loc[donor_id].merge(projects_df, how = 'left', 
                                                      left_on = 'project_id', 
                                                      right_on = 'project_id') \
                          .sort_values('eventStrength', ascending = False)[['eventStrength', 
                                                                          'project_id']]

In [154]:
inspect_donations(donor1, test_set=False).head(20)


AttributeError: 'Series' object has no attribute 'merge'

In [128]:
hybrid_recommender_model.recommend_projects(donor1, topn=20)

Unnamed: 0,recStrengthHybrid,project_id,Project Title,Project Essay
0,3.220508e-18,0078e2e24ae7e0de0817a5a17bd2f48c,little people - big minds,my students are african american and hispanic....
1,2.892621e-18,0012d94ac914624f70e45fb22206e47e,best books of 2015,there's no such thing as a kid who hates readi...
2,1.231851e-18,004a152bbe8952ea5e9d5ef89c179933,claymation experimentation,after seeing my students sewing a jabba the hu...
3,9.062529e-19,0012f7359b9705f46355a1c2b8ecbc1d,leveled books to help us read!,"have you ever been told you need to read, but ..."
4,8.697161999999998e-19,003ed2285483b759f508f29076143447,language arts resources,"""the more that you read, the more things you w..."
5,6.418145999999999e-19,00173eb8e417bbe9fecc3da05893878c,a calming classroom carpet,"""sometimes the questions are complicated and t..."
6,4.0534529999999995e-19,0029c10b4286065811a493b54f85c97e,writing our way through kindergarten!!,my students are motivated and eager to learn. ...
7,3.6730359999999997e-19,003c964592d53d6089d0d8b0d3ee4c0a,to read or not to read to meet ccss? we want t...,students are in a rut with a lack of reading m...
8,1.8376629999999999e-19,0072118edd7e0c5e84d23be2424bebd9,our learning is on fire!,we are an urban school with a large shelter (h...
9,1.8163939999999998e-19,0002d777804d485adbcca7aad4ad96c5,document cameras for student centered learning!,has someone ever tried to explain a concept to...
