# Donors Choose 
## Problem Statement
DonorsChoose is a platform that connects teachers in public schools to donors who want to support public education. 

This competition hosted by Kaggle is to improve on the existing recommender system by predicting a plausible project to the donor that might yield an additional donation. 

More information can be found at the link:     
    [https://www.kaggle.com/donorschoose/io](https://www.kaggle.com/donorschoose/io)

## Intro to Recommenders
In my exploration notebook we explored the characteristics that could be used to build a recommender system. Using the characteristics, we are able to build a content based recommender system. A simple content based recommending system is a feature extraction based recommending system where each feature is assigned to an item profile as an element in a vector, and the system recommends similar items. Similar items can be computed in a large assortment of procedures, but I will follow the mean normalized cosine similarity function, also known as the Pearson Correlation. Then use the similarity as a guide for the most similar items, and average those values.  

Other recommender systems that will be touched on but may not be used in this notebook are collaborative filtering recommenders, and hybrid recommenders. Collaborative filtering on a simple level is a method to predict ratings based on similar types of inputs. For example, user-user collaborative filtering is predicts user ratings on an item based on how other similar users' ratings. Item-item collaborative filtering predicts user ratings on items based on how similar items are to one another. Hybrid methods combine other recommending systems along with either content based or collaborative filtering systems. 

Other components for the Hybrid recommending methods could be a naive recommender such as providing the most popular item, a random item, or a sequential item, or more complex models such as combining a CB and CF model. The main purpose of the hybrid model is to alleviate the downsides and complement either CB or CF models. The downsides may include data sparsity, cold starts for new users, unique niches, and others. 

## My Approach
I will use a content based recommender using TFIDF. This approach will capture the frequency of terms used in projects and show what a person may be interested in. I use this model because I assume a normal distribution of vocabulary used in project descriptions. The purpose of this website is to connect donors to projects, and based on the text samples that I viewed on the website, I feel that the range of vocabulary usage will not be esoteric but simple and to the point. 





In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import sklearn
import math
import random

import os
print(os.listdir("../input"))

from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

['Donations.csv', 'Donors.csv', 'Projects.csv', 'Resources.csv', 'Schools.csv', 'Teachers.csv']


In [31]:
donations = pd.read_csv('../input/Donations.csv')
donations.columns

Index(['Project ID', 'Donation ID', 'Donor ID',
       'Donation Included Optional Donation', 'Donation Amount',
       'Donor Cart Sequence', 'Donation Received Date'],
      dtype='object')

In [32]:
# Pull out information to get the minimal dataframe to Join information
id_df = donations.loc[:,['Project ID', 'Donation ID', 'Donor ID']]

In [33]:
id_df.head()

Unnamed: 0,Project ID,Donation ID,Donor ID
0,000009891526c0ade7180f8423792063,688729120858666221208529ee3fc18e,1f4b5b6e68445c6c4a0509b3aca93f38
1,000009891526c0ade7180f8423792063,dcf1071da3aa3561f91ac689d1f73dee,4aaab6d244bf3599682239ed5591af8a
2,000009891526c0ade7180f8423792063,18a234b9d1e538c431761d521ea7799d,0b0765dc9c759adc48a07688ba25e94e
3,000009891526c0ade7180f8423792063,38d2744bf9138b0b57ed581c76c0e2da,377944ad61f72d800b25ec1862aec363
4,000009891526c0ade7180f8423792063,5a032791e31167a70206bfb86fb60035,6d5b22d39e68c656071a842732c63a0c


### Build User Profile
The user profile will be the average of the projects' tfidf vector values. I would like to make this more complicated by getting a weighted average for those that the user donated more to or donated more frequently to. 

    groupby['donor_id', 'project_id'].count()
    
    

get the set of donor ids
for each donor id      
```
id_df[id_df['Donor ID'] == donor_id]    
```
This is the id_df with the donor id, each with a unique donation id and the project ID.

left join that with the project ID from the TFIDF matrix and then use iloc to get the columns with the vectorized information. 

Create a new dataframe with the set of donor ids, and set the features as the TFIDF. 
   

In [34]:
# multi-index matrix with the count of donations to a project as eventStrength
sample = id_df.groupby(['Donor ID', 'Project ID']).count().head(20)

In [35]:
sample

Unnamed: 0_level_0,Unnamed: 1_level_0,Donation ID
Donor ID,Project ID,Unnamed: 2_level_1
00000ce845c00cbf0686c992fc369df4,5bab6101eed588c396a59f6bd64274b6,1
00002783bc5d108510f3f9666c8b1edd,9db61b1b1e43a7b256eec9b20b42d854,1
00002d44003ed46b066607c5455a999a,2f53e5f31890e647048ac217cda3b83f,2
00002d44003ed46b066607c5455a999a,2f7996f08052785e9b146f72c0c4990d,1
00002d44003ed46b066607c5455a999a,64f54f1efcbeb986114a7a13e6b27257,1
00002d44003ed46b066607c5455a999a,75131d2e94930082aa8ed1e4cd4d21da,1
00002d44003ed46b066607c5455a999a,c5821d32012efd7df4f6fa12e230e991,1
00002d44003ed46b066607c5455a999a,dfdaf35bb33f9c105530c82984960ff3,1
00002d44003ed46b066607c5455a999a,e09933470f4256cc2643341c1d299e55,2
00002d44003ed46b066607c5455a999a,e2beb818569f66adaa4ced21ca299ac6,1


In [36]:
# grab donor id's donations
sample.loc['00002d44003ed46b066607c5455a999a']

Unnamed: 0_level_0,Donation ID
Project ID,Unnamed: 1_level_1
2f53e5f31890e647048ac217cda3b83f,2
2f7996f08052785e9b146f72c0c4990d,1
64f54f1efcbeb986114a7a13e6b27257,1
75131d2e94930082aa8ed1e4cd4d21da,1
c5821d32012efd7df4f6fa12e230e991,1
dfdaf35bb33f9c105530c82984960ff3,1
e09933470f4256cc2643341c1d299e55,2
e2beb818569f66adaa4ced21ca299ac6,1
eb6d91cbeab5037ca2f45fc3f6a4de8c,1


In [37]:
# grabs the count of occurances
sample.loc['00002d44003ed46b066607c5455a999a', '2f53e5f31890e647048ac217cda3b83f']

Donation ID    2
Name: (00002d44003ed46b066607c5455a999a, 2f53e5f31890e647048ac217cda3b83f), dtype: int64

In [39]:
sample.iloc[1:5]

Unnamed: 0_level_0,Unnamed: 1_level_0,Donation ID
Donor ID,Project ID,Unnamed: 2_level_1
00002783bc5d108510f3f9666c8b1edd,9db61b1b1e43a7b256eec9b20b42d854,1
00002d44003ed46b066607c5455a999a,2f53e5f31890e647048ac217cda3b83f,2
00002d44003ed46b066607c5455a999a,2f7996f08052785e9b146f72c0c4990d,1
00002d44003ed46b066607c5455a999a,64f54f1efcbeb986114a7a13e6b27257,1


# Reading in Projects

In [40]:
projects = pd.read_csv('../input/Projects.csv', nrows=500)


# Vectorizing TFIDF

In [41]:
# # Do you need to split?
# features_train, features_test, labels_train, labels_test = train_test_split(word_data, 
#                                                                                authors, 
#                                                                                test_size=0.2, 
#                                                                                random_state=42)

In [42]:
vectorizer = TfidfVectorizer(analyzer='word', 
                                 ngram_range=(1,2), 
                                 sublinear_tf=True, 
                                 max_df=0.5,
                                 lowercase=True,
                                 max_features=1000, # The lower the features, the more specific the words will be to a category.
                                 stop_words='english')

tfidf_matrix = vectorizer.fit_transform(projects['Project Essay'])
# features_test_transformed  = vectorizer.transform(features_test)
tfidf_feature_names = vectorizer.get_feature_names()
tfidf_matrix

<500x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 35194 stored elements in Compressed Sparse Row format>

In [43]:
tfidf_matrix[0]

<1x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 88 stored elements in Compressed Sparse Row format>

# Count donation by Donor ID

In [44]:
# This is the number of donations made by a person, this can use to weigh the donations. 
id_df.groupby('Donor ID').count().head(10)

Unnamed: 0_level_0,Project ID,Donation ID
Donor ID,Unnamed: 1_level_1,Unnamed: 2_level_1
00000ce845c00cbf0686c992fc369df4,1,1
00002783bc5d108510f3f9666c8b1edd,1,1
00002d44003ed46b066607c5455a999a,11,11
00002eb25d60a09c318efbd0797bffb5,5,5
0000300773fe015f870914b42528541b,1,1
00004c31ce07c22148ee37acd0f814b9,1,1
00004e32a448b4832e1b993500bf0731,1,1
00004fa20a986e60a40262ba53d7edf1,1,1
00005454366b6b914f9a8290f18f4aed,1,1
0000584b8cdaeaa6b3de82be509db839,2,2


In [45]:
apple = [1, 2, 3, 4, 5, 6]

In [46]:
apple.index(4)

3

Now you need to line up accessing the tfidf matrix and the person?
I think this function only needs to be used once, so it may be easier to stick with what you know. 

In [47]:
# left join this onto donations, and then divide to get the event strength. 
sum_df = donations.groupby('Donor ID')['Donation Amount'].sum()
sum_df = pd.DataFrame(sum_df)

sum_df = sum_df.unstack().reset_index()

sum_df = sum_df[['Donor ID', 0]]

sum_df['Donation Sum'] = sum_df[0]

sum_df = sum_df[['Donor ID', 'Donation Sum']]

donations = donations.merge(sum_df, on='Donor ID', how='left')

In [48]:
donations['eventStrength'] = donations['Donation Amount'] / donations['Donation Sum']
donations.head()

Unnamed: 0,Project ID,Donation ID,Donor ID,Donation Included Optional Donation,Donation Amount,Donor Cart Sequence,Donation Received Date,Donation Sum,eventStrength
0,000009891526c0ade7180f8423792063,688729120858666221208529ee3fc18e,1f4b5b6e68445c6c4a0509b3aca93f38,No,178.37,11,2016-08-23 13:15:57,139767.73,0.001276
1,000009891526c0ade7180f8423792063,dcf1071da3aa3561f91ac689d1f73dee,4aaab6d244bf3599682239ed5591af8a,Yes,25.0,2,2016-06-06 20:05:23,25.0,1.0
2,000009891526c0ade7180f8423792063,18a234b9d1e538c431761d521ea7799d,0b0765dc9c759adc48a07688ba25e94e,Yes,20.0,3,2016-06-06 14:08:46,60.0,0.333333
3,000009891526c0ade7180f8423792063,38d2744bf9138b0b57ed581c76c0e2da,377944ad61f72d800b25ec1862aec363,Yes,25.0,1,2016-05-15 10:23:04,25.0,1.0
4,000009891526c0ade7180f8423792063,5a032791e31167a70206bfb86fb60035,6d5b22d39e68c656071a842732c63a0c,Yes,25.0,2,2016-05-17 01:23:38,195.0,0.128205


In [49]:
# Create copy with donor ID, Project ID, and eventStrength
don_df = donations[['Donor ID', 'Project ID', 'eventStrength']].copy()

In [50]:
don_df.head()

Unnamed: 0,Donor ID,Project ID,eventStrength
0,1f4b5b6e68445c6c4a0509b3aca93f38,000009891526c0ade7180f8423792063,0.001276
1,4aaab6d244bf3599682239ed5591af8a,000009891526c0ade7180f8423792063,1.0
2,0b0765dc9c759adc48a07688ba25e94e,000009891526c0ade7180f8423792063,0.333333
3,377944ad61f72d800b25ec1862aec363,000009891526c0ade7180f8423792063,1.0
4,6d5b22d39e68c656071a842732c63a0c,000009891526c0ade7180f8423792063,0.128205


In [51]:
project_ids = list(projects['Project ID'])
project_ids[0:10]

['7685f0265a19d7b52a470ee4bac883ba',
 'f9f4af7099061fb4bf44642a03e5c331',
 'afd99a01739ad5557b51b1ba0174e832',
 'c614a38bb1a5e68e2ae6ad9d94bb2492',
 'ec82a697fab916c0db0cdad746338df9',
 '563958074d7b12b48b939279eb59e6ca',
 '717c7a01215d532d68f6fe9e666c88c3',
 '4202c4e251fe483dfd93520da022f987',
 '49825532f85d0cdb569797df3ab8ec46',
 '60dddb9495e5ed60c1f6c1b86fe9a7e4']

In [52]:
don_df = don_df.iloc[:100000]

In [53]:
# [1] 
# replace user with donor
# contentid with Project ID
# interactions with donations
# item with project


def get_project_profile(project_id):
    idx = project_ids.index(project_id)
    project_profile = tfidf_matrix[idx:idx+1]
    return project_profile

def get_project_profiles(ids):
    project_profiles_list = [get_project_profile(x) for x in ids]
    project_profiles = scipy.sparse.vstack(project_profiles_list)
    return project_profiles

def build_donors_profile(donor_id, don_indexed_df):
    donations_donor_df = don_indexed_df[don_indexed_df['Donor ID'] == donor_id]
    donor_project_profiles = get_project_profiles(list(donations_donor_df['Project ID']))
    donor_project_strengths = np.array(donations_donor_df['eventStrength']).reshape(-1,1)
    #Weighted average of project profiles by the donations strength
    donor_project_strengths_weighted_avg = np.sum(donor_project_profiles.multiply(donor_project_strengths),
                                                  axis=0) / np.sum(donor_project_strengths)
    donor_profile_norm = sklearn.preprocessing.normalize(donor_project_strengths_weighted_avg)
    return donor_profile_norm

def build_donors_profiles(): 
    don_indexed_df = don_df[don_df['Project ID'].isin(projects['Project ID'])]
    donor_profiles = {}
    for donor_id in don_indexed_df['Donor ID'].unique():
        donor_profiles[donor_id] = build_donors_profile(donor_id, don_indexed_df)
    return donor_profiles

donor_profiles = build_donors_profiles()

In [28]:
donor_id = don_df[don_df['Project ID'].isin(projects['Project ID'])]['Donor ID'].unique()[0]

myprofile = donor_profiles[donor_id]
print(myprofile.shape)
print(donor_id)
pd.DataFrame(sorted(zip(tfidf_feature_names, 
                        donor_profiles[donor_id].flatten().tolist()), key=lambda x: -x[1])[:20],
             columns=['token', 'relevance'])

(1, 3000)
dbfe64ac9b09eb049378e7147019d11d


Unnamed: 0,token,relevance
0,chromebook,0.251673
1,websites,0.230577
2,computer,0.218609
3,expose students,0.196446
4,educational,0.174033
5,technology students,0.171345
6,expose,0.168447
7,second graders,0.158732
8,specific,0.156663
9,internet,0.147786


## References
[1] Moreira, G (2017) Recommender Systems in Python 101 (Version 2.0) Kaggle    
https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101

# Documenting Debugging

### 0 is not in the list

In [54]:
don_indexed_df = don_df[don_df['Project ID'].isin(projects['Project ID'])].set_index('Donor ID')
donor_id = don_indexed_df.index.unique()[0] # grab one example
donations_donor_df = don_indexed_df.loc[donor_id]
donor_project_profiles = get_project_profiles(donations_donor_df['Project ID'])
don_indexed_df.loc[donor_id]
donations_donor_df = don_indexed_df.loc[donor_id]
get_project_profiles(donations_donor_df['Project ID'])
# We get the 0 is not in the list

ValueError: '0' is not in list

In [55]:
def get_project_profiles(ids):
    project_profiles_list = [get_project_profile(x) for x in ids]
    project_profiles = scipy.sparse.vstack(project_profiles_list)
    return project_profiles

In [56]:
print(donations_donor_df['Project ID'])

# I see, the code breaks when you put in a string, becaseu it parses the string instead of parsing the list
# You need to make sure that you only use those with frequent users
for x in ['00589577d61473566a0d72e01ce2d523']: # passing a list
    print(x)

for x in '00589577d61473566a0d72e01ce2d523': # passing a string
    print(x)

00589577d61473566a0d72e01ce2d523
00589577d61473566a0d72e01ce2d523
0
0
5
8
9
5
7
7
d
6
1
4
7
3
5
6
6
a
0
d
7
2
e
0
1
c
e
2
d
5
2
3


### Ambiguous boolean
    idx = project_ids.index(project_id)

Breaks after printing this
```
Donor ID
2144d56b1947ebb26a19e7f1d07c970a    006fb95c63fe9baedf6754b62e520e94
2144d56b1947ebb26a19e7f1d07c970a    11191e5286b65b68e915e4781b878852
2144d56b1947ebb26a19e7f1d07c970a    d4678b1a597c0afcf45e8b85af77749c
Name: Project ID, dtype: object
```

Issue occurs because the table is not unique and setting the donor as the index when there are multilple copies of the same donor_id creates a multi index. This prevents the projects for each donor to be parsed.

Second fix, change first fix:
```
get_project_profiles(list(donations_donor_df['Project ID'])] if 
                                                  type(donations_donor_df['Project ID']) else 
                                                  donations_donor_df['Project ID'])
```
into
```
get_project_profiles(list(donations_donor_df['Project ID']))
```

Unlike the first problem, I do not have the example code of the results breaking. I solved the problem and I do not have a backup of the mistakes from implementing Moreira's work. 

In [57]:
don_indexed_df = don_df[don_df['Project ID'].isin(projects['Project ID'])]

In [58]:
donor_profiles = {}
for donor_id in don_indexed_df['Donor ID'].unique():
    donor_profiles[donor_id] = build_donors_profile(donor_id, don_indexed_df)

In [59]:
for donor_id in don_indexed_df['Donor ID'].unique():
    donations_donor_df = don_indexed_df[don_indexed_df['Donor ID'] == donor_id]
    get_project_profiles(list(donations_donor_df['Project ID'])) # Had to add list

In [60]:
list(donations_donor_df['Project ID'])

['04a068322915b71d0b728d7629ec16c8']

In [61]:
donor_id = '0d5b4cc12b2eb00013460d0ac38ce2a2'
donations_donor_df = don_indexed_df[don_indexed_df['Donor ID'] == donor_id]
donations_donor_df['Project ID']

63930    037719bf60853f234610458a210f45a9
63932    037719bf60853f234610458a210f45a9
63933    037719bf60853f234610458a210f45a9
Name: Project ID, dtype: object

In [62]:
test = id_df.head(40)

In [63]:
# Duplicates make the set_index create a multiindex which breaks the code later on. 
len(test.set_index('Donor ID').index.unique())

37