Find your next great cause at http://www.charityrecommender.com/
A 1-page project summary is available here
- This project was motivated by the desire to connect people with charitable organizations in their communities.
- Recommending charities for people to donate their time or resources to can transform communities and positively change lives.
- Recommend similar charities to a user-selected charity
- Recommend local charities from a user-selected category and zipcode
- Recommend charities that match a keyword search or description
The first dataset comes from CharityNavigator.org and contains detailed charity information including description, motto, and overall score. The data is availble Here.
- Original Data Set
- 8,400 charities (rows) with 20 features for each charity (columns)
- Data Cleaning:
- Create "corpus" column that contains information about charity category, description, motto, and state
- Cleaned Data Set
- 8,400 charities (rows) with 8 features for each charity (columns)
name | ein | category | description | motto | score | state |
---|---|---|---|---|---|---|
1000 Friends of Oregon | 930642086 | Environment | Working with Oregonians to enhance our quality... | Great communities. Working lands. Iconic Places. | 91.94 | OR |
WYPR | 311770828 | Arts, Culture, Humanities | Serving the metropolitan Baltimore area and th... | 88.1 FM -. Your NPR News Station | 85.59 | MD |
Two Ten Footwear Foundation | 222579809 | Human Services | Funded solely by the footwear industry, Two Te... | Shoepeople Helping Shoepeople | 90.26 | MA |
- Receive a charity name as user input
- Access CharityNavigator database to refer to that charity's information (category, description, and motto)
- Compare that charity's information to the TF-IDF model of charities in our database to generate "similarity scores"
- TF-IDF stands for Term Frequency - Inverse Term Frequency
- Term Frequency refers to how frequently a word from on document occurs in another document
- Inverse Term Frequency refers to how important a given word is in creating a match, determined by "rarity" of word usage
- The Top 3 highest similarity-scoring charities are recommended to the user
The Similar Charity Recommender webpage features an autofill feature to find charities in the database, coded in Javascript. |
The code for this can be found here
The two criteria used to score the similar recommendation model are Category Scores and Similarity Scores.
The CharityNavigator Dataset contains 11 categories. Recommending one of these categories at random would result in the same category getting recommended 1/11 or 9.09% of the time, which will be the baseline for the model. The goal for the recommendation model is to improve category score to above 50%.
The Similarity Scores are calculated using the cosine similarity of the TF-IDF vector representations between documents. The three highest similarly scored documents will be returned as the top 3 recommended charities.
The two hyperparameters used to tune the model are minimum document count and maximum document percentage
Minimum Document Count refers to the minimum number of documents a word must appear in to be included as a token in the model.
Maximum Document Percentage refers to the maximum percentage of documents a word can appear in before it is excluded. For example, a 30% maximum percentage indicates that a word cannot appear in more than 30% of documents.
Increasing the minimum document count and the maximum document percentage will also decrease the number of tokens (words) used to analyze charities and make similarity recommendations. The hypertuning of this model will seek to balance the optimization of category and similarity scores with the loss of tokens from the corpus.
The code for this can be found here
The second dataset comes from IRS.gov and contains information about charitable organizations in the United States
- Original Data Set
- 1,719,673 charities (rows) with 28 features for each charity (columns)
- Data Cleaning:
- Keep charities that offer fully tax deductible donations and have an NTEE Category Code
- Translate NTEE code into category column and keep essential column features
- Cleaned Data Set
- 992,318 charities (rows) with 10 features for each charity (columns)
EIN | NAME | STATE | INCOME_CD | ZIP_FIVE | NTEE_Major_Category | County |
---|---|---|---|---|---|---|
10130427 | BRIDGTON HOSPITAL | ME | 9 | 04009 | Health - General and Rehabilitative | Cumberland County |
10024645 | BANGOR SYMPHONY ORCHESTRA | ME | 6 | 04402 | Arts, Culture and Humanities | Penobscot County |
10015091 | HANOVER SOCCER CLUB INC | NJ | 4 | 07927 | Recreation, Sports, Leisure, Athletics | Morris County |
- Receive a category and zipcode as user input
- Filter charities on selected category
- Assign a "locality score" to each charity
- **Locality Score = (Income Code * Locality Factor) / 50
- Income code represents annual income of charity, higher income generally means more established charities
- Locality Factor represents how "local" the charity is to the user, determnined zipcode, county, or state match
- 50 Represents the maximum score, and is used to convert from a score to a percentage
- The Top 3 highest-scoring charities are recommended to the user
The code to do this can be found here
- Charity Recommender is a functional web application and be be accessed at www.charityrecommender.com/
- Next steps for this project include adding more charities to the corpus and integrating recommendation features with www.givz.com/
Find your next great cause to donate to today!
- Python - Coding Language for Machine Learning Application
- Gensim - Used for Latent Dirichlet Allocation - Topic Modeling
- Flask - Framework for Creating the Web Application
- Elastic Beanstalk - Service for Deploying Web Applications
- Tableau - Tool for creating advanced data visualizations
- Thank you to those that support charitable organizations and help to make our world a better place