Skip to content

Charity Recommender Web Application built using Natural Language Processing

Notifications You must be signed in to change notification settings

jackvessa/Charity_Recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Charity Recommender

Find your next great cause at http://www.charityrecommender.com/

CharityRecommender Homepage

A 1-page project summary is available here

Project Motivation

  • This project was motivated by the desire to connect people with charitable organizations in their communities.
  • Recommending charities for people to donate their time or resources to can transform communities and positively change lives.

Goals:

  • Recommend similar charities to a user-selected charity
  • Recommend local charities from a user-selected category and zipcode
  • Recommend charities that match a keyword search or description

Table of Contents

Overview of the Data

First Dataset - CharityNavigator:

The first dataset comes from CharityNavigator.org and contains detailed charity information including description, motto, and overall score. The data is availble Here.

  • Original Data Set
    • 8,400 charities (rows) with 20 features for each charity (columns)
  • Data Cleaning:
    • Create "corpus" column that contains information about charity category, description, motto, and state
  • Cleaned Data Set
    • 8,400 charities (rows) with 8 features for each charity (columns)

Preview of CharityNavigator Data Set:

name ein category description motto score state
1000 Friends of Oregon 930642086 Environment Working with Oregonians to enhance our quality... Great communities. Working lands. Iconic Places. 91.94 OR
WYPR 311770828 Arts, Culture, Humanities Serving the metropolitan Baltimore area and th... 88.1 FM -. Your NPR News Station 85.59 MD
Two Ten Footwear Foundation 222579809 Human Services Funded solely by the footwear industry, Two Te... Shoepeople Helping Shoepeople 90.26 MA

Back to top

CharityNavigator Dataset - Exploratory Data Analysis

The distribution of charity description lengths is left-skewed with a median of 690 characters. These descriptions will serve as the documents to create the TF-IDF vectorizer which will be used to find similarity between charities.
The distribution of charity scores, as ranked by CharityNavigator, is left-skewed with a median of 88.31% These scores are indicators of a charity's financial health and Accountability & Transparency.
The most common charity category in our database is Human Services, which constitutes 28.32% of charities. The second most common is Arts, Culture, and Humanities at 14.5% followed by Health and Community Development at about 10% each.
Investigating the mean charity score by category reveals that Community Development, Animal, and Environmental charities are the most highly rated. The lowest rated charity categories, on average, are Religion, Research & Public Policy, and Arts, Culture, & Humanities.

Back to top

Similar Recommender

Steps for Building the Similar Recommender:

  • Receive a charity name as user input
  • Access CharityNavigator database to refer to that charity's information (category, description, and motto)
  • Compare that charity's information to the TF-IDF model of charities in our database to generate "similarity scores"
    • TF-IDF stands for Term Frequency - Inverse Term Frequency
    • Term Frequency refers to how frequently a word from on document occurs in another document
    • Inverse Term Frequency refers to how important a given word is in creating a match, determined by "rarity" of word usage
  • The Top 3 highest similarity-scoring charities are recommended to the user

SimilarSelector
The Similar Charity Recommender webpage features an autofill feature to find charities in the database, coded in Javascript.

SimilarRecommendations
The top three most similar charities are recommended by the model. The table below each recommendation shows the similarity score, CharityNavigator rating, and top keywords used in matching the charities.

The code for this can be found here

Back to top

Scoring the Similar Recommendation Model

The two criteria used to score the similar recommendation model are Category Scores and Similarity Scores.

Category Scores

The CharityNavigator Dataset contains 11 categories. Recommending one of these categories at random would result in the same category getting recommended 1/11 or 9.09% of the time, which will be the baseline for the model. The goal for the recommendation model is to improve category score to above 50%.

Similarity Scores

The Similarity Scores are calculated using the cosine similarity of the TF-IDF vector representations between documents. The three highest similarly scored documents will be returned as the top 3 recommended charities.

Hyperparameter Tuning

The two hyperparameters used to tune the model are minimum document count and maximum document percentage

Minimum Document Count refers to the minimum number of documents a word must appear in to be included as a token in the model.

Maximum Document Percentage refers to the maximum percentage of documents a word can appear in before it is excluded. For example, a 30% maximum percentage indicates that a word cannot appear in more than 30% of documents.

Increasing the minimum document count and the maximum document percentage will also decrease the number of tokens (words) used to analyze charities and make similarity recommendations. The hypertuning of this model will seek to balance the optimization of category and similarity scores with the loss of tokens from the corpus.

The category scores graph shows that the category scores gradually increase as minimum document count increases, but jump significantly as maximum document percentage increases. The optimal parameters for category score are 4 minimum documents and 30% max documents, with a category score of 71.73%
The similarity scores increase significantly up to 4 minimum documents and then experience a consistent gradual increase. Changing the maximum document percentage does not appear to significantly change similarity scores. Using the 4 minimum documents and 30% max documents parameters from the tuned category score results in a similarity score of 35.72%

Creating the Latent Dirichlet Allocation (LDA) Model

The distribution of token amounts for each charity is normally distibuted with a mean of 45 tokens. These tokens are the unique words that represent each charity and are used by the model to compare and recommend similar charities. Tuning the model to 4 minimum document counts and 30% maximum documents trimmed the unique tokens in the model from 21799 to 6430 tokens.
The Jaccard Similarity graph shows the similarity, or overlap, across topics at various topic amounts used by the LDA model. Based on this model, 15 topics is optimal for breaking the corpus into coherent topics with only 5.64% mean topic overlap

The code for this can be found here

Back to top

Second Dataset - IRS:

The second dataset comes from IRS.gov and contains information about charitable organizations in the United States

  • Original Data Set
    • 1,719,673 charities (rows) with 28 features for each charity (columns)
  • Data Cleaning:
    • Keep charities that offer fully tax deductible donations and have an NTEE Category Code
    • Translate NTEE code into category column and keep essential column features
  • Cleaned Data Set
    • 992,318 charities (rows) with 10 features for each charity (columns)

Preview of IRS Data Set:

EIN NAME STATE INCOME_CD ZIP_FIVE NTEE_Major_Category County
10130427 BRIDGTON HOSPITAL ME 9 04009 Health - General and Rehabilitative Cumberland County
10024645 BANGOR SYMPHONY ORCHESTRA ME 6 04402 Arts, Culture and Humanities Penobscot County
10015091 HANOVER SOCCER CLUB INC NJ 4 07927 Recreation, Sports, Leisure, Athletics Morris County

IRS Dataset - Exploratory Data Analysis

Charity Income Codes, along with a "locality factor," are used to generate recommendations. This chart shows that the of the charities in the dataset have a code of 3 or 4, with about 8% having a code of 7, 8, or 9
Investigating the charities counts by state reveals that California is the most common charity headquarter location, constituting about 11% of our dataset. The next three most-common are Texas, New York, and Florida at about 7% each.
Charity Categories are another factor used for generating recommendations. The most common charity categories in the USA are Religious and Education charities, followed by Human Services, Philanthropy, and Arts, Culture, and Humanities.
This map takes a closer look at the charity distribution in California. Within California, the charitable organizations are centered around The SF Bay Area and Los Angeles. Charities also appear to be more frequent along the coast than inland.

Local Recommender

Steps for Building the Local Recommender:

  • Receive a category and zipcode as user input
  • Filter charities on selected category
  • Assign a "locality score" to each charity
    • **Locality Score = (Income Code * Locality Factor) / 50
    • Income code represents annual income of charity, higher income generally means more established charities
    • Locality Factor represents how "local" the charity is to the user, determnined zipcode, county, or state match
    • 50 Represents the maximum score, and is used to convert from a score to a percentage
  • The Top 3 highest-scoring charities are recommended to the user

LocalSelector

LocalRecommendations

The code to do this can be found here

Conclusion and Next Steps

  • Charity Recommender is a functional web application and be be accessed at www.charityrecommender.com/
  • Next steps for this project include adding more charities to the corpus and integrating recommendation features with www.givz.com/

Find your next great cause to donate to today!

Built With

  • Python - Coding Language for Machine Learning Application
  • Gensim - Used for Latent Dirichlet Allocation - Topic Modeling
  • Flask - Framework for Creating the Web Application
  • Elastic Beanstalk - Service for Deploying Web Applications
  • Tableau - Tool for creating advanced data visualizations

Acknowledgments

  • Thank you to those that support charitable organizations and help to make our world a better place

Back to top

About

Charity Recommender Web Application built using Natural Language Processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published