## White Papers

[Yodlee faq](https://developer.yodlee.com/Knowledge_Base/Transaction_Categorization)

* categorizes using transaction description field 
    * has a list of keywords (e.g. burrito -> food, mortgage -> loan)
    * (this prob means that they have a list of rules - rules tend not to be as good as state of the art ML)

* categories - https://developer.yodlee.com/Yodlee_API/docs/v1_1/Transaction_Categories
* api - https://developer.yodlee.com/apidocs/index.php#!/transactions/getTransactions
    * their transaction rules api is also pretty awesome - https://developer.yodlee.com/apidocs/index.php#!/transactions/getTransactionCategorizationRules
* important features for their algorithm:
    * credit or debit
    * transaction amount
    * time 
    * description (e.g. ACH Withdrawal-Debit XXXXXXXX00 - PPD US BANK - LOAN A BILL PAYMT)
    * merchant (name, address, type)
* user defined categories
    * 

[MX white paper](https://go.mx.com/hs-fs/hub/456459/file-2662087128-pdf/PDFs/MX_TransactionData.pdf?t=1528465291015)

* takes transaction feed
    * e.g. COSTCxx 04ROCHESTER XXX726 XXX-XXX-1189 XXX027 --> costco
* 4 filters
    * user preference - do what the user asked
    * parser - try to parse the feed with a rule based model
    * matcher - rules that correlate with specific transaction feeds
    * crowdsourced - try to categorize by what other users labeled
* classify transaction (e.g. what are all of your bill payments) - then aggregate the classified transactions with competitors - lastly sell the data

[Strand white paper](https://finance.strands.com/white-paper-design-principles-of-transaction-categorization/)

[Zoho faq](https://www.zoho.com/us/books/kb/banking/auto-categorization-of-transactions.html)

**I called up customer service for Zoho Books (their accounting software)**
* heavily rule based (a rule can be user generated or be one of the generic ones offered by the system)
    * deposit / withdrawals
    * Zoho Books will automatically display the matching transactions for the selected transaction. Select the matching Zoho Books transactions and click on Match.
    * some type of regex with existing transaction feed

[Plaid blog](https://blog.plaid.com/making-sense-of-messy-data/)

* categorize based on transaction description
* bag of word model --> basic natural language processing approach
* word embeddings --> standard approach from around 5 years ago

## Existing Projects

*resource / references for how to do transaction categorization*

https://github.com/eli-goodfriend/banking-class
* some type of parsing library (SOMETHINGLIKETHIS -> something like this)
* 2 categorization rules
    1. if the transaction is from a big institution, have rules to do it directly
    2. otherwise, categorize using ML model
* ML model details
    * merchant name -> apply word embedding -> ML classifier (SVM, logistic regression, naive bayes)
      

https://github.com/tmerr/bank_wrangler/tree/master/bank_wrangler

* https://github.com/tmerr/bank_wrangler/blob/master/bank_wrangler/rules.py
* pretty good implementation of a rule / string matching algorithm

## Applications to Crypto

* it seems that most coins don't have a transaction description field unlike in financial institutions
    * e.g. bitcoins - 9 bytes description (a character is 1 byte)

* Googling "crypto transaction categories" makes me believe that no one is trying to categorize transactions atm
* this means that **crypto transactions need to be classified purely based on crowdsourced data from fintech partners**
* unfortunately, this seem like a graph theory problem - https://medium.com/chainalysis/visualizing-bitcoin-transactions-dd0e67d8e104
    * I remember talking with Adam Draper in 2014, can't remember atm, but he definitely has a few blockchain visualization / informatics companies, def a good idea to check them out
    * https://blockchain.info/tree/59587897 <-- not sure if this can help atm
* data from PFM partners will look like this:
    * address, institution (e.g. addr1, chase bank)
    * AND/OR address - address, category (e.g. addr1-addr2, loans)
* some approaches
    * DeepWalk - word embeddings for graphs (https://arxiv.org/abs/1403.6652)
        * main idea is that we can create a vector for every entity (node in the graph), and figure out what similar vectors are for a graph
        * so for example, if we know that this node is like 3 other nodes, and all 3 other nodes are banks, then we can infer that this is a bank
    * graph clustering algorithms (check out http://web.stanford.edu/class/cs224w/handouts.html - Spectral Clustering section)
        * main idea is also that if we can say these nodes belong in the same cluster, then we can infer the property of the nodes 