________________
# Jack's money went here. 

## Where is twitter likely to lean more and less now that  he's leaving? Where will there be matching donations?

Jack Dorsey is pledging over 466 million dollars and wants matching donations. His rational is simple -- billionaires can spare a tithe to help communities and people, and compounded over a few hundred of his closest friends, have a tremendous impact. 

This dataset is based off of the tweet https://twitter.com/jack/status/1247616214769086465 which lists pledged organizations and their donation. 
__________________________
### We will learn how to quickly data science this dataset. We will select feature representations and visualize the resulting graph using UMAP.

Featurization is the foundation of datascience. Likewise, Graph Thinking requires edges between nodes. Many times the data we have from databases/dataframes is tabular and row like -- with no incling of an edge table. This does *not* have to be an impediment for *Graph Thinking and materialization of datascience workflows*. 

UMAP is a powerful tool that projects complex, heterogeneous data coming from potentially many different distributions, down to lower dimensional embeddings and projections. The embedding estimates similarity between the rows, or nodes of the data, and thus forms a graph. 

Standardizing a feature set across the databases used in every modern company and then sending it to UMAP serves as a powerful graph generation tool.  
____________________________
Here we demonstrate how to Featurize and use UMAP to generate implicit graphs. The features may then be used in subsequent modeling using your favorite libraries -- sklearn, tensorflow, pytorch[, geometric, lightening, ...], cuGraph, DGL, etc. We demonstrate 4 featurization methods -- (latent embeddings, transformer embeddings, ngrams embeddings, one-hot encodings) that may be mixed and used to make different features for different columns, automatically. 

Furthermore, when we `g.plot()` the results, it is layed out according to the 2-dimensional UMAP projection of the data -- nearness in that projection represents nearness in the resulting features. We will test this empiracally using the different featurization methods for textual, numeric and categorical data. 

In [None]:
#!pip install graphistry[ai]  # install the AI dependencies of Graphistry

In [None]:
# cd ..

In [None]:
import os

import pandas as pd
import graphistry
import numpy as np

import matplotlib.pylab as plt
%matplotlib inline

In [None]:
np.random.seed(137)

In [None]:
graphistry.register(api=3, protocol="https", server="hub.graphistry.com", username=os.environ['USERNAME'], password=os.environ['GRAPHISTRY_PASSWORD'])

## Data cleaning
We already added the dataset from the twitter link, downloading a copy (as of May 2022) from the google drive. We need to remove the first few rows to make a valid dataframe. 

In [None]:
df = pd.read_csv('https://gist.githubusercontent.com/silkspace/f8d7b8f279a5ffbd710c301fc402ec43/raw/95a722f5c65812322eaf085c1123b58d3ec3da3a/jack_donations.csv')
df = df.fillna('')
columns = df.iloc[3].values  
ndf = pd.DataFrame(df[4:].values, columns=columns)
ndf

In [None]:
ndf.Category.unique()

# Create the Graph

We will use `g.umap` to featurize and create edges. The details of how UMAP is able to create edges between rows in the data is beyond the scope of this tutorial, however, suffic it to say, it is automatically inferring a network of related entities based off of their column features. 

Here is the dataset as graph, 


In [None]:
g = graphistry.nodes(ndf).bind(point_title='Category').umap()
g.plot()  # fly around the clusters and click on nodes and edges. 

## The above featurized every column over the entire datase. Exploring the nodes and their nearest neighbors indeed clusters similar rows -- all in two lines of code!

# Some light analysis and enrichment 

Lets convert Amount column into numeric, and then see who is getting what by category and grantee.

In [None]:
#ndf.columns
ndf[' Amount ']

In [None]:
# let's convert money into float money (get it?)
from re import sub
from decimal import Decimal

def convert_money_string_to_float(money: str, return_float: bool = True):
    value = Decimal(sub(r"[^\d\-.]", "", money))  # preserves minus signs
    if return_float:
        return float(value)
    return value

ndf['$ amount'] = ndf[' Amount '].apply(lambda x: convert_money_string_to_float(x))

In [None]:
ndf['$ amount']

## Many of these categories are not distinct. But due to data coming in with different notation, it seems distinct. 

We will show in the next section how to deal with this by using the graphistry pipeline to convert the `Category` into a latent target that organizes the labels.


In [None]:
current_funding_by_category = ndf.groupby('Category')['$ amount'].sum()
current_funding_by_category.map(lambda x: '${:3,}'.format(x))

In [None]:
fig = plt.figure(figsize=(15,7))
current_funding_by_category.plot(kind='bar', rot=52)

In [None]:
grantees = ndf.groupby('Grantee')['$ amount'].sum()
grants_sorted = grantees.sort_values()
# top 10 recepients 
grants_sorted[-10:].map(lambda x: '${:3,}'.format(x))[::-1]

In [None]:
# largest grants
fig = plt.figure(figsize=(15,7))
ax= plt.subplot()
# ax.set_xticks(range(len(label_list)))
# ax.set_xticklabels(label_list, rotation=19)
res = grants_sorted[-10:]

res.plot(kind='bar', rot=52)

In [None]:
# smallest grants
fig = plt.figure(figsize=(15,7))
ax= plt.subplot()
# ax.set_xticks(range(len(label_list)))
# ax.set_xticklabels(label_list, rotation=19)
res = grants_sorted[:10]

res.plot(kind='bar', rot = 52)

In [None]:
'Total Pledged ${:3,}'.format(current_funding_by_category.sum())

In [None]:
# and this should be the same too
'Total Pledged ${:3,}'.format(grantees.sum())

## Notice that the Category labels are mixed and interwoven 
We will show how judicious choice of parameters can standardize it without having to do data cleaning or mapping

In [None]:
ndf.Category.unique()  # seems like there are 4-6 topics here

_______________________________________

# Featurize II

let's do it again and concentrate on a subset of the columns, to get a sense for the different ways to featurize named columns.
____________________________

In the following, we concentrate on the textual `Why?` column as it describes the row/entity in question. Further, we select `y='Category'` as a target variable, and will encode it using a Topic Model as well as standard One-Hot-Encoding.


In the following we will show how to encode textual and categorical data using 

1) Topic Models

2) Sentence Transformers

3) Ngrams 

And see the resulting graphs. We will use the Topic label generated by `y='Category'` to color the graphs, as well as `$ amount` 


# Topic Model (latent-) features

We encode the data using Topic Models. This turns the textual features into latent vectors. Likewise, we can do the same for the target data. 


Notice that we set `cardinality_threshold_target` very low and `min_words` very high to force featurization as topic models rather than one-hot or topic encoded;
1) encode target using a topic model, and set `n_topics_target` as the dimension of the latent target factorization. This choice is based on the fact that there are really only 4-6 or so distinct categories across the labels, but they are mixed together. The labels are in fact Hierarchical categories. We can use the topic model to find the lowest moments of this Hierarchical classification in the distributional sense. 

2) and like
wise for the features `Why?`, and set `n_topics` as the dimension of the latent feature factorization.

In [None]:
g = graphistry.nodes(ndf).bind(point_title='Category')

g2 = g.umap(X=['Why?'], y = ['Category'], 
            min_words=50000, # encode as topic model by setting min_words high
            n_topics_target=4, # turn categories into a 4dim vector of regressive targets
            n_topics=21, # latent embedding size 
            cardinality_threshold_target=2, # make sure that we throw targets into topic model over targets
            ) 

In [None]:
g2._node_encoder.label_encoder

In [None]:
# pretend you have a minibatch of new data -- transform under the fit from the above
new_df, new_y = ndf.sample(5), ndf.sample(5) # pd.DataFrame({'Category': ndf['Category'].sample(5)})
a, b = g2.transform(new_df, new_y, kind='nodes')
a

In [None]:
b

In [None]:
plt.figure()
plt.imshow(g2._node_target, aspect='auto', cmap='hot')

In [None]:
g2._node_encoder.label_encoder

In [None]:
g2._node_encoder.y.plot(kind='bar', figsize=(15,7)) # easier to see than before

In [None]:
# likewise you can play with how many edges to include using,
g2 = g2.filter_weighted_edges(scale=0.25)  # lower positive values of scale mean closer similarity 


## We have featurized the data and also run UMAP, which projects the features into a 2-dimensional space while generating edges.

Plotting the result shows the similarity between entities. It does a good job overall at clustering by topic. Click in and check out some nearby nodes. 

In [None]:
g2.plot()

In [None]:
X = g2._node_features 
X

In [None]:
y = g2._node_target  # we've reduced 22 columns into 5
y

In [None]:
## we can inspect the topics from the column headers
label_list = y.columns
label_list

In [None]:
## and see them across rows of the data
fig = plt.figure(figsize=(17,10))
ax = plt.subplot()
plt.imshow(y, aspect='auto', cmap='hot')
plt.colorbar()
plt.ylabel('row number of data')
ax.set_xticks(range(len(label_list)))
ax.set_xticklabels(label_list, rotation=39)
print(f'See the abundance of the data in the latent vector of the corresponding targets')

In [None]:
# find the marginal in the category topic distribution
y.sum(0).plot(kind='bar', ylabel='support across data', rot=79)

In [None]:
## Looking at the above bar chart we may read off the most 

In [None]:
# Let's see how the category columns are supported by the data
from collections import Counter
tops = y.values.argmax(1)
for topic_number in range(y.shape[1]):
    indices = np.where(tops==topic_number)
    top_category = Counter(ndf.loc[indices].Category)
    print()
    print('-'*50)
    print(f'Topic {topic_number}: \t\t\t\t Evidence')
    print(f'{y.columns[topic_number]}')
    print('-'*35)
    for t, c in top_category.most_common():
        print(f'-- {t},    {c}')

### We see that different spellings, spaces, etc or use of ;, , etc map to the same topic. This is a useful way to disambiguate when there are many similar categories without having to do a lot of data cleaning and prep.

The choice of `n_topics_target` sets the prior on the Dirty_Cat GapEncoder used under the hood

## Let's add the Category Topic Number as a feature to help us visualize using the Histogram Feature of the Graphistry UI

This reduces the naive one-hot-encoding of 22 columns down the the number set by the `n_topics_target=5`

In [None]:
tops

In [None]:
g2._nodes['topic'] = y.columns[tops]
ndf['topic'] = y.columns[tops]

In [None]:
g2._nodes.topic

------------------------------------------------------------------------------
In the plot below, use the histogram feature on the bottom right of the UI to color by `topic`


In [None]:
g3 = g2.bind(point_title='topic')
g3.plot()

In [None]:
## lets sum $$ across major topics

In [None]:
topic_sums = ndf.groupby('topic')['$ amount'].sum()
topic_sums.sort_values()[::-1].apply(lambda x : '${:3,}'.format(x))

## hence we have Crisis Relief, Social Justice, Health Education Girls, and UBI occupying the main topics across the target

------------------------------------------------------------------------------------------
# Let's move on to point 2) 
# Sentence Transformer Encodings

To trigger the sentence encoder, just lower the `min_words` count (which previously we had set to higher than the number of words across the `Why?` column) to some small value or zero to force encoding any X=[..] columns, since it sets the minimum number of words to consider passing on to the (sentence, ngram) embedding pipelines.  

Here, UMAP will work directly on the sentence transformer vector and expose a search interface.

In [None]:
g2 = g.umap(X = ['Why?', 'Grantee'], y = 'Category', 
            min_words=0, 
            model_name ='paraphrase-MiniLM-L6-v2', 
            cardinality_threshold_target=2,
            scale=0.6)

In [None]:
g2.search('carbon neutral')[0][['Why?']]

In [None]:
'${:3,}'.format(g2.search('carbon neutral')[0]['$ amount'].sum())

In [None]:
g2.search('sustainable homes and communities')[0][['Why?','$ amount']]#.sum()

In [None]:
'${:3,}'.format(g2.search('sustainable homes and communities')[0]['$ amount'].sum())

In [None]:
# see the queries landscape  -- paste url with .plot(render=False)
g2.search_graph('sustainable homes and communities', scale=0.90, top_n=10).bind(point_title='Why?').plot(render=False)

In [None]:
# or transform on new data as before
a, b = g2.transform(new_df, new_y, kind='nodes')
a

## Clicking around to nearest neighbors demonstrates good semantic similarity, as seen by the Paraphrase Model `paraphrase-MiniLM-L6-v2`

In [None]:
g2.plot()

## Suppose we wanted to add the Grantee column as a feature: 
To include it in the sentence transformer model, reduce the` min_words` threshold to include it. If we want the column `Grantee` to be encoded as a topic model, set `min_words` to between the average of `Why?` (higher) and `Grantee` (lower) and `$ amount` (which is just 1). This may seem a bit sloppy as an API, nevertheless useful across many datasets since if a column is truly categorical, its cardinality is usually well under that of a truly textual feature. Moreover, if you want all columns to be textually encoded, set `min_words=0`. 

In [None]:
g2 = g.umap(X = ['Why?', 'Grantee', '$ amount'], y = 'Category',
            min_words=2,
            model_name ='paraphrase-MiniLM-L6-v2',
            use_scaler=None,
           ) 

In [None]:
g2._node_encoder.text_cols

In [None]:
# just for fun, can we find outliers (which we know will be influenced by the numeric $ amount)
from graphistry.outliers import detect_outliers

# organized by amount
embedding = g2._xy
clfs, ax, fig = detect_outliers(embedding.values, name='Donations', contamination=0.3)

In [None]:
# the different models
clfs

In [None]:
g2.plot() # color/size the noded by `$ amount`

# Lastly, suppose we want a plain Ngrams model matrix, and for a change, one-hot-encode the target `Category`

Set `use_ngrams = True`
and set the `cardinality_threshold_target` > cardinality(`Category`).

UMAP will work directly on the ngrams matrix, and any other feature column one may transform. 

In [None]:
g3 = g.umap(X = ['Why?', 'Grantee'], y = 'Category', 
            use_ngrams=True, 
            ngram_range=(1,3), 
            min_df=2, 
            max_df=0.3,
            cardinality_threshold_target=400
           )  # this will one-hot-encode the target, as we have less than 400 total `categories`

In [None]:
g3.bind(point_title='Category').plot()

In [None]:
g3._node_features  # a standard tfidf ngrams matrix

In [None]:
g3._node_encoder.text_model  #sklearn pipeline 

In [None]:
## vocab size
len(g3._node_encoder.text_model[0].vocabulary_)

In [None]:
# or transform new data: 
emb, a, b = g2.transform_umap(new_df, new_y, kind='nodes')
emb

In [None]:
# we include the naive indicator variable for completeness.
y = g3._node_target
label_list = b.columns

fig = plt.figure(figsize=(17,10))
ax = plt.subplot()
plt.imshow(y, aspect='auto', cmap='hot')
plt.colorbar()
plt.ylabel('row number of data')
ax.set_xticks(range(len(label_list)))
ax.set_xticklabels(label_list, rotation=49)
print('Naive Indicator Variables')

# Contributions

We've seen how we may pull in tabular data that exists in the wild and quickly make features and graphs that allow semantic and topological exploration and traversals. 

In this way one can quickly track a variety of datasets and (in this case) gauge growth, investment, and promise fullfillment and transparently using Graph Thinking and analysis.

Encoding text, categorical, and numeric features while exploring the relationships can be time consuming tasks. We hope that Graphistry[ai] demonstrates an exciting and visually compelling way to explore Graph Data. 

Now you can mix and match features, augment it with more columns via enrichment, and pivot large amounts of data using natural language search, all using a few lines of code. The features produced may then be used in downstream models, whose outputs could be added and the entire process repeated.

Let us know what you think!

Join our Slack: Graphistry-Community
