# Section 1: Linear Algebra and Cosine Similarity
We have two wealthy donors and want to understand how similar their donation behaviors are. For example, we know that these two donors (James and Beth) give to several different charities on an annual basis, some of which overlap:

```
|-------------------------------------------------------------|
| Charity                 |  James |   Beth | Albert |  Stacy |
| The Nature Conservancy  |    $50 |   $400 |   $100 |        |
| National Public Radio   |    $30 |        |   $100 |   $300 |
| St. Jude’s Children’s   |        |   $400 |   $200 |   $500 |
| Doctor’s Without Borders|    $10 |   $100 |   $500 |   $200 |
| World Wildlife Fund     |    $40 |   $500 |        |   $400 |
|-------------------------------------------------------------|
```


How similar are the philanthropic interests of our donors (as demonstrated by the magnitude of donations to the same charities)?

```
|          James   Beth  Albert   Stacy
| James        1  0.976   0.423   0.983
| Beth     0.976      1   0.540   0.966
| Albert   0.423  0.540       1   0.681
| Stacy    0.983  0.966   0.681       1
```

How do we do this?

Write some Python code using numpy matrix functions like np.dot() and np.linalg.norm().

Extra consideration—do we consider ALL charities or only those charities where each member of the pair has made a donation?


In [None]:
# Import python packages
import streamlit as st
import pandas as pd

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()


In [None]:
import numpy as np

data = pd.DataFrame({
    "James": [50, 30, np.nan, 10, 40],
    "Beth": [400, np.nan, 400, 100, 500],
    "Albert": [100, 100, 200, 500, np.nan],
    "Stacy": [np.nan, 300, 500, 200, 400]
}, index = [
    "Nature Conservancy",
    "National Public Radio",
    "St. Jude's",
    "Doctors Without Borders",
    "World Wildlife Fund"
])

In [None]:
data

In [None]:
# Function to compute cosine similarity with overlap masking
def cosine(u, v):

    # Only TRUE when both are not NAN
    mask = ~np.isnan(u) & ~np.isnan(v)

    # If none overlap, return None
    if not(any(mask)):
        return None
    
    u_masked, v_masked = u[mask], v[mask]
    dot = np.dot(u_masked, v_masked)
    norm_u = np.linalg.norm(u_masked)
    norm_v = np.linalg.norm(v_masked)

    # If denominator is 0, return None
    if norm_u == 0 or norm_v == 0:
        cosine_sim = None
    else:
        cosine_sim = dot / (norm_u * norm_v)
    return cosine_sim

In [None]:
# Pairwise similarities
donors = data.columns
sim_matrix = pd.DataFrame(index=donors, columns=donors, dtype=float)

# Iterate through all the pairs of donors
for i in donors:
    for j in donors:
        if i == j:
            sim_matrix.loc[i, j] = 1.0
        else:
            sim_matrix.loc[i, j] = cosine(data.loc[:,i].values, data.loc[:,j].values)

sim_matrix.round(3)

## Alternative 2: sklearn cosine_similarity

SciKit Learn has a built-in cosine_similarity() function that can be used in our situation, too. Check the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Pairwise similarities
donors = data.columns
sk = pd.DataFrame(index=donors, columns=donors, dtype=float)

# Iterate through all the pairs of donors
for i in donors:
    for j in donors:
        if i == j:
            sk.loc[i, j] = 1.0
        else:
            # We still have to mask the data
            u = data.loc[:,i].values
            v = data.loc[:,j].values
            mask = ~np.isnan(u) & ~np.isnan(v)

            # But we can use the cosine_similarity function from
            # sklearn after forcing our values into the right shape.
            # We have a column (mx1), cosine_similarity wants a row (1xn).
            sk.loc[i, j] = cosine_similarity(
                u[mask].reshape(1,-1), 
                v[mask].reshape(1,-1))

sk.round(3)