<a href="https://colab.research.google.com/github/jamiehadd/Math189AD-MathematicalDataScienceAndTopicModeling/blob/main/tutorials/Tensor_Decomposition_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensor Decomposition Models

In this notebook, we'll investigate tensor structure and the CP decomposition model.

## **Activity**

In this notebook, we'll explore the CP decomposition on a real-world dataset. You'll then have the opportunity to explore these models and applications on a new data set!  You should try to select a dataset where you believe the outputs from these models will be interpretable and visualizable!

### Tasks

* Explore the code below for executing and investigating a CP decomposition on a real-world dataset!
* Select a dataset and investigate it -- what are the elements and entries?  Try visualizing the data in various formats (e.g., heatmaps, histograms/counts of values, visualizing individual elements, etc.).
* Apply a CP decomposition model to the data.  You can simply select a choice for the model rank R like above, or you can try to select it systematically like we did in the NMF live script (using the error plot).
* Create some summative visualizations of the results -- think about what visualizations best serve what you have observed/want to show about the results!

### Downloading Software

We'll be using a Python package known as [TensorLy](http://tensorly.org/stable/index.html), which contains many useful functions for dealing with tensors and their decompositions.  We'll also need a package called Sparse that TensorLy is dependent upon.

In [None]:
pip install -U tensorly

In [None]:
pip install sparse

In [None]:
import tensorly as tl
import numpy as np
import sparse
import pandas as pd
import tensorly.contrib.sparse as stl
from tensorly.contrib.sparse.decomposition import parafac
import matplotlib.pyplot as plt

### Loading Data

We're working here with a data set consisting of six months of Uber pickup data in New York City, provided by [fivethirtyeight](https://www.kaggle.com/datasets/fivethirtyeight/uber-pickups-in-new-york-city) after a Freedom of Information request. Data covers April 2014 through August 2014. Latitude and Longitude values are rounded to three decimal places (i.e., 110 meters of resolution). Tensor values are integer counts.  This data is 183 x 24 x 1,140 x 1,717 (days x hours x latitude x longitude).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd '/content/drive/Shareddrives/Math 189AD FA22: Datasets'

In [None]:
def load_data(path):
    """Load the sparse tensor dataset at `path`"""
    values = []
    coords = []
    with open(path, 'rb') as f:
        for line in f:
            data = line.strip().split(b' ')
            coords.append([int(i) - 1 for i in data[:-1]])
            values.append(float(data[-1]))
    coords = np.array(coords, dtype=np.int64).T
    values = np.array(values, dtype=np.float64)
    return sparse.COO(coords, data=values)

In [None]:
tensor = load_data('uber.tns')

### Training a CP Model

Now we will train a rank-three CP decomposition model on the tensor of data we've constructed.



In [None]:
tensorly_tensor = stl.tensor(tensor, dtype='float')

weights, factors =parafac(tensorly_tensor, 3,n_iter_max=10,init='random')

The factor matrices are now held in `factors`.

### Investigating the Decomposition Results

We will visualize the information from the tensor decomposition to understand patterns within the data.  First, it is important to note the dimensions of the factor matrices so we are sure which modes of the tensor they correspond to.

In [None]:
days_factor = factors[0].todense()
hours_factor = factors[1].todense()
lat_factor = factors[2].todense()
long_factor = factors[3].todense()

print(np.shape(days_factor))
print(np.shape(hours_factor))
print(np.shape(lat_factor))
print(np.shape(long_factor))

Check that you recognize the row dimension of each of these four factor matrices!

The row dimension of the first factor matrix corresponds to the 183 days of April - September 2014.  Let's plot the intensity of the columns in this factor matrix as time series.

In [None]:
plt.figure(figsize=[15,10])

plt.suptitle("Day Components")
plt.subplot(3,1,1);
plt.plot(days_factor[:,0])
plt.title("First Component")

plt.subplot(3,1,2);
plt.plot(days_factor[:,1])
plt.title("Second Component")

plt.subplot(3,1,3);
plt.plot(days_factor[:,2])
plt.title("Third Component")

The row dimension of the second factor matrix corresponds to the 24 hours of the day.  Let's again plot the intensity of the columns in this factor matrix as time series.

In [None]:
plt.figure(figsize=[15,10])

plt.suptitle("Hour Components")
plt.subplot(3,1,1);
plt.plot(hours_factor[:,0])
plt.title("First Component")

plt.subplot(3,1,2);
plt.plot(hours_factor[:,1])
plt.title("Second Component")

plt.subplot(3,1,3);
plt.plot(hours_factor[:,2])
plt.title("Third Component")

Finally, the row dimensions of the remaining factor matrices correspond to latitude and longitude of the Uber pickup.  Since this is two dimensional data, we can visualize this data together.  We will form heatmaps of the shared latitude and longitude intensity (by taking the outer product of the two factors).

In [None]:
plt.figure(figsize=[15,4])

plt.suptitle("Geographic Components")
plt.subplot(1,3,1);
plt.imshow(np.outer(lat_factor[:,0],long_factor[:,0]))
plt.title("First Component")

plt.subplot(1,3,2);
plt.imshow(np.outer(lat_factor[:,1],long_factor[:,1]))
plt.title("Second Component")

plt.subplot(1,3,3);
plt.imshow(np.outer(lat_factor[:,2],long_factor[:,2]))
plt.title("Third Component")

plt.colorbar()