# Ch3: Geometry & Algebra of Tensors
    
## 3.1 Motivation and Intuition

A Video Analysis with Tensor Decomposition in Python example can be found at:
https://towardsdatascience.com/video-analysis-with-tensor-decomposition-in-python-3a1fe088831c

Other Examples are presented below, using a tensorisation step that is hand tailored to each examples, with attempts to generalise

In [None]:
import math

def tensorisation_Values (Df, components, value="value", aggFunc=np.mean):
    if components >= Df.shape[1]:
        print ("Number of components must be less or equal to number of the columns in the input matrix. Exiting without creating the tensor")
        return
    minVals = []
    maxVals = []
    tensorShape = []
    for i in range(components):
        minVal = Df.iloc[:,i].min()
        # this will be an index, therefore starting from zero is necessary
        if minVal > 0:
            Df.iloc[:,i] += minVal
            minVal = 0
        if minVal < 0:
            Df.iloc[:,i] -= minVal
            minVal = 0
        minVals.append(minVal)
        # also the max value need to be positive non-zero, because it will be the tensor shape 
        maxVal = Df.iloc[:,i].max()
        if maxVal <= 0:
            Df.iloc[:,i] -= maxVal + 1
            maxVal = 1
        maxVals.append(maxVal)
        print("mode  " + str(i) + " max value =" + str(maxVal) + ", min value = " + str(minVal))
        tensorShape.append(int(maxVal)+1)

    # update the values in the array, to be used as indices
    for k,j in Df.iterrows():
            for i in range(components):
                j[i] = int(math.floor(j[i])) + abs(int(minVals[i])) + 1
    print (tensorShape)
    tensorShape = tuple(tensorShape) 
    tensor_array = np.zeros(tensorShape)
    count = 0
    for k,j in Df.iterrows():
        count = count + 1
        t_index = tuple(
            int(j[i]) for i in range(components)
        )
        #print (t_index)
        tensor_array[t_index] = aggFunc(j[value])

    return tensor_array

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np

userdata = pd.read_csv('data/GlobalLandTemperaturesByMajorCity.csv')
columns = ['dt','AverageTemperature','AverageTemperatureUncertainty','City','Country','Latitude','Longitude'] # such that these will be x
df=userdata[columns].copy()
Values = userdata["AverageTemperature"] # this will be the values of the tensor
df

In [None]:
# We will tensorise in the same order mentioned in chapter 3 first section on motivational problem
# First example is one temperature value as a scalar (rank-zero tensor), which does not require tensorisation, just indexing will do
df.iloc[0, 1] # specifying the index values, which is not usually interpreted easily


In [None]:
df.loc[(df['City']== 'Abidjan') & (df['dt']=='1849-01-01') ,['AverageTemperature']] # specifing the values needed (city and date values)

In [None]:
# Second example is all temperature values in this dataset or in a specific city as a vector (rank-one tensor), which does not require tensorisation, just indexing will do

df.loc[df['City'] == 'Abidjan']['AverageTemperature']

In [None]:
# Third example is temperature values per city (rank-two tensor). We will do it using pivot tables, and again with tensorisation
# pivot tables are another method in handling multi-way analysis
# checking how many cardiac cases in each age value

cityTempMeans = df.pivot_table(values='AverageTemperature',
                               index='City', 
                               aggfunc=np.mean,               ## aggfunc='size', # size if you want to aggregate by frequency counting
                               fill_value=0)
cityTempMeans

In [None]:
cityTempMeans.iloc[0,0] # Average temperature for Abidjan by indexing the pivot table

In [None]:
df.loc[df['City'] == 'Abidjan']['AverageTemperature'].mean() # getting to the same value from the original dataset DataFrame, probably the mean function used here is not np.mean since the value is slightly different, but could be rounding error

In [None]:
for i in range(1):
    print(sdf.iloc[:,i].min())

In [None]:
# tensorising by arranging the columns as coordinates first, and last column is the value to aggregate on

columns = ['City','AverageTemperature'] # such that these will be x
sdf=userdata[columns].copy()

# we need to encode all coordinate columns numerically
from sklearn import preprocessing

# lbl_encoder object knows how to understand word labels.
city_encoder = preprocessing.LabelEncoder()

sdf['CityBasis']= city_encoder.fit_transform(sdf['City'])
sdf['CityBasis'].unique()

In [None]:
city_encoder.classes_

In [None]:
city_encoder.inverse_transform([0])

In [None]:
city_encoder.transform(['Abidjan'])

In [None]:
columns = ['CityBasis', 'City','AverageTemperature'] # such that these will be x
sdf=sdf[columns].copy()
sdf

In [None]:
import tensorly as tl

tensor_array = tensorisation_Values (sdf, 1, value="AverageTemperature", aggFunc=np.mean)
tensor2 = tl.tensor(tensor_array)
tensor2.shape


In [None]:
tensor2 # the tensorisation function can also be updated to handle nan values, and to be vectorised and optimisaed for parallel processing

In [None]:
# Fourth example is temperature values per location as Latitude & Longitude  (rank-three tensor). 
#Attemping pivot tables, will still be rank-two, as we create a matrix with the both location columns flattened in one mode
columns = ['dt','AverageTemperature','AverageTemperatureUncertainty','City','Country','Latitude','Longitude'] # such that these will be x


locTempMeans = df.pivot_table(values='AverageTemperature',
                               index=['Latitude','Longitude'], 
                               aggfunc=np.mean,               ## aggfunc='size', # size if you want to aggregate by frequency counting
                               fill_value=0)
locTempMeans



In [None]:
#We will do it using tensorisation

columns = ['Latitude','Longitude','AverageTemperature'] 
sdf=userdata[columns].copy()

sdf # 'Latitude','Longitude' contain numeric values followed by N/S in the first, and E/W in the second, which is degree minute second (DMS) coordinates 
# We will need to be numerically encode them to be turned to coordinate basis using decimal degrees
# there is a solution here https://medium.com/@quinn.dougherty92/simple-geographical-encoding-8293fde9e964

In [None]:
!pip install dms2dec

In [None]:
# geopandas has interesting solutions, but will use dms2dec for simplicity
from dms2dec.dms_convert import dms2dec
sdf['Latitude'] = sdf['Latitude'].apply(dms2dec)
sdf['Longitude'] = sdf['Longitude'].apply(dms2dec)
sdf['Latitude'] = sdf['Latitude'].astype(float)
sdf['Longitude'] = sdf['Longitude'].astype(float)
sdf

In [None]:
# we will need to update the tensorisation function to be able to aggregate by mean or other function, and take
# the values as mean of rows returned from a condition per column in the dataframe
# this is almost tailored to Latitude & Longitude specific values, it is better to be updated to ranges, such that
# values are aggregated when they are >= the current basis index and < next basis index
# many optimisations can be achived for vectorisation, paralleisation, quantisation
def tensorisation2_Values (Df, components, value="value", aggFunc=np.mean):
    if components >= Df.shape[1]:
        print ("Number of components must be less or equal to number of the columns in the input matrix. Exiting without creating the tensor")
        return
    minVals = []
    maxVals = []
    tensorShape = []
    for i in range(components):
        minVal = Df.iloc[:,i].min()
        # this will be an index, therefore starting from zero is necessary
        if minVal > 0:
            Df.iloc[:,i] += minVal
            minVal = 0
        if minVal < 0:
            Df.iloc[:,i] -= minVal
            minVal = 0
        minVals.append(minVal)
        # also the max value need to be positive non-zero, because it will be the tensor shape 
        maxVal = Df.iloc[:,i].max()
        if maxVal <= 0:
            Df.iloc[:,i] -= maxVal + 1
            maxVal = 1
        maxVals.append(maxVal)
        print("mode  " + str(i) + " max value =" + str(maxVal) + ", min value = " + str(minVal))
        tensorShape.append(int(maxVal)+1)

    # update the values in the array, to be used as indices
    for k,j in Df.iterrows():
            for i in range(components):
                j[i] = int(math.floor(j[i])) + abs(int(minVals[i])) + 1
    print (tensorShape)
    tensorShape = tuple(tensorShape) 
    tensor_array = np.zeros(tensorShape)
    
    ## two useful functions
    np.unravel_index(0, tensorShape) # flat linear index to multidimensional index 
    np.ravel_multi_index([tensorShape[i]-1 for i in range(len(tensorShape))], tensorShape) # multidimensional index to flat linear index

    for i in range (np.prod(tensorShape)-1): # iterate through the tensor flat indices
        t_index = np.unravel_index(i, tensorShape) # get the tensor multidimensional index
        print ("i: " + str(i) + " t_in: " + str(t_index))
        condition = '' # accumulate the conditions to add to the data frame selection
        for j in range(len(t_index)):       # iterate through the dataframe columns
            if j==0:
                condition += '(Df[Df.columns[' + str(j)+']] == ' + str(t_index[j]) + ')'
            else:
                condition += ' & (Df[Df.columns[' + str(j)+']]  == ' + str(t_index[j]) + ')'
            print ("j: " + str(j) + " column: " + Df.columns[j] + " cond: " + condition)
            
        #print ("Condition is: " + condition)
        if len(condition) > 0:    
            #print (eval(condition))
            tensor_array[t_index] = aggFunc(Df.loc[eval(condition)]['AverageTemperature'])
        else:
            tensor_array[t_index] = aggFunc(Df.loc['AverageTemperature'])


    return tensor_array

In [None]:
tensor_array =  tensorisation2_Values (sdf, 2, value="AverageTemperature", aggFunc=np.mean)
tensor2 = tl.tensor(tensor_array)
tensor2.shape

In [None]:
tensor2


In [None]:
# Fifth example is temperature values per location as Latitude & Longitude & date  (rank-four tensor). We will do it using tensorisation
#Attemping pivot tables, will still be rank-two, as we create a matrix with the both location columns and time column flattened in one mode
columns = ['dt','AverageTemperature','AverageTemperatureUncertainty','City','Country','Latitude','Longitude'] # such that these will be x


locTempMeans = df.pivot_table(values='AverageTemperature',
                               index=['Latitude','Longitude','dt'], 
                               aggfunc=np.mean,               ## aggfunc='size', # size if you want to aggregate by frequency counting
                               fill_value=0)
locTempMeans


In [None]:
#We will do it using tensorisation

columns = ['Latitude','Longitude','dt', 'AverageTemperature'] 
sdf=userdata[columns].copy()

sdf


In [None]:
# we have here a date column, we can create time series by lagging function and quantisation of values, but will simply encode here
date = pd.to_datetime(sdf['dt'])
date

In [None]:
sdf['Latitude'] = sdf['Latitude'].apply(dms2dec)
sdf['Longitude'] = sdf['Longitude'].apply(dms2dec)
sdf['Latitude'] = sdf['Latitude'].astype(float)
sdf['Longitude'] = sdf['Longitude'].astype(float)

date_encoder = preprocessing.LabelEncoder()

sdf['dtBasis']= date_encoder.fit_transform(pd.to_datetime(sdf['dt']))
sdf['dtBasis'] = sdf['dtBasis'].astype(float)
sdf['dtBasis'].unique()

In [None]:
columns = ['Latitude','Longitude','dtBasis', 'AverageTemperature'] 
sdf=sdf[columns].copy()

sdf

In [None]:
tensor_array =  tensorisation2_Values (sdf, 3, value="AverageTemperature", aggFunc=np.mean)
tensor2 = tl.tensor(tensor_array)
tensor2.shape

In [None]:
tensor2

## We will consider another problem using Data already in tensor form.


The data set is from http://www.models.life.ku.dk/nwaydata
in Matlab form, and can be read by scipy loadmat function, and saved as numpy arrays for easier loads later

The data has X variable as 3-way tensor of  5 samples in mode - 1 (rows) 5 × 51 × 201. , containing different amounts of tyrosine, tryptophan and phenylalanine amino acids belong to three amino acids dissolved in phosphate buffered water. The samples were measured by fluorescence (excitation 250-300 nm, emission 250-450 nm, 1 nm intervals) on a spectrofluorometer 

The data has Y variable, which is the ground truth,  the known concentrations of the three chemicals (mode-2) that are in the samples (mode-1) 

In [None]:
import scipy.io
amino = scipy.io.loadmat('data/amino.mat')

In [None]:
X = amino.get('X')

In [None]:
Y = amino.get('Y')

In [None]:
import numpy as np

np.save('data/amino_x', X)
np.save('data/amino_y', Y)

In [None]:
X.shape

In [None]:
Y.shape

In [None]:
# doing Parafac decomposition
from tensorly.decomposition import parafac
weights, factors = parafac(X,rank = 3, verbose = 2)

## PARAFAC should have three components and therefore a 5 × 3 so called score matrix (first mode loading matrix). Each column in this score matrix should approximately match the concentration of one of the three aminoacids which are held in the 5 × 3 Y matrix. Matching in this case, means that the corresponding columns should be correlated.

In [None]:
len(factors)

In [None]:
[f.shape for f in factors]

In [None]:
np.isclose(Y, factors[0]) # obviously they are not close enough for numpy, will do all possible column permutations correlation then

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def corrEstimate (i, j, Y, factors):
    data = {
        'Y': Y[:,i], 
        'Y_p': factors[0][:,j]
    }

    df = pd.DataFrame(data, columns=['Y', 'Y_p'])
    corr = df.corr()
    print("Correlation matrix of estimated column" + str(i) + " with ground truth " + str(j) + " is : ")
    print(corr)
    
    return corr, ax

In [None]:
plt.figure(figsize=(15, 12))
ax = 0
corrList = []
for i in range(Y.shape[1]):
    for j in range(factors[0].shape[1]):
        corr, ax = corrEstimate(i, j, Y, factors)
        corrList.append(corr)

n=0 # there is a bug that makes the plot read the same correlation matrix every time, I tried inside the function, and then added the list to separate the variables, and not good
for i in range(Y.shape[1]):
    for j in range(factors[0].shape[1]):
        ax = plt.subplot(3, 3, n+1)
        plt.imshow(corrList[n],cmap='coolwarm',interpolation='nearest')
        n= n+1

### It is obvious that the first estimated chemical matches the first ground truth, more than the other two

### The second estimated chemical matches the third ground truth, 

### The third estimated checmical matches the second ground truth

### The second and third are swapped

# Ch6: Fundamentals of Tensor Analysis & Applications

## 6.1 Scientific Computing Applications

## Second Problem is a survey answers about happiness scores related to 3 conditions. This created a three-way contingency table has as its variables: Happiness reported by participants in a survey (i mode-1 : 3 categories), Number of siblings (j mode-2 : Siblings - 5 categories) and the Years of schooling completed (k mode-3 : Schooling - 4 categories). It is thus a frequency table containing in the cells the number of people with a particular combination of categories. 

## the data is stored in a dat file that reads a 2 dimensional matricised tensor, (12, 5), assuming that reshaping into (3, 4, 5) and swapping the mode-2 and mode-3 in the problem definition in https://three-mode.leidenuniv.nl/ will be ok

## Will decompose with Tucker to find which rank gives the best fit.

In [None]:
import csv

# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("data/Happiness.dat").readlines()]

X = np.array(datContent)
X.shape

In [None]:
X = X.reshape(3, 4, 5)
X.shape

In [None]:
# doing Tucker decomposition
from tensorly.decomposition import tucker
import tensorly as tl
import math



def doTucker (X, rank):
    X = tl.tensor(X,dtype=float)
    core, factors = tucker(X, rank=rank)
    tucker_reconstruction_2 = tl.tucker_to_tensor((core, factors))

    tk_RMSE = math.sqrt(np.square(np.subtract(X,tucker_reconstruction_2)).mean() )
    print ("Tucker " + str(rank) + " RMSE = ", tk_RMSE)
    return core ,factors, tk_RMSE

In [None]:
# doing Tucker decomposition
from tensorly.decomposition import tucker
import tensorly as tl

ranks = []
tucker_RMSE=[]

for i in range(np.prod(X.shape)): # iterate through the flat index
    rank= np.unravel_index(i, X.shape) # get the multidimensional index to use as a rank, in case it does not contain a zero, this will traverse all possible ranks
    if np.all(rank):
        core ,factors, tk_RMSE = doTucker(X, rank)
        tucker_RMSE.append(tk_RMSE)
        ranks.append(str(rank))


max = np.argmin(tucker_RMSE)
print ("Lowest RMSE achieched at rank = " + ranks[max]) # this will show the full matrix rank, 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure( figsize=(20,6))
plt.style.use('ggplot')

x_pos = [i for i, _ in enumerate(ranks)]

plt.bar(ranks, tucker_RMSE, color='green')
plt.xlabel("Dimensionality Reduction Methods")
plt.ylabel("RMSE")
plt.title("Comparing Classification RMSE using RandomForest on Linear unsupervised PCA, supervised LDA/QDA and various Embedding Learning Algorithms")

plt.xticks(x_pos, ranks)

plt.show()

## Looking at all possible ranks reconstruction error, it seems the highest errors occured when all modes were reduced, but when first mode only was reduced, the error was small, which means it is not very dominant in this dataset, and the third mode seems to be the most dominant.  

In the book ( Kroonenberg, P.M., 2008. Applied multi-way data analysis, Wiley series in probability and statistics. Wiley-Interscience, Hoboken, N.J.) the author showed that rank (2,2,2) was the most fit to this dataset, which is proven here as well. The software he used can be downloaded from https://three-mode.leidenuniv.nl/ , 