# Tensor Decomposition for Malware

Have code like this at the top of every notebook.

In [1]:
import datetime
import numpy as np
import pickle
import sys
import time

#from LZJD import *         # do'nt seem to need LZJD in this notebook, yet
from whereAmI import *
from os import chdir
from os import getcwd
from os import listdir
from os import path

(home, dropBoxDir, dataDir) = whereAmI()

print(dropBoxDir)

ModuleNotFoundError: No module named 'whereAmI'

## Installing TensorFlow

We need TensorFlow to do the machine learning work in this notebook.  To install tensorflow on a Ubuntu guest, try this:

which pip     # to find out which version of pip to use

/home/charles/anaconda3/bin/pip install -U tensorflow  # may need full path if using sudo

But contrary to the documentation at 
https://www.tensorflow.org/install/install_linux#InstallingNativePip,
it doesn't seem that sudo is necessary.  

To see if tensorflow is installed, try the following as a stand-alone script.  The cell shown below is markup, and is not intended for execution inside this notebook.

`python -c "import tensorflow as tf; print(tf.__version__)"`

If that works, then the following cell should also work!

In [2]:
import tensorflow as tf
print(tf.__version__)

1.0.0


After installing tensorflow the first time, there was this warning:
    
`/home/charles/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: 
FutureWarning: Conversion of the second argument of issubdtype from 
"float" to "np.floating" is deprecated. In future, it will be treated 
as "np.float64 == np.dtype(float).type".  from ._conv import register_converters as _register_converters`

so after a couple of Google searches, upgrading to a newer version of h5py seems to make the warning go away.

`pip install h5py==2.8.0rc1`

`python -c 'import h5py; print(h5py.version.info)'`

Some options for the program...

We need to have some tensor decomposition code that does not come with tensorflow, but uses it.  See https://github.com/ebigelow/tf-decompose.  The pip program doesn't seem to know about ktensor and its friends, so unzip the code from GitHub and copy ktensor.py, dtensor.py, and utils.py to the Jupyter process's directory.

The tensor decomposition code uses tqdm, which looks handy anyway, and is easily installed:
    `pip install tqdm`

In [3]:
from ktensor import KruskalTensor

Let's see if we can parse the Shakespeare corpus, which I have in ~/Dropbox/working/vx/WS

In [4]:
# defining the class termdocumenttensor which has the functions like creating the tensor,
# decomposing the tensor, generation of the similiarity matrix and its clustering 

from scipy import spatial
from collections import deque
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import _pickle as pickle
import time

flag = 1   # print comments in a few places

def flag_function_tdm(cmts):
        global flag
        flag = cmts

The next block of code needs plotly.  That's easy to install: `pip install plotly`

In [5]:
#Tensorvisualization class which takes the similarity matrix and vizualizes 
# them into Clusters or Heatmaps
flag =1
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD
import time
import scipy.cluster.hierarchy
import scipy.spatial.distance
from collections import Counter
#import accuracy    # Phani, is this package important?

def flag_function_visualization(cmts):
    global flag
    flag = cmts

There will be several of these classes that end in "mixin".  By having no data in those classes, only methods, often just one method, they can be inherited by another class.  That allows us to have methods in their own cells in this notebook, allowing markup cells between them as documentation.  

In [6]:
class generate_heat_map_mixin:
    def generate_heat_map(self, data, axis_labels):
        """
        Generates a heat map for the current data
        Currently only meant to support using a cosine similarity matrix
        :param data:
        :param axis_labels:
        :return:
        """
        if flag == 1:
            print("Generating heatmap...")
        axis_labels_abbreviated = [label[:14] for label in axis_labels]
        info = [go.Heatmap(z=data,
                           x=axis_labels_abbreviated,
                           y=axis_labels_abbreviated,
                           colorscale='Hot',
                           )]

        layout = go.Layout(title='Cosine Similarity Between Documents',
                           xaxis=dict(ticks=''),
                           yaxis=dict(ticks=''),
                           plot_bgcolor='#444',
                           paper_bgcolor='#eee'
                           )
        fig = go.Figure(data=info, layout=layout)
        plotly.offline.plot(fig, filename='notebook_heatmap.html')

Use SVD to do something, and then some clustering.  Not sure exactly what the underlying logic might be...

In [7]:
class k_mean_clustering_mixin:
    def k_means_clustering(self, factor_matrix, file_names=[], clusters=2):
        clusters = 2
        svd = TruncatedSVD(n_components=2, n_iter=20, random_state=42)
        reduced = svd.fit_transform(factor_matrix)
        kmeans = KMeans(n_clusters=clusters, random_state=0).fit(factor_matrix)
        labels_predicted = kmeans.labels_
        data = [plotly.graph_objs.Scatter(x=[entry[0] for entry in reduced],
                                          y=[entry[1] for entry in reduced],
                                          mode='markers',
                                          marker=dict(color=kmeans.labels_),
                                          text=file_names
                                          )
                ]
        fig = go.Figure(data=data)
        plotly.offline.plot(fig, filename='kmeans_cluster.html')

In [8]:
class TensorVisualization(generate_heat_map_mixin, k_mean_clustering_mixin):
    def __init__(self):
        from plotly import __version__
        print( "Using Plotly version "+__version__) # requires version >= 1.9.0
        plotly.tools.set_credentials_file(username='cknicholas', 
                                  api_key='pa9Z110GEeh029O4jTV0')

In [9]:
class generate_cosine_similarity_mixin:
    def generate_cosine_similarity_matrix(self, matrix):
        f = open('cosine.txt', 'w');
        if flag == 1:
            print("Generating a cosine similarity matrix")
        cosine_sim = []
        for entry in matrix:
            sim = []
            for other_entry in matrix:
                sim.append(spatial.distance.cosine(entry, other_entry) * -1 + 1)
                f.write(str(spatial.distance.cosine(entry, other_entry) * -1 + 1))
                f.write("\n")
            cosine_sim.append(sim)
        return cosine_sim

This logic for estimating the rank of the tensor is interesting.  But I'm not sure if it's being used.  

In [10]:
class get_estimated_rank_mixin:
    def get_estimated_rank(self):
        """
        Getting the rank of a tensor is an NP hard problem
        Therefore we use an estimation based on the size of the dimensions of our tensor.
        These numbers are taken from Table 3.3 of Tammy Kolda's paper:
        http://www.sandia.gov/~tgkolda/pubs/pubfiles/TensorReview.pdf
        :return:
        """
        # At the moment the rank returned by this function is normally too high for either
        # my machine or the tensorly library to handle, therefore I have made it just 
        # return 1 for right now
        if flag == 1:
            print("Estimating the rank of the tensor...")
        I = len(self.tensor[0])
        J = len(self.tensor[0][0])
        K = len(self.tensor)

        if I == 1 or J == 1 or K == 1:
            return 1
        elif I == J == K == 2:
            return 2
        elif I == J == 3 and K == 2:
            return 3
        elif I == 5 and J == K == 3:
            return 5
        elif I >= 2 * J and K == 2:
            return 2 * J
        elif 2 * J > I > J and K == 2:
            return I
        elif I == J and K == 2:
            return I
        elif I >= J * K:
            return J * K
        elif J * K - J < I < J * K:
            return I
        elif I == J * K - I:
            return I
        else:
            print(I, J, K, "did not have an exact estimation")
            return min(I * J, I * K, J * K)


In [11]:
class create_term_document_tensor_mixin:
    def create_term_document_tensor(self, **kwargs):
        """
        Generic tensor creation function. Returns different tensor based on user input.
        :param kwargs:
        :return:
        """
        if flag == 1:
            print("Creating a Term Document Tensor")
        if self.type == "binary":
            return self.create_binary_term_document_tensor(**kwargs)
        else:
            return self.create_term_document_tensor_text(**kwargs)


In [12]:
class create_binary_term_document_tensor_mixin:
    def create_binary_term_document_tensor(self, **kwargs):
        start_time1 = time.time()
        if flag == 1:
            print("Binary Term Document Tensor")
        doc_content = []
        first_occurences_corpus = {}
        ngrams = kwargs["ngrams"] if kwargs["ngrams"] is not None else 1
        print("ngrams is %s" % (ngrams))

        for file_name in os.listdir(self.directory):
            previous_bytes = deque()
            first_occurences = {}
            byte_count = 0
            with open(self.directory + "/" + file_name, "rb") as file:
                #print("Reading %s\n" % self.directory + "/" + file_name)
                my_string = ""
                while True:
                    byte_count += 1
                    current_byte = file.read(1).hex()
                    if not current_byte:
                        break
                    if byte_count >= ngrams:
                        byte_gram = "".join(list(previous_bytes)) + current_byte
                        if byte_gram not in first_occurences:
                            first_occurences[byte_gram] = byte_count
                        if byte_count % ngrams == 0:
                            my_string += byte_gram + " "
                        if ngrams > 1:
                            previous_bytes.popleft()
                    if ngrams > 1:
                        previous_bytes.append(current_byte)
                first_occurences_corpus[file_name] = first_occurences
            doc_content.append(my_string)
        doc_names = os.listdir(self.directory)

        # Convert a collection of text documents to a matrix of token counts
        vectorizer = TfidfVectorizer(use_idf=False)
        # Learn the vocabulary dictionary and return term-document matrix.
        x1 = vectorizer.fit_transform(doc_content).toarray()
        del doc_content
        self.vocab = ["vocab"]

        self.vocab.extend(vectorizer.get_feature_names())
        tdm = []
        for i in range(len(doc_names)):
            row = x1[i]
            tdm.append(row)
        svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
        reduced_tdm = svd.fit_transform(tdm)
        tdm_first_occurences = []
        self.corpus_names = doc_names
        # Create a first occurences matrix that corresponds with the tdm
        for j in range(len(doc_names)):
            item = doc_names[j]
            this_tdm = []
            for i in range(0, len(tdm[0])):
                word = self.vocab[i]
                try:
                    this_tdm.append(first_occurences_corpus[item][word])
                except:
                    this_tdm.append(0)
            # print(this_tdm)
            tdm_first_occurences.append(this_tdm)
        reduced_tdm_first_occurences = svd.fit_transform(tdm_first_occurences)
        del tdm_first_occurences
        del tdm
        tdt = [reduced_tdm, reduced_tdm_first_occurences]
        self.tensor = tdt
        #tdm_sparse = scipy.sparse.csr_matrix(tdm)
        #tdm_first_occurences_sparse = scipy.sparse.csr_matrix(tdm_first_occurences)
        if flag == 1:
            print("  %s seconds for TDM Binary" % format((time.time() - start_time1), '.2f'))
        return self.tensor


In [13]:
class create_term_document_tensor_text_mixin:
    def create_term_document_tensor_text(self, **kwargs):
        """
        Creates term-sentence-document tensor out of files in directory
        Attempts to save this tensor to a pickle file
        
        :return: 3-D dense numpy array, self.tensor
        """
        start_time2 = time.time()
        if flag == 1:
            print("Word-based Term Document Tensor")

        self.tensor = None
        vectorizer = TfidfVectorizer(use_idf=False, analyzer="word")
        document_cutoff_positions = []
        doc_content = []
        pos = 0
        max_matrix_height = 0
        max_sentences = kwargs["lines"]
        self.corpus_names = os.listdir(self.directory)

        # If given Pickle file, read it in
        if self.file_name is not None:
            file = open(self.file_name, 'rb')
            self.tensor = pickle.load(file)
            return self.tensor

        # Create one large term document matrix from all documents. 
        # Done to ensure same vocabulary.
        for file_name in self.corpus_names:
            document_cutoff_positions.append(pos)
            with open(self.directory + "/" + file_name, "r", errors="ignore") as file:
                print("Reading %s" % self.directory + "/" + file_name)
                for line in file:
                    if len(line) > 2:
                        pos += 1
                        doc_content.append(line)
                    if pos - document_cutoff_positions[-1] >= max_sentences:
                        break
                if max_matrix_height < pos - document_cutoff_positions[-1]:
                    max_matrix_height = pos - document_cutoff_positions[-1]

        document_cutoff_positions.append(pos)

        x1 = vectorizer.fit_transform(doc_content)
        matrix_length = len(vectorizer.get_feature_names())

        # Split large term document matrix into term document tensor. 
        # Splits happen where one document ends.
        for i in range(len(document_cutoff_positions) - 1):
            temp = x1[document_cutoff_positions[i]:document_cutoff_positions[i + 1], :]
            temp = temp.todense()
            # Make all matrix slices the same size
            term_sentence_matrix = np.zeros((max_matrix_height, matrix_length))
            term_sentence_matrix[:temp.shape[0], :temp.shape[1]] = temp
            if self.tensor is None:
                self.tensor = term_sentence_matrix
            else:
                self.tensor = np.dstack((self.tensor, term_sentence_matrix))

        self.file_name = self.directory + ".pkl"
        if flag == 1:
            print("Finished tensor construction.")
        if flag == 1:
            print("Tensor shape:" + str(self.tensor.shape))
        try:
            pickle.dump(self.tensor, open(self.file_name, "wb"))
        except OverflowError:
            print("ERROR: Tensor cannot be saved to pickle file due to size larger than 4 GB")
        if flag == 1:
            print("  %s seconds for TDM document" % format((time.time() - start_time2),'.2f'))
        return self.tensor


In [14]:
class parafac_decomposition_mixin:
    def parafac_decomposition(self):
        """
        Computes a parafac decomposition of the tensor.
        This will return n rank 3 factor matrices, where n represents the 
        dimensionality of the tensor.
        :return:
        """
        start_time3 = time.time()
        if flag == 1:
            print("Ready to decompose the TDM")
        decompose = KruskalTensor(self.tensor.shape, rank=3, regularize=1e-6, 
                                  init='nvecs', X_data=self.tensor)
        if flag == 1:
            print("Returned from decomposing the TDM")
        self.factors = decompose.U
        with tf.Session() as sess:
            for i in range(len(self.factors)):
                sess.run(self.factors[i].initializer)
                self.factors[i] = self.factors[i].eval()
        if flag == 1:
            print("  %s seconds for decomposition of the tensor" % 
                  format((time.time() - start_time3), '.2f'))
        return self.factors

In [15]:
class TermDocumentTensor(generate_cosine_similarity_mixin, 
                         get_estimated_rank_mixin, 
                         create_term_document_tensor_mixin, 
                         create_binary_term_document_tensor_mixin,
                         create_term_document_tensor_text_mixin, 
                         parafac_decomposition_mixin):
    
    def __init__(self, directory, type="binary", file_name=None):
        if flag==1:
            print("Initializing the tensor")
        self.vocab = []
        self.tensor = []
        self.corpus_names = []
        self.directory = directory
        self.type = type
        self.rank_approximation = None
        self.factor_matrices = []
        # These are the outputs of our tensor decomposition.
        self.factors = []
        self.file_name = file_name

    def print_formatted_term_document_tensor(self):
        if flag == 1:
            print("Print the Term Document Tensor")
        for matrix in self.tensor:
            print(self.vocab)
            for i in range(len(matrix)):
                print(self.corpus_names[i], matrix[i])

In [16]:
# To run this notebook , you need to have the directory = "output" in the same folder 
# where this folder is executed 

args = {}
args['Comments'] = "N"
args['axis'] = 2
args['binary'] = False
args['components'] = 2
args['decom'] = 'parafac'
args['directory'] = 'output'   # was 'output' which I don't understand
args['file'] = None
args['heatmap'] = True
args['kmeans']= True
args['lines'] = 300                              # used to be only 100, which seems short
args['ngrams'] = 1
args['text'] = True
args['output_option'] = False

# create the tensor and its decompositions
def main():
    start_time = time.time()
    file_type = "binary" if args['binary'] else "text"
    print("Reading files from %s" % args['directory'])
    tdt =TermDocumentTensor(args['directory'], type=file_type, file_name=args['file'])
    tdt.create_term_document_tensor(ngrams=args['ngrams'], lines=args['lines'])
    if args['decom'] == "parafac":
        factors = tdt.parafac_decomposition()
    visualize = TensorVisualization()
    if args['heatmap']:
        cos_sim = tdt.generate_cosine_similarity_matrix(factors[args['axis']])
        visualize.generate_heat_map(cos_sim, tdt.corpus_names)
    if args['kmeans']:
        visualize.k_means_clustering(factors[args['axis']], tdt.corpus_names, 
                                     clusters=args['components'])
        # what is this doing here?  if we're not using it
        #tdt.generate_cosine_similarity_matrix(factors[args['axis']])
    print("  %s seconds is the total time for program to execute" % 
          format((time.time() - start_time), '.2f'))
main()

Reading files from output
Initializing the tensor
Creating a Term Document Tensor
Word-based Term Document Tensor
Reading output/JC.txt
Reading output/1957-Eisenhower.txt
Reading output/TNK.txt
Reading output/Tmp.txt
Reading output/1881-Garfield.txt
Reading output/1905-Roosevelt.txt
Reading output/1985-Reagan.txt
Reading output/1913-Wilson.txt
Reading output/Cym.txt
Reading output/.DS_Store
Reading output/2Henry4.txt
Reading output/Cor.txt
Reading output/1793-Washington.txt
Reading output/1885-Cleveland.txt
Reading output/Rom.txt
Reading output/1789-Washington.txt
Reading output/Mac.txt
Reading output/2Henry6.txt
Reading output/1845-Polk.txt
Reading output/1925-Coolidge.txt
Reading output/1837-VanBuren.txt
Reading output/Henry5.txt
Reading output/1941-Roosevelt.txt
Reading output/Tim.txt
Reading output/2001-Bush.txt
Reading output/1961-Kennedy.txt
Reading output/1909-Taft.txt
Reading output/1945-Roosevelt.txt
Reading output/1817-Monroe.txt
Reading output/1949-Truman.txt
Reading output/

OSError: [Errno 22] Invalid argument