# New featurizer

This code implements the new featurizer of code and demonstrates its usage. It was used originally to featurize lines from Gerrit's comments.

This version contains extra comments useful for understanding how this works. 

(c) Miroslaw Staron

## Class declaration

The class FeatureMaker is a code that takes a given set of features and a line as input. As output it produces the feature vector. 

It's supposed to be used as a utility and does not have much code in it. The main function is featurize. It has three variants, which have different performance, but the one that is just called featurize is best for most cases. 

The other class - DataSet - is a class which contains the feature vectors for all lines in a specific piece of code. Kind of like a DataFrame, but specific for this purpose.

In [1]:
# This is a python program to create feature vectors from the programming language code
# in such a way that every distinct line is indeed distinct

# author: Miroslaw Staron

import re
import pandas as pd
import time
from collections import Counter
import re

class FeatureMaker:
    """Class makes a feature vector base on the set of features defined in the class parameters"""

    def __init__(self):
        """The initial state of the feature vector is actually an empty vector"""
        self.featureVector = []

    def addNewFeature(self, newFeature):
        """Adds just one feature to the list of features"""
        self.featureVector.append(newFeature)

    def addNewFeatures(self, newFeatures):
        """Adds a list of features"""
        self.featureVector = self.featureVector + newFeatures
    
    def getFeatureVector(self):
        return self.featureVector

    def featurize(self, line):
        """Counts the frequency of each feature in a given line"""
        self.features = []
        for feature in self.featureVector:
            self.features.append(str(line.count(feature)))
        return self.features
    
    def featurize2(self, line):
        """Counts the frequency of each feature in a given line"""
        self.features = []
        counter = Counter(line)
        for feature in self.featureVector:
            self.features.append(str(counter[feature]))
        return self.features
    
    def featurize3(self, line):
        """Counts the frequency of each feature in a given line"""
        self.features = []
        for feature in self.featureVector:
            self.features.append(str(len(re.findall(str(feature), line))))
        return self.features
    
    def featuresToString(self):
        strFeatures = '$'.join(str(e) for e in self.features)
        return strFeatures
    
    def findNewFeatures(self, lstTokens):
        newElements = list(set(lstTokens) - set(self.featureVector))
        return newElements

def tokenizeString(myString):
    """This function takes a line and returns a set of strings; empty strings are removed"""
    tokenList = '[\(|"|,|.|;|\)|\[|\]|{|}| ,|\n|\t:]'
    tokens = re.split(tokenList, myString)
    tokens = list(filter(None, tokens))
    return tokens

class DataSet:
    """The class makes the connection between the code and its features"""

    def __init__(self):
        self.dictRows = {}
        self.featureVector = []
    
    def addFeatureVector(self, lstFeatureVector):
        self.featureVector = lstFeatureVector

    def addNewLine(self, strLine, lstFeatures):
        self.dictRows[lstFeatures] = strLine 

    def hasLine(self, lstFeatures):
        return (lstFeatures in self.dictRows.keys())
            
    def getLine(self, lstFeatures):
        return self.dictRows[lstFeatures]

    def toCSV(self, strFilename):
        fFile = open(strFilename, 'w', encoding='utf8')
        strFirstLine = 'line$'
        strFirstLine += '$'.join(self.featureVector) + '\n'
        fFile.write(strFirstLine)
        for key, value in self.dictRows.items():
            value = value.replace("\n", "").replace("$","").replace("\r","").replace("\t","")
            strToFile = f'{value}${key}\n'
            fFile.write(strToFile)
        fFile.close()
    
    def flush(self):
        self.dictRows = {}
        self.featureVector = []

## The feature finder function

This function is the main algorithm. It takes a list of lines to featurize and a name of the file where the list of featurers is to be stored. The file is needed as the list of features is saved periodically. 

The periodical saving is needed as the algorithm can take weeks to find the optimal feature set (for code bases that have over 100,000 loc) and sometimes the computer may restart during that time. To prevent loosing the work, the feature set is saved periodically. 

In [2]:

# this is an recursive function to find a set of features for a set of lines
# the function goes through all lines and then checks which lines are identical given the feature list
# then it takes one of the tokens from the identical lines and makes a recursion
def findFeatureListIterative(lstLines, strOutputFeatureFile):
    #print(f'Lines: {len(lstLines)}, features: {len(lstFeatures)}')
    start_time = time.time()
    
    featurizer = FeatureMaker()
    featureList = []

    if len(lstLines) > 0:
        initialFeatures = tokenizeString(' '.join(lstLines))
        featurizer.addNewFeature(initialFeatures[0])
        featureList.append(initialFeatures[0])
    else:
        return featureList

    featureAdded = True

    while featureAdded:
        dictLinesUnique = {}
        lstNotUnique = []
        featureAdded = False

        # featurizing all lines in this iteration
        for line in lstLines:
            mFeatures = featurizer.featurize(line)
            strFeatures = '$'.join(mFeatures)
            if not (strFeatures in dictLinesUnique.keys()):
                dictLinesUnique[strFeatures] = line
            else:
                lstNotUnique.append(line)
                lstNotUnique.append(dictLinesUnique[strFeatures])
        
        lstNotUnique = list(set(lstNotUnique))
        strTime = f'{(time.time() - start_time):.2f} sec.'
        start_time = time.time()
        print(f'Non-unique lines remaining: {len(lstNotUnique)}, features found: {len(featureList)} in {strTime}')
        
        # this if statement is used only to save the features
        if len(featureList) % 10 == 0:
            print('Saving feature list...')
            fFile = open(strOutputFeatureFile, 'w', encoding='utf8')
            strFirstLine = str(len(lstNotUnique))
            strFirstLine = '$'.join(featureList) + '\n'
            fFile.write(strFirstLine)
            fFile.close()
            print('Done...')

        # and kicking-off the next iteration if necessary
        # by necessary I mean that there are lines that are not different
        if len(lstNotUnique) > 0:
            allLines = lstNotUnique
            getTokens = tokenizeString(' '.join(allLines))
            #getTokens.sort(reverse=True)
            for oneToken in getTokens:
                if not oneToken in featureList:                    
                    featureList.append(oneToken)
                    featurizer.addNewFeature(oneToken)
                    featureAdded = True
                    break
            if featureAdded:
                #featureList = findFeatureList(allLines, featureList)
                lstLines = lstNotUnique
        
    # returning the feature list
    return featureList

## Demonstration of how to use it

This code demonstrates how to use the featurizer. 

It first reads the code (in this case as a Pandas data frame), then it processes it so that we have a list of lines of code only. Then it removes the duplicated lines (to make the algorithm work faster). 

In [3]:
##
## This is a block where we test the featurizer class and the dataset class
##

dfCode = pd.read_csv('./main.csv', 
                    sep='$', 
                    error_bad_lines=False, 
                    warn_bad_lines=True, 
                    header=0, 
                    index_col=False)

mLines = [line for line in dfCode['code_content'] if str(line) != 'nan' ]

print(f'All Lines: {len(mLines)}')

mLines = list(set(mLines))

print(f'Unique Lines: {len(mLines)}')
#print(mLines)



All Lines: 8
Unique Lines: 8


This line executes the feature finder and gets the set of features as a result. 

The feature finder function prints out a bit of information about how much work it has left and how much time it takes per line. The time grows as the number of features grows with each iteration. Since the FeatureMaker.featurize function is a loop, it takes more time for each iteration. 

In [5]:
features = findFeatureListIterative(mLines, './feature_list.csv')

Non-unique lines remaining: 7, features found: 1 in 0.00 sec.
Non-unique lines remaining: 6, features found: 2 in 0.00 sec.
Non-unique lines remaining: 6, features found: 3 in 0.00 sec.
Non-unique lines remaining: 6, features found: 4 in 0.00 sec.
Non-unique lines remaining: 6, features found: 5 in 0.00 sec.
Non-unique lines remaining: 4, features found: 6 in 0.00 sec.
Non-unique lines remaining: 2, features found: 7 in 0.00 sec.


# Function to featurize the code based on a predefined list of features

This function goes through the code once again and creates the feature vector for each line. It is separate from the previous one as we can use any feature list we want, not only the one from this new featurizer. 

In [6]:

def featurizeListPredefined(lstLines, lstFeatures):
    dtLines = DataSet()
    
    featurizer = FeatureMaker()

    featurizer.addNewFeatures(lstFeatures)

    foundNewFeature = True
    i = 1
    while foundNewFeature:
        foundNewFeature = False
        print(f'Pass number: {i}')
        i += 1
        iLine = 0
        total = len(lstLines)
        for line in lstLines:
            iLine += 1
            if not foundNewFeature: 
                mFeatures = featurizer.featurize(line)  
                if not all(v == 0 for v in mFeatures):    
                    strFeatures = featurizer.featuresToString()
                    if not dtLines.hasLine(strFeatures):
                        dtLines.addNewLine(line, strFeatures)
                    else:
                        strLine = dtLines.getLine(strFeatures)
                        if strLine != line:
                            lineTokens = tokenizeString(line)
                            oldLineTokens = tokenizeString(strLine)
                            newFeatures = featurizer.findNewFeatures(lineTokens+oldLineTokens)
                            if len(newFeatures) > 0:
                                featurizer.addNewFeature(newFeatures[0])
                                foundNewFeature = True
                                dtLines.flush()
                                dtLines.addFeatureVector(featurizer.featureVector)
                                print(f'Found new feature at line {iLine} of {total}')                           

    return dtLines

## Demonstration of featurizing the code

In this piece of code, we just use the featurizer. We already have a list of features to use and now we need to find the feature vector for each line. 

This function takes a while, mostly saving the feature table to CSV, which tends to be very big. If you want to check the progress, you can take a look at the memory consumption here or the size of the output file on the disk. 

In [7]:
strOutputFile = './output_main.csv'
dtLines = featurizeListPredefined(mLines, features)
dtLines.addFeatureVector(features)
dtLines.toCSV(strOutputFile)

Pass number: 1
