## File System Feature Engineering
#### Author: Nathan Tibbetts
#### Date: 2 Dec. 2019
#### Class: ACME Volume 3

Note: I have not shown here a few bug fixes in my scraper.

Note: Due to lack of time and sufficient filesystems to make building a macro-stats database accurate, we are restricting our efforts to the analysis of a single scraped filesystem.

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
from matplotlib import pyplot as plt
from datetime import datetime

In [2]:
### Engineer our Data

# Load pickled data
a = pd.read_pickle("linux_stem_filesystem0_public.pkl")

# Drop unnecessary columns
#   Our current analyses won't be using these.
a.drop(["Inode", "Device", "Group ID", "Sticky", "User Execute",
        "Group Read", "Group Write", "Group Execute", "Other Read",
        "Other Write", "Other Execute"], axis=1, inplace=True)

# Define our estimation of what's in the user-space
#   I want to binary-classify the difference between user space and
#   OS space. We want things that are normal files, directories,
#   or links, aren't hidden or in hidden things, are user-readable
#   and user-writeable, aren't root persmissioned, and are in the
#   user home directory. This is not a perfect definition, but
#   should be good enough for our purposes.
a["Irregular"] = (a["Is Directory"] == 0) & (a["Is Regular File"] == 0) & (a["Is Link To"] < 0)
a["Userspace"] = ((a["User ID"] != 0) &
                  (a["Sub-Hidden"] == 0) &
                  (a["User Read"] == 1) &
                  (a["User Write"] == 1) &
                  (a["Sub-Desktop-Parent"] == 1) &
                  (a["Irregular"] == 0))
a.drop(["User ID", "Hidden", "Sub-Hidden", "User Read", "User Write"],
       axis=1, inplace=True)
print("Defined Userspace")

# Process time format
#   What we want is a representation of how often files are used, but
#   the closest approximation we can get is how long it's been since
#   they were messed with last.
a["Time"] = pd.to_datetime(a["Access Time"])
a["Time2"] = pd.to_datetime(a["Modify Time"])
newest = max(a.Time)
a["Recency"] = newest - a["Time"]
a["Modification Recency"] = newest - a["Time2"]
a.drop(["Time", "Time2", "Access Time", "Modify Time", "Metachange Time"],
       axis=1, inplace=True)
print("Defined Time")

# Prep for visual effects
#   Since this doesn't need to be categorical, if we sum these columns
#   we can still use them as binary for home directory, or color-code
#   a graph based on how close to the user this way.
a["Desktop"] = a["Desktop"].astype(int) + a["Sub-Desktop"].astype(int) + a["Sub-Desktop-Parent"].astype(int)
a.drop(["Sub-Desktop", "Sub-Desktop-Parent"], axis=1, inplace=True)
print("Defined Desktop Property")

Defined Userspace
Defined Time
Defined Desktop Property


In [3]:
# Feature engineering for tree stuff
P = list(a.columns).index("Parent")
children = np.zeros(len(a), dtype=np.uint32)
depth = np.zeros(len(a), dtype=np.uint32)

for i, row in enumerate(a.values):
    # Depth of node in Tree
    j = row[P]
    d = 0
    while j != -1:
        d += 1
        j = a.at[j, 'Parent']
    depth[i] = d
        
    # Number of children
    if row[P] >= 0: children[row[P]] += 1
        
    if i % 1000 == 0: print(i, end="\r")
        
a["Child Count"] = children
a["Depth"] = depth
print(i)
print("Defined Children and Depth")

1008683
Defined Children and Depth


In [5]:
# Do a little more necessary feature engineering, generating log_2 of file sizes.
#   We do the latter because they have such a wide range of sizes, and the distribution becomes closer to normal
#   or bimodal if we do, and is more readable/understandable.
#   We replace -inf's with -1's for graphability.
a["Size Log2"] = np.log(a.Size)/np.log(2)
a["Size Log2"] = [max(s, -1) for s in a["Size Log2"]]

# Correct some data types
a.index = a.index.astype(int)
a.Parent = a.Parent.astype(int)
a["Is Link To"] = a["Is Link To"].astype(int)
a.Desktop = a.Desktop.astype(int)
a["Child Count"] = a["Child Count"].astype(int)
a.Depth = a.Depth.astype(int)

# Show a bit of the user space
a[a.Userspace == 1]

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0_level_0,Parent,Size,Is Directory,Is Regular File,Is Link To,Desktop,Path,Irregular,Userspace,Recency,Modification Recency,Child Count,Depth,Size Log2
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
848763,22,4096,True,False,-1,1,/home/nate/,False,True,0 days 00:36:41.996604,0 days 00:37:20.644943,37,2,12.000000
848764,848763,4096,True,False,-1,1,/home/nate/Projects/,False,True,0 days 01:29:51.600762,72 days 02:15:50.955432,30,3,12.000000
848770,848763,4096,True,False,-1,3,/home/nate/Desktop/,False,True,0 days 22:54:09.251098,29 days 18:36:28.881459,7,3,12.000000
848771,848763,4096,True,False,-1,1,/home/nate/Documents/,False,True,0 days 18:47:47.089086,3 days 15:06:10.032516,24,3,12.000000
848775,848763,4096,True,False,-1,1,/home/nate/Pictures/,False,True,0 days 18:47:38.272122,66 days 10:50:52.985739,4,3,12.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
989939,848789,2405497,False,True,-1,1,/home/nate/Downloads/sublist 16 Nov 2019.pdf,False,True,17 days 14:01:01.051904,17 days 14:01:30.772572,0,4,21.197904
989940,848789,2409462,False,True,-1,1,/home/nate/Downloads/sublist 2 Nov 2019.pdf,False,True,25 days 17:08:26.736386,31 days 21:03:26.888391,0,4,21.200280
989941,848789,2784701,False,True,-1,1,/home/nate/Downloads/sublist 14 sept 2019.pdf,False,True,62 days 13:07:49.768834,78 days 20:25:01.157452,0,4,21.409091
989942,848789,2411318,False,True,-1,1,/home/nate/Downloads/sublist 25 Oct 2019.pdf,False,True,17 days 14:01:03.975597,35 days 23:45:49.229982,0,4,21.201390


In [6]:
a.to_pickle("data0.pkl")