# Analyze Notebook

Jupyter notebook used to test and document Analyze operations performed by analyze.py


## Initializations

Required initalizations and configurations. 

Please, **note that file names have to be set to proper values and left clear before submitting code to repos**. Exceptions, such as 'File not found', are not controlled for clarity purporses.

### CSV Structure

Fields in CSV files are: `PATH;FILENAME;SIZE;CHANGED;HASH;PREV_HASH;DATE;MODIFICATION`

In [1]:
import pandas as pd
import numpy as np

VALUE_DEFAULT_SEPARATOR = ';'
TXT_O_CSV_HEADER = "PATH;FILENAME;SIZE;CHANGED;HASH;PREV_HASH;DATE;MODIFICATION"


filename1 = "tests\\test.csv"
filename2 = "tests\\testBitrot.csv"

# Load CSV files as dataframes
df_file1 = pd.read_csv(filename1, sep=VALUE_DEFAULT_SEPARATOR)
df_file2 = pd.read_csv(filename2, sep=VALUE_DEFAULT_SEPARATOR)

# Append a new column with the source CSV used
df_file1 = df_file1.assign(SOURCE=filename1)
df_file2 = df_file2.assign(SOURCE=filename2)

# Dataframe to return with selected items
#df_return

## Search for duplicated files

One entry is a duplicated file when values in both entries (hash,filesize) are the same. Yoy can not only rely on hash due to conflicts.

### CASE 1. Files marked as `CHANGED` are candidates to have suffered Bit-rot

In [2]:
df_modified_files1 = df_file1[df_file1['CHANGED']]
df_modified_files2 = df_file2[df_file2['CHANGED']]

### CASE 2. Files with same relevant information but different `HASH`

In [3]:
# Join identical files to search Hash differences
df_file1['MarkerFull'] = df_file1['SIZE'].astype(str) + '|' + df_file1['MODIFICATION'] + '|' + df_file1['PATH'] + "\\" + df_file1['FILENAME'] 
df_file2['MarkerFull'] = df_file2['SIZE'].astype(str) + '|' + df_file2['MODIFICATION'] + '|' + df_file2['PATH'] + "\\" + df_file2['FILENAME'] 
df_joint = df_file1.set_index('MarkerFull').join(df_file2.set_index('MarkerFull'), rsuffix='_2')

df_diff_hash = df_joint[(df_joint['HASH'] != df_joint['HASH_2'])]

### CASE 3. Files with same relevant information, different `PATH` (moved or under different folder tree structure) and different `HASH`
We assume that if the file name has been changed we have modified the file and shouldn't be considered to have been bit-rotted

In [9]:
#### NOT WORKING STILL!!!
df_file1['MarkerName'] = df_file1['SIZE'].astype(str) + '|' + df_file1['MODIFICATION'] + '|' + df_file1['FILENAME'] 
df_file2['MarkerName'] = df_file2['SIZE'].astype(str) + '|' + df_file2['MODIFICATION'] + '|' + df_file2['FILENAME'] 
df_joint = df_file1.set_index('MarkerName').join(df_file2.set_index('MarkerFull'), rsuffix='_2')

df_diff_name_hash = df_joint[(df_joint['HASH'] != df_joint['HASH_2'])]

In [5]:
#Option 1: using a dataframe with duplicated Ids to look for duplicated hashes
#hashes = df_file1["HASH"]
#df_file1_duplicated_hashes = df_file1[hashes.isin(hashes[hashes.duplicated()])]
#df_file1_single_hashes = df_file1[~hashes.isin(hashes[hashes.duplicated()])] # The other part of the files

#Option 2: Duplicated values grouped by hash and file size (slower)
#df_file1_grouped_duplicates = pd.concat(g for _, g in df_file1.groupby(["HASH", "SIZE"]) if len(g) > 1)

In [6]:
#df_modified_files = df_everything[df_everything['CHANGED']]   # Maintain all the columns
#df_modified_files = df_everything.loc[df_everything['CHANGED'], ["PATH", "FILENAME", "SIZE", "HASH", "MODIFICATION", "SOURCE"]] # Maintain only selected columns

In [7]:
# Joining tests
#df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
#other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'A': ['B0', 'B1', 'B2']})
#d1=df.join(other, lsuffix='_caller', rsuffix='_other')
#d2=df.set_index('key').join(other.set_index('key'), lsuffix='_caller', rsuffix='_other')
#d3=df.join(other.set_index('key'), on='key', lsuffix='_caller', rsuffix='_other')