# Calculating offset for ECCO reuse data

By running this code you should be able to calculate the correct start and end values for the reuse data so that the Octavo Reader works correctly. 

I am no expert in Python or data structures and algorithms, so things are not very optimised, but this seems to run in reasonable time so did not spend too much time optimizing this.

## Running the code

Let's start by importing the libraries we need for this.
If some of the libraries are not importing you probably have to install them first. 
Google might answer better how to do that than I.

In [None]:
import json
import os
import pandas as pd

Determine the path to the folder where your offset data is. You should not have other files or folders there.

Fast way to do this is to locate one of the files from that folder and right click that. Then choose properties and copy the path that is inside there. 

At least with Windows the path will contain backslashes which is an escape character so you need to add another backslash before the backslashes for them to work. Slashes instead of double backslahs might work too. There is an example below.

Characters like Å, Ä, Ö might cause problems with paths so you might want to avoid them.

Relative paths should also be cool.

In [None]:
# Replace this path with your own. 
offsetPath = "C:\\Users\\mikko\\ECCO_data\\offset_data\\" 

Then we get a list of all the offset files in that folder.
You can print the list to make sure that all the files are there.

In [None]:
arrayOfOffsetFiles = os.listdir(offsetPath)

In [None]:
print(arrayOfOffsetFiles)

Next we will make a new dictionary that contains all the JSON objects in the files as dictionaries. This will make it easy to search the right ids later.

I have commented out a piece of code where you can check if some id is actually in the data.
If a id exist, console will print "true" and the name of the file it is in.

When I was running this, some ids were missing and some were in multiple files, so if the offsets are not working after
running this, it might be worth it to check that the id is actually in the data.

In [None]:
offsetObject = dict()

for name in arrayOfOffsetFiles:
    file = open(offsetPath + name, encoding="utf8")
    data = json.load(file)

    #Replace the id with your own
    #if data.get("1702800101"):
    #      print(name)
    #      print("true")

    offsetObject = {**offsetObject, **data}


We also need to locate the path of the reuse data folder.
Same procedure as with the offset data. 
Again check that there ar eno other files and folders there or it will try to run this on those too, which will probably lead to errors or unnecessary work.

In [None]:
# Replace this path with your own. 
dataPath = "C:\\Users\\mikko\\ECCO_data\\spectator1720\\"
dataFiles = arrayOfDataFiles = os.listdir(dataPath)

Here is function a function for actually calculating the new offset.
It takes as properties the following things: current row we are changing, column name where the id is located and column where the start/end point of the reuse is located.

You probably don't need to change anything here unless you want to optimize my code.

In [None]:
def calcOffset(row, id_column, char_column):
    
    # The value of start/end currently
    base = int(row[char_column])
    
    # The cumulative offset we will add to base
    biggestSmaller = None
    
    # The offset dictionary that matches the id of the current row.
    # Here I have added some zeros as padding to the left side of the id since the offset data and reuse data had the ids in
    # different form
    offset = offsetObject.get(str(row[id_column]).zfill(10))
        
    #If the offset is not found we can just return the initial value
    if not offset:
        return pd.Series([base, None])
    
    # Determining the new offset works here so that we are looking for the biggest offset point from the offset data that
    # is smaller than the current position. After finding that we can just add the cumulative offset to the current position.
    # Currently we are checking all the items in dictionary which is not very optimized. 
    for key, value in offset.items():
        if int(key) < base and (not biggestSmaller or int(key) > int(biggestSmaller[0])):
            biggestSmaller = (key, value)
        
    if not biggestSmaller:
        return pd.Series([base, None])
    
    return pd.Series([base + biggestSmaller[1].get("offset"), biggestSmaller[1].get("header")])

Lastly, we want to run this function to all rows of all data files.

If you have different column names, you might need to change the parameters, but you probably don't need to.

Choose a folder where you want to save the new files. This should probably be other folder than the one where the current data is. Also make sure that the folder actually exists.

The new start and end points are stored in new columns called "offsetPrimaryStart" and such, but those can also be changed if need to.

In [None]:
# From optimization point of view you should probably run just a single lambda to make it faster, but I did not bother.
# In another words, this might take a moment.

for file in dataFiles:

    currentFile = pd.read_csv(dataPath + file)
    
    
    # Changing the name in the brackets allows you to change the column where the new value is created
    currentFile[['offsetPrimaryStart', 'primaryStartHeader']] = currentFile.apply(
    #Changing these attributes allows you to change from which columns the values are read
     lambda row: calcOffset(row, "id_primary", "text_start_primary"), axis=1
    )
    currentFile[['offsetPrimaryEnd', 'primaryEndHeader']] = currentFile.apply(
     lambda row: calcOffset(row, "id_primary", "text_end_primary"), axis=1
    )
    currentFile[['offsetSecondaryStart', 'secondaryStartHeader']] = currentFile.apply(
     lambda row: calcOffset(row, "id_secondary", "text_start_secondary"), axis=1
    )
    currentFile[['offsetSecondaryEnd', 'secondaryEndHeader']] = currentFile.apply(
     lambda row: calcOffset(row, "id_secondary", "text_end_secondary"), axis=1
    )
    
    # Here you can change where the new files are saved to. Keep the file in the end so you save each file to different file.
    # For example, in my case this would result in file names such as:
    # "C:\Users\mikko\ECCO_data\fixedOffset\spectator1720\fixedOffset_spectator1.csv"
    currentFile.to_csv("C:\\Users\\mikko\\ECCO_data\\fixedOffset\\spectator1720\\fixedOffset_" + file)