In [1]:
import pandas as pd

# For preprocessing
import csv
from os import path
from pathlib import Path

# Set the project root folder to where this file is located
root_dir = Path().absolute()

## Importing the data

This import has an error so we will wrap it in a Try-Except block first.

In [2]:
# Attempting to read the file
try:
    data = pd.read_csv('../data/accre-jobs-2020.csv')
except Exception as error:
    print(error)

Error tokenizing data. C error: Expected 13 fields in line 3461, saw 15



Looks like the error is on line index 3461 (This is 1-based so with our 0-based indexing, that would be 3460.)

Looking into the file directly, it appears that the import creates some errors because some of the data in that line are formatted as `[1,2,3,4]`. The commas here are tricking the CSV parser in thinking that they should be multiple columns instead of one single column. Let's fix that!

If we put that last column within quotes as `"[1,2,3,4]"`, the csv parser would be able to interpret it as one column.

## Preprocessing the file

Here, we read the original file, go through each line, and up to the 13th column, we keep them as is. Beyond the 13th column, we wrap all the values within quotes. We then write this into a new file that we will use as our dataset with Pandas.

In [3]:
def preprocess_original_csv_data():
    try:
        # The data might have already been pre-processed previoulsy
        # If not, this line would generate a "No such file" error so we would fallback to the exception below
        data = pd.read_csv('../data/accre-jobs-2020-processed.csv')
        print('Reading from previously processed file...')
        display(data.head())

    except FileNotFoundError:
        print("Preprocessed file not found. Creating a new one. Please wait...")

        # Open 2 files: One is the source (read) and one is the destination (write)
        with open(path.join(root_dir, '..', 'data', 'accre-jobs-2020.csv'), mode='r') as source,\
        open(path.join(root_dir, '..', 'data', 'accre-jobs-2020-processed.csv'), mode='w', newline='') as destination:

            # Set file reader and file writer on the source and destination
            reader = csv.reader(source)
            writer = csv.writer(destination)

            # Go through each line in the source file
            for line in reader:
                # Create a new matching line with the last column within quotes
                # This will make it into a single column if there are more commas
                newline = line[:12] + [",".join(line[12:])]
                # Write newline to the destination file
                writer.writerow(newline)

    except Error as error:
        print("Something went wrong:", error)

**This line only needs to run once!\
Uncomment and run if needed to re-process the data file!\
This line will take a while to run!**

In [4]:
# preprocess_original_csv_data()

## New Import

Now, we can work off that destination file instead

In [5]:
# Try importing as normal again on the newly created destination file
data = pd.read_csv('../data/accre-jobs-2020-processed.csv')
data.head()

Unnamed: 0,JOBID,ACCOUNT,USER,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE,STATE,NODELIST
0,15925210,treviso,arabella,122880Mn,65973.49M,13-18:00:00,13-18:00:28,1,24,production,0:0,COMPLETED,cn1531
1,15861126,treviso,arabella,122880Mn,67181.12M,13-18:00:00,12-14:50:56,1,24,production,0:0,COMPLETED,cn1441
2,15861125,treviso,arabella,122880Mn,69111.86M,13-18:00:00,13-18:00:20,1,24,production,0:0,COMPLETED,cn1464
3,16251645,treviso,arabella,122880Mn,65317.33M,13-18:00:00,12-03:50:32,1,24,production,0:0,COMPLETED,cn1473
4,16251646,treviso,arabella,122880Mn,65876.11M,13-18:00:00,13-18:00:03,1,24,production,0:0,COMPLETED,cn1440


Let's look at line 3461 (Index 3460) to confirm.

In [6]:
# Check the previously problematic line
data.iloc[[3460]]

Unnamed: 0,JOBID,ACCOUNT,USER,REQMEM,USEDMEM,REQTIME,USEDTIME,NODES,CPUS,PARTITION,EXITCODE,STATE,NODELIST
3460,17050901_91,winged,lavonda,4096Mn,669.61M,12:00:00,00:06:05,4,1,production,0:0,COMPLETED,"cn[449,463,911,913]"


Yep, looks like it is imported normally. We are good to go! So moving forward, we will work off the *preprocessed* file