# Removing first column index

During initial clustering of the output of extract_transform_attentions, there were initially some interesting results.  Specifically, when we used PCA to experiment with the effectiveness of dimensionality reduction.  It turns out it doesn't do a good job retaining much explainability with our data.

That was only discovered after our first head-scratcher - the first principal component, upon inspecting, was monotonically increasing(?)  Looking at the raw output of extract_transform_attentions, we realized that when saving to CSV, we had inserted an index as column 1.  Since the pipeline took the better part of a week to run and produce ~400GB of data, this short script was created and allowed us to remove that first column from the dataset in a few short hours.  At least this time we could take advantage of multi-processing, but RAM usage was still a concern, so only 2 files were able to be processed concurrently.

This was also the case for our data that had taken over night to upload to a google cloud storage bucket, so this script was slightly modified and run on a GCP compute vm overnight as well.

In [None]:
import pandas as pd
import logging
import time

import multiprocessing as mp
p = mp.Pool(2)

In [2]:
def clean_index(count):
        start_time = time.time()
        print(f'Beginning transform of {count} ...')
        df = pd.read_csv(f'representation_df_{count}.csv')
        df = df.iloc[:,1:]
        print(f'Writing transformed {count} ...')
        df.to_csv(f'final/final_representation_df_{count}.csv', index=False)  
        print(f'--- Finished writing transformed {count} in {(time.time() - start_time)/60} minutes ---"')
        print("--- %s seconds ---" % (time.time() - start_time))
        return count

In [3]:
## for running on gcp
# def clean_index(count):
#         start_time = time.time()
#         print(f'Beginning transform of {count} ...')
#         df = pd.read_csv(f'gs://representations/raw_representations_with_index/representation_df_{count}.csv')
#         df = df.iloc[:,1:]
#         print(f'Writing transformed {count} ...')
#         df.to_csv(f'gs://representations/final/final_representation_df_{count}.csv', index=False)  
#         print(f'--- Finished writing transformed {count} in {(time.time() - start_time)/60} minutes ---"')
#         print("--- %s seconds ---" % (time.time() - start_time))
#         return count

In [None]:
x = p.map(clean_index, range(5000,135000,5000))

Beginning transform of 45000 ...
Beginning transform of 65000 ...
Writing transformed 45000 ...
Writing transformed 65000 ...
--- Finished writing transformed 65000 in 29.023671944936115 minutes ---"
--- 1741.4221730232239 seconds ---
Beginning transform of 70000 ...
--- Finished writing transformed 45000 in 29.028982969125114 minutes ---"
--- 1741.742642402649 seconds ---
Beginning transform of 50000 ...
Writing transformed 70000 ...
Writing transformed 50000 ...
--- Finished writing transformed 70000 in 28.93064546585083 minutes ---"
--- 1735.8410482406616 seconds ---
Beginning transform of 75000 ...
--- Finished writing transformed 50000 in 29.01656938791275 minutes ---"
--- 1740.996126651764 seconds ---
Beginning transform of 55000 ...
Writing transformed 75000 ...
Writing transformed 55000 ...
--- Finished writing transformed 75000 in 28.92327152490616 minutes ---"
--- 1735.3983399868011 seconds ---
Beginning transform of 80000 ...
--- Finished writing transformed 55000 in 28.8600

In [None]:
x = p.map(clean_index, range(5000,45000,5000))

Beginning transform of 15000 ...
Beginning transform of 20000 ...
Writing transformed 20000 ...
Writing transformed 15000 ...
--- Finished writing transformed 20000 in 27.99247100353241 minutes ---"
--- 1679.5508012771606 seconds ---
Beginning transform of 25000 ...
--- Finished writing transformed 15000 in 28.01156253417333 minutes ---"
--- 1680.6961460113525 seconds ---
Beginning transform of 30000 ...
Writing transformed 25000 ...
Writing transformed 30000 ...
--- Finished writing transformed 30000 in 28.235480531056723 minutes ---"
--- 1694.1308937072754 seconds ---
Beginning transform of 35000 ...
--- Finished writing transformed 25000 in 28.256190037727357 minutes ---"
--- 1695.3736989498138 seconds ---
Beginning transform of 40000 ...
Writing transformed 35000 ...
Writing transformed 40000 ...
--- Finished writing transformed 40000 in 28.742069808642068 minutes ---"
--- 1724.5259702205658 seconds ---
--- Finished writing transformed 35000 in 28.77057419617971 minutes ---"
--- 17

In [3]:
clean_index(5000)

Beginning transform of 5000 ...
Writing transformed 5000 ...
--- Finished writing transformed 5000 in 28.037185633182524 minutes ---"
--- 1682.231214761734 seconds ---


5000

In [4]:
clean_index(10000)

Beginning transform of 10000 ...
Writing transformed 10000 ...
--- Finished writing transformed 10000 in 27.491266107559206 minutes ---"
--- 1649.47602891922 seconds ---


10000