This notebook does an initial preprocessing from the raw data obtained from AcademicTorrents. Then, it uploads the processed data into an S3 bucket.

## Preprocess CSV

Since Spark was not reading the CSV correctly due to some punctuation characters and entries divided among multiple lines, it had to be preprocessed locally before uploading it to S3 for further cleaning and analysis. Although perhaps slow (~3 min), it was an important step. 

Use of AI - Part of this code was created with ChatGPT with the prompt "I have a huge dataset in a csv and some entries have different rows, how can I do in pyspark so it recognizes an entire cell as just one row?".

Then, it was modified to also delete punctuation and facilitate parsing. 

In [1]:
import csv
import string 

with open('submissions.csv', 'r', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    with open('processed_submissions.csv', 'w', newline='') as outfile:
        writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
        writer.writeheader()
        for row in reader:
            # Concatenate multiline fields
            row['selftext'] = ' '.join(row['selftext'].splitlines())
            row['title'] = ' '.join(row['title'].splitlines())

            # Remove punctuation 
            to_remove = string.punctuation + "’" #Add additional character to
                                                #default punctuation characters
            translator = str.maketrans('', '', to_remove) 
            row['selftext'] = row['selftext'].translate(translator) 
            row['title'] = row['title'].translate(translator) 

            # Write the processed row to the output file
            writer.writerow(row)

print("Preprocessing complete. Processed data saved")

Preprocessing complete. Processed data saved


## Upload to S3

In [3]:
import boto3

s3 = boto3.client('s3')
s3_resource = boto3.resource('s3')

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='LabRole')

s3.upload_file(Filename='processed_submissions.csv', 
               Bucket='finalproject-nat-s3',
               Key = 'processed_submissions.csv')