# File Import

The raw JSON files with posts and comments made to the r/ADHD subreddit are stored in a S3 bucket as a .tar.gz archive name `radhd.tar.gz`. We will use the `boto3` package to get the data:  

In [2]:
import boto3
import botocore
import pandas as pd
from glob import glob


BUCKET_NAME = "radhd-records"
file = "radhd.tar.gz"

s3 = boto3.resource("s3")

try:
    s3.Bucket(BUCKET_NAME).download_file(file, "../data/raw/adhd_records.tar.gz")
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise 

After decompressing the .tar.gz, we can import the JSON files into a pandas dataframe:

In [None]:
records = pd.DataFrame()

for file in glob("../data/raw/*json"):
    print(file)

    df = pd.read_json(file, orient="records")
    df["record"] = "post" if "posts" in file else "comment"

    records = records.append(df, ignore_index=True, sort=False)


