# smart-open

`smart_open` is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

`smart_open` is a drop-in replacement for Python’s built-in `open()`: it can do anything `open` can (100% compatible, falls back to native `open` wherever possible), plus lots of nifty extra stuff on top.

In [None]:
%pip install 'smart_open[s3]'
# or pip install smart_open[s3] if not using zsh because zsh uses [] for globbing
# also available [gpc] and [azure]

See [smart_open project site](https://pypi.org/project/smart-open/) and the [help file online](https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt) for more details. Also availabe built-in with `help('smart_open')`

In [None]:
help('smart_open')

`smart_open` uses `boto3` to talk to S3. So it uses the same mechanisms as `boto3` to authenticate credentials. If `aws_cli` is set up in your system (and your credentials are stored in your env vars), then you can access your S3 system without the need to include your credentials. Very useful for running notebooks:

In [None]:
from smart_open import open

with open("s3://audantic-data-test/attom/avm_staging/ds=20160623/AVM_20160623_001.txt.gz") as f:
    line_count = 0
    for line in f:
        if line_count > 2:
            break
        print(line)
        line_count += 1


Otherwise, you can pass a `boto3` session, or specify the credentials within the S3 URI

In [None]:
import os, boto3

session = boto3.Session(
     aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
     aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
)

with open(
    "s3://audantic-data-test/attom/avm_staging/ds=20160623/AVM_20160623_001.txt.gz",
    transport_params={"session": session},    
) as f:
    line_count = 0
    for line in f:
        if line_count > 2:
            break
        print(line)
        line_count += 1


In [None]:
aws_access_key_id = os.environ['AWS_ACCESS_KEY_ID']
aws_secret_access_key = os.environ['AWS_SECRET_ACCESS_KEY']

with open(
    f"s3://{aws_access_key_id}:{aws_secret_access_key}@audantic-data-test/attom/avm_staging/ds=20160623/AVM_20160623_001.txt.gz",
    transport_params={"session": session},    
) as f:
    line_count = 0
    for line in f:
        if line_count > 2:
            break
        print(line)
        line_count += 1

`smart_open` natively reads and writes gzip and bzip2 files over HTTP, S3 and other protocols, based on the file extension. Support for other file extensions and compression formats can be easily added

In [None]:
with open("s3://audantic-test/smart_open-test/smart_open_test.gz", "w") as f:
    f.write("Here we are")

In [None]:
with open("s3://audantic-data-test/attom/avm_staging/ds=20160623/AVM_20160623_001.txt.gz") as fin:
    with open("s3://audantic-test/smart_open-test/smart_open_test2.bzip2", "w") as fout:
        line_count = 0
        for line in fin:
            fout.write(line)

In [None]:
help(open)

## Iterating Over an S3 Bucket’s Contents
Since going over all (or select) keys in an S3 bucket is a very common operation, there’s also an extra function `smart_open.s3.iter_bucket()` that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

In [None]:
from smart_open import s3
help(s3.iter_bucket)

In [None]:
bucket = "audantic-data"
prefix = "attom/foreclosure_staging/"
for key, content in s3.iter_bucket(
    bucket, prefix=prefix, 
    accept_key=lambda key: "/ds=202009" in key, 
    workers=2,
    key_limit=3, 
):
    print(key, len(content))
