Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Storage (GCS) #404

Merged
merged 87 commits into from
Jan 24, 2020
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
6eeb5b4
GCS support working
Nov 11, 2019
45e1019
Start updating setup.py and README
Nov 12, 2019
ae6c252
Add .idea/ to .gitignore
Nov 12, 2019
38905ae
Move integration tests to their proper location
Nov 12, 2019
d054a87
All unit tests passing with mocks
Nov 15, 2019
a5e2a6e
Add gcs integration tests
Nov 15, 2019
4fc43fa
Add .env to .gitignore
Nov 19, 2019
67feb7b
Integration tests passing
Jan 5, 2020
62252f0
Remove kms integration test
Jan 5, 2020
40cbb7d
Fix failing test
Jan 5, 2020
5769e9b
Change writer from buffer size to min part size
Jan 5, 2020
056abe9
Merge branch 'master' into gcs
petedannemann Jan 5, 2020
c31f79d
Fix typo
petedannemann Jan 5, 2020
e2be283
Change buffer size to min part size
petedannemann Jan 5, 2020
0eca1d3
Fix another missed min_part_size
petedannemann Jan 5, 2020
9357d48
Fix broken example
petedannemann Jan 5, 2020
eb2817b
Change smart_open to open
petedannemann Jan 5, 2020
00f7c86
Make WHENCE_CHOICES a tuple
petedannemann Jan 5, 2020
2544aca
Specify client type
petedannemann Jan 5, 2020
d3e0d8c
Make init one line instead of multiple
petedannemann Jan 5, 2020
b2f38cf
Shorten _parse_uri_gcs
petedannemann Jan 5, 2020
0afb007
Use open instead of smart_open
petedannemann Jan 5, 2020
5b95e87
Fix memory issue with RawReader
petedannemann Jan 5, 2020
81f50ae
Fix type hints
petedannemann Jan 5, 2020
2f8f22c
Fix exists in test
petedannemann Jan 5, 2020
7e07484
Fix typo in smart_open_lib about schemes
petedannemann Jan 5, 2020
b779ccd
Remove RawReaders and BufferedInputBase
petedannemann Jan 5, 2020
66a0b33
Remove unneeded arg passing in tests
petedannemann Jan 5, 2020
ca90299
Fix bug with seek
petedannemann Jan 5, 2020
ca4c26b
Internal and doc stringed constants
petedannemann Jan 5, 2020
c4d1595
Internal functions and doc strings
petedannemann Jan 5, 2020
79ee333
Fix issue with double issue on upload and add __str__ and __repr__
petedannemann Jan 6, 2020
9622bc6
Minor cleanup
petedannemann Jan 6, 2020
780c549
Fix flake8 errors
petedannemann Jan 6, 2020
4382710
Additional flake8 resolution
petedannemann Jan 6, 2020
13872de
Add source code encoding to test file
petedannemann Jan 6, 2020
9b5c802
Docstrings in imperative mode
Jan 6, 2020
daae9fe
Test grammar
Jan 6, 2020
86c24db
Fix mock_gcs docstring
Jan 6, 2020
f4c4781
Only support gs scheme
Jan 6, 2020
e03e5fe
Remove additional occurences of removed gcs scheme
Jan 6, 2020
7321bbd
Add test_read_past_end
petedannemann Jan 6, 2020
5d1baf8
Clean up tests with class level decorator
petedannemann Jan 7, 2020
56112cd
Remove stub function
petedannemann Jan 7, 2020
f4ad80b
Use equality instead of in for scheme
Jan 7, 2020
8611dba
Fix repr and mock import
petedannemann Jan 7, 2020
8eb3e0a
Merge branch 'gcs' of https://github.com/petedannemann/smart_open int…
petedannemann Jan 7, 2020
03ac2a9
Specify ImportError
petedannemann Jan 7, 2020
93e4b0f
Use BytesIO in integration test
petedannemann Jan 8, 2020
2bcf8e3
Explicit encoding in integration tests
petedannemann Jan 8, 2020
0a48ba4
Remove unneeded variable in integration test
petedannemann Jan 8, 2020
18ca5f8
Remove unneeded data variable
petedannemann Jan 8, 2020
ffa8083
Remove .format from test_gcs
petedannemann Jan 8, 2020
f56909c
Merge branch 'gcs' of https://github.com/petedannemann/smart_open int…
petedannemann Jan 8, 2020
9579b1d
Move RESUMEABLE_SESSION_URI_TEMPLATE to module scope
petedannemann Jan 8, 2020
aa90d64
Remove unnecessary explicit encoding
petedannemann Jan 8, 2020
adf5d56
Remove unnecessary variable in read
petedannemann Jan 8, 2020
03a7d72
Change assertion to use _REQUIRED_CHUNK_MULTIPLE
petedannemann Jan 8, 2020
115940a
Import clamp from s3
Jan 8, 2020
b735b7e
Add comment on buffering being read-only
Jan 8, 2020
1d4805c
Fix client docstring
Jan 8, 2020
0b5797f
Make SeekableRawReader internal
Jan 8, 2020
6dd7e76
Add return value to seek in _SeekableRawReader
Jan 8, 2020
94ce2bd
Allow integration test to take a prefix
Jan 8, 2020
d2a4c2e
Add doc to NotFound exception
Jan 8, 2020
674ad80
Fix misleading log statement
Jan 8, 2020
910bdde
Various formatting changes
Jan 8, 2020
fb44cd2
Add additional assertion for min_part_size
Jan 8, 2020
f17b9b1
Clean up _upload_next_part
petedannemann Jan 8, 2020
5b86291
Improve UploadFailedError
petedannemann Jan 8, 2020
52cb2d1
Add docstring to terminate
petedannemann Jan 9, 2020
e02fdab
Add additional clean up to close
petedannemann Jan 9, 2020
dcef33d
Remove useless terminate in SeekableBufferedInputBase
petedannemann Jan 9, 2020
8ddb98d
Fix data type for status_code in UploadFailedError
petedannemann Jan 9, 2020
c3fc44a
Clean up UploadedFailedError msg
petedannemann Jan 9, 2020
7cc0509
Start on mock tests
Jan 12, 2020
bec8a6d
Add tests for mocks
petedannemann Jan 20, 2020
be21104
Clean up registering dependencies
petedannemann Jan 20, 2020
a760f71
Get initialize_bucket to work without gsutil
petedannemann Jan 20, 2020
dadda49
Change buffering to buffer_size
Jan 23, 2020
343d5a2
Add copyright header, fix logging styles, and move result outside ctx…
Jan 23, 2020
8373740
Change _upload_empty_part to debug msg
Jan 23, 2020
ab4d936
Clean up patching style
Jan 23, 2020
f8dd13b
Add tests for smart_open_lib
Jan 23, 2020
1d2613c
Add blank lines before constants to help readability
Jan 23, 2020
9ec0b40
Add missing clean up of raw_reader in close
Jan 23, 2020
a8c272e
Remove aws related credentials from gs tests
Jan 23, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,9 @@ target/
# vim
*.swp
*.swo

# PyCharm
.idea/

# env files
.env
12 changes: 11 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ smart_open — utils for streaming large files in Python
What?
=====

``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, GCS, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.

Expand Down Expand Up @@ -80,6 +80,7 @@ Other examples of URLs that ``smart_open`` accepts::
s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
gs://my_bucket/my_blob
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
Expand Down Expand Up @@ -174,6 +175,14 @@ More examples
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'here we stand')

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
fout.write(b'hello world')
petedannemann marked this conversation as resolved.
Show resolved Hide resolved

Supported Compression Formats
-----------------------------

Expand Down Expand Up @@ -212,6 +221,7 @@ Transport-specific Options
- HTTP, HTTPS (read-only)
- SSH, SCP and SFTP
- WebHDFS
- GCS

Each option involves setting up its own set of parameters.
For example, for accessing S3, you often need to set up authentication, like API keys or a profile name.
Expand Down
120 changes: 120 additions & 0 deletions integration-tests/test_gcs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# -*- coding: utf-8 -*-
import io
import os

import google.cloud.storage

import smart_open

_GCS_BUCKET = os.environ.get('SO_GCS_BUCKET')
_GCS_URL = 'gs://' + _GCS_BUCKET
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
assert _GCS_BUCKET is not None, 'please set the SO_GCS_BUCKET environment variable'


def initialize_bucket():
storage_client = google.cloud.storage.Client()
bucket = storage_client.get_bucket(_GCS_BUCKET)
blobs = bucket.list_blobs()
for blob in blobs:
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
blob.delete()


def write_read(key, content, write_mode, read_mode, encoding=None, **kwargs):
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
with smart_open.open(key, write_mode, encoding=encoding, **kwargs) as fout:
fout.write(content)
with smart_open.open(key, read_mode, encoding=encoding, **kwargs) as fin:
actual = fin.read()
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
return actual


def read_length_prefixed_messages(key, read_mode, encoding=None, **kwargs):
with smart_open.open(key, read_mode, encoding=encoding, **kwargs) as fin:
actual = b''
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
length_byte = fin.read(1);
while len(length_byte):
actual += length_byte
msg = fin.read(ord(length_byte))
actual += msg
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
length_byte = fin.read(1)
return actual


def test_gcs_readwrite_text(benchmark):
initialize_bucket()

key = _GCS_URL + '/sanity.txt'
text = 'с гранатою в кармане, с чекою в руке'
actual = benchmark(write_read, key, text, 'w', 'r', 'utf-8')
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
assert actual == text


def test_gcs_readwrite_text_gzip(benchmark):
initialize_bucket()

key = _GCS_URL + '/sanity.txt.gz'
text = 'не чайки здесь запели на знакомом языке'
actual = benchmark(write_read, key, text, 'w', 'r', 'utf-8')
assert actual == text


def test_gcs_readwrite_binary(benchmark):
initialize_bucket()

key = _GCS_URL + '/sanity.txt'
binary = b'this is a test'
actual = benchmark(write_read, key, binary, 'wb', 'rb')
assert actual == binary


def test_gcs_readwrite_binary_gzip(benchmark):
initialize_bucket()

key = _GCS_URL + '/sanity.txt.gz'
binary = b'this is a test'
actual = benchmark(write_read, key, binary, 'wb', 'rb')
assert actual == binary


def test_gcs_performance(benchmark):
initialize_bucket()

one_megabyte = io.BytesIO()
for _ in range(1024*128):
one_megabyte.write(b'01234567')
one_megabyte = one_megabyte.getvalue()

key = _GCS_URL + '/performance.txt'
actual = benchmark(write_read, key, one_megabyte, 'wb', 'rb')
assert actual == one_megabyte


def test_gcs_performance_gz(benchmark):
initialize_bucket()

one_megabyte = io.BytesIO()
for _ in range(1024*128):
one_megabyte.write(b'01234567')
one_megabyte = one_megabyte.getvalue()

key = _GCS_URL + '/performance.txt.gz'
actual = benchmark(write_read, key, one_megabyte, 'wb', 'rb')
assert actual == one_megabyte


def test_gcs_performance_small_reads(benchmark):
initialize_bucket()

ONE_MIB = 1024**2
one_megabyte_of_msgs = io.BytesIO()
msg = b'\x0f' + b'0123456789abcde' # a length-prefixed "message"
for _ in range(0, ONE_MIB, len(msg)):
one_megabyte_of_msgs.write(msg)
one_megabyte_of_msgs = one_megabyte_of_msgs.getvalue()

key = _GCS_URL + '/many_reads_performance.bin'

with smart_open.open(key, 'wb') as fout:
fout.write(one_megabyte_of_msgs)

actual = benchmark(read_length_prefixed_messages, key, 'rb', buffering=ONE_MIB)
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
assert actual == one_megabyte_of_msgs
5 changes: 3 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,15 @@ def read(fname):
'boto >= 2.32',
'requests',
'boto3',
'google-cloud-storage',
petedannemann marked this conversation as resolved.
Show resolved Hide resolved
]
if sys.version_info[0] == 2:
install_requires.append('bz2file')

setup(
name='smart_open',
version=__version__,
description='Utils for streaming large files (S3, HDFS, gzip, bz2...)',
description='Utils for streaming large files (S3, HDFS, GCS, gzip, bz2...)',
long_description=read('README.rst'),

packages=find_packages(),
Expand All @@ -82,7 +83,7 @@ def read(fname):
url='https://github.com/piskvorky/smart_open',
download_url='http://pypi.python.org/pypi/smart_open',

keywords='file streaming, s3, hdfs',
keywords='file streaming, s3, hdfs, gcs',

license='MIT',
platforms='any',
Expand Down