Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Gzip file on azure blob containing json #25

Open
ashwinmuni opened this issue Apr 21, 2022 · 6 comments
Open

Reading Gzip file on azure blob containing json #25

ashwinmuni opened this issue Apr 21, 2022 · 6 comments

Comments

@ashwinmuni
Copy link

Possibility to read gz files stored on azure blob. The gz contains json files, have done the following config and below is the error

input {
    azure_blob_storage {
        storageaccount => "ashwin"
        access_key => "12WB3f+exT2wImZgX+N7KgJw=="
        container => "india"
        codec => "json"
    }
}

output {
      elasticsearch {
        user => "elastic"
        password => "F##@AbwOzN"
        ssl => true
        ssl_certificate_verification => false
        hosts => [ "https://127.0.0.1:9200/" ]
        index => "assam-blob-%{+YYYY.MM.dd}"
        cacert => "/etc/logstash/http_ca_1.crt"
      }
}

did tried with different codex - gzip_lines, didnt worked.

[2022-04-21T20:10:28,589][INFO ][logstash.inputs.azureblobstorage][main][2b92afafbd9b3b3a837d391ec4215c55812dc93f150871c0c89b7bcf205559ed] learn json one of the attempts failed BlobArchived (409): This operation is not permitted on an archived blob.

@janmg
Copy link
Owner

janmg commented Apr 21, 2022

This one is a bit more complicated as ideally logstash-input-azure_blob_storage is only an input plugin. But this is a codec / filtering process issues, so I need to think how the flow would work best. Files can grow and the plugin tries to deal with partial reads and for json, there is a prefix and a postfix that should be taken into account. The learning can be skipped by manually configuring those, but I'm not sure how well logstash understands gzipped files. At least for reading logstash files

There is a codec which does json zipped files, but I don't think somebody actually published it yet.
http://speakmy.name/2014/01/13/gzipped-json-files-and-logstash/

I don't have much time in the near future, but I can try to troubleshoot something in a couple of months.

@ashwinmuni
Copy link
Author

Thanks @janmg let me also try to modify the bits and see if it successful, will keep posted

@dimuskin
Copy link

Any news with this feature?

@janmg
Copy link
Owner

janmg commented Apr 22, 2023

Can you describe what is contained in those gzip files, how are they created and can they grow? Maybe I can created an experimental gzip decoder and let the rest be dealt with by the JSON codec. Ideally I only load files from azureblobstorage, but my plugin already became a bit more than an input plugin, so maybe cramming in a gzip decoder isnt going to be such a sin...I would appreciate a bit of feedback though on what the use case is.

@dimuskin
Copy link

@janmg thank you for your responce. I meant the functionality that is present in filebeat.
https://www.elastic.co/guide/en/beats/filebeat/master/filebeat-input-azure-blob-storage.html

if we have a gzip file, then it must be unpacked before processing and content will be processed as regular text/json file.

one file per archive will be correct, otherwise the situation will only become more complicated.

about growing, gzip is no streaming protocol, so it can't grow.

@janmg
Copy link
Owner

janmg commented Apr 29, 2023

I found two codecs named json_gz that can read gzipped json files. Both versions of the codec work with the azure_blob_storage input plugin, but I haven't tested exactly how it would process the json file. I assume it's processed as a single logstash event, which means that you have to split events in the logstash filter stage. If that codec doesn't work because your input has for instance json_lines, it's still easiest to make modifications to the codec rather than the input plugin.

https://github.com/dterziev/logstash-codec-json_gz/
https://github.com/ador-mg/logstash-codec-json_gz

The dterziev version is in rubygems version 1.0.1
sudo -u logstash /usr/share/logstash/bin/logstash-plugin install logstash-codec-json_gz

Both codecs can be configured like this, because the share the same name
input {
azure_blob_storage {
codec => "json_gz"
}
}

For future reference, elastic themselves are working on this filebeat. it's in beta and in x-pack. I like it that it's in GO rather then Ruby, but I don't know the state and if it can replace my plugin.
https://github.com/elastic/beats/blob/main/x-pack/filebeat/input/azureblobstorage/input.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants