Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_pretrained making internet connection if internet turned on #2867

Closed
Swarzkopf314 opened this issue Feb 15, 2020 · 15 comments · Fixed by #2930
Closed

from_pretrained making internet connection if internet turned on #2867

Swarzkopf314 opened this issue Feb 15, 2020 · 15 comments · Fixed by #2930

Comments

@Swarzkopf314
Copy link

I'd like to ask why model.from_pretrained makes ssl connection event though I provide cache_dir? If I turn off the internet everything works just fine.

│     └─ 0.726 from_pretrained  transformers/tokenization_utils.py:256
│        └─ 0.726 _from_pretrained  transformers/tokenization_utils.py:311
│           ├─ 0.570 cached_path  transformers/file_utils.py:205
│           │  └─ 0.570 get_from_cache  transformers/file_utils.py:333
│           │     └─ 0.570 head  requests/api.py:91
│           │        └─ 0.570 request  requests/api.py:16
│           │           └─ 0.565 request  requests/sessions.py:466
│           │              └─ 0.558 send  requests/sessions.py:617
│           │                 └─ 0.558 send  requests/adapters.py:394
│           │                    └─ 0.557 urlopen  urllib3/connectionpool.py:494
│           │                       └─ 0.557 _make_request  urllib3/connectionpool.py:351
│           │                          ├─ 0.413 _validate_conn  urllib3/connectionpool.py:986
│           │                          │  └─ 0.413 connect  urllib3/connection.py:298
│           │                          │     ├─ 0.281 ssl_wrap_socket  urllib3/util/ssl_.py:296
│           │                          │     │  ├─ 0.263 wrap_socket  ssl.py:410
│           │                          │     │  │  └─ 0.263 _create  ssl.py:813
│           │                          │     │  │     └─ 0.263 do_handshake  ssl.py:1132
│           │                          │     │  └─ 0.018 [self]  
│           │                          │     └─ 0.132 _new_conn  urllib3/connection.py:143
│           │                          │        └─ 0.132 create_connection  urllib3/util/connection.py:33
│           │                          │           └─ 0.130 [self]  
│           │                          └─ 0.144 getresponse  http/client.py:1300
│           │                             └─ 0.144 begin  http/client.py:299
│           │                                └─ 0.144 _read_status  http/client.py:266
│           │                                   └─ 0.144 readinto  socket.py:575
│           │                                      └─ 0.144 recv_into  ssl.py:1060
│           │                                         └─ 0.144 read  ssl.py:920

and here's the output with internet turned off

     └─ 0.358 from_pretrained  transformers/tokenization_utils.py:256
│        └─ 0.358 _from_pretrained  transformers/tokenization_utils.py:311
│           ├─ 0.255 __init__  transformers/tokenization_bert.py:138
│           │  ├─ 0.163 load_vocab  transformers/tokenization_bert.py:98
│           │  │  └─ 0.160 [self]  
│           │  ├─ 0.056 <listcomp>  transformers/tokenization_bert.py:186
│           │  └─ 0.036 [self]  
│           └─ 0.102 cached_path  transformers/file_utils.py:205
│              └─ 0.101 get_from_cache  transformers/file_utils.py:333
│                 ├─ 0.083 head  requests/api.py:91
│                 │  └─ 0.083 request  requests/api.py:16
│                 │     └─ 0.080 request  requests/sessions.py:466
│                 │        ├─ 0.066 send  requests/sessions.py:617
│                 │        │  └─ 0.066 send  requests/adapters.py:394
│                 │        │     ├─ 0.046 urlopen  urllib3/connectionpool.py:494
│                 │        │     │  ├─ 0.035 _make_request  urllib3/connectionpool.py:351
│                 │        │     │  │  └─ 0.035 _validate_conn  urllib3/connectionpool.py:986
│                 │        │     │  │     └─ 0.035 connect  urllib3/connection.py:298
│                 │        │     │  │        └─ 0.035 _new_conn  urllib3/connection.py:143
│                 │        │     │  │           ├─ 0.015 create_connection  urllib3/util/connection.py:33
│                 │        │     │  │           │  └─ 0.014 getaddrinfo  socket.py:735
│                 │        │     │  │           ├─ 0.012 [self]  
│                 │        │     │  │           └─ 0.008 __init__  urllib3/exceptions.py:20
│                 │        │     │  └─ 0.006 increment  urllib3/util/retry.py:355
│                 │        │     ├─ 0.008 [self]  
│                 │        │     └─ 0.008 __init__  requests/exceptions.py:17
│                 │        ├─ 0.006 merge_environment_settings  requests/sessions.py:690
│                 │        │  └─ 0.005 get_environ_proxies  requests/utils.py:755
│                 │        └─ 0.006 [self]  
│                 └─ 0.014 filter  fnmatch.py:48
│                    └─ 0.009 _compile_pattern  fnmatch.py:38
│                       └─ 0.005 compile  re.py:232
│                          └─ 0.005 _compile  re.py:271
│                             └─ 0.005 compile  sre_compile.py:759
@BramVanroy
Copy link
Collaborator

BramVanroy commented Feb 18, 2020

I might be mistaken, but it seems that s3_etag verifies that the etag of a cached (downloaded) file is the same as the one that is in the S3 bucket, to ensure that you have the right files (in terms of versions, or corruption). If those files are not in the cached folder, they are downloaded.

See

@s3_request
def s3_etag(url, proxies=None):
"""Check ETag on S3 object."""
s3_resource = boto3.resource("s3", config=Config(proxies=proxies))
bucket_name, s3_path = split_s3_path(url)
s3_object = s3_resource.Object(bucket_name, s3_path)
return s3_object.e_tag

@Swarzkopf314
Copy link
Author

Is there any way to turn that off?

@BramVanroy
Copy link
Collaborator

Not as far as I can see. What is your use-case? Why do you need this?

@minimaxir
Copy link

minimaxir commented Feb 18, 2020

A use case where validating against external servers is not ideal is if the network is behind a firewall and/or is a containerized microservice, and you want to avoid pinging outside the firewall as much as possible.

I would appreciate a config flag that disables all external pinging.

@Swarzkopf314
Copy link
Author

Swarzkopf314 commented Feb 19, 2020

It's not comfortable for development - I'm doing many tests with the pretrained model and it's pretty annoying as it slows down my experiments considerably. I quess I could just save and load the model myself but I was curious why from_pretrained takes so long.

@BramVanroy
Copy link
Collaborator

I think it should be possible by skipping this block (and setting etag=None)

if url.startswith("s3://"):
etag = s3_etag(url, proxies=proxies)
else:
try:
response = requests.head(url, allow_redirects=True, proxies=proxies, timeout=etag_timeout)
if response.status_code != 200:
etag = None
else:
etag = response.headers.get("ETag")
except (EnvironmentError, requests.exceptions.Timeout):
etag = None

which will then fallback to

if etag is None:
if os.path.exists(cache_path):
return cache_path
else:
matching_files = [
file
for file in fnmatch.filter(os.listdir(cache_dir), filename + ".*")
if not file.endswith(".json") and not file.endswith(".lock")
]
if len(matching_files) > 0:
return os.path.join(cache_dir, matching_files[-1])
else:
return None

A flag should be added to the signature, something like: disable_outgoing=False. When True, it will skip the lookup and possible download.

I might be able to work on this in the future, but it's not high on my priority list.

Opinions? @minimaxir @Swarzkopf314

@Swarzkopf314
Copy link
Author

Yeah that would be great :)

@BramVanroy
Copy link
Collaborator

BramVanroy commented Feb 20, 2020

@Swarzkopf314 Can you tell me how you made the graphs in OP? (some library, I presume) So I can use them for testing.

@Swarzkopf314
Copy link
Author

Swarzkopf314 commented Feb 20, 2020

I made a wrapper for pyinstrument, feel free to use it:

import pyinstrument

# with TreeProfiler(show_all=True):
#   # code to profie...
class TreeProfiler(object):

  def __init__(self, show_all=False):
    self.profiler = pyinstrument.Profiler()
    self.show_all = show_all # verbose output of pyinstrument profiler

  def __enter__(self):
    print("WITH TREE_PROFILER:")
    self.profiler.start()

  def __exit__(self, *args):
    self.profiler.stop()
    print(self.profiler.output_text(unicode=True, color=True, show_all=self.show_all))

@BramVanroy
Copy link
Collaborator

You can try out my PR #2930 if you want.

import pyinstrument
from transformers import DistilBertConfig, DistilBertModel, DistilBertTokenizer


class TreeProfiler():
    def __init__(self, show_all=False):
        self.profiler = pyinstrument.Profiler()
        self.show_all = show_all # verbose output of pyinstrument profiler

    def __enter__(self):
        print("WITH TREE_PROFILER:")
        self.profiler.start()

    def __exit__(self, *args):
        self.profiler.stop()
        print(self.profiler.output_text(unicode=True, color=True, show_all=self.show_all))


def main():
    with TreeProfiler(show_all=True):
        config = DistilBertConfig.from_pretrained('distilbert-base-uncased', disable_outgoing=True)
        model = DistilBertModel.from_pretrained('distilbert-base-uncased', disable_outgoing=True)
        tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', disable_outgoing=True)


if __name__ == '__main__':
    main()

The above snippet will throw an error message when the expected files are not present in the cache. When they are, though, everything is loaded fine without the need of any additional lookups.

@Swarzkopf314
Copy link
Author

Amazing, thanks a lot! <3

@BramVanroy
Copy link
Collaborator

No problem. Note that I have not written tests for this functionality yet. I don't think it should break the library, but if you do find some inconsistencies, please let me know.

@minimaxir
Copy link

Excellent! :D

@BramVanroy
Copy link
Collaborator

Note that the parameter name has been changed to local_files_only.

@James4Ever0
Copy link

Note that in practice, I find some parameter "local_files_first" which will resolve this issue even further. As named, it will first check if the model is cached. If not, it will make internet connection and download that model. I find this useful for production and testing, thus might write some pull requests for this new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants