Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amazon parser: store all product details #2316

Closed
wants to merge 10 commits into from

Conversation

milahu
Copy link

@milahu milahu commented Jun 23, 2024

i needed a generic amazon parser to get all product details
so these are now stored in mi._details

also fixed some other stuff, probably not all will be merged

example use
#!/usr/bin/env python3

# based on /nix/store/q220vq57bgpijr7qbymhcv8b26jb861n-calibre-7.10.0/bin/.calibre-wrapped

import os
import re
import sys



# set calibre paths

calibre_source = os.path.dirname(__file__) + "/calibre"
calibre_prefix = "/nix/store/q220vq57bgpijr7qbymhcv8b26jb861n-calibre-7.10.0"

#path = os.environ.get('CALIBRE_PYTHON_PATH', calibre_prefix + '/lib/calibre')
path = os.environ.get('CALIBRE_PYTHON_PATH', calibre_source + "/src")
if path not in sys.path:
    sys.path.insert(0, path)

#sys.resources_location = os.environ.get('CALIBRE_RESOURCES_PATH', calibre_prefix + "/share/calibre")
sys.resources_location = os.environ.get('CALIBRE_RESOURCES_PATH', calibre_source + "/resources")

sys.extensions_location = os.environ.get('CALIBRE_EXTENSIONS_PATH', calibre_prefix + '/lib/calibre/calibre/plugins')

sys.executables_location = os.environ.get('CALIBRE_EXECUTABLES_PATH', calibre_prefix + '/bin')

sys.system_plugins_location = None



#from calibre.gui_launch import calibre
#sys.exit(calibre())

url = sys.argv[1]
cache_path = sys.argv[2]

import calibre.ebooks.metadata.sources.amazon

import calibre.utils.browser

import calibre.utils.logging

log = calibre.utils.logging.Log()

import traceback

def log_exception(self, *args, **kwargs):
    limit = kwargs.pop('limit', None)
    print(*args, **kwargs)
    print(traceback.format_exc(limit))
    #raise

log.exception = log_exception

import queue
result_queue = queue.Queue()

# Get book details from amazons book page
worker = calibre.ebooks.metadata.sources.amazon.Worker(
    url,
    result_queue,
    browser=calibre.utils.browser.Browser(),
    log=log,
    relevance=None,
    domain=None,
    plugin=None,
    #timeout=20,
    #testing=False,
    #preparsed_root=None,
    #cover_url_processor=None,
    #filter_result=None
    cache_path=cache_path,
)

worker.run()
print("result_queue", result_queue)

mi = result_queue.get()

'''
print("mi")
print(mi)
print("details")
for key, val in mi._details.items():
    print(f"{key}: {val}")
'''



# split sentences
# note: the results are bad and needs manual fixing
product_description = mi.comments
print("product_description", product_description)
product_description = re.split("([a-zA-Z]{2,}[,:;.?!]) ", product_description)
res = []
for idx, part in enumerate(product_description):
    if idx % 2 == 0:
        #res.append(part.strip())
        res.append(part)
    else:
        # add end of sentence
        res[-1] += part
#res = list(filter(lambda s: s != "", res))
res = "\n".join(res)
res = re.sub("</p>\s*<p>", "\n\n", res)
if res.startswith("<p>"):
    res = res[3:]
if res.endswith("</p>"):
    res = res[:-4]
product_description = res



def trim(s):
    return re.sub("\s+", " ", s).strip()



#authors = mi.authors
authors = mi.authors_with_roles
authors = ", ".join(map(trim, authors))



###



print(url)
print()

print(mi.title)
print()

#print(mi.authors)
print(authors)
print()



#print(mi.comments)
print(product_description)
print()

for k, v in mi._details.items():
    # Publisher: ABOD Verlag; 1st edition (9 Nov. 2015)
    if k == "Publisher" and "; " in v and " (" in v:
        kv = []
        v1, v2 = v.split("; ", 1)
        kv.append((k, v1))
        v2, v3 = v2.split(" (", 1)
        v3 = mi.pubdate.strftime("%F")
        kv.append(("Edition", v2))
        kv.append(("Release Date", v3))
        for k, v in kv:
            print(f"{k}: {v}")
        continue
    if re.fullmatch("Audible\.[a-z.]{2,10} Release Date", k):
        if mi.pubdate:
            v = mi.pubdate.strftime("%F")
    if k == "Best Sellers Rank" and not "ASIN" in mi._details:
        k2 = "ASIN"
        v2 = mi.identifiers.get("amazon")
        if v2:
            print(f"{k2}: {v2}")
    if k == "ISBN-13":
        v = v.replace("-", "")
    if isinstance(v, str):
        print(f"{k}: {v}")
    elif isinstance(v, list):
        print(f"{k}:")
        for line in v:
            print(f"  {line}")

@kovidgoyal
Copy link
Owner

I am not going to commit to maintaining parsing of metadata from amazon that calibre doesnt use. As for the rest of your fixes, I cant see how any of them are relevant, or even correct. For example, the ISO 639-2 standard code for german is both deu and ger. So why prefer ger? Why special case formatting of zero times in isoformat? And doing so means yur datetime becomes naive losing timezone information. Why change Authors(s) to Authors?

@kovidgoyal kovidgoyal closed this Jun 23, 2024
@milahu
Copy link
Author

milahu commented Jun 23, 2024

these may be useful

518b921
73a9485
35b409a

@kovidgoyal
Copy link
Owner

The rating is correct already. Source plugins are supposed to return values on a scale of 10, they get normalized by other code in the pipeline. I have merged the audiobooks one, thanks.

@milahu
Copy link
Author

milahu commented Jun 23, 2024

The rating is correct already.

fails in at least 2 cases
https://www.amazon.de/-/en/dp/B086GX5SNN
https://www.amazon.de/-/en/dp/3954714493

also 35b409a to avoid exceptions

@kovidgoyal
Copy link
Owner

kovidgoyal commented Jun 23, 2024 via email

@milahu
Copy link
Author

milahu commented Jun 23, 2024

Gives me a rating of 4.5 as expected

no, rating should be 4.7 * 2 = 9.4 of 10

https://www.amazon.de/-/en/dp/B086GX5SNN

4.7 out of 5 stars

with calibre master
mi.rating is 4.7 of 5
print(mi) says Rating : 4.7

with 518b921 and a5fdcb3
mi.rating is 9.4 of 10
print(mi) says Rating : 9.4

... or rating should be renamed to stars
to separate 10-based from 5-based values

... or all ratings should be 5-based to make it consistent

running via command line

im calling calibre.ebooks.metadata.sources.amazon.Worker directly
see my first comment

What exception does that avoid?

dateutil.parser._parser.ParserError: Unknown string format: Some Publisher
via self.log.exception('Failed to parse pubdate: %s' % val)

@milahu
Copy link
Author

milahu commented Jun 23, 2024

also a2bbd55

@kovidgoyal
Copy link
Owner

kovidgoyal commented Jun 23, 2024 via email

@milahu
Copy link
Author

milahu commented Jun 23, 2024

effective rating of 4.5

its confusing when "rating" can be 5-based or 10-based

... or rating should be renamed to stars
to separate 10-based from 5-based values

... or all ratings should be 5-based to make it consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants