Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spiegel example from Gist #37

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

antonengelhardt
Copy link

This is example is from a gist from @lorey (Author).

Signed-off-by: Anton Engelhardt <antoncengelhardt@icloud.com>
@antonengelhardt
Copy link
Author

@lorey Here you go. Can you please check if this runs on your system?

When i run it, i get an error.

@lorey
Copy link
Owner

lorey commented Apr 14, 2023

Thanks for adding.

What's the error?

@antonengelhardt
Copy link
Author

@lorey These are just the last few lines. Do you want me to post the whole output?

INFO:root:found len(value_matches)=2 on page (self.value='24.06.2022, 14.26 Uhr', self.page=<Page self.soup.name='[document]' classes=None, text=Kristina H...>)
INFO:root:value_matches=[<ValueMatch self.node=<Node self.soup.name='time' classes=['timeformat'], text=24.06.2022...>, self.extractor=<TextValueExtractor>>, <ValueMatch self.node=<Node self.soup.name='div' classes=['font-sansUI', 'lg:text-base', 'md:text-base', 'sm:text-s', 'text-shade-dark', 'dark:text-shade-light'], text=24.06.2022...>, self.extractor=<TextValueExtractor>>]
Traceback (most recent call last):
  File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 93, in <module>
    train_and_scrape()
  File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 48, in train_and_scrape
    scraper = train_spon_scraper()
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/Documents/SaveStrike Code/mlscraper-ae/examples/spiegel.py", line 67, in train_spon_scraper
    scraper = train_scraper(training_set, complexity=5)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 44, in train_scraper
    sample_matches = [
                     ^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 45, in <listcomp>
    sorted(s.get_matches(), key=lambda m: m.span)[:100]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/training.py", line 45, in <lambda>
    sorted(s.get_matches(), key=lambda m: m.span)[:100]
                                          ^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 131, in span
    return sum(
           ^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 132, in <genexpr>
    m.span + get_relative_depth(m.root, self.root)
    ^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 165, in span
    return sum(get_relative_depth(m.root, self.root) + m.span for m in self.matches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/matches.py", line 165, in <genexpr>
    return sum(get_relative_depth(m.root, self.root) + m.span for m in self.matches)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/antonengelhardt/.pyenv/versions/3.11.2/lib/python3.11/site-packages/mlscraper/html.py", line 181, in get_relative_depth
    i = node_parents.index(root.soup)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: <a class="text-black dark:text-shade-lightest font-bold border-b border-shade-light hover:border-black dark:hover:border-white" href="https://www.spiegel.de/impressum/autor-1a9752a4-0001-0003-0000-000000020534" target="_self" title="Nike Laurenz">
Nike Laurenz</a> is not in list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants