Determine mandatory fields in the config file #154

bidoubiwa · 2021-09-29T08:24:45Z

When using a minimal docs-scraper config:

{
  "index_uid": "docs",
  "start_urls": ["https://docs.meilisearch.com"]
}

Errors are thrown for missing fields:

Then I get the error TypeError: argument of type 'NoneType' is not iterable that happens here. I am thinking the json I am using is not what you have in mind? The thing that makes me question that and wonder if there could be an issue is the parser is called from ConfigLoader and in that Class selectors gets initialized to None. This makes me think calling the parser with selectors set to None shouldn't throw an error? Should I instead be using the basic config json file?

We should be able to determine what fields are mandatory and which fields should or should not be mandatory.

The text was updated successfully, but these errors were encountered:

askalik · 2021-10-02T17:54:12Z

Can I take this one?

bidoubiwa · 2021-10-04T08:03:34Z

Of course! Thanks a lot @askalik . Maybe adding verification of the provided options so we can throw elegantly when a field is missing? Good luck :)

bidoubiwa · 2021-10-05T14:37:45Z

Hello @askalik just an update. We can only guarantee the assignment for 5 days. After which if someone else wants to give it a try we will change the assignation. Good luck with the PR!

askalik · 2021-10-05T14:49:17Z

@bidoubiwa i will work on this today. I will reach out if i have any questions.

yankeeinlondon · 2022-01-19T00:53:21Z

It would be really helpful if you published a Typescript interface that described the interface. I think it would be much more compact representation of what is allowed than parsing through prose.

I've created a boilerplate today which I is not correct yet but might serve as a starting point:

export type ScrapeSelector =
  /** the simple representation is just to put a selector in as a string **/
  | string
  /** but there is an object notation for greater control */
  | {
      selector: string;
      global: boolean;
      /**
       * will be the displayed value if no content in selector was found.
       */
      default_value: string;
    };

export type ScrapeSelectorTargets = {
  lvl0: ScrapeSelector;
  lvl1: ScrapeSelector;
  lvl2: ScrapeSelector;
  lvl3: ScrapeSelector;
  lvl4: ScrapeSelector;
  lvl5: ScrapeSelector;
  lvl6: ScrapeSelector;
  /** the main body of text */
  text: ScrapeSelector;
};

export type ScrapeUrls =
  | string
  | {
      url: string;
      selectors_key: string;
    };

export interface MeiliSearchConfig {
  /**
   * The index_uid field is the index identifier in your MeiliSearch instance
   * in which your website content is stored. The scraping tool will create a
   * new index if it does not exist.
   */
  index_uid: string;
  /** allows the scraper to index pages by following links */
  start_urls: ScrapeUrls[];
  /**
   * Sitemaps can contain alternative links for URLs. Those are other versions
   * of the same page, in a different language, or with a different URL. By
   * default docs-scraper will ignore those URLs.
   *
   * Set this to true if you want those other versions to be scraped as well.
   */
  sitemap_alternate_links: ScrapeUrls[];
  /** allows the scraper to index using the sitemap XML */
  sitemap_urls: string[];
  /** The scraper will not follow links that match stop_urls. **/
  stop_urls: string[];
  /**
   * DOM selector references which bracket the relevant sections.
   *
   * **Note:** `lvl0` is highest priority and `lvl6` lowest
   *
   * **Note:** the object notation allows you to have different selectors between
   * different pages (`default` is the fallback). To use this format you must also add
   * `selectors_key` props to the start_urls
   */
  selectors: ScrapeSelectorTargets | Record<string, ScrapeSelectorTargets>;
  /**
   * This expects an array of CSS selectors. Any element matching one of those selectors
   * will be removed from the page before any data is extracted from it.
   *
   * This can be used to remove a table of content, a sidebar, or a footer, to make
   * other selectors easier to write.
   */
  selectors_exclude?: string[];

  custom_settings?: {
    /**
     * The synonyms SGG <=> Static Site Generator allows the user to find all the results
     * containing "Static Site Generator" with only typing "SSG" (and the opposite). Here
     * is the [dedicated page about synonyms](https://docs.meilisearch.com/reference/features/synonyms.html)
     * in the official documentation. Also see
     * [Scraper Docs](https://docs.meilisearch.com/reference/features/synonyms.html).
     */
    synonyms?: Record<string, string[]>;

    /**
     * Because your website might provide content with structured English sentences, we 
     * recommend adding stop words. Indeed, the search-engine would not be "spoiled" by 
     * linking words and would focus on the main words of the query, rendering more 
     * accurate results.
     * 
     * Here is the [dedicated page about stop-words](https://docs.meilisearch.com/reference/features/stop_words.html) 
     * in the official documentation. You can find more complete lists of 
     * English stop-words [like this one](https://gist.github.com/sebleier/554280).

     */
    stopWords?: string[];

    /**
     * The default value is 0. By increasing it, you can choose not to index some records
     * if they don't have enough lvlX matching. For example, with a min_indexed_level: 2,
     * the scraper indexes temporary records having at least lvl0, lvl1 and lvl2 set.
     *
     * This is useful when your documentation has pages that share the same lvl0 and lvl1
     * for example. In that case, you don't want to index all the shared records, but want
     * to keep the content different across pages.
     */
    min_indexed_level?: number;
    /**
     * When only_content_level is set to true, then the scraper won't create records for
     * the lvlX selectors.
     *
     * If used, min_indexed_level is ignored.
     */
    only_content_level?: boolean;

    /**
     * When js_render is set to true, the scraper will use ChromeDriver. This is needed for
     * pages that are rendered with JavaScript, for example, pages generated with React, Vue,
     * or applications that are running in development mode: autoreload watch.
     *
     * After installing ChromeDriver, provide the path to the bin using the following environment
     * variable CHROMEDRIVER_PATH (default value is /usr/bin/chromedriver).
     *
     * The default value of js_render is false.
     */
    js_render?: boolean;

    /**
     * This setting specifies the domains that the scraper is allowed to access. In most cases
     * the allowed_domains will be automatically set using the start_urls and stop_urls. When
     * scraping a domain that contains a port, for example http://localhost:8080, the domain
     * needs to be manually added to the configuration.
     */
    allowed_domains?: string[];
  };
}

sanders41 · 2022-01-19T01:19:00Z

How would the Typescript interface be used? The scraper is written in Python so I don't see where the Typescript interface would get used.

bidoubiwa · 2022-02-01T16:21:58Z

@ksnyde
I m very curious on how to implement this. Do you have any sources that I might read ?

sanders41 · 2022-02-01T17:26:22Z

@bidoubiwa I'm curious to hear also. If it turns out it won't work but it is something you want dataclasses, attrs, and Pydantic are Python options to do the same kind of thing.

yankeeinlondon · 2022-02-03T19:02:42Z

@sanders41 I find the best documentation is the documentation that's closest to the code and for JS/TS that's in providing good types. My suggestion would be to include these types in the docs-meilisearch.js repo but because the document structure is a shared assumption across the scraper and this repo it is worth having this availability mentioned in the scraper docs too as a footnote.

Currently, there are a lot of hidden rules in how this dropdown works and so it seems to strongly encourage the use of it rather than adventure into the more powerful land of custom/bespoke dropdowns. If a JS/TS developer was building their own dropdown however, they'd immediately benefit from having this sort of guard rails and confidence on knowing how to interface with the scraper's document structure.

yankeeinlondon · 2022-02-03T19:03:28Z

I'm not very familiar with Python but since it's loosely typed the benefit would likely be less reachable for Python developers.

yankeeinlondon · 2022-02-03T19:07:25Z

In this regard, I did make a mistake in not considering the SDK and looking at it ... it does have types but they're not commented and I haven't gauged them yet for completeness or whether they cover the scrapers assumed document structure.

SDK Types

yankeeinlondon · 2022-02-03T19:09:02Z

Finally, I really would like to solve the original problem I mentioned here. I have a valid JSON file (that's in fact now just a cut and paste from what are in the docs) but for some reason it is not working and complaining that the JSON is invalid.

yankeeinlondon · 2022-02-03T19:09:44Z

I'm using AST representations of Rust, TS, and Markdown to build indexes but I'd really like to be able to compare the results to just scraping the HTML

alallema · 2023-09-06T11:00:11Z

As this repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs.

bidoubiwa mentioned this issue Sep 29, 2021

Tests meilisearch implementation #32

Closed

bidoubiwa added good first issue Good for newcomers hacktoberfest labels Sep 29, 2021

meili-bot removed the hacktoberfest label Nov 4, 2021

sanders41 mentioned this issue Feb 18, 2022

TypeError: argument of type 'NoneType' is not iterable #190

Closed

alallema added the hacktoberfest label Sep 27, 2022

curquiza removed the hacktoberfest label Nov 15, 2022

alallema closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine mandatory fields in the config file #154

Determine mandatory fields in the config file #154

bidoubiwa commented Sep 29, 2021

askalik commented Oct 2, 2021

bidoubiwa commented Oct 4, 2021

bidoubiwa commented Oct 5, 2021

askalik commented Oct 5, 2021

yankeeinlondon commented Jan 19, 2022

sanders41 commented Jan 19, 2022

bidoubiwa commented Feb 1, 2022

sanders41 commented Feb 1, 2022

yankeeinlondon commented Feb 3, 2022

yankeeinlondon commented Feb 3, 2022 •

edited

yankeeinlondon commented Feb 3, 2022

yankeeinlondon commented Feb 3, 2022

yankeeinlondon commented Feb 3, 2022

alallema commented Sep 6, 2023

Determine mandatory fields in the config file #154

Determine mandatory fields in the config file #154

Comments

bidoubiwa commented Sep 29, 2021

askalik commented Oct 2, 2021

bidoubiwa commented Oct 4, 2021

bidoubiwa commented Oct 5, 2021

askalik commented Oct 5, 2021

yankeeinlondon commented Jan 19, 2022

sanders41 commented Jan 19, 2022

bidoubiwa commented Feb 1, 2022

sanders41 commented Feb 1, 2022

yankeeinlondon commented Feb 3, 2022

yankeeinlondon commented Feb 3, 2022 • edited

yankeeinlondon commented Feb 3, 2022

yankeeinlondon commented Feb 3, 2022

yankeeinlondon commented Feb 3, 2022

alallema commented Sep 6, 2023

yankeeinlondon commented Feb 3, 2022 •

edited