Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine mandatory fields in the config file #154

Closed
bidoubiwa opened this issue Sep 29, 2021 · 14 comments
Closed

Determine mandatory fields in the config file #154

bidoubiwa opened this issue Sep 29, 2021 · 14 comments
Labels
good first issue Good for newcomers

Comments

@bidoubiwa
Copy link
Contributor

From @sanders41 in this comment

When using a minimal docs-scraper config:

{
  "index_uid": "docs",
  "start_urls": ["https://docs.meilisearch.com"]
}

Errors are thrown for missing fields:

Then I get the error TypeError: argument of type 'NoneType' is not iterable that happens here. I am thinking the json I am using is not what you have in mind? The thing that makes me question that and wonder if there could be an issue is the parser is called from ConfigLoader and in that Class selectors gets initialized to None. This makes me think calling the parser with selectors set to None shouldn't throw an error? Should I instead be using the basic config json file?

We should be able to determine what fields are mandatory and which fields should or should not be mandatory.

@askalik
Copy link

askalik commented Oct 2, 2021

Can I take this one?

@bidoubiwa
Copy link
Contributor Author

Of course! Thanks a lot @askalik . Maybe adding verification of the provided options so we can throw elegantly when a field is missing? Good luck :)

@bidoubiwa
Copy link
Contributor Author

Hello @askalik just an update. We can only guarantee the assignment for 5 days. After which if someone else wants to give it a try we will change the assignation. Good luck with the PR!

@askalik
Copy link

askalik commented Oct 5, 2021

@bidoubiwa i will work on this today. I will reach out if i have any questions.

@yankeeinlondon
Copy link

It would be really helpful if you published a Typescript interface that described the interface. I think it would be much more compact representation of what is allowed than parsing through prose.

I've created a boilerplate today which I is not correct yet but might serve as a starting point:

export type ScrapeSelector =
  /** the simple representation is just to put a selector in as a string **/
  | string
  /** but there is an object notation for greater control */
  | {
      selector: string;
      global: boolean;
      /**
       * will be the displayed value if no content in selector was found.
       */
      default_value: string;
    };

export type ScrapeSelectorTargets = {
  lvl0: ScrapeSelector;
  lvl1: ScrapeSelector;
  lvl2: ScrapeSelector;
  lvl3: ScrapeSelector;
  lvl4: ScrapeSelector;
  lvl5: ScrapeSelector;
  lvl6: ScrapeSelector;
  /** the main body of text */
  text: ScrapeSelector;
};

export type ScrapeUrls =
  | string
  | {
      url: string;
      selectors_key: string;
    };

export interface MeiliSearchConfig {
  /**
   * The index_uid field is the index identifier in your MeiliSearch instance
   * in which your website content is stored. The scraping tool will create a
   * new index if it does not exist.
   */
  index_uid: string;
  /** allows the scraper to index pages by following links */
  start_urls: ScrapeUrls[];
  /**
   * Sitemaps can contain alternative links for URLs. Those are other versions
   * of the same page, in a different language, or with a different URL. By
   * default docs-scraper will ignore those URLs.
   *
   * Set this to true if you want those other versions to be scraped as well.
   */
  sitemap_alternate_links: ScrapeUrls[];
  /** allows the scraper to index using the sitemap XML */
  sitemap_urls: string[];
  /** The scraper will not follow links that match stop_urls. **/
  stop_urls: string[];
  /**
   * DOM selector references which bracket the relevant sections.
   *
   * **Note:** `lvl0` is highest priority and `lvl6` lowest
   *
   * **Note:** the object notation allows you to have different selectors between
   * different pages (`default` is the fallback). To use this format you must also add
   * `selectors_key` props to the start_urls
   */
  selectors: ScrapeSelectorTargets | Record<string, ScrapeSelectorTargets>;
  /**
   * This expects an array of CSS selectors. Any element matching one of those selectors
   * will be removed from the page before any data is extracted from it.
   *
   * This can be used to remove a table of content, a sidebar, or a footer, to make
   * other selectors easier to write.
   */
  selectors_exclude?: string[];

  custom_settings?: {
    /**
     * The synonyms SGG <=> Static Site Generator allows the user to find all the results
     * containing "Static Site Generator" with only typing "SSG" (and the opposite). Here
     * is the [dedicated page about synonyms](https://docs.meilisearch.com/reference/features/synonyms.html)
     * in the official documentation. Also see
     * [Scraper Docs](https://docs.meilisearch.com/reference/features/synonyms.html).
     */
    synonyms?: Record<string, string[]>;

    /**
     * Because your website might provide content with structured English sentences, we 
     * recommend adding stop words. Indeed, the search-engine would not be "spoiled" by 
     * linking words and would focus on the main words of the query, rendering more 
     * accurate results.
     * 
     * Here is the [dedicated page about stop-words](https://docs.meilisearch.com/reference/features/stop_words.html) 
     * in the official documentation. You can find more complete lists of 
     * English stop-words [like this one](https://gist.github.com/sebleier/554280).

     */
    stopWords?: string[];

    /**
     * The default value is 0. By increasing it, you can choose not to index some records
     * if they don't have enough lvlX matching. For example, with a min_indexed_level: 2,
     * the scraper indexes temporary records having at least lvl0, lvl1 and lvl2 set.
     *
     * This is useful when your documentation has pages that share the same lvl0 and lvl1
     * for example. In that case, you don't want to index all the shared records, but want
     * to keep the content different across pages.
     */
    min_indexed_level?: number;
    /**
     * When only_content_level is set to true, then the scraper won't create records for
     * the lvlX selectors.
     *
     * If used, min_indexed_level is ignored.
     */
    only_content_level?: boolean;

    /**
     * When js_render is set to true, the scraper will use ChromeDriver. This is needed for
     * pages that are rendered with JavaScript, for example, pages generated with React, Vue,
     * or applications that are running in development mode: autoreload watch.
     *
     * After installing ChromeDriver, provide the path to the bin using the following environment
     * variable CHROMEDRIVER_PATH (default value is /usr/bin/chromedriver).
     *
     * The default value of js_render is false.
     */
    js_render?: boolean;

    /**
     * This setting specifies the domains that the scraper is allowed to access. In most cases
     * the allowed_domains will be automatically set using the start_urls and stop_urls. When
     * scraping a domain that contains a port, for example http://localhost:8080, the domain
     * needs to be manually added to the configuration.
     */
    allowed_domains?: string[];
  };
}

@sanders41
Copy link
Collaborator

How would the Typescript interface be used? The scraper is written in Python so I don't see where the Typescript interface would get used.

@bidoubiwa
Copy link
Contributor Author

@ksnyde
I m very curious on how to implement this. Do you have any sources that I might read ?

@sanders41
Copy link
Collaborator

@bidoubiwa I'm curious to hear also. If it turns out it won't work but it is something you want dataclasses, attrs, and Pydantic are Python options to do the same kind of thing.

@yankeeinlondon
Copy link

@sanders41 I find the best documentation is the documentation that's closest to the code and for JS/TS that's in providing good types. My suggestion would be to include these types in the docs-meilisearch.js repo but because the document structure is a shared assumption across the scraper and this repo it is worth having this availability mentioned in the scraper docs too as a footnote.

Currently, there are a lot of hidden rules in how this dropdown works and so it seems to strongly encourage the use of it rather than adventure into the more powerful land of custom/bespoke dropdowns. If a JS/TS developer was building their own dropdown however, they'd immediately benefit from having this sort of guard rails and confidence on knowing how to interface with the scraper's document structure.

@yankeeinlondon
Copy link

yankeeinlondon commented Feb 3, 2022

I'm not very familiar with Python but since it's loosely typed the benefit would likely be less reachable for Python developers.

@yankeeinlondon
Copy link

In this regard, I did make a mistake in not considering the SDK and looking at it ... it does have types but they're not commented and I haven't gauged them yet for completeness or whether they cover the scrapers assumed document structure.

SDK Types

@yankeeinlondon
Copy link

Finally, I really would like to solve the original problem I mentioned here. I have a valid JSON file (that's in fact now just a cut and paste from what are in the docs) but for some reason it is not working and complaining that the JSON is invalid.

@yankeeinlondon
Copy link

I'm using AST representations of Rust, TS, and Markdown to build indexes but I'd really like to be able to compare the results to just scraping the HTML

@alallema
Copy link
Contributor

alallema commented Sep 6, 2023

As this repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs.

@alallema alallema closed this as completed Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

7 participants