Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analytics #1800

Merged
merged 68 commits into from Oct 29, 2021
Merged

Analytics #1800

merged 68 commits into from Oct 29, 2021

Conversation

irevoire
Copy link
Member

@irevoire irevoire commented Oct 12, 2021

Closes #1784
Implements this spec

Anonymous Analytics Policy

1. Functional Specification

I. Summary

This specification describes an exhaustive list of anonymous metrics collected by the MeiliSearch binary. It also describes the tools we use for this collection and how we identify a Meilisearch instance.

II. Motivation

At MeiliSearch, our vision is to provide an easy-to-use search solution that meets the essential needs of our users. At all times, we strive to understand our users better and meet their expectations in the best possible way.

Although we can gather needs and understand our users through several channels such as Github, Slack, surveys, interviews or roadmap votes, we realize that this is not enough to have a complete view of MeiliSearch usage and features adoption. By cross-referencing our product discovery phases with aggregated quantitative data, we want to make the product much better than what it is today. Our decision-making will be taken a step further to make a product that users love.

III. Explanation

General Data Protection Regulation (GDPR)

The metrics collected are non-sensitive, non-personal and do not identify an individual or a group of individuals using MeiliSearch. The data collected is secured and anonymized. We do not collect any data from the values stored in the documents.

We, the MeiliSearch team, provide an email address so that users can request the removal of their data: privacy@meilisearch.com.

Thanks to the unique identifier generated for their MeiliSearch installation (Instance uuid when launching MeiliSearch), we can remove the corresponding data from all the tools we describe below. Any questions regarding the management of the data collected can be sent to the email address as well.

Tools

Segment

The collected data is sent to Segment. Segment is a platform for data collection and provides data management tools.

Amplitude

Amplitude is a tool for graphing and highlighting collected data. Segment feeds Amplitude so that we can build visualizations according to our needs.


The identify call we send every hour:

System Configuration system

This property allows us to gather essential information to better understand on which type of machine MeiliSearch is used. This allows us to better advise users on the machines to choose according to their data volume and their use-cases.

  • system => Never changes but still sent every hours
    • distribution | On which distribution MeiliSearch is launched, eg: Arch Linux
    • kernel_version | On which kernel version MeiliSearch is launched, eg: 5.14.10-arch1-1
    • cores | How many cores does the machine have, eg: 24
    • ram_size | Total capacity of the machine's ram. Expressed in Kb, eg: 33604210
    • disk_size | Total capacity of the biggest disk. Expressed in Kb, eg: 336042103
    • server_provider | Users can tell us on which provider MeiliSearch is hosted by filling the MEILI_SERVER_PROVIDER env var. This is also filled by our providers deploy scripts. e.g. GCP cloud-config.yaml, eg: gcp

MeiliSearch Configuration

  • context.app.version: MeiliSearch version, eg: 0.23.0
  • env: production / development, eg: production
  • has_snapshot: Does the MeiliSearch instance has snapshot activated, eg: true

MeiliSearch Statistics stats

  • stats
    • database_size: Size of indexed data. Expressed in Kb, eg: 180230
    • indexes_number: Number of indexes, eg: 2
    • documents_number: Number of indexed documents, eg: 165847
    • start_since_days: How many days ago was the instance launched?, eg: 328

  • Launched | This is the first event sent to mark that MeiliSearch is launched a first time

  • Documents Searched POST: The Documents Searched event is sent once an hour. The event's properties are averaged over all search operations during that time so as not to track everything and generate unnecessary noise.
    • user-agent: Represents all the user-agents encountered on this endpoint during one hour, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • requests
      • 99th_response_time: The maximum latency, in ms, for the fastest 99% of requests, eg: 57ms
      • total_suceeded: The total number of succeeded search requests, eg: 3456
      • total_failed: The total number of failed search requests, eg: 24
      • total_received: The total number of received search requests, eg: 3480
    • sort
      • with_geoPoint: Does the built-in sort rule _geoPoint rule has been used?, eg: true /false
      • avg_criteria_number: The average number of sort criteria among all the requests containing the sort parameter. "sort": [] equals to 0 while not sending sort does not influence the average, eg: 2
    • filter
      • with_geoRadius: Does the built-in filter rule _geoRadius has been used?, eg: true /false
      • avg_criteria_number: The average number of filter criteria among all the requests containing the filter parameter. "filter": [] equals to 0 while not sending filter does not influence the average, eg: 4
      • most_used_syntax: The most used filter syntax among all the requests containing the requests containing the filter parameter. string / array / mixed, mixed
    • q
      • avg_terms_number: The average number of terms for the q parameter among all requests, eg: 5
    • pagination:
      • max_limit: The maximum limit encountered among all requests, eg: 20
      • max_offset: The maxium offset encountered among all requests, eg: 1000

  • Documents Searched GET: The Documents Searched event is sent once an hour. The event's properties are averaged over all search operations during that time so as not to track everything and generate unnecessary noise.
    • user-agent: Represents all the user-agents encountered on this endpoint during one hour, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • requests
      • 99th_response_time: The maximum latency, in ms, for the fastest 99% of requests, eg: 57ms
      • total_suceeded: The total number of succeeded search requests, eg: 3456
      • total_failed: The total number of failed search requests, eg: 24
      • total_received: The total number of received search requests, eg: 3480
    • sort
      • with_geoPoint: Does the built-in sort rule _geoPoint rule has been used?, eg: true /false
      • avg_criteria_number: The average number of sort criteria among all the requests containing the sort parameter. "sort": [] equals to 0 while not sending sort does not influence the average, eg: 2
    • filter
      • with_geoRadius: Does the built-in filter rule _geoRadius has been used?, eg: true /false
      • avg_criteria_number: The average number of filter criteria among all the requests containing the filter parameter. "filter": [] equals to 0 while not sending filter does not influence the average, eg: 4
      • most_used_syntax: The most used filter syntax among all the requests containing the requests containing the filter parameter. string / array / mixed, mixed
    • q
      • avg_terms_number: The average number of terms for the q parameter among all requests, eg: 5
    • pagination:
      • max_limit: The maximum limit encountered among all requests, eg: 20
      • max_offset: The maxium offset encountered among all requests, eg: 1000

  • Index Created
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • primary_key: The name of the field used as primary key if set, otherwise null, eg: id

  • Index Updated
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • primary_key: The name of the field used as primary key if set, otherwise null, eg: id

  • Documents Added: The Documents Added event is sent once an hour. The event's properties are averaged over all POST /documents additions operations during that time to not track everything and generate unnecessary noise.
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • payload_type: Represents all the payload_type encountered on this endpoint during one hour, eg: [text/csv]
    • primary_key: The name of the field used as primary key if set, otherwise null, eg: id
    • index_creation: Does an index creation happened, eg: false

  • Documents Updated: The Documents Added event is sent once an hour. The event's properties are averaged over all PUT /documents additions operations during that time to not track everything and generate unnecessary noise.
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • payload_type: Represents all the payload_type encountered on this endpoint during one hour, eg: [application/json]
    • primary_key: The name of the field used as primary key if set, otherwise null, eg: id
    • index_creation: Does an index creation happened, eg: false

  • Settings Updated
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • ranking_rules
      • sort_position: Position of the sort ranking rule if any, otherwise null, eg: 5
    • sortable_attributes
      • total: Number of sortable attributes, eg: 3
      • has_geo: Indicate if _geo is set as a sortable attribute, eg: false
    • filterable_attributes
      • total: Number of filterable attributes, eg: 3
      • has_geo: Indicate if _geo is set as a filterable attribute, eg: false

  • RankingRules Updated
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • sort_position: Position of the sort ranking rule if any, otherwise null, eg: 5

  • SortableAttributes Updated
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • total: Number of sortable attributes, eg: 3
    • has_geo: Indicate if _geo is set as a sortable attribute, eg: false

  • FilterableAttributes Updated
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]
    • total: Number of filterable attributes, eg: 3
    • has_geo: Indicate if _geo is set as a filterable attribute, eg: false

  • Dump Created
    • user-agent: Represents the user-agent encountered for this API call, eg: ["MeiliSearch Ruby (2.1)", "Ruby (3.0)"]

Ensure the user-id file is well saved and loaded with:

  • the dumps

  • the snapshots

  • Ensure the CLI uuid only show if analytics are activate at launch or already exists (=even if meilisearch was launched without analytics)

@irevoire
Copy link
Member Author

irevoire commented Oct 27, 2021

  • Currently on macOS and probably *BSD we can't cat the instance-id config file because it starts with a -
  • Meilisearch should display an instance-id and not a user-id when starting
  • When no documents are sent we still send the Documents Added and Documents Updated event

meilisearch-http/Cargo.toml Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/routes/indexes/updates.rs Outdated Show resolved Hide resolved
@irevoire
Copy link
Member Author

irevoire commented Oct 28, 2021

  • index creation is inverted
  • send the sort_position with null if it's not specified
  • set the total of sortable-attributes and filterable-attributes to 0 when there was no field specified
  • check when creating an index
  • move the user-agent out of the context
  • searchableAttributes
  • move start_since_days in the root of the identify

@MarinPostma
Copy link
Contributor

bors try

bors bot added a commit that referenced this pull request Oct 28, 2021
@bors
Copy link
Contributor

bors bot commented Oct 28, 2021

try

Build failed:

meilisearch-http/src/analytics/mock_analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics/segment_analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics/segment_analytics.rs Outdated Show resolved Hide resolved
meilisearch-http/src/analytics/mod.rs Outdated Show resolved Hide resolved
meilisearch-http/src/main.rs Outdated Show resolved Hide resolved
meilisearch-http/src/lib.rs Outdated Show resolved Hide resolved
@irevoire
Copy link
Member Author

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 29, 2021

@bors bors bot merged commit c32f13a into main Oct 29, 2021
@bors bors bot deleted the segment branch October 29, 2021 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the way of handling analytics
4 participants