Maintenance

Mindscape Maintenance Guide

Authors: Mustafa Tariq, Khoa Nguyen

Key Principles

Optimize effort upfront to minimize troubleshooting later.
Adhere to the project's pull request (PR) template for detailed guidance.

Development Practices

Always Use Pull Requests: Direct pushes to main bypass safeguards. Create PRs for collaborative review and integrated testing.
Follow the Checklist: When submitting a PR, the submitters and the reviewers must follow and complete their respective checklist provided in .github/pull_request_template.md.
Ensure All Tests Pass: Do not merge PRs with failing tests. Strive to deliver production-ready, bug-free code. Tests can be located for the server in server/tests. For frontend we do not have a testing pipeline. So just run the frontend and ensure that it looks how you'd like.

Production Troubleshooting

Downtime:
- Investigate via SSH – check Docker containers (docker compose up --build -d).
- Review recent PRs to assess potential issues; consider rollbacks.
Slowness:
- Monitor resource utilization. Scale server as needed.
- Consider upgrading your domain beyond free DNS services for improved routing.
Cost Management:
- Employ fixed-cost resources (VM, domain) and impose spending limits where applicable.
- Consider alternative cloud providers for potential cost optimization.

Feature Management

Removal: Prioritize removing frontend calls to deprecated features for ease of reversal, then proceed to backend cleanup.
Addition: Implement and test backend functionality extensively before focusing on frontend styling.
Change: Test thoroughly on a development environment. Submit a PR for review after adapting both frontend and backend.

Common Changes

Language Model Updates:
- Modify llm and llm_embedder in server/controllers/utils/ai.py.
- Refer to LangChain documentation: https://python.langchain.com/docs/integrations/components
- Explore Hugging Face's prompting guide for refinement: https://huggingface.co/docs/transformers/main/en/tasks/prompting
- Review model selection rationale: https://github.com/mustafa-tariqk/mindscape/issues/23#issuecomment-1936730018

Customization

Disclaimer: Update client/public/disclaimer.txt as needed.
Aesthetics:
- Modify CSS: client/src/style.scss and client/src/index.scss (global), or component-specific .scss files.
- Edit React components in client/src/components/ and client/src/pages/ for layout and content changes.
- Replace images in client/src/img/.

Future Recommendations

Managed Database: Consider a service like DigitalOcean's Managed PostgreSQL for enhanced robustness and reliability.
Language Model Agility: See "Common Changes" section.
Security Enhancements:
- Use UUIDs for database IDs.
- Leverage Google User IDs for authentication.
- Currently if a user gets access to another users chat id they can access it. While this is really hard to guess, restricting chat endpoints for each user by checking emails seems like a good idea here.

Architecture Details

The project follow the client-server architecture with hierarchical components.

Server Architecture

The server architecture follows the following component hierarchy, where a component can only interface with a component below it:

Low level components : controller.utils.ai (interfaces and provides an abstraction layer with the LLM). controller.utils.database (interfaces and provides an abstraction layer with the SQL database). controller.utils.vstore (interfaces and provides an abstraction layer with the langchain vectorstore). models (defines and provides SQL functionalities, partly abstracted by controller.utils.database).
Controllers: controller.analytics provides analytics functionalities.
API: app defines functionalities offered by the API layer that offers communications with the client.

Database Seeding

Adapt to database changes by deleting .db in server/instances/ and restarting the application (note: this currently erases existing data).
Erowid Seeding: Data from Erowid can be scraped using the provided script. Change the id range (low inclusive, high exclusive) to extract the amount that you want. Note this will get you blacklisted from the website, but the script should still work (untested). For a set of cleaned extracted data from roughly 200 Erowid submissions, please contact the team. Having the correct json files in server/data/erowid and deleting the .db in server/instances/ will trigger the seeding process including Erowid.

Multi-Language Support

Infrastructure: Supports multiple languages with appropriate models and word frequency data.
Requirements:
- An LLM that supports the language. Note Clustering will only work if submissions are in a common language. You can decide upon a common clustering language and an embedder for that language, and provide an additional module that give summarization in the common language within the chat submission pipeline located in controllers.utils.ai.summarize_submission()
- Word frequency CSV (see example: https://www.kaggle.com/datasets/rtatman/english-word-frequency/)
- Seed language and dataset in models.create_app(). Use the language name format used by NLTK as the id (simply lowercase language names like 'english' or 'dutch'). Ensure the language used is supported by NLTK's tokenizer and stopwords corpus (most common languages would be supported automatically).
NLTK Gaps: Implement custom tokenizers and stop word lists as needed.

Chat Categorization

Metrics: Employs SI units (kg for weight, cm for height).

Customization:

Modifying the schema as instructed in server/ai.py/handle_submission()
Modifying the Chats_Categories database model accordingly in server/models.py
Modify the process of writing to the database. (TBA, yet to be implemented)
Re-seed the database (will delete existing submissions made using this app).

Experience Clustering

Infrastructure: New messages are assign a classification based on similar submissions. Substance information is not taken into account for research purposes, although such an information is still stored alongside the classification. If a new submission does not fit known clusters, a new clusters will be created.
Method: Set up a cron job to update the experience clusters and reclusters the submissions based on the new clusters. This is recommended every now and then to ensure that the clusters are up to date with new data.
Customization:

Modifying the k value as instructed in server/app.py. This changes the number of clusters
Modifying the SIMILARITY_THRESHOLD constant in server/app.py. A lower value makes it easier for new clusters to be formed with new submission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly