-
Notifications
You must be signed in to change notification settings - Fork 0
Maintenance
Authors: Mustafa Tariq, Khoa Nguyen
- Optimize effort upfront to minimize troubleshooting later.
- Adhere to the project's pull request (PR) template for detailed guidance.
-
Always Use Pull Requests: Direct pushes to
main
bypass safeguards. Create PRs for collaborative review and integrated testing. -
Follow the Checklist: When submitting a PR, the submitters and the reviewers must follow and complete their respective checklist provided in
.github/pull_request_template.md
. -
Ensure All Tests Pass: Do not merge PRs with failing tests. Strive to deliver production-ready, bug-free code. Tests can be located for the server in
server/tests
. For frontend we do not have a testing pipeline. So just run the frontend and ensure that it looks how you'd like.
-
Downtime:
- Investigate via SSH – check Docker containers (
docker compose up --build -d
). - Review recent PRs to assess potential issues; consider rollbacks.
- Investigate via SSH – check Docker containers (
-
Slowness:
- Monitor resource utilization. Scale server as needed.
- Consider upgrading your domain beyond free DNS services for improved routing.
-
Cost Management:
- Employ fixed-cost resources (VM, domain) and impose spending limits where applicable.
- Consider alternative cloud providers for potential cost optimization.
- Removal: Prioritize removing frontend calls to deprecated features for ease of reversal, then proceed to backend cleanup.
- Addition: Implement and test backend functionality extensively before focusing on frontend styling.
- Change: Test thoroughly on a development environment. Submit a PR for review after adapting both frontend and backend.
-
Language Model Updates:
- Modify
llm
andllm_embedder
inserver/controllers/utils/ai.py
. - Refer to LangChain documentation: https://python.langchain.com/docs/integrations/components
- Explore Hugging Face's prompting guide for refinement: https://huggingface.co/docs/transformers/main/en/tasks/prompting
- Review model selection rationale: https://github.com/mustafa-tariqk/mindscape/issues/23#issuecomment-1936730018
- Modify
-
Disclaimer: Update
client/public/disclaimer.txt
as needed. -
Aesthetics:
- Modify CSS:
client/src/style.scss
andclient/src/index.scss
(global), or component-specific .scss files. - Edit React components in
client/src/components/
andclient/src/pages/
for layout and content changes. - Replace images in
client/src/img/
.
- Modify CSS:
- Managed Database: Consider a service like DigitalOcean's Managed PostgreSQL for enhanced robustness and reliability.
- Language Model Agility: See "Common Changes" section.
-
Security Enhancements:
- Use UUIDs for database IDs.
- Leverage Google User IDs for authentication.
- Currently if a user gets access to another users chat id they can access it. While this is really hard to guess, restricting chat endpoints for each user by checking emails seems like a good idea here.
The project follow the client-server architecture with hierarchical components.
The server architecture follows the following component hierarchy, where a component can only interface with a component below it:
- Low level components :
controller.utils.ai
(interfaces and provides an abstraction layer with the LLM).controller.utils.database
(interfaces and provides an abstraction layer with the SQL database).controller.utils.vstore
(interfaces and provides an abstraction layer with the langchain vectorstore).models
(defines and provides SQL functionalities, partly abstracted bycontroller.utils.database
). - Controllers:
controller.analytics
provides analytics functionalities. - API:
app
defines functionalities offered by the API layer that offers communications with the client.
-
Adapt to database changes by deleting
.db
inserver/instances/
and restarting the application (note: this currently erases existing data). -
Erowid Seeding: Data from Erowid can be scraped using the provided script. Change the id range (low inclusive, high exclusive) to extract the amount that you want. Note this will get you blacklisted from the website, but the script should still work (untested). For a set of cleaned extracted data from roughly 200 Erowid submissions, please contact the team. Having the correct json files in
server/data/erowid
and deleting the.db
inserver/instances/
will trigger the seeding process including Erowid.
- Infrastructure: Supports multiple languages with appropriate models and word frequency data.
-
Requirements:
- An LLM that supports the language. Note Clustering will only work if submissions are in a common language. You can decide upon a common clustering language and an embedder for that language, and provide an additional module that give summarization in the common language within the chat submission pipeline located in
controllers.utils.ai.summarize_submission()
- Word frequency CSV (see example: https://www.kaggle.com/datasets/rtatman/english-word-frequency/)
- Seed language and dataset in
models.create_app()
. Use the language name format used by NLTK as the id (simply lowercase language names like 'english' or 'dutch'). Ensure the language used is supported by NLTK's tokenizer and stopwords corpus (most common languages would be supported automatically).
- An LLM that supports the language. Note Clustering will only work if submissions are in a common language. You can decide upon a common clustering language and an embedder for that language, and provide an additional module that give summarization in the common language within the chat submission pipeline located in
- NLTK Gaps: Implement custom tokenizers and stop word lists as needed.
Metrics: Employs SI units (kg for weight, cm for height).
Customization:
- Modifying the schema as instructed in
server/ai.py/handle_submission()
- Modifying the Chats_Categories database model accordingly in
server/models.py
- Modify the process of writing to the database. (TBA, yet to be implemented)
- Re-seed the database (will delete existing submissions made using this app).
- Infrastructure: New messages are assign a classification based on similar submissions. Substance information is not taken into account for research purposes, although such an information is still stored alongside the classification. If a new submission does not fit known clusters, a new clusters will be created.
- Method: Set up a cron job to update the experience clusters and reclusters the submissions based on the new clusters. This is recommended every now and then to ensure that the clusters are up to date with new data.
- Customization:
- Modifying the k value as instructed in
server/app.py
. This changes the number of clusters - Modifying the
SIMILARITY_THRESHOLD
constant inserver/app.py
. A lower value makes it easier for new clusters to be formed with new submission.