Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to export database #36

Merged
merged 21 commits into from
Nov 8, 2020
Merged

Script to export database #36

merged 21 commits into from
Nov 8, 2020

Conversation

iakshat
Copy link
Member

@iakshat iakshat commented Oct 30, 2020

This PR intends to solve Issue #33 to export from mLab. I've included a simple script to collect each notice from the original database and send it one by one to the terget database.

Proposed Step by Step Process:

  1. Create a Atlas account and make a database (It seems the free tier would be sufficient)
  2. Ensure you have MONGODB_URI set in the env for mftp script i.e. the connection uri for original database
  3. Get the connection uri from Atlas and add it as MONGODB_URI_ATLAS in env for mftp script (ensure your ip is whitelisted too)
  4. Ensure you're calling export_db function in the main
  5. Once the script starts it'll push notices to the database in Atlas one by one. If an error in incurred, such notices will be pushed into defaulter files and you can use another function insert_from_file to further process it

If there's some problems while running the export_db function we can run it multiple times and the error handling will sort out what to do.
The insertion of documents includes _id of the original mLab database so we'll actually be replicating the same database in Atlas.

Once a database is ready in Atlas we can test it with a local mftp script without export db scripts and see if there's some problems.

PS: I think that we'll not need to run the reprocessing defaulters part as there's very little chance of defaulting since almost all lines of code are api calls to various databases
PS: Please ignore the changes made to update.py, I forgot to take a pull before pushing my changes 😰

Copy link
Contributor

@thealphadollar thealphadollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @iakshat Thank you very much for working on this crucial issue - we really needed this migration to be done and the deadline is close. I also liked how there are so many debugging prints in the script to make sure we know right when things go wrong.

PS. I'll do an in-depth review once we have generalized the migration script.

update.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
export_database.py Outdated Show resolved Hide resolved
@iakshat
Copy link
Member Author

iakshat commented Oct 31, 2020

@thealphadollar Now the migration script is independent of the main file and accepts two raw inputs for OLD_URI and NEW_URI. I couldn't make them script arguments as python thinks of URIs as some directory locations and throws an error.

export_database.py Outdated Show resolved Hide resolved
Copy link
Contributor

@thealphadollar thealphadollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Akshat, I've left a few more comments on the code structure. There are three major updates that would be required:

  • Having a CLI workflow, for instance, an argument --try-defaulters would just ask for a new DB URI, and running without it will require both. You may choose to do it as you feel better, this is just an example.
  • Adding the instructions about what the script does, various arguments, how to use the script, and anything else you seem fit in the README.md.
  • Letting us know that you have tested the script with two MongoDB instances and providing any test results or anomalies that you noticed and we should be aware of.

export_database.py Outdated Show resolved Hide resolved
export_database.py Outdated Show resolved Hide resolved
export_database.py Outdated Show resolved Hide resolved
export_database.py Outdated Show resolved Hide resolved
update.py Outdated Show resolved Hide resolved
Copy link
Member

@icyflame icyflame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not sure why we need a script, MongoDB has native helpers for doing this: mongoexport and mongoimport. I don't think we should use a hand-made script as there are several corner cases which won't be handled (eg: Connection error will cause us to clean destination DB and restart the script; etc)

As long as a host can connect to both databases, we can simply dump the original DB and clone it into the new DB. ✅

In general, changing master DBs is done with downtime and the following process is followed:

  1. Stop service (downtime starts)
  2. Export original DB using mongoexport
  3. Import into new master DB using mongoimport
  4. Update environment variable MONGODB_URI (say) to the new DB's host
  5. Restart service (downtime ends)

It's completely fine to have a few minutes of downtime during this upgrade. It is the safest way to ensure that we don't mess something up during the upgrade.


In this PR, there are some changes to update.py, I am not sure what they are or why they are in this PR 🙇‍♂️ Master DB changes should not require any application changes 😌

@iakshat
Copy link
Member Author

iakshat commented Nov 2, 2020

@icyflame I am really not aware of the process usually carried out at these times so I felt this was a good way as this ensured a zero downtime for the script as people generally depend on it completely. But if this method is prone to errors then it's really a better way to stop the script for some time and keep the data safe.

@thealphadollar Please guide me if I should move ahead with the changes you suggested or should scrape this pr and workout the way with mongoexport. Afaik @mukul-mehta looked into this way, he'll be able to give a better insight into this.


Regarding the script :

Connection error will cause us to clean destination DB and restart the script; etc

I'm not sure why we'll need to clean the destination DB. The script I have written is handling such problems and if somehow connection is lost, we can just restart the script with no problems of duplication.

In this PR, there are some changes to update.py, I am not sure what they are or why they are in this PR bowing_man Master DB changes should not require any application changes relieved

Sorry for this but the changes that are showing up in update.py are already in the main of this repo. They should really not show up here 😕

Copy link
Member

@icyflame icyflame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @thealphadollar, I will change my review status to Comment 🙆‍♂️

@icyflame
Copy link
Member

icyflame commented Nov 2, 2020

@iakshat Thanks for working on this!

In the future, please do search for existing solutions that are native or near-native and built or associated with the project, before deciding to roll your own solution.

These native solutions are well tested (unit tests), maintained by the community, well-documented and used by a lot of people in production environments. This ensures that there are no corner cases that one would run into when running them in production. This is extremely important as database related operations are high risk. 😅 In general, running a well-tested tool that is used by several other people is safer than running a script that was written by hand and tested a few times in controlled test environments.

@thealphadollar
Copy link
Contributor

thealphadollar commented Nov 2, 2020

@iakshat Please go ahead with the changes I had requested - I and @icyflame have discussed and definitely it is not a good idea to reinvent the wheel if an easier method existed already. I see you've taken considerable caution for errors and even implemented error handling and resumption in case of issues - that is great and for this purpose, your script seems to be doing fine. It will also help us avoid installing mongo and associated import-export function, and that's an added plus ➕

Let's finish this script with the required changes, thorough testing, and add all instructions to readme so that we can do the migration by the end of the week.

PS. I'd like to understand your thought process as to what made you write the script if you were aware of an existing solution. This will help us guide each other better and know on a deeper level in the future.

@amrav
Copy link
Member

amrav commented Nov 2, 2020

@iakshat Thanks for the PR! +1 to using their recommended migration tool instead, as @thealphadollar had linked previously https://docs.mlab.com/how-to-migrate-sandbox-databases-to-atlas/ Even if we have some code already, and even if we've reviewed it, it's impossible to write bug-free code, and the safer option is to use a well-tested migration tool written by people with more expertise and time than you and me :)

After transferring data to the destination db, and before switching to it, it would be good to carry out some sanity checks on the source and destination databases:

  1. same number of records
  2. same storage used
  3. same indices
  4. run some sample queries (eg notices in the last day) and check you get the same results

Also, storage for Atlas is capped at 512MB for their free tier, what's our current usage?

@amrav
Copy link
Member

amrav commented Nov 2, 2020

@iakshat I too have talked to @thealphadollar 😆 My comment is not a blocker for this PR, please go ahead (with appropriate caution)!

@iakshat
Copy link
Member Author

iakshat commented Nov 2, 2020

@icyflame @thealphadollar @amrav Thank you for all your insights. I am really learning a lot new things working on these issues! I too feel that it should have been done with a less error-prone tool which has already been properly tested.

To be honest I was a bit lazy in not properly checking out each step of the suggested export procedure(which had way too many things 🤕) and I felt that it would be easier to write this script since the database has only one collection. I thing we can try this way and if anything goes wrong we can use the mLab prescribed procedure here as mentioned by @amrav .

Copy link
Contributor

@thealphadollar thealphadollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks for making the changes - below are some queries I've.

  1. Should we include a requirements.txt file? It will help install all the required documents.
  2. Do you think specifying the python version would be helpful in the README.md?
  3. I propose that we create a new folder with the script, requirements file, and a README.md specific to script. In the root README.md we would just mention that for migration, look into the folder. This was we can have all the related code at one place.

Let me know what you think of the above suggestions. Let's aim for a migration by the end of the week (8th November, Sunday).

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
export_database.py Outdated Show resolved Hide resolved
export_database.py Outdated Show resolved Hide resolved
iakshat and others added 2 commits November 4, 2020 00:53
Co-authored-by: Shivam Kumar Jha <shivam.cs.iit.kgp@gmail.com>
@iakshat
Copy link
Member Author

iakshat commented Nov 3, 2020

Included the suggested changes to reorganise the migration script in another directory and all other minor changes suggested. If it's possible please include requirements.txt because my environment adds some un-used libraries to fresh envs cause of some setup issues.

Copy link
Contributor

@thealphadollar thealphadollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I have some more doubts - and I think we might have forgot to update the readme.

migration/export_database.py Show resolved Hide resolved
migration/README.md Outdated Show resolved Hide resolved
migration/README.md Outdated Show resolved Hide resolved
migration/export_database.py Show resolved Hide resolved
migration/export_database.py Outdated Show resolved Hide resolved
Copy link
Contributor

@thealphadollar thealphadollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! This is beautiful <3 I'm now trying to use this to migrate MFTP.

@thealphadollar thealphadollar merged commit 4781031 into metakgp:master Nov 8, 2020
@thealphadollar
Copy link
Contributor

@iakshat Can we please update the README.md with the latest environment variables details and the fact that we have migrated to mLab? The document is outdated at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants