Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation should describe advantages over DataFrame constructor (of Pandas) #107

Closed
sanjaydasgupta opened this issue Nov 18, 2022 · 6 comments

Comments

@sanjaydasgupta
Copy link

Converting the output of the pymongo "find()" method to a Pandas DataFrame can be done directly by the DataFrame constructor.

The output of the "find()" method is a Python list containing Python dictionary objects, and this kind of data collection can be directly handled by the DataFrame constructor.

Moreover, the Pandas DataFrame constructor can already handle data of all Python types (particularly lists and dictionaries).

In view of the above, there should be some discussion of the need for this library, and any advantages it may eventually have over the Pandas DataFrame constructor should be documented.

@blink1073
Copy link
Member

Hi @sanjaydasgupta, thanks for the suggestion. I've opened https://jira.mongodb.org/browse/ARROW-129 to track the issue.

Summarizing here as well:

We should list the pros and cons of using this library versus using the PyMongo API directly, highlighting the benchmarks as well as the limitations.

We should give examples showing how the same tasks could be accomplished with each.

@sanjaydasgupta
Copy link
Author

Hi @blink1073, thanks for your response.

Here is some sample code that illustrates this direct approach to obtain a pandas DataFrame from the contents of a MongoDB collection:

from pymongo import MongoClient
import pandas as pd

client = MongoClient('mongodb+srv://????.mongodb.net/')
records = list(client.get_database('???').get_collection('???').find())

df = pd.DataFrame(records)

The code above handles all Python types (including lists and dicts), and does a fair job of deducing the column data types.

I hope this is helpful.

@blink1073
Copy link
Member

It is, thank you!

@Khushali22
Copy link

@sanjaydasgupta I agree with your suggestion to make a proper documentation of the advantages of using this library vs using direct pymongo API.

I have experimented with both the libraries and What I was looking for is faster response. If you have a huge dataset then conversion pymongo cursor to list and then convert to dataframe is time taking process vs using the find_pandas_all API to directly have response in pandas.

@blink1073 Can you help me to understand more clear if i am right? Time would be lesser in case of using pymongoarrow. Also any plan to work/supported with nested data structure directly without using any aggregation pipeline in between?

Also, One observation objectid data are converting to bindata. Is it the case or it is just with me? If this then i think can open a new issue for that.

Looking forward to have a response.

Thank you.

@blink1073
Copy link
Member

Hi @Khushali22, thank you for the further insight. I am currently working on nested data in #104. The object id representation is tracked in https://jira.mongodb.org/browse/ARROW-55.

@blink1073
Copy link
Member

Added in the 1.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants