Documentation should describe advantages over DataFrame constructor (of Pandas) #107

sanjaydasgupta · 2022-11-18T05:53:06Z

Converting the output of the pymongo "find()" method to a Pandas DataFrame can be done directly by the DataFrame constructor.

The output of the "find()" method is a Python list containing Python dictionary objects, and this kind of data collection can be directly handled by the DataFrame constructor.

Moreover, the Pandas DataFrame constructor can already handle data of all Python types (particularly lists and dictionaries).

In view of the above, there should be some discussion of the need for this library, and any advantages it may eventually have over the Pandas DataFrame constructor should be documented.

blink1073 · 2022-11-18T22:21:34Z

Hi @sanjaydasgupta, thanks for the suggestion. I've opened https://jira.mongodb.org/browse/ARROW-129 to track the issue.

Summarizing here as well:

We should list the pros and cons of using this library versus using the PyMongo API directly, highlighting the benchmarks as well as the limitations.

We should give examples showing how the same tasks could be accomplished with each.

sanjaydasgupta · 2022-11-19T08:56:04Z

Hi @blink1073, thanks for your response.

Here is some sample code that illustrates this direct approach to obtain a pandas DataFrame from the contents of a MongoDB collection:

from pymongo import MongoClient
import pandas as pd

client = MongoClient('mongodb+srv://????.mongodb.net/')
records = list(client.get_database('???').get_collection('???').find())

df = pd.DataFrame(records)

The code above handles all Python types (including lists and dicts), and does a fair job of deducing the column data types.

I hope this is helpful.

blink1073 · 2022-11-21T18:02:19Z

It is, thank you!

Khushali22 · 2022-11-22T07:24:55Z

@sanjaydasgupta I agree with your suggestion to make a proper documentation of the advantages of using this library vs using direct pymongo API.

I have experimented with both the libraries and What I was looking for is faster response. If you have a huge dataset then conversion pymongo cursor to list and then convert to dataframe is time taking process vs using the find_pandas_all API to directly have response in pandas.

@blink1073 Can you help me to understand more clear if i am right? Time would be lesser in case of using pymongoarrow. Also any plan to work/supported with nested data structure directly without using any aggregation pipeline in between?

Also, One observation objectid data are converting to bindata. Is it the case or it is just with me? If this then i think can open a new issue for that.

Looking forward to have a response.

Thank you.

blink1073 · 2022-11-22T17:28:42Z

Hi @Khushali22, thank you for the further insight. I am currently working on nested data in #104. The object id representation is tracked in https://jira.mongodb.org/browse/ARROW-55.

blink1073 · 2023-07-17T10:04:26Z

Added in the 1.0 release

blink1073 closed this as completed Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation should describe advantages over DataFrame constructor (of Pandas) #107

Documentation should describe advantages over DataFrame constructor (of Pandas) #107

sanjaydasgupta commented Nov 18, 2022

blink1073 commented Nov 18, 2022

sanjaydasgupta commented Nov 19, 2022

blink1073 commented Nov 21, 2022

Khushali22 commented Nov 22, 2022

blink1073 commented Nov 22, 2022

blink1073 commented Jul 17, 2023

Documentation should describe advantages over DataFrame constructor (of Pandas) #107

Documentation should describe advantages over DataFrame constructor (of Pandas) #107

Comments

sanjaydasgupta commented Nov 18, 2022

blink1073 commented Nov 18, 2022

sanjaydasgupta commented Nov 19, 2022

blink1073 commented Nov 21, 2022

Khushali22 commented Nov 22, 2022

blink1073 commented Nov 22, 2022

blink1073 commented Jul 17, 2023