Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak bug with to_sql with chunksize ? #26569

Closed
Santosh-Gupta opened this issue May 29, 2019 · 9 comments
Closed

Possible memory leak bug with to_sql with chunksize ? #26569

Santosh-Gupta opened this issue May 29, 2019 · 9 comments

Comments

@Santosh-Gupta
Copy link

Code Sample, a copy-pastable example if possible

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)

Problem description

I created a large database in Pandas, about 6 million rows of text data. I wanted to save this as a SQL database file, but when I try to save it, I get an out of memory RAM error. I even reduced the chuck size to 100 and it still crashes.

However, if I just have smaller version of that dataframe with 100,000 rows, and save it to a database with no chucksize specified, I have no issues saving tha dataframe.

My understanding was that since it's only processing 100 rows at a time, the RAM usage should reflect that of a save of 100 rows. Is there something else happening behind the scenes? Perhaps multi-threading?

Before I run this code, I am using 4.8 GB RAM, out of the 12.8 GB RAM available in Google Colab. Running the above code eats up all the RAM until the enviroment crashes.

I would like to be able to save my pandas dataframe to a SQL file without my environment crashing. The environment I am in is Google Colab. The pandas datafame is 2 columns, ~6 million rows. Each cell contains about this much text:

"The dominant sequence transduction models are based on complex
recurrent or convolutional neural networks in an encoder-decoder
configuration. The best performing models also connect the encoder and
decoder through an attention mechanism. We propose a new simple
network architecture, the Transformer, based solely on attention
mechanisms, dispensing with recurrence and convolutions entirely.
Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring
significantly less time to train. Our model achieves 28.4 BLEU on the
WMT 2014 English-to-German translation task, improving over the
existing best results, including ensembles by over 2 BLEU. On the WMT
2014 English-to-French translation task, our model establishes a new
single-model state-of-the-art BLEU score of 41.8 after training for
3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer
generalizes well to other tasks by applying it successfully to English
constituency parsing both with large and limited training data."

I posted a SO issue here

https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi

And one of the users said

From stepping through the code I think it's this line, which reads creates a bunch of DataFrames

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

Which looks like it's probably a bug. Specifically this happens prior to database insertion, in preparation.

Expected Output

The code should be executed without a crash.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.79+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 3.6.4
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.3.0
pyarrow: 0.13.0
xarray: 0.11.3
IPython: 5.5.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.5.3
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: 0.4.0
matplotlib: 3.0.3
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: 0.2.1
fastparquet: None
pandas_gbq: 0.4.1
pandas_datareader: 0.7.0
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented May 30, 2019

Do you have any way of profiling?

@Santosh-Gupta
Copy link
Author

I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab.

Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code.

@Santosh-Gupta
Copy link
Author

Here is the notebook I am using , with the to_sql code at the bottom.

https://drive.google.com/open?id=1j7JnDlmIP5QE9lwLuQo-z3jABElaesZN

The code needs my data from my Google drive folder. I could not find how let a new user download directly all contents of my folder, so I am just going to share my folder with this link

https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8

Import this into your drive (should note take up any room in your drive since I am hosting it)

Then run all the cells in the notebook. Warning, this will ask for your Google drive credentials in the first block of code.

It's possible for me to make a minimal example, however it's very tricky because I need to make the dataframe big enough for it to fail with to_sql, but small enough to still fit inside colab.

Let me know if there's anything else I can do to make looking into this issue easier.

@WillAyd
Copy link
Member

WillAyd commented May 31, 2019

Here's an example of profiling if you can isolate the issue in your code:

#25893 (comment)

@Santosh-Gupta
Copy link
Author

Not quite sure how to use that

I awkwardly tried

process = psutil.Process(os.getpid())
i = 0

def sink():
    global i
    i += 1
    if i % 100 == 0:
        mem = process.memory_info().rss / 1e6
        print("mem %fMB" % mem)

while True:
    dfAllT.to_sql("CS_table2", engine, chunksize = 100)
    sink()

and

process = psutil.Process(os.getpid())
i = 0

def sink():
    global i
    i += 1
    if i % 100 == 0:
        mem = process.memory_info().rss / 1e6
        print("mem %fMB" % mem)

while True:
    dfAllT.to_sql("CS_table2", engine, chunksize = 100).rolling(4000).max()
    sink()

But neither produced any output; my enviroment crashed without outputting anything.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 31, 2019 via email

@Santosh-Gupta
Copy link
Author

Santosh-Gupta commented Jun 2, 2019

I believe the issue is not with Pandas but with sqlalchemy. I used something like for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) as an alternative method for chucking but the ram still built up and crashed. So perhaps some sort of memory leak in sqlalchemy.

Here's a Stackoverflow link for anyone interested in continuing the monitor this issue.

https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi

Here's is an minimal example to anyone wanting to replicate the issue

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

@Santosh-Gupta
Copy link
Author

I mentioned this on SQLalchemy and they said that it's unlikely it's happening there, so the issue might be in SQLite driver or in pandas.

It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.

I had a hard time trying to replicate the memory leak that occurred when converting my data, through a minimal example. But this gets pretty close. With my data, it would result in a memory crash no matter what I did. But when I tried to create a self-contained minimal example, I have to play around with parameters to cause the leak.

The following code, when run in Google colab CPU mode, has a leak, although it's not severe enough to cause it to crash.

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

This was run on Google Colab CPU enviroment.

The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions, repeated insertions via looping to_sql , or one to_sql with chucksize specified.

@Santosh-Gupta Santosh-Gupta reopened this Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants