Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySpark kernel for Jupyterhub creates file “???????e?” in working directory #9613

Closed
LMtx opened this issue Jun 14, 2016 · 29 comments
Closed
Milestone

Comments

@LMtx
Copy link

LMtx commented Jun 14, 2016

On CentOS 6 I use JupyterHUB 0.6.1 with PySpark kernel (Spark 1.5.0 + Python 2.7).

When I start PySpark kernel JupyterHUB creates file named "????????????????????????????????????????????????????????????????????????????????????????????????????????????????????e?" in my working directory.

Mirror server with JupyterHUB 0.5.0 does not have this issue.

Please see details on http://stackoverflow.com/questions/37808659/pyspark-kernel-for-jupyterhub-creates-file-e-in-working-directory

@Carreau
Copy link
Member

Carreau commented Jun 14, 2016

That seem like a pyspark kernel issue... and I'm not sure which one... this one?

@Carreau Carreau added this to the not ipython milestone Jun 14, 2016
@LMtx
Copy link
Author

LMtx commented Jun 15, 2016

But I do not use the Apache Toree kernel. I have modified the Python2.7 kernel in following way:

{
 "display_name": "pySpark (Spark 1.5.0)",
 "language": "python",
 "argv": [
  "/usr/bin/python2.7",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "PYSPARK_PYTHON": "python2.7",
  "SPARK_HOME": "/opt/cloudera/parcels/CDH/lib/spark",
  "PYTHONPATH": "/opt/cloudera/parcels/CDH/lib/spark/python/:/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.2.1-src.zip",
  "PYTHONSTARTUP": "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master yarn-client pyspark-shell"
 }
}

This kernel is the same on both servers and with JupyterHUB 0.5.0 everything works just fine.

I updated the stackoverflow question - please check the code I found.

Please find attached shell.py - maybe it could help in further investigation.

shell.zip

Thank you for your time.

@takluyver
Copy link
Member

That sounds like something has used binary data as a filename. I'm not sure what might be doing that - I can't think of anything in our code that would.

@minrk
Copy link
Member

minrk commented Jun 16, 2016

@LMtx what happens if you leave out the PYTHONSTARTUP env? It's likely to be something in the spark startup that's doing it. Can you compare the os.environ dict when run in the version that works and the version that doesn't?

@Dom-nik
Copy link

Dom-nik commented Jun 16, 2016

Hello,
I'm working with LMtx on this and I tried to run the experiment that you outlined: I removed the PYTHONSTARTUP variable from kernel.json, restarted Jupyter and printed os.environ on both instances.

The only difference I found was:
'PYTHONSTARTUP': '/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/shell.py'

If I don't remove PYTHONSTARTUP, I'm not able to run print os.environ.
The startup script is the same on both instances, as the underlying Cloudera distribution is the same.

@takluyver
Copy link
Member

Does it write the same file every time, or a different one? If it's the same, try making it read-only - maybe you'll get a failure that shows what's trying to open it.

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

Yes, it writes the same file every time. Just once it created additional file "????????????????????????????????????????????????????????????????????????????????????????????????????????????????????e**-journal**" with content similar to "???..." file (but I could not reproduce this).

The "???...-journal" file ended with:

end timestamp, num_cmds integer, remark text)9!N¦

That line is missing in the "???..." file.

I removed all permissions for file "???...". When I tried to start the pyspark kernel it stuck in the reboot loop. No additional file similar to "???..." was created.

Please find the attached tail of jupyterhub.log file.
jupyterhub.log.tail.zip

@minrk
Copy link
Member

minrk commented Jun 17, 2016

Interesting, it looks to be failing to open the IPython history database. Do you have any IPython configuration to set the history file to something in particular?

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

No, we are using the default settings. The IPython version:

Name: ipykernel
Version: 4.3.1

@minrk
Copy link
Member

minrk commented Jun 17, 2016

In that case, can you move your IPython directory to a temporary location, so that it starts fresh:

mv ~/.ipython ~/.save_ipython

and launch again?

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

I moved the directory and started notebook - still the reboot loop :(

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

The file name "???..." decoded to asci:

<80><80><90><81><81><90><81><90><81><81><81><90><81><80><80><81><90><80><90><81><80><81><81><80><80><81><90><81><90><81><80><81><9f><81><90><81><80><81><80><80><81><81><90><81><80><80><81><80><81>e^A

Maybe that will help?

@takluyver
Copy link
Member

Try configuring IPython to store history in memory instead of on disk: HistoryAccessor.hist_file=':memory'. You lose persistent history by doing this, but it avoids attempting to open a history database on disk.

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

Can I reconfigure IPython only for my user - I do not want to change the configuration of JupyterHUB on production server for everyone (at least not yet).

@takluyver
Copy link
Member

Yes, run ipython profile create as your user, and then edit ~/.ipython/profile_default/ipython_config.py.

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

PySpark kernel seams to work but it still generates strange files:

filename: <90><80><90><81><81>y<83>羳:

content (plus lots of binary characters):

<82>7tableoutput_historyoutput_history^FCREATE TABLE output_history
(session integer, line integer, output text,
PRIMARY KEY (session, line));^F^F^WO)^A^@indexsqlite_autoindex_output_history_1output_history^G<81>*^C^G^W^[^[^A<82>+tablehistoryhistory^DCREATE TABLE history (session integer, line integer, ]^A<82>mtablesessionssessions^BCREATE TABLE sessions (session integer
primary key autoincrement, start timestamp,

Any more ideas what is going on?

@takluyver
Copy link
Member

Not really. It looks like part of the history database, but I've no idea why it's getting written to garbage filenames. Maybe your sqlite library or the Python bindings are corrupted? That's a total guess, though. I don't think anyone's ever reported something like that.

@takluyver
Copy link
Member

For that matter, maybe the filesystem is corrupted so that part of the data that should be in a file is showing up as a filename. Can you fsck it? Still guesswork, though.

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

That could be the case but I think that one should not run fsck on running system though. Especially on production :/

@takluyver
Copy link
Member

Yeah, I believe you have to unmount a filesystem to fsck it, so if it's the root fs, that means rebooting.

@LMtx
Copy link
Author

LMtx commented Jun 17, 2016

What would you recommend to totally remove JupyterHUB and IPython (with all configuration files etc.)? I going to take my chance and reinstall Jupyter - maybe it would help somehow.

@takluyver
Copy link
Member

I have no reason to think that would fix it, but as I don't understand what's gone wrong, I can't rule it out.

@LMtx
Copy link
Author

LMtx commented Jun 21, 2016

Quick update: neither reinstallation of version 0.6.1 nor downgrade to 0.5.0 solved my issue. I am seriously considering bringing down the server to fsck the HDD's.

Any additional thoughts on this subject are very welcome.

@takluyver
Copy link
Member

The version of Jupyterhub almost certainly won't affect it, because the history database is written by IPython. But I don't think a different version of IPython is likely to fix it either.

@LMtx
Copy link
Author

LMtx commented Jun 22, 2016

We have tried reinstalling the IPython as well. The problem occurs not only for PySpark kernel but also for Python2.7 kernel. Now we verify if Python2.7 on both servers was compiled in the same way - we used the same shell commands but maybe packages installed on both servers vary in a way that impacted the compilation.

@LMtx
Copy link
Author

LMtx commented Jun 22, 2016

It occurred that compilation of Python2.7 on one server was somehow corrupted - "copy/paste" whole installation folder from one machine to another solved our issue. I guess that there are some differences in installed rpm packages on both machines - we are going to investigate this issue but the most urgent problem is fixed.

Thank you for your time @takluyver and @minrk

@LMtx LMtx closed this as completed Jun 22, 2016
@takluyver
Copy link
Member

Great, thanks for letting us know.

@LMtx
Copy link
Author

LMtx commented Jul 14, 2016

Hello again,

It looks like IPython (python2.7 kernel) generates the "???..." files when SQLite3 is installed on the box. Did you encounter any issues regarding this version of SQLite?

@LMtx LMtx reopened this Jul 14, 2016
@takluyver
Copy link
Member

SQLite3 has been the standard version of sqlite for many years, as far as I know. I have 3.11.0 and 3.13.0 in different Pythons on my machine. I have never seen an issue similar to this with any version of sqlite.

@LMtx LMtx closed this as completed Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants