Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joblib import error after a run of failed gridsearch cv in Azure VM #354

Closed
krishnateja614 opened this issue May 24, 2016 · 9 comments · Fixed by #355
Closed

Joblib import error after a run of failed gridsearch cv in Azure VM #354

krishnateja614 opened this issue May 24, 2016 · 9 comments · Fixed by #355
Assignees

Comments

@krishnateja614
Copy link

@krishnateja614 krishnateja614 commented May 24, 2016

Hi , I am running gridsearchcv on a 16 core Azure VM. While the same code works fine on my PC, I am getting the following error on the VM. It is fitting all the folds as I can see from verbose(=10) but it is throwing up this error after finishing all the folds. This is occurring after fitting all CV-folds but before calculating the "best_score_" and "best_estimator_".Below is the traceback. (I am getting the same error for n_jobs=-1 or a number less than 16 and refit is True)

CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False 
[CV] ml__C=0.5, ml__penalty=l1, ml__tol=0.01, ml__max_iter=300, ml__warm_start=False, score=0.620114 - 47.6s
Traceback (most recent call last):
File "", line 1, in 
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in 
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self.fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 818, in __call
self.terminate_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 549, in terminate_pool
self._pool.terminate() # terminate does a join()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 583, in terminate
super(MemmapingPool, self).terminate()
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 465, in terminate
self._terminate()
File "C:\Anaconda2\lib\multiprocessing\util.py", line 207, in __call
res = self._callback(self._args, *self._kwargs)
File "C:\Anaconda2\lib\multiprocessing\pool.py", line 513, in _terminate_pool
p.terminate()
File "C:\Anaconda2\lib\multiprocessing\process.py", line 137, in terminate
self._popen.terminate()
File "C:\Anaconda2\lib\multiprocessing\forking.py", line 312, in terminate
_subprocess.TerminateProcess(int(self._handle), TERMINATE)
WindowsError: [Error 5] Access is denied

ALSO, in the very next run of the code, I am getting the following Joblib error.
Fitting 4 folds for each of 3 candidates, totalling 12 fits

Traceback (most recent call last):
File "", line 1, in 
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Preston/Documents/Python Scripts/ml_cpt_cv.py", line 111, in 
model.fit(x_train, y_train)
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self.fit(X, y, ParameterGrid(self.param_grid))
File "C:\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in fit
for parameters in parameter_iterable
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 766, in __call
n_jobs = self.initialize_pool()
File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 515, in initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name == 'main'". Please see the joblib documentation on Parallel for more information.

This is the code,

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, precision_score, recall_score, precision_recall_curve, accuracy_score,confusion_matrix
import sklearn.grid_search
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import pipeline,metrics, grid_search

if name=="main":
train_data=pd.read_csv("train.csv",delimiter=",")
test_data=pd.read_csv("test.csv",delimiter=",")
n_cv=int(raw_input("Please tell us how many fold cross validation you would like to perform,please give a value between 3-10"))
assert (n_cv>=3) and (n_cv<=10), "Looks like the number of folds is not in between 3 and 10"
#scl = StandardScaler()
y_train=train_data["class"]
y_test=test_data["class"]
x_train=train_data
x_test=test_data
#x_test=list(x_test)
y_train=list(y_train)
y_train=[1 if i==True else 0 for i in y_train]

y_test=list(y_test)
y_test=[1 if i==True else 0 for i in y_test]

type_of_ml_model=raw_input("Please input rf for random forests, lr for logistic regression, svm for support vector machines and gbdt for gradient boosting")
if type_of_ml_model=="rf":
ml_model=RandomForestClassifier(random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {
'ml__n_estimators':[100,500],
'ml__max_features':["auto",None,"log2"],
'ml__max_depth':[5,4,3],
'ml__min_samples_split':[1,2],
'ml__oob_score':[True,False],
'ml__class_weight':["balanced","balanced_subsample"]
}

elif type_of_ml_model=="lr":
ml_model=LogisticRegression(class_weight="balanced",random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {#'ml__fit_intercept':[True],'ml__intercept_scaling':[1,2],
'ml__tol':[0.01],'ml__max_iter':[300],'ml__warm_start':[False],
'ml__C': [0.5,0.2,0.6],
'ml__penalty':["l1"]}

elif type_of_ml_model=="svm":
ml_model=LinearSVC(class_weight="balanced",random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {'ml__loss':["squared_hinge"],'ml__dual':[False],'ml__fit_intercept':[True],
'ml__intercept_scaling':[2,3],
'ml__tol':[0.001,0.0001],'ml__max_iter':[200],
'ml__C': [1,0.8],
'ml__penalty':["l1"]}

elif type_of_ml_model=="gbdt":
ml_model=GradientBoostingClassifier(random_state=5)
clf = pipeline.Pipeline([('ml', ml_model)])
param_grid = {'ml__loss':["deviance"],
'ml__n_estimators':[100,200],
'ml__max_features':["auto",None],
'ml__max_depth':[4,3],
'ml__min_samples_split':[1,2],
'ml__subsample':[1.0],'ml__warm_start':[False],
'ml__min_samples_leaf':[1,2]
}

precision_scorer = metrics.make_scorer(precision_score, greater_is_better = True)

# Initialize Grid Search Model
model = grid_search.GridSearchCV(estimator = clf, param_grid=param_grid, scoring=precision_scorer,
verbose=10,n_jobs=-1, iid=True, refit=True, cv=n_cv)

# Fit Grid Search Model
model.fit(x_train, y_train)

Is there something in parameters that we need to change when we are running on a VM?

@ogrisel

This comment has been minimized.

Copy link
Contributor

@ogrisel ogrisel commented May 25, 2016

The correct syntax is:

if  __name__ == "__main__":
    # put code here

instead of:

if name=="main":
    # ....
@ogrisel

This comment has been minimized.

Copy link
Contributor

@ogrisel ogrisel commented May 25, 2016

The original WindowsError problem that occurs in multiprocessing at:

C:\Anaconda2\lib\multiprocessing\forking.py", line 312, in terminate

is really weird: basically the master process seems to not have the permission to kill its own children worker processes.

Maybe you could try to use Python 3.5 (you can install it easily with Anaconda). The multiprocessing module of Python has been improved a lot between Python 2.7 and Python 3.5. I am not sure this will help solve this specific issue though.

@ogrisel

This comment has been minimized.

Copy link
Contributor

@ogrisel ogrisel commented May 25, 2016

@ogrisel

This comment has been minimized.

Copy link
Contributor

@ogrisel ogrisel commented May 25, 2016

I think I found the cause of the issue as documented in scikit-learn/scikit-learn#4016 (comment): indeed the problem should have a lower probability of happening under Python 3.5 than on Python 2.7.

@krishnateja614

This comment has been minimized.

Copy link
Author

@krishnateja614 krishnateja614 commented May 25, 2016

I'm sorry,there might have been some issue in formatting but the actual code was indeed
if __name__=="__main__". Sorry, I haven't noticed that it's been pasted incorrectly.

@krishnateja614

This comment has been minimized.

Copy link
Author

@krishnateja614 krishnateja614 commented May 25, 2016

Do you think I could prevent this by just changing the timeout value in Python 2.7 multiprocessing/forking.py at line 312 from 0.1 to 1.0? I have been working in 2.7 for so long and don't want to change to 3.5 unless all other avenues are tried. I'll change this and will let you know. Thanks a lot.

@krishnateja614

This comment has been minimized.

Copy link
Author

@krishnateja614 krishnateja614 commented May 25, 2016

Changing the timeout from 0.1 to 1.0 seems to work atleast for small number of cv fits (<50). I'll close this issue and will open it again if I run into the same problem for large number of cv fits (>10000). Thanks a lot for your help @ogrisel

@ogrisel

This comment has been minimized.

Copy link
Contributor

@ogrisel ogrisel commented May 26, 2016

Do as you wish but maintaining a patched version of Python 2.7 on your own is certainly more hassle than writing code that works with Python 3 directly.

Have you tried to run your code with Anaconda? For most numerical analysis / machine learning stuff you don't have to change anything besides writing print "stuff" as print("stuff") which works both in Python 2 and Python 3.

At some point in the future, people like us will stop maintaining support for Python 2.7 in their libraries. You'd better switch your code to Python 3 (or at least by both Python 2 and Python 3 compatible which is generally very easy). Otherwise you will accumulate a lot of technical debt and be very frustrated when projects will stop supporting Python 2.

@krishnateja614

This comment has been minimized.

Copy link
Author

@krishnateja614 krishnateja614 commented May 26, 2016

This was on Anaconda. I am slowly changing but for my current job, I'm using python 2.7. I'm starting with 3.5 independently though. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants
You can’t perform that action at this time.