Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model_selection.learning_curve breaks on some datasets #1705

Closed
stan29308 opened this issue Feb 11, 2024 · 1 comment
Closed

model_selection.learning_curve breaks on some datasets #1705

stan29308 opened this issue Feb 11, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@stan29308
Copy link

Describe the bug
When generating the learning curve on the Adult Income dataset using KNN or SVC, the learning curve returns many nan depending on dataset size and it does not do this on an unpatched version of sklearn or a patched version of sklearn using the Wine dataset

To Reproduce

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import multiprocessing as mp
import pickle
from sklearnex import patch_sklearn, unpatch_sklearn 
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler
from pathlib import Path

patch_sklearn()
HALF_CPUS = mp.cpu_count() // 2

headers = ['age', 'work_class', 'final_weight', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
some_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', sep=',', names=headers)
more_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test', sep=',', names=headers, header=None, skiprows=1)
data = pd.concat([some_data, more_data], ignore_index=True)

# Start cleaning

# Shuffle the rows
data = data.sample(frac=1, random_state=0)

# Trim all whitespace
data_obj = data.select_dtypes(['object'])
data[data_obj.columns] = data_obj.apply(lambda x: x.str.strip())
data['income'] = data['income'].str.strip()

# This is technically metadata
data = data.drop(columns='final_weight')

# Covered by education-num
data = data.drop(columns='education')

# Non-US native_country rows are 19.3% >= 50k and make up 3930 rows
data = data.drop(data[data['native_country'] != 'United-States'].index)
data = data.drop(columns='native_country')

# Drop any row with ?
data = data.replace('?', np.nan)
data = data.dropna()

# # Drop capital gain and loss
# data = data.drop(columns='capital_gain')
# data = data.drop(columns='capital_loss')

# Simplify marital status
data['marital_status'] = data['marital_status'].replace(['Married-civ-spouse', 'Married-AF-spouse'], 'Married')
data['marital_status'] = data['marital_status'].replace(['Married-spouse-absent', 'Separated', 'Divorced'], 'Separated')
data['marital_status'] = data['marital_status'].replace(['Widowed'], 'Widowed')
data['marital_status'] = data['marital_status'].replace(['Never-married'], 'Never-married')

# Make income a binary value
data['income'] = data['income'].map({ '>50K.': 1, '>50K': 1, '<=50K.': 0, '<=50K': 0 })
data['sex'] = data['sex'].map({ 'Male': 0, 'Female': 1 })

# One hot encode everything
data = pd.get_dummies(data, columns=data.select_dtypes(['object']).columns)

# Split data
np.random.seed(0)
train, validate, test = np.split(data, [int(0.6 * len(data)), int(0.8 * len(data))])

X_train = train.drop(columns='income')
X_validate = validate.drop(columns='income')
X_test = test.drop(columns='income')
X = data.drop(columns='income')

y_train = train['income']
y_validate = validate['income']
y_test = test['income']
y = data['income']

# K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier

fig, ax = plt.subplots(ncols=2, sharey=True)
fig.suptitle("Validation Curves for KNN Classifier (Income)")
fig.supylabel("accuracy")
fig.set_tight_layout(True)


# Create learning curve for K-Nearest Neighbors Classifier
tempa = KNeighborsClassifier(n_neighbors=5)
tempb = np.linspace(0.1, 1, 10)
train_sizes, train_scores, test_scores = learning_curve(tempa, X.tail(5000), y.tail(5000), train_sizes=tempb, cv=5)

fig, ax = plt.subplots()
fig.suptitle("Learning Curve for KNN Classifier (Income)")
fig.supylabel("accuracy")
fig.set_tight_layout(True)

ax.set_xlabel("training examples")
ax.plot(train_sizes, np.mean(train_scores, axis=1), marker="o", label="train", ms=2)
ax.plot(train_sizes, np.mean(test_scores, axis=1), marker="o", label="test", ms=2)
ax.legend()
plt.show()

Expected behavior
Describe what your are expecting from steps above

Output/Screenshots
If applicable, add output/screenshots to help explain your problem.

Environment:

  • OS: Ubuntu 22.04
  • Compiler: gcc 11.4.0
  • Version: 2024.1.0
@stan29308 stan29308 added the bug Something isn't working label Feb 11, 2024
@stan29308
Copy link
Author

Just tried the repro code in colab and it seems to work, sorry for the noise

@stan29308 stan29308 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant