-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To be continue:When I run Wide and DeepTab,I got stuck #70
Comments
Thank you for your answers. According to your suggestion, I changed the label to start from 0, and added "pred_dim=5" to the "Wide" and "WideDeep" functions, but the problem arose again, that is, my problem was not solved, sorry. Previously, I guessed that the problem might be caused by missing data. I randomly deleted part of the data in the original "predicted adult salary level" data set, that is, there are actual data in the input data. I found that the model can also run normally in this case. So I am now guessing whether the large number of 0 values in my data is causing problem one. |
Hi there @zhang-HZAU . So, is impossible for me to debug or solve an issue without reproducible code (unless is relatively simple), which means I need to see some code AND a dataset that can pass through the model so I can see what the problem is. Unfortunately, the 5 rows of data you sent me in With that in mind, I built a toy example using import numpy as np
import pandas as pd
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.models import TabMlp, Wide, WideDeep
from pytorch_widedeep.preprocessing import TabPreprocessor, WidePreprocessor
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# manually adjusting the dataset so that it has categorical cols
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_classes=3)
df = pd.DataFrame(X, columns=["_".join(["col", str(i)]) for i in range(X.shape[1])])
for c in df.columns[10:]:
df[c] = df[c].astype(int)
df["target"] = y
# train test split
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.target)
# col 0 to 9 are continuous
continuous_cols = df_train.columns.tolist()[:10]
# col 10 to 20 are categorical (well, integers treated as categories) except
# for the target. Let's use 5 (number 5 is random and the number of crossed
# columns depends very much on your knowledge of the problem and the data) of
# them as wide cols and all of them as categorical embeddings
wide_cols = df_train.columns.tolist()[10:15]
# we will "cross" two (random) categorical colums. NOTE: crossed columns MUST
# BE CATEGORICAL
cross_cols = [("col_12", "col_15")]
cat_embed_cols = [c for c in df_train.columns.tolist()[10:] if c != "target"]
# target
target = df_train["target"].values
# have a look to the documentation and the attributes of the preprocessors to
# understand what these do
wide_preproc = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = wide_preproc.fit_transform(df_train)
tab_preproc = TabPreprocessor(
embed_cols=cat_embed_cols, continuous_cols=continuous_cols # type: ignore[arg-type]
)
X_tab = tab_preproc.fit_transform(df_train)
# the wide/linear component is connected DIRECTLY to the output neuron(s)
# Therefore, needs to know the size of that final prediction layer
# before-hand via the pred-dim param
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=3)
# the remaining components/modes of the model are combined via the WideDeep
# class, and therefore, we ONLY need to spedify the pred_dim when we build
# the WideDeep class
tab_mlp = TabMlp(
column_idx=tab_preproc.column_idx,
embed_input=tab_preproc.embeddings_input,
continuous_cols=continuous_cols,
mlp_hidden_dims=[64, 32],
mlp_dropout=[0.2, 0.2],
)
model = WideDeep(wide=wide, deeptabular=tab_mlp, pred_dim=3)
trainer = Trainer(model, objective="multiclass", metrics=[Accuracy])
trainer.fit(
X_wide=X_wide,
X_tab=X_tab,
target=target,
n_epochs=1,
batch_size=32,
val_split=0.2,
)
X_wide_test = wide_preproc.transform(df_test)
X_tab_test = tab_preproc.transform(df_test)
preds = trainer.predict(X_wide=X_wide_test, X_tab=X_tab_test) Hope this helps. If you need more help best thing to do is to join the slack group and/or maybe we could have a zoom call Let me know if it helps Cheers |
Hi jrzaurin ,
|
Hey @zhang-HZAU sorry long week at work, I will have a look to the file this week 🙂 |
Hi jrzaurin ,
Similarly, I did not find a way to obtain the loss and accuracy of the predict process in the "predict" function. How should I get this data? |
Hey @zhang-HZAU I will have a look to the notebook later and check that val split problem, in the meantime, regarding the loss and accuracy from the If you have a specific test dataset where you know the target and you want to get the metric and loss, you have to do it externally, i.e run predict and then get the losses and the metrics, such as: from sklearn.metrics import accuracy_score
y_pred = trainer.predict(X_test)
acc = accuracy_score(y_true, y_pred) Same with the loss. Just bear in mind that for the loss you would need probabilities, not actual classes, for which you should use the trainer's method Hope this helps |
Hi @zhang-HZAU so I looked into your notebook and simply, the issue you are facing is because while you define X_tab with the train dataset ...
X_tab = tab_preprocessor.fit_transform(df_train)
... You then define the target with the entire set: ...
target = df["index_0"].values
... Simply change that into target = df_train["index_0"].values and at least in my case, it runs :) Let me know if this fixes the issue |
Hi,jrzaurin, |
Hi @zhang-HZAU No problem, thank you for using the library 🙂 |
Sum: The original question mentioned that according to the project of "Predicting Adult Salary Levels", I changed the input classification category, "RuntimeWarning" appeared and split the validation set based on the training set and encountered the problem of exceeding the index. The original problem address is as follows:#68
The text was updated successfully, but these errors were encountered: