To be continue:When I run Wide and DeepTab,I got stuck #70

zhang-HZAU · 2021-12-11T01:43:22Z

Sum: The original question mentioned that according to the project of "Predicting Adult Salary Levels", I changed the input classification category, "RuntimeWarning" appeared and split the validation set based on the training set and encountered the problem of exceeding the index. The original problem address is as follows:#68

zhang-HZAU · 2021-12-11T01:45:04Z

Thank you for your answers. According to your suggestion, I changed the label to start from 0, and added "pred_dim=5" to the "Wide" and "WideDeep" functions, but the problem arose again, that is, my problem was not solved, sorry. Previously, I guessed that the problem might be caused by missing data. I randomly deleted part of the data in the original "predicted adult salary level" data set, that is, there are actual data in the input data. I found that the model can also run normally in this case. So I am now guessing whether the large number of 0 values in my data is causing problem one.
In fact, I tried to use matrix factorization to fill in the data before, but the data filling effect is not very good. So I tried unfilled data as input data.
Sorry, for some reason, I can only provide a subset of the data. Column 0 is the label. The rest are indicators.
sub_data.xlsx

jrzaurin · 2021-12-11T12:35:47Z

Hi there @zhang-HZAU .

So, is impossible for me to debug or solve an issue without reproducible code (unless is relatively simple), which means I need to see some code AND a dataset that can pass through the model so I can see what the problem is.

Unfortunately, the 5 rows of data you sent me in sub_data.xlsx are not enough for me to debug. Therefore I decided to do the following. I see that your dataset is a mix of categorical and numerical/continuous cols, and I also see that you might be confused as to how to prepare some of them and pass them to the model (as you are passing as wide columns columns that are numerical, while these have to be categorical).

With that in mind, I built a toy example using sklearn's datasets that I hope is close enough to your problem so that you can use the code

import numpy as np
import pandas as pd
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.models import TabMlp, Wide, WideDeep
from pytorch_widedeep.preprocessing import TabPreprocessor, WidePreprocessor
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


# manually adjusting the dataset so that it has categorical cols
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_classes=3)
df = pd.DataFrame(X, columns=["_".join(["col", str(i)]) for i in range(X.shape[1])])
for c in df.columns[10:]:
    df[c] = df[c].astype(int)
df["target"] = y

# train test split
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.target)

# col 0 to 9 are continuous
continuous_cols = df_train.columns.tolist()[:10]

# col 10 to 20 are categorical (well, integers treated as categories) except
# for the target. Let's use 5 (number 5 is random and the number of crossed
# columns depends very much on your knowledge of the problem and the data) of
# them as wide cols and all of them as categorical embeddings
wide_cols = df_train.columns.tolist()[10:15]
# we will "cross" two (random) categorical colums. NOTE: crossed columns MUST
# BE CATEGORICAL
cross_cols = [("col_12", "col_15")]
cat_embed_cols = [c for c in df_train.columns.tolist()[10:] if c != "target"]

# target
target = df_train["target"].values

# have a look to the documentation and the attributes of the preprocessors to
# understand what these do
wide_preproc = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = wide_preproc.fit_transform(df_train)

tab_preproc = TabPreprocessor(
    embed_cols=cat_embed_cols, continuous_cols=continuous_cols  # type: ignore[arg-type]
)
X_tab = tab_preproc.fit_transform(df_train)

# the wide/linear component is connected DIRECTLY to the output neuron(s)
# Therefore, needs to know the size of that final prediction layer
# before-hand via the pred-dim param
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=3)

# the remaining components/modes of the model are combined via the WideDeep
# class, and therefore, we ONLY need to spedify the pred_dim when we build
# the WideDeep class
tab_mlp = TabMlp(
    column_idx=tab_preproc.column_idx,
    embed_input=tab_preproc.embeddings_input,
    continuous_cols=continuous_cols,
    mlp_hidden_dims=[64, 32],
    mlp_dropout=[0.2, 0.2],
)
model = WideDeep(wide=wide, deeptabular=tab_mlp, pred_dim=3)

trainer = Trainer(model, objective="multiclass", metrics=[Accuracy])
trainer.fit(
    X_wide=X_wide,
    X_tab=X_tab,
    target=target,
    n_epochs=1,
    batch_size=32,
    val_split=0.2,
)

X_wide_test = wide_preproc.transform(df_test)
X_tab_test = tab_preproc.transform(df_test)
preds = trainer.predict(X_wide=X_wide_test, X_tab=X_tab_test)

Hope this helps. If you need more help best thing to do is to join the slack group and/or maybe we could have a zoom call

Let me know if it helps

Cheers
J.

zhang-HZAU · 2021-12-15T03:34:44Z

Hi jrzaurin ，
Sorry, I haven't provided feedback in time because of other things being dealt with recently.
After reading your example, I realized that there is indeed a problem with the category data. Now I have processed the data as follows:
①The data is filled with the matrix decomposition (Funk_SVD) algorithm (the filling process needs to be optimized, and currently only the prediction process needs to be run through);
②For category data, I originally intended to input the category label (0, 1, 2, 3...) as category data. The previous misoperation made this part of the data into "float" .I now convert this part of the data into integers and then further into "str" as input. (For the same purpose, I now only focus on running the process through).
Result:
The filled data solves the aforementioned first problem, and now the second problem still exists, that is, when the training set is split in the following code, the second problem reappears until I comment out the "val_split" parameter. Hope to provide solutions, thank you.

trainer.fit(
   X_wide=X_wide,
   X_tab=X_tab,
   target=target,
   n_epochs=150,
   batch_size=16,
   val_split=0.1,
)

In addition, I packaged the original input data and the notebook code into an attachment, and the detailed information of the runtime is retained in the notebook. Regarding the "slack" link you provided, I cannot join due to network reasons. I will continue to try to join in the future.

issue_sum.zip

jrzaurin · 2021-12-19T16:21:45Z

Hey @zhang-HZAU

sorry long week at work, I will have a look to the file this week 🙂

zhang-HZAU · 2021-12-23T03:29:38Z

Hi jrzaurin ，
I updated the data of the input model (see the attachment for the new data and code), namely "wide_cols" and "cross_cols", but the problem of splitting the training set into the validation set still exists.
In the new situation now, I can obtain the loss and accuracy of the interface function through the "History" interface to visualize the training process,as follows:

import matplotlib.pyplot as plt


plt.plot(range(len(trainer.history['train_acc'])),trainer.history['train_acc'])
plt.xlabel("epochs")
plt.ylabel("train_acc")
plt.title('Model_train_acc')
plt.show()

Similarly, I did not find a way to obtain the loss and accuracy of the predict process in the "predict" function. How should I get this data?
Looking forward to your answers,Thank you.
issue_feedback.zip

jrzaurin · 2021-12-23T06:25:47Z

Hey @zhang-HZAU I will have a look to the notebook later and check that val split problem, in the meantime, regarding the loss and accuracy from the predict method here is the thing. The predict method in the Trainer does not take a target variable, as is supposed to be used with test data where you normally don't have the target. Note this is common in most libraries (e.g. sklearn).

If you have a specific test dataset where you know the target and you want to get the metric and loss, you have to do it externally, i.e run predict and then get the losses and the metrics, such as:

from sklearn.metrics import accuracy_score
y_pred = trainer.predict(X_test)
acc = accuracy_score(y_true, y_pred)

Same with the loss. Just bear in mind that for the loss you would need probabilities, not actual classes, for which you should use the trainer's method predict_proba

Hope this helps

jrzaurin · 2021-12-23T12:40:24Z

Hi @zhang-HZAU

so I looked into your notebook and simply, the issue you are facing is because while you define X_tab with the train dataset

...
X_tab = tab_preprocessor.fit_transform(df_train)
...

You then define the target with the entire set:

...
target = df["index_0"].values
...

Simply change that into

target = df_train["index_0"].values

and at least in my case, it runs

:)

Let me know if this fixes the issue

zhang-HZAU · 2021-12-24T00:46:27Z

Hi,jrzaurin,
Indeed, I followed your suggestion, and the error of segmenting the data set disappeared, sorry. Thank you again for your patient answers.

jrzaurin · 2021-12-24T12:13:22Z

Hi @zhang-HZAU

No problem, thank you for using the library 🙂

zhang-HZAU mentioned this issue Dec 11, 2021

When I run wide and deeptab,I got stuck #68

Closed

jrzaurin self-assigned this Dec 11, 2021

jrzaurin closed this as completed Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To be continue:When I run Wide and DeepTab,I got stuck #70

To be continue:When I run Wide and DeepTab,I got stuck #70

zhang-HZAU commented Dec 11, 2021

zhang-HZAU commented Dec 11, 2021

jrzaurin commented Dec 11, 2021 •

edited

zhang-HZAU commented Dec 15, 2021

jrzaurin commented Dec 19, 2021

zhang-HZAU commented Dec 23, 2021

jrzaurin commented Dec 23, 2021 •

edited

jrzaurin commented Dec 23, 2021 •

edited

zhang-HZAU commented Dec 24, 2021

jrzaurin commented Dec 24, 2021

To be continue:When I run Wide and DeepTab,I got stuck #70

To be continue:When I run Wide and DeepTab,I got stuck #70

Comments

zhang-HZAU commented Dec 11, 2021

zhang-HZAU commented Dec 11, 2021

jrzaurin commented Dec 11, 2021 • edited

zhang-HZAU commented Dec 15, 2021

jrzaurin commented Dec 19, 2021

zhang-HZAU commented Dec 23, 2021

jrzaurin commented Dec 23, 2021 • edited

jrzaurin commented Dec 23, 2021 • edited

zhang-HZAU commented Dec 24, 2021

jrzaurin commented Dec 24, 2021

jrzaurin commented Dec 11, 2021 •

edited

jrzaurin commented Dec 23, 2021 •

edited

jrzaurin commented Dec 23, 2021 •

edited