Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To be continue:When I run Wide and DeepTab,I got stuck #70

Closed
zhang-HZAU opened this issue Dec 11, 2021 · 9 comments
Closed

To be continue:When I run Wide and DeepTab,I got stuck #70

zhang-HZAU opened this issue Dec 11, 2021 · 9 comments
Assignees

Comments

@zhang-HZAU
Copy link

Sum: The original question mentioned that according to the project of "Predicting Adult Salary Levels", I changed the input classification category, "RuntimeWarning" appeared and split the validation set based on the training set and encountered the problem of exceeding the index. The original problem address is as follows:#68

@zhang-HZAU
Copy link
Author

Thank you for your answers. According to your suggestion, I changed the label to start from 0, and added "pred_dim=5" to the "Wide" and "WideDeep" functions, but the problem arose again, that is, my problem was not solved, sorry. Previously, I guessed that the problem might be caused by missing data. I randomly deleted part of the data in the original "predicted adult salary level" data set, that is, there are actual data in the input data. I found that the model can also run normally in this case. So I am now guessing whether the large number of 0 values ​​in my data is causing problem one.
In fact, I tried to use matrix factorization to fill in the data before, but the data filling effect is not very good. So I tried unfilled data as input data.
Sorry, for some reason, I can only provide a subset of the data. Column 0 is the label. The rest are indicators.
sub_data.xlsx

@jrzaurin
Copy link
Owner

jrzaurin commented Dec 11, 2021

Hi there @zhang-HZAU .

So, is impossible for me to debug or solve an issue without reproducible code (unless is relatively simple), which means I need to see some code AND a dataset that can pass through the model so I can see what the problem is.

Unfortunately, the 5 rows of data you sent me in sub_data.xlsx are not enough for me to debug. Therefore I decided to do the following. I see that your dataset is a mix of categorical and numerical/continuous cols, and I also see that you might be confused as to how to prepare some of them and pass them to the model (as you are passing as wide columns columns that are numerical, while these have to be categorical).

With that in mind, I built a toy example using sklearn's datasets that I hope is close enough to your problem so that you can use the code

import numpy as np
import pandas as pd
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import Accuracy
from pytorch_widedeep.models import TabMlp, Wide, WideDeep
from pytorch_widedeep.preprocessing import TabPreprocessor, WidePreprocessor
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split


# manually adjusting the dataset so that it has categorical cols
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_classes=3)
df = pd.DataFrame(X, columns=["_".join(["col", str(i)]) for i in range(X.shape[1])])
for c in df.columns[10:]:
    df[c] = df[c].astype(int)
df["target"] = y

# train test split
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.target)

# col 0 to 9 are continuous
continuous_cols = df_train.columns.tolist()[:10]

# col 10 to 20 are categorical (well, integers treated as categories) except
# for the target. Let's use 5 (number 5 is random and the number of crossed
# columns depends very much on your knowledge of the problem and the data) of
# them as wide cols and all of them as categorical embeddings
wide_cols = df_train.columns.tolist()[10:15]
# we will "cross" two (random) categorical colums. NOTE: crossed columns MUST
# BE CATEGORICAL
cross_cols = [("col_12", "col_15")]
cat_embed_cols = [c for c in df_train.columns.tolist()[10:] if c != "target"]

# target
target = df_train["target"].values

# have a look to the documentation and the attributes of the preprocessors to
# understand what these do
wide_preproc = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = wide_preproc.fit_transform(df_train)

tab_preproc = TabPreprocessor(
    embed_cols=cat_embed_cols, continuous_cols=continuous_cols  # type: ignore[arg-type]
)
X_tab = tab_preproc.fit_transform(df_train)

# the wide/linear component is connected DIRECTLY to the output neuron(s)
# Therefore, needs to know the size of that final prediction layer
# before-hand via the pred-dim param
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=3)

# the remaining components/modes of the model are combined via the WideDeep
# class, and therefore, we ONLY need to spedify the pred_dim when we build
# the WideDeep class
tab_mlp = TabMlp(
    column_idx=tab_preproc.column_idx,
    embed_input=tab_preproc.embeddings_input,
    continuous_cols=continuous_cols,
    mlp_hidden_dims=[64, 32],
    mlp_dropout=[0.2, 0.2],
)
model = WideDeep(wide=wide, deeptabular=tab_mlp, pred_dim=3)

trainer = Trainer(model, objective="multiclass", metrics=[Accuracy])
trainer.fit(
    X_wide=X_wide,
    X_tab=X_tab,
    target=target,
    n_epochs=1,
    batch_size=32,
    val_split=0.2,
)

X_wide_test = wide_preproc.transform(df_test)
X_tab_test = tab_preproc.transform(df_test)
preds = trainer.predict(X_wide=X_wide_test, X_tab=X_tab_test)

Hope this helps. If you need more help best thing to do is to join the slack group and/or maybe we could have a zoom call

Let me know if it helps

Cheers
J.

@zhang-HZAU
Copy link
Author

Hi jrzaurin ,
Sorry, I haven't provided feedback in time because of other things being dealt with recently.
After reading your example, I realized that there is indeed a problem with the category data. Now I have processed the data as follows:
①The data is filled with the matrix decomposition (Funk_SVD) algorithm (the filling process needs to be optimized, and currently only the prediction process needs to be run through);
②For category data, I originally intended to input the category label (0, 1, 2, 3...) as category data. The previous misoperation made this part of the data into "float" .I now convert this part of the data into integers and then further into "str" as input. (For the same purpose, I now only focus on running the process through).
Result:
The filled data solves the aforementioned first problem, and now the second problem still exists, that is, when the training set is split in the following code, the second problem reappears until I comment out the "val_split" parameter. Hope to provide solutions, thank you.

trainer.fit(
   X_wide=X_wide,
   X_tab=X_tab,
   target=target,
   n_epochs=150,
   batch_size=16,
   val_split=0.1,
)
In addition, I packaged the original input data and the notebook code into an attachment, and the detailed information of the runtime is retained in the notebook. Regarding the "slack" link you provided, I cannot join due to network reasons. I will continue to try to join in the future.

issue_sum.zip

@jrzaurin
Copy link
Owner

Hey @zhang-HZAU

sorry long week at work, I will have a look to the file this week 🙂

@zhang-HZAU
Copy link
Author

Hi jrzaurin ,
I updated the data of the input model (see the attachment for the new data and code), namely "wide_cols" and "cross_cols", but the problem of splitting the training set into the validation set still exists.
In the new situation now, I can obtain the loss and accuracy of the interface function through the "History" interface to visualize the training process,as follows:

import matplotlib.pyplot as plt


plt.plot(range(len(trainer.history['train_acc'])),trainer.history['train_acc'])
plt.xlabel("epochs")
plt.ylabel("train_acc")
plt.title('Model_train_acc')
plt.show()

Similarly, I did not find a way to obtain the loss and accuracy of the predict process in the "predict" function. How should I get this data?
Looking forward to your answers,Thank you.
issue_feedback.zip

@jrzaurin
Copy link
Owner

jrzaurin commented Dec 23, 2021

Hey @zhang-HZAU I will have a look to the notebook later and check that val split problem, in the meantime, regarding the loss and accuracy from the predict method here is the thing. The predict method in the Trainer does not take a target variable, as is supposed to be used with test data where you normally don't have the target. Note this is common in most libraries (e.g. sklearn).

If you have a specific test dataset where you know the target and you want to get the metric and loss, you have to do it externally, i.e run predict and then get the losses and the metrics, such as:

from sklearn.metrics import accuracy_score
y_pred = trainer.predict(X_test)
acc = accuracy_score(y_true, y_pred)

Same with the loss. Just bear in mind that for the loss you would need probabilities, not actual classes, for which you should use the trainer's method predict_proba

Hope this helps

@jrzaurin
Copy link
Owner

jrzaurin commented Dec 23, 2021

Hi @zhang-HZAU

so I looked into your notebook and simply, the issue you are facing is because while you define X_tab with the train dataset

...
X_tab = tab_preprocessor.fit_transform(df_train)
...

You then define the target with the entire set:

...
target = df["index_0"].values
...

Simply change that into

target = df_train["index_0"].values

and at least in my case, it runs

:)

Let me know if this fixes the issue

@zhang-HZAU
Copy link
Author

Hi,jrzaurin,
Indeed, I followed your suggestion, and the error of segmenting the data set disappeared, sorry. Thank you again for your patient answers.

@jrzaurin
Copy link
Owner

Hi @zhang-HZAU

No problem, thank you for using the library 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants