Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about the wide feature and deep feature #3

Closed
Usernamezhx opened this issue Sep 19, 2019 · 2 comments
Closed

about the wide feature and deep feature #3

Usernamezhx opened this issue Sep 19, 2019 · 2 comments

Comments

@Usernamezhx
Copy link

Usernamezhx commented Sep 19, 2019

first of all. thanks for your work. I am a little confuse about the split of the wide and deep feature. embeddings_cols, continuous_cols, standardize_cols, wide_cols, crossed_cols, already_dummies. embeddings_cols is the category feature. continuous_cols is continuous feature. what's the different between standardize_cols and continuous_cols? the wide_cols feature is similar to the category feature. crossed_cols I know you want to get the interaction feature.how to select the element to constitute interaction feature? thanks in advance.

@jrzaurin
Copy link
Owner

Hi @Usernamezhx

ok, let me explain. If you go here:
https://github.com/jrzaurin/Wide-and-Deep-PyTorch/blob/master/prepare_data.py#L52

you will read the following:

    embeddings_cols: List
        List containing just the name of the columns that will be represented
        with embeddings or a Tuple with the name and the embedding dimension.
        e.g.:  [('education',32), ('relationship',16)
    continuous_cols: List
        List with the name of the so called continuous cols
    standardize_cols: List
        List with the name of the continuous cols that will be Standarised.
        Only included because the Airbnb dataset includes Longitude and
        Latitude and does not make sense to normalise that

The functions in prepare_data.py are highly customised to the problem in particular. So, given this input, and for the airbnb dataset:

continuous_cols = ['latitude', 'longitude', 'security_deposit', 'extra_people']
standardize_cols = ['security_deposit', 'extra_people']

what will happen is that while 'security_deposit', 'extra_people' will be standarised, 'latitude', 'longitude' will not (because it does not make sense.

Regarding to the other column-type inputs, if you go here:
https://github.com/jrzaurin/Wide-and-Deep-PyTorch/blob/master/prepare_data.py#L128

you will read the following:

    wide_cols: List
        List with the name of the columns that will be one-hot encoded and
        pass through the Wide model
    crossed_cols: List
        List of Tuples with the name of the columns that will be "crossed"
        and then one-hot encoded. e.g. (['education', 'occupation'], ...)
    already_dummies: List
        List of columns that are already dummies/one-hot encoded

The wide columns are normally one-hot encoded and then pass through the model. However, there might be some columns that are already one hot encoded, and I call them already_dummies.

And regarding to your last question: "how to select the element to constitute interaction feature?" The answer is that you have to experiment, there is no rule for that. For example, if you have a couple of features and you think that including their relation might add useful information, then is probably useful if you "cross them". For example, directly from the tensorflow tutorials: "...If you have a feature 'favorite_sport' and a feature 'home_city' and you're trying to predict whether a person likes to wear red, your linear model won't be able to learn that baseball fans from St. Louis especially like to wear red..."

Let me know if this helps

@Usernamezhx
Copy link
Author

Thank you very much for your patiently reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants