Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accurate name prior #20

Open
marcoct opened this issue Mar 7, 2021 · 0 comments
Open

Accurate name prior #20

marcoct opened this issue Mar 7, 2021 · 0 comments

Comments

@marcoct
Copy link
Contributor

marcoct commented Mar 7, 2021

A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is useful to model the possibility of typos occurring in names. If we model an observed name field using a typo model, without an accurate name prior it is easy for the model to infer that a correctly spelled name is actually a version of another name but with typos introduced. I encountered this when writing a simple model of first names. Here is a minimal example:

PClean.@model CustomerModel begin

    @class FirstNames begin
        name ~ StringPrior(1, 60, all_given_names)
    end

    @class Person begin
        given_name ~ FirstNames
    end;

    @class Obs begin
        begin
            person ~ Person
            given_name ~ AddTypos(person.given_name.name)
        end
    end;

end;

query = @query CustomerModel.Obs [
    given_name person.given_name.name given_name
];

observations = [ObservedDataset(query, df)]
config = PClean.InferenceConfig(5, 2; use_mh_instead_of_pg=true)
@time begin 
    tr = initialize_trace(observations, config);
    run_inference!(tr, config)
end

Coming up with a good name prior seems like a very nontrivial task. Intuitively, if a human were performing this task, they would rely on their prior experience with names, including common spelling and translation / transliterations and knowledge of the variety closely related names with common phonetic origins, etc. A name expert would have a much more accurate name prior than a random person. Also, the statistics of names (frequency distributions, etc.) might vary widely based on the population or sub-population. One longer-term goal could be to develop an accurate name prior that represents the knowledge of a "global name expert".

Intermediate steps could be to

  • Train a more accurate n-gram text model that is trained on a data set of names.

  • Train or find an existing deep generative model for names.

Other steps that don't involve coming up with a name prior, but might mitigate the issue mentioned above might be:

  • Come up with a more precise typo model, or an approximate typo model that somehow alleviates the issue (e.g. by upper-bounding the number of typos in a name). (This should be a separate issue).

  • Use a large data set of names a directly-observed table in the model. This is equivalent to using a name prior that is a frequency-weighted distribution over these names. (A likely issue with that approach is that if a name is not observed at least once within this data set, then it might be likely to be corrected to name that is).

  • Change the Pitman-Yor parameters for the underlying name table to better match statistics of real names, and more generally admit more rare names.

Also, a review of the potential consequences of a biased name prior, and approaches to reduce bias in the name priors, and/or mitigate downstream consequences of this bias, could be valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant