Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weighted distribution across multiple variables #4

Open
soooh opened this issue Apr 13, 2017 · 4 comments
Open

Weighted distribution across multiple variables #4

soooh opened this issue Apr 13, 2017 · 4 comments

Comments

@soooh
Copy link

soooh commented Apr 13, 2017

Could be a useful addition to your library. As an example, I'm interested in getting stats on race and gender in a group over time. Something like:

data_by_year = data.groupby(['year'])
race_gender_demographics = calc.distribution(data_by_year, ['race', 'gender']).round(3)
@jsvine
Copy link
Owner

jsvine commented Apr 13, 2017

Hi @soooh! You should actually be able to do this with the current code. It'll depend on what, exactly, you're looking to calculate. But lets say you're looking for the weighted distribution of race, by gender and over time. In that case, this should work:

grouped = data.groupby([ "year", "gender" ])
dist = calc.distribution(grouped, "race").round(3)

Does that work? Are you aiming for something slightly different?

@soooh
Copy link
Author

soooh commented Apr 13, 2017

Ah, so what that does is give me the racial demographics of women and men separately. E.g., of all the women, 20% are white, 10% are black, and so on. What I want is something like, 10% of the group is white men, 8% white women, etc. Does that make sense?

@jsvine
Copy link
Owner

jsvine commented Apr 13, 2017

Ah, so what that does is give me the racial demographics of women and men separately. E.g., of all the women, 20% are white, 10% are black, and so on.

Yep.

What I want is something like, 10% of the group is white men, 8% white women, etc. Does that make sense?

Ah, sounds like I misunderstood the goal. In that case, the easiest way might be like so:

data["race_x_gender"] = data[[ "race", "gender" ]].apply(" x ".join, axis=1)
dist = calc.distribution(data.groupby("year"), "race_x_gender").round(3)

Does that achieve your goal? (It assumes that race and gender are strings.)

I'll also think about ways I could incorporate a generic feature like this into the library itself. Thanks for the suggestion!

@soooh
Copy link
Author

soooh commented Apr 13, 2017

Ah yes, that is actually what I am doing! 😄
I thought it could be a useful feature, though, which is why I suggested it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants