## Gender Stats

Now that we've got our author data set and inferences for genders, it's time to do some exploratory data analysis. I'm going to do this in [julia](http://julialang.org). I made two functions - `importauthors()` and `getgenderprob()` that take the csv fils from `write_names_to_file` found in [xml parsing](xml_parsing.py) and create julia `DataFrames`.

In [5]:
include("dataimport.jl")

getgenderprob (generic function with 1 method)

In [6]:
bio = importauthors("../data/pubs/bio.csv", "bio")
comp = importauthors("../data/pubs/comp.csv", "comp")
git = importauthors("../data/pubs/git.csv", "git")

Unnamed: 0,PMID,Date,Journal,Author_Name,Position,Dataset
1,26357045,2015-09-11,IEEE/ACM Trans Comput Biol Bioinform,Fumihide,first,git
2,26357045,2015-09-11,IEEE/ACM Trans Comput Biol Bioinform,Eberhard,last,git
3,26357045,2015-09-11,IEEE/ACM Trans Comput Biol Bioinform,Erika,second,git
4,25601296,2015-01-20,JMIR Med Inform,Abhishek,first,git
5,25601296,2015-01-20,JMIR Med Inform,Richard,last,git
6,25558360,2015-01-05,Ecol Evol,Sean,first,git
7,25558360,2015-01-05,Ecol Evol,Lawrence,last,git
8,25558360,2015-01-05,Ecol Evol,Helen,second,git
9,25558360,2015-01-05,Ecol Evol,Andy,penultimate,git
10,25558360,2015-01-05,Ecol Evol,Rogier,other,git


Now let's combine all the data - we can subset it again later. 

We'll also use the `get_gender_prob()` function to add columns for the probability that the author is female (`Pfemale`) and the number of times that name showed up in the genderize.io database, which gives us some sense of how certain we can be in the result (`Count`).

Finally, we'll use `pool!`, which makes the represenation of factored data (data that has distinct rather than continuous values) a bit more efficient in memory (and will make queries faster later on).

In [7]:
alldata = vcat(bio, comp, git)

alldata[:Pfemale], alldata[:Count] = getgenderprob(alldata, :Author_Name)

pool!(alldata)

In julia, we can [subset our dataframes](http://dataframesjl.readthedocs.io/en/latest/subsets.html) pretty easily. For example, we can pull back out rows for our different datasets.

In [None]:
biodata = alldata[alldata[:Dataset] .== "bio", :]
gitdata = alldata[alldata[:Dataset] .== "git", :]
compdata = alldata[alldata[:Dataset] .== "comp", :]

In [15]:
biodata[1:5, :] # get the first 5 rows, and all columns

Unnamed: 0,PMID,Date,Journal,Author_Name,Position,Dataset,Pfemale,Count
1,26466425,2015-10-15,Southeast Asian J. Trop. Med. Public Health,Suwit,first,bio,0.0,2
2,26466425,2015-10-15,Southeast Asian J. Trop. Med. Public Health,Prapas,last,bio,0.0,1
3,26466425,2015-10-15,Southeast Asian J. Trop. Med. Public Health,Suvichai,second,bio,,0
4,26466425,2015-10-15,Southeast Asian J. Trop. Med. Public Health,Pakpoom,penultimate,bio,0.0,1
5,26466425,2015-10-15,Southeast Asian J. Trop. Med. Public Health,Fred,other,bio,0.02,966


Let's find out how many times each name shows up in our bio dataset. 

In [20]:
bionames = by(biodata, :Author_Name, nrow)
sort!(bionames, cols = [:x1], rev=true)

bionames[1:10, :]

Unnamed: 0,Author_Name,x1
1,M,24033
2,J,20916
3,A,18676
4,S,15745
5,C,12175
6,R,12106
7,David,10955
8,D,10180
9,Michael,10055
10,P,9894


Hmm... Looks like many of the most frequent names are single letters, which we don't know the gender of. Let's remove the names where we don't know the gender.

In [None]:
bionames[:Pfemale], bionames[:Count] = getgenderprob(bionames, :Author_Name)
knownbionames = bionames[!isna(bionames[:Pfemale]), :]
unknownbionames = bionames[isna(bionames[:Pfemale]), :]

In [23]:
sort!(knownbionames, cols = [:x1], rev=true)
knownbionames[1:10, :]

Unnamed: 0,Author_Name,x1,Pfemale,Count
1,David,10955,0.0,12597
2,Michael,10055,0.0,11160
3,John,7460,0.01,9931
4,Peter,6025,0.0,4373
5,Robert,5813,0.0,5245
6,Thomas,5043,0.0,3753
7,Richard,4638,0.0,4381
8,James,4629,0.01,6359
9,Mark,4344,0.0,6178
10,Daniel,4300,0.0,8186
