### Exploring gender through authorship and journal information
Author: Peter Laurin
This is a brief exploration into what variables we've collected, and how they might predict author gender in our data. 

In [4]:
import numpy as np
import scipy
import pandas as pd
import sqlite3

First, to view our variables (as a pandas dataframe)

In [5]:
conn = sqlite3.connect('journals.db')
sql_text = 'SELECT first_name, last_name, institution, gender, country, field, rank, num_authors FROM authors JOIN papers JOIN author_key_rank ON author_key_rank.author_identifier = authors.author_identifier AND author_key_rank.paper_identifier = papers.paper_identifier;'
author_frame = pd.read_sql_query(sql_text, conn)

Should be set

In [6]:
author_frame[:5]

Unnamed: 0,first_name,last_name,institution,gender,country,field,rank,num_authors
0,Diego R.,Barneche,indian ocean marine research centre,boy,australia,biological-sciences,1,7
1,Chris J.,Hulatt,queen mary university of london,gender neutral,uk,biological-sciences,2,7
2,Matteo,Dossena,queen mary university of london,boy,uk,biological-sciences,3,7
3,Daniel,Padfield,university of exeter,boy,uk,biological-sciences,4,7
4,Guy,Woodward,imperial college london,boy,uk,biological-sciences,5,7


Now, only including authors that go to large institutions (reduces number of variables and removes low_quality data)

In [7]:
inst_count = author_frame.groupby('institution').count().sort_values(by = 'num_authors', ascending=False).iloc[:,-1:]
large_institutions = inst_count[inst_count['num_authors'] > 10].index

In [8]:
author_frame = author_frame[author_frame['institution'].isin(large_institutions)]

Got rid of about 15,000 authors, but we have a large dataset still (60,000 authors) , and this is an acceptable loss, as it would have been hard to get around issues of multicollinearity, etc. I would also filter by country having > 10 authors, but this should already be done by institution. 

Now to determine author ranking. We want to reduce the number of variables in rank to be a categorial variable of 'first author', 'middle_author' or 'last_author' instead of the regression interpreting it as a numerical variable with increasing author status

In [9]:
author_frame.loc[:,'author_status'] = 'middle_author'
author_frame.loc[author_frame['rank'] == 1, 'author_status'] = 'first_author'
author_frame.loc[author_frame['rank'] == author_frame['num_authors'], 'author_status'] = 'last_author'

In [10]:
regression_frame = author_frame.loc[:,['author_status', 'gender', 'institution', 'country', 'field']]
regression_frame.astype('category')

Unnamed: 0,author_status,gender,institution,country,field
1,middle_author,gender neutral,queen mary university of london,uk,biological-sciences
2,middle_author,boy,queen mary university of london,uk,biological-sciences
3,middle_author,boy,university of exeter,uk,biological-sciences
4,middle_author,boy,imperial college london,uk,biological-sciences
5,middle_author,boy,queen mary university of london,uk,biological-sciences
...,...,...,...,...,...
84070,first_author,boy,,united states of america,business-and-commerce
84071,middle_author,boy,,united states of america,business-and-commerce
84072,middle_author,boy,,united states of america,business-and-commerce
84073,middle_author,girl,,united states of america,business-and-commerce


In [95]:
regression_frame = regression_frame[regression_frame['institution'] != '']
regression_frame = regression_frame[regression_frame['gender'].isin(['boy', 'girl'])]

In [13]:
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

array(['boy', 'girl'], dtype=object)