### Exploring gender through authorship and journal information
Author: Peter Laurin
This is a brief exploration into what variables we've collected, and how they might predict author gender in our data. 

In [136]:
import numpy as np
import scipy
import pandas as pd
import sqlite3
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

In [11]:
import os
os.chdir('..')

First, to view our variables (as a pandas dataframe)

In [12]:
conn = sqlite3.connect('journals.db')
sql_text = 'SELECT first_name, last_name, institution, gender, country, field, rank, num_authors FROM authors JOIN papers JOIN author_key_rank ON author_key_rank.author_identifier = authors.author_identifier AND author_key_rank.paper_identifier = papers.paper_identifier;'
author_frame = pd.read_sql_query(sql_text, conn)

Should be set

In [6]:
author_frame[:5]

Unnamed: 0,first_name,last_name,institution,gender,country,field,rank,num_authors
0,Diego R.,Barneche,indian ocean marine research centre,boy,australia,biological-sciences,1,7
1,Chris J.,Hulatt,queen mary university of london,gender neutral,uk,biological-sciences,2,7
2,Matteo,Dossena,queen mary university of london,boy,uk,biological-sciences,3,7
3,Daniel,Padfield,university of exeter,boy,uk,biological-sciences,4,7
4,Guy,Woodward,imperial college london,boy,uk,biological-sciences,5,7


Now, only including authors that go to large institutions (reduces number of variables and removes low_quality data)

In [13]:
inst_count = author_frame.groupby('institution').count().sort_values(by = 'num_authors', ascending=False).iloc[:,-1:]
large_institutions = inst_count[inst_count['num_authors'] > 10].index

In [14]:
author_frame = author_frame[author_frame['institution'].isin(large_institutions)]

Got rid of about 15,000 authors, but we have a large dataset still (60,000 authors) , and this is an acceptable loss, as it would have been hard to get around issues of multicollinearity, etc. I would also filter by country having > 10 authors, but this should already be done by institution. 

Now to determine author ranking. We want to reduce the number of variables in rank to be a categorial variable of 'first author', 'middle_author' or 'last_author' instead of the regression interpreting it as a numerical variable with increasing author status

In [15]:
author_frame.loc[:,'author_status'] = 'middle_author'
author_frame.loc[author_frame['rank'] == 1, 'author_status'] = 'first_author'
author_frame.loc[author_frame['rank'] == author_frame['num_authors'], 'author_status'] = 'last_author'

In [110]:
regression_frame = author_frame.loc[:,['gender', 'institution', 'author_status', 'field', 'country']]
regression_frame = regression_frame[regression_frame['institution'] != '']
regression_frame = regression_frame[regression_frame['gender'].isin(['boy', 'girl'])]
regression_frame = regression_frame.astype('category')

In [111]:
regression_frame.loc[regression_frame['gender'] == 'boy', 'gender_bin'] = 1
regression_frame.loc[regression_frame['gender'] == 'girl', 'gender_bin'] = 0

In [101]:
y = regression_frame['gender_bin']
X = regression_frame.iloc[:, 1:5]
X = pd.get_dummies(data = X, drop_first=True)

In [122]:
def split_data(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)
    return X_train, X_test, y_train, y_test

Ready for regression! Because this is still a rather large dataset with ~ 40,000 entries, we'll use saga to solve and elastic net to help account for both overfitting and multicollinearity between our predictive variables.

In [92]:
X_train, X_test, y_train, y_test = split_data(X, y)
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
y_pred = lin_model.predict(X_test)

Let's see how the model looks

In [94]:
print(mean_squared_error(y_test, y_pred), r2_score(y_test, y_pred))

6.116139238701662e+20 -2.6190322690391047e+21


Some truly terrible numbers there. Because institution added so many variables, we might be overfitting. Let's try it again without. 

In [104]:
y = regression_frame['gender_bin']
X = regression_frame.iloc[:, 2:5]
X = pd.get_dummies(data = X, drop_first = True)

In [108]:
X_train, X_test, y_train, y_test = split_data(X, y)
lin_model_no_inst = LinearRegression()
lin_model_no_inst.fit(X_train, y_train)
y_pred = lin_model_no_inst.predict(X_test)

(22546, 132)
(5637, 132)
(22546,)
(5637,)


In [109]:
print(mean_squared_error(y_test, y_pred), r2_score(y_test, y_pred))

6.107386932387707e+22 -2.6430480936598035e+23


No luck, maybe just with author status, and field, to be safe

In [137]:
y = regression_frame['gender_bin']
X = regression_frame.iloc[:, 2:4]
X = pd.get_dummies(data = X, drop_first = True)

In [138]:
X_train, X_test, y_train, y_test = split_data(X, y)
lin_model_min = LinearRegression()
lin_model_min.fit(X_train, y_train)
y_pred = lin_model_min.predict(X_test)

(22546, 9)
(5637, 9)
(22546,)
(5637,)


In [139]:
print(mean_squared_error(y_test, y_pred), r2_score(y_test, y_pred))

0.22366388942758628 0.018701172615155248


While our model clearly still has some faults, we're getting closer. We'll focus on field and author status in our visualizations. Let's take a look at our predictions.