# Mathematicians of Wikipedia

This project analyzes a dataset of more than 8,500 famous mathematicians.

## Table of Contents
- [Importing Dataset and Libraries](#import)
- [Exploratory Data Analysis](#exploration)

## Importing Dataset and Libraries <a name="importing"></a>

In [143]:
using Suppressor #buscar alternativa que funcioni
using CSV
using DataFrames
using StatsPlots

In [144]:
data = CSV.read("../wikipedia-mathematicians/data_cleaned.csv");

## Exploratory Data Analysis <a name="exploration"></a>

In [145]:
describe(data)[[:variable, :eltype, :nmissing]]

│   caller = top-level scope at In[145]:1
└ @ Core In[145]:1


Unnamed: 0_level_0,variable,eltype,nmissing
Unnamed: 0_level_1,Symbol,Type,Union…
1,mathematicians,String,
2,occupation,String,
3,country of citizenship,"Union{Missing, String}",1721.0
4,place of birth,"Union{Missing, String}",3026.0
5,date of death,"Union{Missing, String}",3702.0
6,educated at,"Union{Missing, String}",3869.0
7,employer,"Union{Missing, String}",5086.0
8,place of death,"Union{Missing, String}",5436.0
9,member of,"Union{Missing, String}",5505.0
10,employer_1,"Union{Missing, String}",5086.0


The above table shows the names of the variables (columns) in our dataset. It also displays the type of each variable and the amount of missing values. Before we start working with these variables, we will rename the columns so they are easier to handle:

In [146]:
newcols = ["mathematicians", "occupation", "citizenship", "birth_place", "death_date", "education", "employer", 
    "death_place", "member", "employer_1", "doctoral_advisor", "languages", "academic_degree", "doctoral_student", 
    "death_manner", "position", "field", "award", "erdos_number", "instance_of", "sex_or_gender", "approx_birth_date",
    "birth_day", "birth_month", "birth_year", "approx_death_date", "death_day", "death_month", "death_year"]
rename!(data, newcols)
@show names(data);

names(data) = ["mathematicians", "occupation", "citizenship", "birth_place", "death_date", "education", "employer", "death_place", "member", "employer_1", "doctoral_advisor", "languages", "academic_degree", "doctoral_student", "death_manner", "position", "field", "award", "erdos_number", "instance_of", "sex_or_gender", "approx_birth_date", "birth_day", "birth_month", "birth_year", "approx_death_date", "death_day", "death_month", "death_year"]


In [147]:
# Number of rows in our dataset:
nrow(data)

8596

We will need to check if there are any missing values in some columns:

In [148]:
function missing_values(col)
    if (ismissing.(data[:, col]) == 1) == true
        println("Column number $col has missing values")
    end
end

for i in 1:29
    missing_values(i)
end

# TODO

One interesting thing to measure is the proportion of women with respect to men:

In [149]:
function count_gender(data, gender::String)
        count = length(findall(skipmissing(data[:,21] .== gender)))
        return count
end

male = count_gender(data,"['male']")
female = count_gender(data, "['female']")

println("There are $male men and $female women in our dataset. In $(nrow(data)-male-female) cases, gender is not specified.")

There are 7774 men and 787 women in our dataset. In 35 cases, gender is not specified.


In [150]:
prop_w = round((female/(male+female)); digits=3)
prop_m = round((male/(male+female)); digits=3)

println("Only $prop_w% of mathematicians from whom we have data are women, the remaining $prop_m% are men.")

Only 0.092% of mathematicians from whom we have data are women, the remaining 0.908% are men.


The **Erdős number** is the *collaborative distance* between a given person and the famous mathematician Paul Erdős. This distance is measured in terms of authorship of mathematical papers. 

We now analyze the Erdős numbers corresponding to the different mathematicians in our dataset:

In [151]:
# TODO

The variable named "instance of" shows fun or curious occurrences in the life of these mathematicians. Let's explore this column:

In [161]:
for i in 1:nrow(data)
    if (data[i, "instance_of"] != "['human']") 
        println(data[i, "instance_of"]);
    end
end

['human', 'Russian Wikipedia']
['human', 'twin']
['Q5']
['human', 'twin']
['Q5']
['Q5']
['twin', 'human']
['Q5']
['human', 'English Wikipedia']
['Q5']
['Q5']
['human', 'Russian Wikipedia']
['human', 'data.bnf.fr', '10 October 2015', 'http://data.bnf.fr/ark:/12148/cb119176085']
['human', 'Russian Wikipedia', 'data.bnf.fr', '10 October 2015', 'http://data.bnf.fr/ark:/12148/cb118976048']
['Q5']
['Q5']
['Q5']
['Q5']
['Q5']
['eunuch', 'human']
['Q5']
['human', 'male']
['Q5']
['human', 'twin']
['Q5']
['human', 'Russian Wikipedia']
['human', 'emeritus']
['Q5']


The most interesting cases from the ones displayed above are the cases where the mathematician in question had a twin sibling or was castrated:

In [183]:
for i in 1:nrow(data)
    if(data[i, "instance_of"] == "['human', 'twin']" || data[i, "instance_of"] == "['twin', 'human']")
        println("$(data[i, "mathematicians"]) had a twin sibling")
    end
    if(data[i, "instance_of"] == "['eunuch', 'human']")
        println("$(data[i, "mathematicians"]) was castrated!")
    end
end

Leon O. Chua had a twin sibling
Carl Gustav Axel Harnack had a twin sibling
Nathan D'Laryea had a twin sibling
Jia Xian was castrated!
Milutin Milanković had a twin sibling
