# Machine learning 101
Machine learning is not difficult, and that is what we will prove here.
First we will load the lengths dataset, that was generated in the statistics notebook and use it for further explanation.

In [None]:
import csv
import numpy as np
with open("lengths.csv", "rt") as f:
    dataset = np.array(list(csv.reader(f))).astype(float)

lengths = dataset[:,0]
gender = dataset[:,1]

Nfemale = (gender == 1).sum()
Nmale = (gender == 0).sum()
N = gender.shape[0]

The histogram of the lengths is shown in the next graph. The average for men and women is shown as two red lines. The black line is exactly in between the two distributions.

In [None]:
import matplotlib.pyplot as plt
plt.hist(lengths, bins=40);
plt.vlines([170,184], 0, 1400000, 'r')
plt.vlines([177], 0, 1400000, 'k')

## Question
given the fact that the average length of men in the netherlands is 1.84 and that of women in the netherlands is 1.70, what would be a strategy to determine the gender of each individual?

Please describe the solution in the field below and try to write a short piece of code to determine the gender.

## Answer


In [None]:
## try to write some code here
## fill a variable y_ with your guesses of the gender
## what should be in place of the question marks?
y_ = lengths < ???


Let us first compare the real values with the estimated:

In [None]:
print(f"Number of women: {y_.sum()} (true: {Nfemale})")
print(f"Number of men: {(~y_).sum()} (true: {Nmale})")

As we can see, the true value and the estimated values are not that different. We actually did not such a bad job. However, we can also look at the numbers of correct and incorrect classifications:

In [None]:
correct_f = (y_ & (gender==1)).sum()
correct_m = (~y_ & (gender==0)).sum()
incorrect_f = (y_ & (gender==0)).sum()
incorrect_m = (~y_ & (gender==1)).sum()
print(correct_f, correct_m, incorrect_f, incorrect_m)
print(f"misclassifications: {incorrect_f+incorrect_m}")
print(f"correct classifications: {correct_f+correct_m}")

A quality measure we sometimes use is the accuracy which is the number of correct classification divided by the total number of cases:

In [None]:
print(f"Accuracy = {(correct_f+correct_m)/N}")

This means that the gender of about 84% of the population is correctly determined.

## Scikit learn
We are now going to use scikit learn. We will use the most simple classifier to work with: the so called logistic regression, and will try to see if we can find the same result.

In [None]:
from sklearn.linear_model import LogisticRegression
X = lengths[0:100].reshape(100,1)
y = gender[0:100]
clf = LogisticRegression(random_state=0).fit(X, y)

It is not important how it was calculated, but the criterium, that we estimated in the previous approach, is now automatically determined by the algorithm. Don't look at the calculation, but:

In [None]:
print(f"the criterium is equal to: {(-clf.intercept_[0] / clf.coef_[0])[0]}")

## Question
As you see, the criterium is pretty good with the value we used earlier. However, it is not completely the same. Could you think about reasons why it is not the same? 

## Answer


Let us now see how far the estimation of the number of women in the netherlands is of. for that, we first predict the gender based on the model and then we see how many are predicted as women:

In [None]:
y_ = clf.predict(lengths.reshape(N,1))
print(f"number of women: {y_.sum()} (real: {Nfemale})")

mmm, the number is not that accurate. how could that be? In the rest of the project, we are going to investigate this.

# Populations and samples

Like in the previous notebook, we finally want to count. We want to count, for instance, the number of females in the dutch population. Like shown in the previous cells, machine learning algorithms are not always very accurate at that point. A part of the uncertainty in the estimate comes from the small sample that we use for the training set. In the next cell, we will do the same as we did in determining the variance as a function of the sample size. Now we are going to look at the training set size and measure the variance. 


In [None]:
from tqdm import tqdm
from numpy.random import default_rng
rng = default_rng()

variances = []
SAMPLE_SIZES = [50,100,200,400,800,1600]
for sample_size in tqdm(SAMPLE_SIZES):
    estimated_f = []
    for n in range(100):
        # generate a training set as a sample of the population
        training = rng.choice(dataset, replace = False, size = sample_size)
        # we call the features in the training set X
        X = training[:,0].reshape(sample_size,1)
        # we call the target variable (gender) y
        y = training[:,1]
        # clf is the logistic regression model.
        clf = LogisticRegression().fit(X, y)
        # y_ are the prediction for the whole population
        y_ = clf.predict(lengths.reshape(N,1))
        estimated_f.append(y_.sum())
    variances.append(np.array(estimated_f).var())
    

we generate the same plot as before:

In [None]:
import matplotlib.pyplot as plt
variances = np.array(variances)
sample_sizes = np.array(SAMPLE_SIZES)
plt.plot(sample_sizes, variances)

as we can see, at about 600 trials, the variance does not change that much anymore.