
# Assignment - Univariate Gender Classification

## 1. An univariate classifier from first principles

### a. Data Generation
- Generate distributions (gaussian to start with) for male and female heights (1000 samples each) 
- Fix the mean of female heights to 152 cm and male mean height to 166 cm
- Label the appropriate gender for samples in each of the distribution (M or F)

### b. Standard Deviation
- Fix the sd of both the distributions to 5

### c. Classification Approaches
Implement the following approaches with aim to minimize misclassification:

i. **Likelihood-based Classification**
   - Assign gender based on likelihood calculated from distributions
   - Empirically estimate mean and sd
   - Calculate probability assuming gaussian distributions

ii. **Threshold-based Classification**
   - Derive a threshold height to separate male female

iii. **Quantized Classification**
   - Quantize the data at scale of 0.5 cm
   - Empirically estimate the likelihood of male female in each segment based on majority

iv. **Evaluation**
   - Output a confusion matrix for classification in each of the above cases

### d. Standard Deviation Analysis
- Try following values of sd: 2.5, 7.5 and 10
- Repeat steps 3.a, 3.b, 3.c, 3.d
- Observe impact of change in sd on classification accuracy

### e. Quantization Interval Analysis
- Change the quantization interval length (0.001, 0.05, 0.1, 0.3, 1, 2, 5, 10 cm etc)
- Repeat steps 3.a, 3.b, 3.c, 3.d
- Observe impact of change in sd on classification accuracy

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm

def generate_Heights(n=1000, femalex=152, malex=166, sd=5):
    female_h = np.random.normal(femalex,sd,n)
    male_h = np.random.normal(malex,sd,n)
    df_fm = pd.DataFrame({'height': female_h, 'gender': ['F'] * n})
    df_m = pd.DataFrame({'height': male_h, 'gender': ['M'] * n})
    df = pd.concat([df_fm,df_m])
    return df

print(generate_Heights())
df = generate_Heights()
generate_Heights().to_csv('heights.csv', index=False)

def probability_based_on_likelihood(df,femalex = 152 , malex = 166 , sd = 5):
    female_prob = norm.pdf(df['height'], femalex, sd)
    male_prob = norm.pdf(df['height'], malex, sd)
    result = np.where(male_prob>female_prob, 'M', 'F')
    return result

result = probability_based_on_likelihood(df)
print(result)
df['predicted_gender'] = result
df.to_csv('pdf.csv', index=False)

def threshold_classifier(df, threshold=None):
    if threshold is None:
        threshold = (152 + 166) / 2
    result = np.where(df['height'] > threshold, 'M', 'F')
    return result

print(threshold_classifier(df))
result = threshold_classifier(df)
df['predicted_gender'] = result
df.to_csv('threshold.csv', index=False)





         height gender
0    151.808896      F
1    153.849567      F
2    147.467315      F
3    157.398612      F
4    149.575639      F
..          ...    ...
995  159.822442      M
996  165.794818      M
997  165.266990      M
998  164.038541      M
999  158.786550      M

[2000 rows x 2 columns]
['F' 'F' 'F' ... 'M' 'M' 'M']
['F' 'F' 'F' ... 'M' 'M' 'M']


In [4]:
pd.read_csv('threshold.csv')

Unnamed: 0,height,gender,predicted_gender
0,155.805494,F,F
1,148.332763,F,F
2,148.172724,F,F
3,153.371773,F,F
4,148.681061,F,F
...,...,...,...
1995,165.580025,M,M
1996,168.085098,M,M
1997,167.172716,M,M
1998,166.713052,M,M
