# Predicting T-shirt size using the ANSUR II dataset
We will here try to predict a persons t-shirt size given the weight and height of the person. We will use the ANSUR II dataset which contains a lot of information about the physical attributes of a large number of people.
 
We will first try to map the persons in the dataset to a t-shirt size. It is hard to find a concise size chart for t-shirt so we will create our own, initial chart, based on these assumptions:
 
We will only look at two measurements, Shoulder Width and Chest Circumference.
 
Our first problem is that Shoulder Width is not one of the measurements taken in the dataset. But we have Biacromial Breadth which is the distance between the two acromion processes. We will assume that this is the same as Shoulder Width.
 
We will then have these initial rules:
 
| Size | Percentile |
|------|------------|
| XS   | 0-5        |
| S    | 5-25       |
| M    | 25-50      |
| L    | 50-75      |
| XL   | 75-90      |
| XXL  | 90-97      |
| XXXL | 97-100     |

What i know : 

we have a dataset with shoulder and chest measurments and we need to get the t shirt size with this 2 measurments 

“Let’s rank people from smallest to largest, and assign sizes based on where they fall.”

mapping rule : if person falls in this percentile then this size 

conflict : person 1 falls in S based on chest measurments and M based on shoulder 

if chest size == shoulder size (match )
if chest size != shoulder size (conflict)


steps : 
1. Rank people by chest -> assign size  
(if we sort the sizes of all people 1-2-3------100 | 1-5 ppl are xs | 6-25 are s and so on 98-100 are 3xl)
2. rank by shoulders -> assign size 
3. Compare the two sizes
Count:
People with same size
People with different sizes




In [1]:
import pandas as pd

In [2]:
female = pd.read_csv('Data/female.csv')
male = pd.read_csv('Data/male.csv')

In [3]:
print(f'for women we have (rows,colums): {female.shape}')
print(f'for men we have (rows,colums): {male.shape}')

for women we have (rows,colums): (1986, 108)
for men we have (rows,colums): (4082, 108)


## checking the percentiles

determining the percentiles of data 

In [4]:
def compute_percentile_ranges(column):
    #define ranges
    ranges = [(0,5) , (5,25) , (25,50) , (50,75) , (75,90) ,(90,97) , (97,100)] #size percentile intervals

    percentiles = {(low,high): (column.quantile(low/100), column.quantile(high/100)) for low,high in ranges} #dictonary comprehension
    #a tuple can be a key value in dictonary (it is only immutable type)
    #in our data quantile(low/100)-> 0/100->0.0| quantile(high/100)->5/100->0.05
    #oth percintile and 5th percentile 
    #if we have like Measurements sorted: 78, 80, 82, 85, 88, 90, 92, 95, 98, 100, 102, 105
    #we get (78,82)
    #and people whose mesurments are between 78,82 are size xs

    counts = {}

    #r =(0,5), low=0%,high-5%

    for r,(low,high) in percentiles.items():
        counts[r] = int(((column >=low) & (column < high)).sum()) #if we remove int() we will have numpy int64 type which is not a problem
    return counts

print(compute_percentile_ranges(female['chestcircumference']))
print(compute_percentile_ranges(female['biacromialbreadth'])) #shoulder width


print(compute_percentile_ranges(male['chestcircumference']))
print(compute_percentile_ranges(male['biacromialbreadth'])) #shoulder width



{(0, 5): 100, (5, 25): 396, (25, 50): 492, (50, 75): 499, (75, 90): 299, (90, 97): 140, (97, 100): 59}
{(0, 5): 93, (5, 25): 377, (25, 50): 477, (50, 75): 541, (75, 90): 297, (90, 97): 139, (97, 100): 61}
{(0, 5): 199, (5, 25): 810, (25, 50): 1025, (50, 75): 1012, (75, 90): 616, (90, 97): 295, (97, 100): 124}
{(0, 5): 191, (5, 25): 787, (25, 50): 989, (50, 75): 1079, (75, 90): 610, (90, 97): 303, (97, 100): 122}


## Generate the t shirt size chart

Example first iteration:

p = 0

p/100 = 0/100 = 0.0

data[chest_column].quantile(0.0) → smallest chest measurement

Example second iteration:

p = 5

p/100 = 0.05

data[chest_column].quantile(0.05) → value below which 5% of people fall

Example third iteration:

p = 25

.quantile(0.25) → value below which 25% of people fall

…and so on for all p in [0,5,25,50,75,90,97].

In [5]:
def compute_size_percentile_measurents(data,chest_column,shoulder_column):
    sizes = ['XS','S','M','L','XL','XXL','XXXL']
    ranges = [0,5,25,50,75,90,97,100]

    #compute the values for each percentile for chest and shoulder

    chest_percentiles = {p: data[chest_column].quantile(p/100) for p in ranges} 
    #0:value at 0th percintile|5:value at 5th percintle and so on

    shoulder_percentiles = {p: data[shoulder_column].quantile(p/100) for p in ranges}

    #map t shirt sizes to chest and shoulder values

    size_mappings ={}
    for i,size in enumerate(sizes): #i=0 , size=xs ---
        size_mappings[size]={
            'Chest':int(chest_percentiles[ranges[i+1]]), #i= index , for i=0, i+1=5, upper percentile of the size 
            #i+1 because always take upper percentile of size 
            'Shoulder':int(shoulder_percentiles[ranges[i+1]])
        }
    return size_mappings

print(compute_size_percentile_measurents(female,'chestcircumference','biacromialbreadth'))
print(compute_size_percentile_measurents(male,'chestcircumference','biacromialbreadth'))


{'XS': {'Chest': 824, 'Shoulder': 335}, 'S': {'Chest': 889, 'Shoulder': 353}, 'M': {'Chest': 940, 'Shoulder': 365}, 'L': {'Chest': 999, 'Shoulder': 378}, 'XL': {'Chest': 1057, 'Shoulder': 389}, 'XXL': {'Chest': 1117, 'Shoulder': 400}, 'XXXL': {'Chest': 1266, 'Shoulder': 422}}
{'XS': {'Chest': 922, 'Shoulder': 384}, 'S': {'Chest': 996, 'Shoulder': 403}, 'M': {'Chest': 1056, 'Shoulder': 415}, 'L': {'Chest': 1117, 'Shoulder': 428}, 'XL': {'Chest': 1172, 'Shoulder': 441}, 'XXL': {'Chest': 1233, 'Shoulder': 452}, 'XXXL': {'Chest': 1469, 'Shoulder': 489}}


In [6]:
# Female size chart
female_size_chart = compute_size_percentile_measurents(
    female, 'chestcircumference', 'biacromialbreadth'
)

# Male size chart
male_size_chart = compute_size_percentile_measurents(
    male, 'chestcircumference', 'biacromialbreadth'
)

print(female_size_chart)


{'XS': {'Chest': 824, 'Shoulder': 335}, 'S': {'Chest': 889, 'Shoulder': 353}, 'M': {'Chest': 940, 'Shoulder': 365}, 'L': {'Chest': 999, 'Shoulder': 378}, 'XL': {'Chest': 1057, 'Shoulder': 389}, 'XXL': {'Chest': 1117, 'Shoulder': 400}, 'XXXL': {'Chest': 1266, 'Shoulder': 422}}


In [None]:
def assign_size_from_chart(value, size_chart, measurement):
    """
    value       : one measurement (chest or shoulder)
    size_chart  : dictionary from compute_size_percentile_measurents
    measurement : 'Chest' or 'Shoulder'
    """
    for size, limits in size_chart.items():
        if value <= limits[measurement]:
            return size
    return 'XXXL'  # fallback if measurement > largest size


In [None]:
#Assigning chest & shoulder sizes to each person

# Female
female['chest_size'] = female['chestcircumference'].apply(
    lambda x: assign_size_from_chart(x, female_size_chart, 'Chest')
)
female['shoulder_size'] = female['biacromialbreadth'].apply(
    lambda x: assign_size_from_chart(x, female_size_chart, 'Shoulder')
)

# Male
male['chest_size'] = male['chestcircumference'].apply(
    lambda x: assign_size_from_chart(x, male_size_chart, 'Chest')
)
male['shoulder_size'] = male['biacromialbreadth'].apply(
    lambda x: assign_size_from_chart(x, male_size_chart, 'Shoulder')
)


In [41]:
def match_or_conflict(row):
    if row['chest_size'] == row['shoulder_size']:
        return 'match'
    else:
        return 'conflict'


In [42]:
# Female
female['result'] = female.apply(match_or_conflict, axis=1)

# Male
male['result'] = male.apply(match_or_conflict, axis=1)


In [43]:
# Female
print("Female results:")
print(female['result'].value_counts())

# Male
print("\nMale results:")
print(male['result'].value_counts())


Female results:
result
conflict    1517
match        469
Name: count, dtype: int64

Male results:
result
conflict    2914
match       1168
Name: count, dtype: int64


In [48]:
female[['chestcircumference','biacromialbreadth','chest_size','shoulder_size','result']].to_csv('female_sizes_clean.csv', index=False)

male[['chestcircumference','biacromialbreadth','chest_size','shoulder_size','result']].to_csv('male_sizes_clean.csv', index=False)


In [None]:
def get_size(data,size_chart):
    matches ={size :0 for size in size_chart.keys()}
    ties=0

    for _, row in data.iterrows():
        possible_sizes=[]

        for size,measurements in size_chart.items():
            if(row['biacromialbreadth'] <= measurements['shoulder'] and 
               row['chestcircumference'] <= measurements['chest']):
                possible_sizes.append(size)

        if len(possible_sizes) ==1 :
            matches[possible_sizes[0]] +=1
        elif len(possible_sizes) >1:
            ties +=1

    return matches,ties



female_matches, female_ties = get_size(female,female_size_chart)
male_matches , male_ties = get_size(male,male_size_chart)

print('Female matches : ' , female_matches)
print('Female ties : ' , female_ties)

print('Male matches : ' , male_matches)
print('Male ties : ' , male_ties)