# Clustering Masculinity data

This project is a brief beginner introduction to the KMeans algorithm by looking at masculinity survey data. 

## Import required modules

In [47]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

## Import the dataset and run the initial data inspection

In [48]:
df = pd.read_csv(r"C:\Users\pjhop\OneDrive\Documents\Programming & Coding\Python\Projects\Datasets\masculinity.csv", index_col=0)
df.head()

Unnamed: 0,StartDate,EndDate,q0001,q0002,q0004_0001,q0004_0002,q0004_0003,q0004_0004,q0004_0005,q0004_0006,...,q0035,q0036,race2,racethn4,educ3,educ4,age3,kids,orientation,weight
1,5/10/18 4:01,5/10/18 4:06,Somewhat masculine,Somewhat important,Not selected,Not selected,Not selected,Pop culture,Not selected,Not selected,...,Middle Atlantic,Windows Desktop / Laptop,Non-white,Hispanic,College or more,College or more,35 - 64,No children,Gay/Bisexual,1.714026
2,5/10/18 6:30,5/10/18 6:53,Somewhat masculine,Somewhat important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Not selected,...,East North Central,iOS Phone / Tablet,White,White,Some college,Some college,65 and up,Has children,Straight,1.24712
3,5/10/18 7:02,5/10/18 7:09,Very masculine,Not too important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Other (please specify),...,East North Central,Windows Desktop / Laptop,White,White,College or more,College or more,35 - 64,Has children,Straight,0.515746
4,5/10/18 7:27,5/10/18 7:31,Very masculine,Not too important,Father or father figure(s),Mother or mother figure(s),Other family members,Not selected,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,Some college,Some college,65 and up,Has children,No answer,0.60064
5,5/10/18 7:35,5/10/18 7:42,Very masculine,Very important,Not selected,Not selected,Other family members,Not selected,Not selected,Not selected,...,East North Central,Windows Desktop / Laptop,White,White,College or more,College or more,35 - 64,No children,Straight,1.0334


In [49]:
print(df.shape)
print(df.columns)
print('Number of responses: ', len(df))

(1189, 97)
Index(['StartDate', 'EndDate', 'q0001', 'q0002', 'q0004_0001', 'q0004_0002',
       'q0004_0003', 'q0004_0004', 'q0004_0005', 'q0004_0006', 'q0005',
       'q0007_0001', 'q0007_0002', 'q0007_0003', 'q0007_0004', 'q0007_0005',
       'q0007_0006', 'q0007_0007', 'q0007_0008', 'q0007_0009', 'q0007_0010',
       'q0007_0011', 'q0008_0001', 'q0008_0002', 'q0008_0003', 'q0008_0004',
       'q0008_0005', 'q0008_0006', 'q0008_0007', 'q0008_0008', 'q0008_0009',
       'q0008_0010', 'q0008_0011', 'q0008_0012', 'q0009', 'q0010_0001',
       'q0010_0002', 'q0010_0003', 'q0010_0004', 'q0010_0005', 'q0010_0006',
       'q0010_0007', 'q0010_0008', 'q0011_0001', 'q0011_0002', 'q0011_0003',
       'q0011_0004', 'q0011_0005', 'q0012_0001', 'q0012_0002', 'q0012_0003',
       'q0012_0004', 'q0012_0005', 'q0012_0006', 'q0012_0007', 'q0013',
       'q0014', 'q0015', 'q0017', 'q0018', 'q0019_0001', 'q0019_0002',
       'q0019_0003', 'q0019_0004', 'q0019_0005', 'q0019_0006', 'q0019_0007',
       'q

In [50]:
missing_counts = df.isnull().sum()
missing_columns = missing_counts[missing_counts > 0]
print(missing_columns)

q0010_0001      613
q0010_0002      613
q0010_0003      613
q0010_0004      613
q0010_0005      613
q0010_0006      613
q0010_0007      613
q0010_0008      613
q0011_0001      613
q0011_0002      613
q0011_0003      613
q0011_0004      613
q0011_0005      613
q0012_0001      613
q0012_0002      613
q0012_0003      613
q0012_0004      613
q0012_0005      613
q0012_0006      613
q0012_0007      613
q0013          1160
q0014           613
q0015           704
q0019_0001      264
q0019_0002      264
q0019_0003      264
q0019_0004      264
q0019_0005      264
q0019_0006      264
q0019_0007      264
q0034             2
q0035            12
q0036             2
educ3             1
educ4             1
age3              1
kids              6
orientation       1
weight            1
dtype: int64


Many of these values are down to participants skipping or failing to answer questions.

## Data manipulation

In [51]:
cols_to_map = ["q0007_0001", "q0007_0002", "q0007_0003", "q0007_0004",
       "q0007_0005", "q0007_0006", "q0007_0007", "q0007_0008", "q0007_0009",
       "q0007_0010", "q0007_0011"]

for col_name in cols_to_map:
    df[col_name] = df[col_name].map({"Often": 4, "Sometimes": 3, "Rarely" : 2, "Never, but open to it": 1, 
                                    "Never, and not open to it": 0})
    

In [53]:
df['q0001'] = df['q0001'].map({'Very masculine':3, 'Somewhat masculine':2, 'Not very masculine':1, 'Not at all masculine':0})

In [54]:
relevant_cols = cols_to_map
rows_to_cluster = df.dropna(subset=relevant_cols)

## Fit the K-Means algorithm

KMeans is an unsupervised machine learning algorithm, which is used to group together similar datapoints into k clusters, where each cluster represents a group of similar datapoints. 

Here is how it works:
1. First, you need to know how many clusters you are grouping the data into.
2. The algorithm selects 'k' datapoints to be the initial centers of each cluster. 
3. Each datapoint is assigned to the cluster with the nearest center point.
4. The center of the cluster is then updated to be the mean of the datapoints in each cluster that was previously taken.
5. Steps 3 and 4 repeat continues until either the centers no longer move or the maximum number of iterations is reached. 
6. The algorithm then outputs the clusters.

In [36]:
km = KMeans(n_clusters=2)
km.fit(rows_to_cluster[relevant_cols])

KMeans(n_clusters=2)

In [37]:
print(km.cluster_centers_)

[[1.91052632 1.85263158 0.95789474 1.66578947 0.53947368 2.88421053
  0.08421053 2.80789474 2.17894737 0.60789474 1.66315789]
 [2.85381026 2.83359253 2.83981337 2.44012442 0.71695179 2.74339036
  0.52410575 2.97045101 2.80248834 1.53654743 2.39502333]]


In [43]:
labels = km.labels_

In [60]:
cluster_zero_df = []
cluster_one_df = []
for i in range(len(labels)):
    if labels[i] == 0:
        cluster_zero_df.append(i+1)
    elif labels[i] == 1:
        cluster_one_df.append(i+1)
    else:
        print('Error')

In [61]:
print(cluster_zero_df)

[3, 4, 6, 7, 9, 10, 12, 17, 18, 19, 26, 28, 32, 41, 46, 47, 49, 50, 51, 52, 53, 55, 56, 60, 61, 62, 63, 66, 72, 76, 82, 83, 84, 85, 86, 87, 88, 90, 91, 92, 93, 94, 97, 102, 103, 107, 109, 112, 114, 115, 119, 124, 125, 126, 127, 128, 130, 135, 137, 138, 139, 144, 150, 168, 172, 175, 180, 181, 182, 184, 186, 190, 192, 193, 195, 196, 199, 203, 206, 223, 225, 227, 230, 231, 234, 235, 242, 244, 254, 255, 258, 263, 266, 269, 273, 277, 278, 279, 285, 290, 293, 294, 295, 299, 304, 306, 311, 313, 317, 320, 328, 329, 330, 331, 336, 339, 340, 342, 347, 348, 351, 353, 354, 356, 362, 363, 371, 375, 376, 377, 378, 379, 382, 383, 384, 388, 394, 395, 397, 401, 402, 403, 404, 410, 413, 414, 417, 432, 434, 436, 438, 439, 440, 442, 444, 445, 446, 453, 455, 456, 457, 460, 461, 462, 463, 469, 470, 474, 475, 476, 477, 478, 479, 481, 483, 484, 487, 489, 493, 494, 495, 497, 500, 501, 505, 507, 510, 511, 513, 515, 516, 525, 529, 533, 538, 539, 542, 543, 545, 546, 547, 548, 549, 552, 554, 555, 559, 561, 562, 56

In [62]:
print(cluster_one_df)

[1, 2, 5, 8, 11, 13, 14, 15, 16, 20, 21, 22, 23, 24, 25, 27, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 40, 42, 43, 44, 45, 48, 54, 57, 58, 59, 64, 65, 67, 68, 69, 70, 71, 73, 74, 75, 77, 78, 79, 80, 81, 89, 95, 96, 98, 99, 100, 101, 104, 105, 106, 108, 110, 111, 113, 116, 117, 118, 120, 121, 122, 123, 129, 131, 132, 133, 134, 136, 140, 141, 142, 143, 145, 146, 147, 148, 149, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 169, 170, 171, 173, 174, 176, 177, 178, 179, 183, 185, 187, 188, 189, 191, 194, 197, 198, 200, 201, 202, 204, 205, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 224, 226, 228, 229, 232, 233, 236, 237, 238, 239, 240, 241, 243, 245, 246, 247, 248, 249, 250, 251, 252, 253, 256, 257, 259, 260, 261, 262, 264, 265, 267, 268, 270, 271, 272, 274, 275, 276, 280, 281, 282, 283, 284, 286, 287, 288, 289, 291, 292, 296, 297, 298, 300, 301, 302, 303, 305, 307, 308, 309, 310, 312, 314, 315, 316, 318, 319, 321, 322,

In [63]:
cluster_zero_mean = df.q0001[cluster_zero_df].mean()

In [64]:
cluster_one_mean = df.q0001[cluster_one_df].mean()

In [65]:
print(cluster_zero_mean)
print(cluster_one_mean)

2.2705570291777186
2.2574568288854002


Looking at the difference in what the survey respondents thought of their masculinity in the two clusters, we can see the mean values are extremely close, therefore we can see that based on features in question 7 there is no natural clustering by masculinity. 