## Masculinity Survey
#### Ideas about masculinity

masculinity-survey.csv contains the results of a survey of 1,615 adult men conducted by SurveyMonkey in partnership with FiveThirtyEight and WNYC Studios from May 10-22, 2018. The modeled error estimate for this survey is plus or minus 2.5 percentage points. The percentages have been weighted for age, race, education, and geography using the Census Bureau’s American Community Survey to reflect the demographic composition of the United States age 18 and over. Crosstabs with less than 100 respondents have been left blank because responses would not be statistically significant.

raw-responses.csv contains all 1,615 responses to the survey including the weights for each response. Responses to open-ended questions have been omitted, including those where a respondent explained what they meant by selecting the "other" option in response to a question.

https://github.com/fivethirtyeight/data/blob/master/masculinity-survey/masculinity-survey.pdf

The columns of the raw responses were named numerically - 'question 0001', etc. The data providers had also already created dummy variable like columns for any question that allowed respondents to select more than one response.

The structure of the survey created an interesting problem in terms of handling null values- the survey had 20 questions that all respondents saw, but then up to 10 additional questions that only a subset of respondents saw, based on responses to previous questions. These were dispersed throughout the survey. So- if a respondent did not see a question- what should I do with that "null" value?

In this case, it didn't make sense to me to fill it with a value (either mean or most common categorical, etc.)- it seems like that would be creating too much 'fake' data, that would potentially skew the results.

For example, the biggest split happened at the question asking about employment- of 1651 survey takers, 880 were partially or fully employed, and 771 were not currently employed. Thus, if I wanted to predict on any of the questions only seen by Employed men, I would be working with 880 values, not 1651.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [23]:
responses = pd.read_excel('masculinity-responses-renamed.xlsx')

In [25]:
responses.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1615 entries, 1 to 1615
Data columns (total 97 columns):
StartDate                    1615 non-null datetime64[ns]
EndDate                      1615 non-null datetime64[ns]
self_manly                   1615 non-null object
others_manly                 1615 non-null object
source_ideas_father          1615 non-null object
source_ideas_mother          1615 non-null object
source_ideas_family          1615 non-null object
source_ideas_popculture      1615 non-null object
source_ideas_friends         1615 non-null object
source_ideas_other           1615 non-null object
societal_pressure            1615 non-null object
prof_advice                  1615 non-null object
personal_advice              1615 non-null object
phys_aff                     1615 non-null object
cry                          1615 non-null object
phys_fight                   1615 non-null object
sex_women                    1615 non-null object
sex_men                    

In [26]:
responses.shape

(1615, 97)

In [24]:
responses['me_too_awareness'].value_counts()

A lot             432
Some              225
Nothing at all    137
Only a little      85
No answer           1
Name: me_too_awareness, dtype: int64

In [25]:
responses = responses[responses.me_too_awareness != 'No answer']

In [27]:
responses['me_too_work_behavior'].value_counts()

No           489
Yes          247
No answer      6
Name: me_too_work_behavior, dtype: int64

In [28]:
responses['me_too_romantic_behavior'].value_counts()

No           1442
Yes           144
No answer      29
Name: me_too_romantic_behavior, dtype: int64

I'd like to make predictions on the two columns above. They are responses to the questions: <br>
> As a man, would you say you think about your behavior differently at work in the wake of #MeToo? <br>
> Have you changed your behavior in romantic relationships in the wake of the #MeToo movement?

### 3 formats of survey question

The survey had 3 distinct formats of questions, which each require a different approach to cleaning the data. Also, none of the questions were required, so respondents could skip any of the questions. If a respondent skipped a question, it was recorded as 'No answer' **(not Null/ NAN)**. 

Here is the breakdown of answers to the first question: 
>In general, how masculine or 'manly' do you feel?

In [29]:
responses['self_manly'].value_counts()

Somewhat masculine      826
Very masculine          612
Not very masculine      131
Not at all masculine     32
No answer                14
Name: self_manly, dtype: int64

With these 'multiple choice, single answer' questions, I needed to address the responses that did not answer the question.
I didn't feel comfortable filling the value with the most common response, Somewhat masculine, so I decided to drop those lines. 

In [26]:
responses = responses[responses.self_manly != 'No answer']

And the same with the next question:
>'How important is it to you that others see you as masculine?

In [3]:
responses['others_manly'].value_counts()

Somewhat important      628
Not too important       541
Not at all important    240
Very important          197
No answer                 9
Name: others_manly, dtype: int64

In [27]:
responses = responses[responses.others_manly != 'No answer']

I also needed to decide how to make these parseable- should I map each response to a numeric value, or create dummy columns? 
<br>The response options were on a graded scale, from 'Not at all' to 'Very', so I decided to map to numeric values (0-3)

In [28]:
responses['self_manly'] = responses['self_manly'].map({'Not at all masculine':0, 
                                                       'Not very masculine':1, 
                                                       'Somewhat masculine':2,
                                                        'Very masculine': 3})
responses['others_manly'] = responses['others_manly'].map({'Not at all important':0, 
                                                           'Not too important':1, 
                                                           'Somewhat important':2, 
                                                           'Very important':3})

What happens if I drop all lines that skipped any of the multi-choice, single answer questions? (i.e. ignoring the 'select all that apply', for now, how many rows left with?

In [30]:
responses.shape

(1596, 97)

In [None]:
responses = responses[responses.self_manly != 'No answer']
responses = responses[responses.self_manly != 'No answer']
responses = responses[responses.self_manly != 'No answer']
responses = responses[responses.self_manly != 'No answer']
responses = responses[responses.self_manly != 'No answer']

There were also questions that allowed the respondent to select multiple answers- questions that ended with 'Select all that apply'. These were represented by a set of columns, one for each option (kind of like dummy columns).
The question 'Where have you gotten your ideas about what it means to be a good man?', had the following response options:

-  Father or father figure
-  Mother or mother figure
-  Friends
-  Other family members
-  Pop culture
-  Other

Each of the options above had its own column in the raw responses. Here is the value breakdown for the column representing the respondent selecing 'Father or Father figure':

In [4]:
responses['source_ideas_father'].value_counts()

Father or father figure(s)    1103
Not selected                   498
Name: source_ideas_father, dtype: int64

For each of these types of questions, I need to check the columns for all possible answers, and verify that at least one option was selected. If all columns contain 'Not selected', then I know the question was skipped.

I also already know that I want to drop several questions, so I will do that first.

In [39]:
responses = responses.drop(['harass_no_response_reason', 'pay_right_thing', 
                           'pay_make_more', 'pay_feel_good', 'pay_expectation', 
                           'pay_initiator_obligation', 'pay_test_share', 'pay_other',
                            'children_u18', 'children_18+', 'children_none', 'orientation', 
                           'device', 'race2', 'racethn4', 'educ3', 'educ4', 'weight'], axis = 1)

In [6]:
responses.shape

(1601, 80)

Next I'm going to create a new column, flagging any lines that did not select one of the options. Before I do this, I need to map the column values using this pattern: 
</br> 
>Response selected = 1 (i.e. cell value is 'Father or father figures') </br>
>Not selected = 0 

In [40]:
cols = ['source_ideas_father', 'source_ideas_mother', 'source_ideas_friends', 'source_ideas_family','source_ideas_popculture','source_ideas_other']

for col in cols:
    responses[col] = responses[col].map({'Not selected': 0})

In [24]:
responses['source_ideas_father'].value_counts()

0.0    498
Name: source_ideas_father, dtype: int64

In [25]:
responses['source_ideas_father'].head()

1    0.0
2    NaN
3    NaN
4    NaN
5    0.0
Name: source_ideas_father, dtype: float64

In [41]:
responses[cols] = responses[cols].fillna(1)

In [42]:
responses[cols].head()

Unnamed: 0,source_ideas_father,source_ideas_mother,source_ideas_friends,source_ideas_family,source_ideas_popculture,source_ideas_other
1,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0
4,1.0,1.0,0.0,1.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0


In [43]:
responses['source_ideas_father'].value_counts()

1.0    1103
0.0     498
Name: source_ideas_father, dtype: int64

In [44]:
responses['answered_self_manly'] = responses.loc[:, ['source_ideas_father', 'source_ideas_mother', 'source_ideas_friends', 
                                                     'source_ideas_family','source_ideas_popculture','source_ideas_other' ]].sum(axis=1) > 0

In [45]:
responses['answered_self_manly'].value_counts()

True     1593
False       8
Name: answered_self_manly, dtype: int64

Which of the following do you worry about on a daily or near daily basis?
>Your height
<br>Your weight
<br>Your hair or hairline
<br>Your physique
<br>Appearance of your genitalia
<br>Your clothing/ style
<br>Sexual performance or amoutnof sex
<br>Your mental health
<br>Your physical health
<br>Your finances including current or future assets and debt
<br>Your ability to provide for your current or anticipated family

In which of the following ways would you say it's an **advantage** to be a man at work right now?
>Men make more money
<br>Men are taken more seriously
<br>Men have more choice
<br>Men have more promotion/ professional development opportunities
<br>Men are explicitly praised more often
<br>Men generally have more support from their managers
<br>Other (Please specify)
<br>None of the above

In which of the following ways would you say it's a **disadvantage** to be a man at work right now?
>Managers want to hire and promote women
<br>Greater risk of being accused of sexual harassment
<br>Greater risk of being accused of being sexist or racist
<br>Other (please specify)
<br>None of the above

In [46]:
worry_cols = ['worry_height', 'worry_weight', 'worry_hair', 'worry_physique', 'worry_genitalia', 'worry_style', 'worry_sex', 'worry_mental_health', 'worry_phys_health', 'worry_finances', 'worry_provide', 'no_worries']
male_adv_cols = ['men_earn_more', 'men_taken_serious', 'men_more_choice', 'men_more_promo', 'men_more_praise', 'men_more_support', 'no_male_advantages', 'men_advantage_other']
male_disadv_cols = ['hire_women', 'sex_harass', 'sexist_racist', 'men_disadvantage_none', 'men_disadvantage_other']
harass_cols = ['harass_confront', 'harass_HR', 'harass_manager', 'harass_support', 'harass_no_response', 'harass_never_seen', 'harass_other']
int_cols = ['int_body_lang', 'int_verb_conf', 'int_phys_move', 'int_diff_sit', 'int_unclear', 'int_other']
bound_cols = ['bound_wonder', 'bound_talk', 'bound_contact', 'bound_none']

In [50]:
col_list = [worry_cols, male_adv_cols, male_disadv_cols, harass_cols, int_cols, bound_cols]

for group in col_list:
    for col in range(0, len(group)):
        responses[group[col]] = responses[group[col]].map({'Not selected': 0})
        col +=1
        
    responses[group] = responses[group].fillna(1)

In [53]:
responses['answered_worry'] = responses[worry_cols].sum(axis=1) > 0
responses['answered_male_advantages'] = responses[male_adv_cols].sum(axis=1) > 0
responses['answered_male_disadv'] = responses[male_disadv_cols].sum(axis=1) > 0
responses['answered_harass'] = responses[harass_cols].sum(axis=1) > 0
responses['answered_int'] = responses[int_cols].sum(axis=1) > 0
responses['answered_bound'] = responses[bound_cols].sum(axis=1) > 0

#### Drop lines with 'No Answer' values- or treat 'No answer' values as another category 'Not comfortable providing answer' ?

In [19]:
responses = responses[responses.societal_pressure != 'No answer']
responses.shape

(1591, 80)

In [20]:
responses = responses[responses.prof_advice != 'No answer']
responses.shape

(1579, 80)

In [21]:
responses = responses[responses.personal_advice != 'No answer']
responses.shape

(1568, 80)

In [22]:
responses = responses[responses.phys_aff != 'No answer']
responses.shape

(1562, 80)

In [23]:
responses = responses[responses.cry != 'No answer']
responses.shape

(1481, 80)

> 81 men chose not to respond to the question about crying

In [24]:
responses = responses[responses.phys_fight != 'No answer']
responses.shape

(1478, 80)

In [25]:
responses = responses[responses.sex_women != 'No answer']
responses.shape

(1458, 80)

In [26]:
responses = responses[responses.sex_men != 'No answer']
responses.shape

(1455, 80)

In [27]:
responses = responses[responses.sports != 'No answer']
responses.shape

(1448, 80)

In [28]:
responses = responses[responses.workout != 'No answer']
responses.shape

(1422, 80)

In [29]:
responses = responses[responses.therapist != 'No answer']
responses.shape

(1407, 80)

In [30]:
responses = responses[responses.lonely != 'No answer']
responses.shape

(1393, 80)

In [5]:
responses['Employment'].value_counts()

Employed, working full-time           737
Not employed-retired                  524
Employed, working part-time           143
Not employed, NOT looking for work    110
Not employed, looking for work         68
Not employed, student                  30
No answer                               3
Name: Employment, dtype: int64

Next, for 'Select all that apply' questions, I need to verify that at least one response was chosen, otherwise treat that as 'No answer'

In [13]:
responses.loc[range(1,5)]

Unnamed: 0,StartDate,EndDate,self_manly,others_manly,source_ideas_father,source_ideas_mother,source_ideas_friends,source_ideas_fam,source_ideas_popculture,source_ideas_other,...,marital_status,children_u18,children_18+,children_none,race,education,state,income,region,orientation.1
1,2018-05-10 04:01:00,2018-05-10 04:06:00,Somewhat masculine,Somewhat important,Not selected,Not selected,Not selected,Pop culture,Not selected,Not selected,...,Never married,Not selected,Not selected,No children,Hispanic,College graduate,New York,"$0-$9,999",Middle Atlantic,Gay/Bisexual
2,2018-05-10 06:30:00,2018-05-10 06:53:00,Somewhat masculine,Somewhat important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Not selected,...,Widowed,Not selected,"Yes, one or more children 18 or older",Not selected,White,Some college,Ohio,"$50,000-$74,999",East North Central,Straight
3,2018-05-10 07:02:00,2018-05-10 07:09:00,Very masculine,Not too important,Father or father figure(s),Not selected,Not selected,Not selected,Not selected,Other (please specify),...,Married,Not selected,"Yes, one or more children 18 or older",Not selected,White,College graduate,Michigan,"$50,000-$74,999",East North Central,Straight
4,2018-05-10 07:27:00,2018-05-10 07:31:00,Very masculine,Not too important,Father or father figure(s),Mother or mother figure(s),Other family members,Not selected,Not selected,Not selected,...,Married,Not selected,"Yes, one or more children 18 or older",Not selected,White,Some college,Indiana,"$50,000-$74,999",East North Central,No answer


In [None]:
map values to 1 = response, 0 = no answer

responses[source_ideas]

for i in index
    source_ideas_responses = responses.loc[i, ['source_ideas_father','source_ideas_mother', 'source_ideas_friends', 
                                               'source_ideas_fam', 'source_ideas_popculture', 'source_ideas_other']]
    if sum(source_ideas_responses) > 0
        responses.loc[i, 'source_ideas'] = 1
    else
        responses.loc[i, 'source_ideas'] = 0