# An Interactive Exploration of Gender, Work, and Attitudes in the U.S.
## An Exploratory Analysis and Interactive Dashboard Using the 2018 General Social Survey

### Introduction. 
This project analyzes data from the 2018 General Social Survey (GSS) to explore the relationships between gender and key socioeconomic and attitudinal indicators in the United States. The GSS is a comprehensive, nationally representative survey that has tracked the opinions and behaviors of American adults for decades, making it an invaluable resource for social science research.

The analysis focuses on several key areas:

- **Economic Outcomes:** Comparing income, job prestige, and socioeconomic status between genders.

- **Societal Attitudes:** Examining differing views on gender roles in the family and workplace.

The primary outcome of this project is an interactive web dashboard that allows users to explore these relationships themselves.

**Context: The Gender Pay Gap**  

The gender pay gap is a persistent and complex issue. In 2024, women earned about 84 cents for every dollar a man earned, a figure that has changed little in two decades. This gap is wider for Hispanic and Black women. The persistence of this disparity is complex, stemming from factors like historical educational and career segregation, societal gender norms, and discrimination. The pay gap increases in higher-paying, male-dominated professions and is significantly impacted by the "motherhood penalty," where caregiving responsibilities negatively affect women's earnings while fatherhood can boost men's. Even with significant educational gains by women, they remain overrepresented in low-paying occupations and underrepresented in high-paying "greedy jobs" that demand long hours. To address the issue, experts suggest organizational and governmental actions, such as banning salary history inquiries and promoting pay transparency, rather than focusing solely on individual negotiation skills.

About the Data: The General Social Survey (GSS)
This dashboard uses data from the General Social Survey (GSS), a highly influential study that has been monitoring the complexity of American society for five decades. Since 1972, the GSS has used full-probability, personal interviews to collect data from a nationally representative sample of adults in the U.S., making it the single best source for sociological and attitudinal trend data in the country. The survey contains a standard core of demographic, behavioral, and attitudinal questions on topics including social mobility, civil liberties, crime, and psychological well-being, allowing researchers to track societal changes over time. You can read more about it [[here](https://gss.norc.org/us/en/gss/about-the-gss.html)].

### 1. Setup and Data Preparation  
First, we'll import the necessary libraries for data manipulation (pandas) and visualization (plotly), as well as the tools to build our web app (dash).

In [1]:
import numpy as np
import pandas as pd

# Plotly modules/methods/settings
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True) # enables display of plotly figures in HTML/PDF notebooks

# Dash modules/methods/settings
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css'] # Controls default visual appearance of the dashboard

Next, we load the 2018 GSS data directly from its source and perform some initial cleaning. This involves selecting relevant columns, renaming them for clarity, and correcting data types.

For this lab, we will be working with the 2019 General Social Survey one last time.

In [2]:
%%capture
gss = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
                 encoding='cp1252', na_values=['IAP','IAP,DK,NA,uncodeable', 'NOT SURE',
                                               'DK', 'IAP, DK, NA, uncodeable', '.a', "CAN'T CHOOSE"],
                                               low_memory=False)

Here is code that cleans the data and gets it ready to be used for data visualizations:

In [3]:
mycols = ['id', 'wtss', 'sex', 'educ', 'region', 'age', 'coninc',
          'prestg10', 'mapres10', 'papres10', 'sei10', 'satjob',
          'fechld', 'fefam', 'fepol', 'fepresch', 'meovrwrk'] 
gss_clean = gss[mycols]
gss_clean = gss_clean.rename({'wtss':'weight', 
                              'educ':'education', 
                              'coninc':'income', 
                              'prestg10':'job_prestige',
                              'mapres10':'mother_job_prestige', 
                              'papres10':'father_job_prestige', 
                              'sei10':'socioeconomic_index', 
                              'fechld':'relationship', 
                              'fefam':'male_breadwinner', 
                              'fehire':'hire_women', 
                              'fejobaff':'preference_hire_women', 
                              'fepol':'men_bettersuited', 
                              'fepresch':'child_suffer',
                              'meovrwrk':'men_overwork'},axis=1)
gss_clean.age = gss_clean.age.replace({'89 or older':'89'})
gss_clean.age = gss_clean.age.astype('float')

In [4]:
gss_clean.head()

Unnamed: 0,id,weight,sex,education,region,age,income,job_prestige,mother_job_prestige,father_job_prestige,socioeconomic_index,satjob,relationship,male_breadwinner,men_bettersuited,child_suffer,men_overwork
0,1,2.357493,male,14.0,new england,43.0,,47.0,31.0,45.0,65.3,very satisfied,strongly agree,disagree,agree,strongly disagree,agree
1,2,0.942997,female,10.0,new england,74.0,22782.5,22.0,32.0,39.0,14.8,,,,,,
2,3,0.942997,male,16.0,new england,42.0,112160.0,61.0,32.0,72.0,83.4,mod. satisfied,strongly agree,disagree,disagree,disagree,disagree
3,4,0.942997,female,16.0,new england,63.0,158201.8412,59.0,,39.0,69.3,very satisfied,agree,disagree,disagree,disagree,neither agree nor disagree
4,5,0.942997,male,18.0,new england,71.0,158201.8412,53.0,35.0,45.0,68.6,,,,,,


The `gss_clean` dataframe now contains the following features:

* `id` - a numeric unique ID for each person who responded to the survey
* `weight` - survey sample weights
* `sex` - male or female
* `education` - years of formal education
* `region` - region of the country where the respondent lives
* `age` - age
* `income` - the respondent's personal annual income
* `job_prestige` - the respondent's occupational prestige score, as measured by the GSS using the methodology described above
* `mother_job_prestige` - the respondent's mother's occupational prestige score, as measured by the GSS using the methodology described above
* `father_job_prestige` -the respondent's father's occupational prestige score, as measured by the GSS using the methodology described above
* `socioeconomic_index` - an index measuring the respondent's socioeconomic status
* `satjob` - responses to "On the whole, how satisfied are you with the work you do?"
* `relationship` - agree or disagree with: "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."
* `male_breadwinner` - agree or disagree with: "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."
* `men_bettersuited` - agree or disagree with: "Most men are better suited emotionally for politics than are most women."
* `child_suffer` - agree or disagree with: "A preschool child is likely to suffer if his or her mother works."
* `men_overwork` - agree or disagree with: "Family life often suffers because men concentrate too much on their work."

### 2. Exploratory Data Analysis & Visualization
In this section, we'll create the individual plots that will form the basis of our dashboard. Each visualization is designed to be interactive, allowing for deeper exploration of the data.

2.1. Overall Comparison: Key Economic Indicators
To start, let's look at a high-level comparison of mean income, socioeconomic status, and education levels between men and women in the survey.

In [5]:
# Group by sex and calculate means
summary_stats = gss_clean.groupby('sex')[['income','socioeconomic_index','education']].mean().reset_index()

# Round for presentation
summary_stats = summary_stats.round(2)

# Rename columns for clarity
summary_stats = summary_stats.rename(columns = {
    'income':'Mean Income',
    'socioeconomic_index':'Mean Socioeconomic Index',
    'education':'Mean Years of Education'
})

# Create the interactive table
table_fig = ff.create_table(summary_stats)
table_fig.show()

### 2.2. Attitudes on Gender Roles
This bar plot visualizes responses to the statement: "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."  
This helps us gauge societal attitudes toward traditional gender roles.

In [6]:
# Create the interactive bar plot for the 'male_breadwinner' variable
bar_fig = px.bar(gss_clean.dropna(subset=['male_breadwinner']),
    x='male_breadwinner',
    color='sex',
    barmode='group',
    category_orders={'male_breadwinner': ['strongly agree','agree','neither agree nor disagree','disagree','strongly disagree']},
    labels={
        'male_breadwinner':'Response to "Man is Achiever, Woman is Homemaker"',
        'count':'Number of Respondents',
        'sex':'Sex'
    }
)
bar_fig.show()

### 2.3. Income vs. Job Prestige

Is there a direct relationship between the prestige of a job and the income it provides? This scatter plot explores that question, separating the data by gender and including trendlines to visualize the relationship for both men and women. Hovering over any point reveals the respondent's education and socioeconomic index.

If you see an error that says the package "statsmodels" is not installed, add it to your conda environment via the terminal by activating the environment then typing `conda install statsmodels`.

In [7]:
# Create the scatter plot with trendlines
plot_data_scatter = gss_clean.dropna(subset=['income','job_prestige'])
scatter_fig = px.scatter(plot_data_scatter,
    x='job_prestige',
    y='income',
    color='sex',
    trendline='ols', # Adds Ordinary Least Squares regression lines
    hover_data=['education','socioeconomic_index'],
    labels={
        'job_prestige':'Occupational Prestige Score',
        'income':'Annual Income',
        'sex':'Sex'
    }
)
scatter_fig.show()

### 2.4. Distribution of Income and Job Prestige

While averages are useful, box plots allow us to see the full distribution of data, including the median, interquartile range, and outliers. Here we create two plots to compare the distributions of income and job prestige between men and women.

In [8]:
# Box plot for Income
income_box_fig = px.box(gss_clean.dropna(subset=['income']),
                        x='sex',
                        y='income',
                        labels={'income': 'Annual Income ($)'}
                       )
income_box_fig.update_layout(xaxis_title=None, showlegend=False)

# Box plot for Job Prestige
prestige_box_fig = px.box(gss_clean.dropna(subset=['job_prestige']),
                          x='sex',
                          y='job_prestige',
                          labels={'job_prestige': 'Occupational Prestige Score'}
                         )
prestige_box_fig.update_layout(xaxis_title=None, showlegend=False)

# Display both figures
income_box_fig.show()
prestige_box_fig.show()

### 2.5. Deeper Dive: Income Gap Across Prestige Levels

To investigate the income gap more closely, we can segment the population by job prestige. This faceted plot breaks job prestige into six equal tiers and displays an income comparison box plot for each, revealing how the income gap may vary across different levels of professional standing.

In [9]:
# Create a new dataframe for this specific plot
prestige_df = gss_clean[['income', 'sex', 'job_prestige']].copy()
prestige_df.dropna(inplace=True)

# Create six categories for job_prestige
prestige_df['prestige_category'] = pd.cut(prestige_df['job_prestige'],
                                          bins=6,
                                          labels=['Category 1 (Lowest)',' Category 2','Category 3','Category 4','Category 5','Category 6 (Highest)'])

# Create the faceted box plot
facet_fig = px.box(prestige_df,
                   x='sex',
                   y='income',
                   color='sex',
                   color_discrete_map={'male':'blue', 'female':'red'},
                   facet_col='prestige_category',
                   facet_col_wrap=2, # Arrange plots in 2 columns
                   labels={'income': 'Annual Income', 'sex': 'Sex'}
                  )

# Clean up facet titles
facet_fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
facet_fig.update_xaxes(title_text=None)

facet_fig.show()

### 3. The Interactive Dashboard
Finally, we assemble all the components into a self-contained, interactive dashboard using Dash. This application allows users to explore different facets of the data on their own. The bar plot at the bottom is fully interactive, enabling dynamic comparisons of different survey questions grouped by sex, region, or education level.

Note: The following code block defines and runs the web application. For deployment (e.g., on PythonAnywhere), this entire script would be saved as a single app.py file.

In [None]:
# --- 1. IMPORTS ---
import pandas as pd
import plotly.express as px
import dash
from dash import dcc, html
from dash.dependencies import Input, Output

# --- 2. DATA LOADING AND CLEANING ---
# This block runs once when the app starts, creating the gss_clean DataFrame.
gss = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
                 encoding='cp1252', 
                 na_values=['IAP','IAP,DK,NA,uncodeable', 'NOT SURE',
                            'DK', 'IAP, DK, NA, uncodeable', '.a', "CAN'T CHOOSE"],
                 low_memory=False)

mycols = ['id', 'wtss', 'sex', 'educ', 'region', 'age', 'coninc',
          'prestg10', 'mapres10', 'papres10', 'sei10', 'satjob',
          'fechld', 'fefam', 'fepol', 'fepresch', 'meovrwrk']
gss_clean = gss[mycols]
gss_clean = gss_clean.rename({'wtss':'weight', 'educ':'education', 'coninc':'income',
                              'prestg10':'job_prestige', 'mapres10':'mother_job_prestige',
                              'papres10':'father_job_prestige', 'sei10':'socioeconomic_index',
                              'fechld':'relationship', 'fefam':'male_breadwinner',
                              'fepol':'men_bettersuited', 'fepresch':'child_suffer',
                              'meovrwrk':'men_overwork'}, axis=1)
gss_clean.age = gss_clean.age.replace({'89 or older':'89'})
gss_clean.age = gss_clean.age.astype('float')


# --- 3. APP SETUP ---
app = dash.Dash(__name__)
server = app.server # This line is needed for services like PythonAnywhere
app.title = "GSS Interactive Dashboard"

# --- 4. LAYOUT DEFINITION ---
app.layout = html.Div([
    html.H1("GSS Interactive Dashboard", style={'textAlign': 'center'}),
    html.P("An exploration of gender, work, and attitudes from the 2018 General Social Survey.", style={'textAlign': 'center'}),
    html.Hr(),
    html.H3("Explore Attitudes by Topic and Demographic"),
    html.P("Select a survey question and a demographic group to see how responses vary."),
    
    html.Div([
        # Dropdown for selecting the variable
        dcc.Dropdown(
            id='variable-dropdown',
            options=[
                {'label': 'Job Satisfaction', 'value': 'satjob'},
                {'label': 'Working Mother Relationship', 'value': 'relationship'},
                {'label': 'Male Breadwinner', 'value': 'male_breadwinner'},
                {'label': 'Men Suited for Politics', 'value': 'men_bettersuited'},
                {'label': 'Working Mother and Child Suffering', 'value': 'child_suffer'},
                {'label': 'Men Overworking', 'value': 'men_overwork'}
            ],
            value='satjob', # Default value
            style={'color': '#000000'}
        ),
        # Dropdown for selecting the grouping
        dcc.Dropdown(
            id='group-dropdown',
            options=[
                {'label': 'Sex', 'value': 'sex'},
                {'label': 'Region', 'value': 'region'},
                {'label': 'Education Level', 'value': 'education'}
            ],
            value='sex', # Default value
            style={'color': '#000000'}
        ),
    ], style={'width': '80%', 'margin': 'auto'}),

    # Graph that will be updated by the callback
    dcc.Graph(id='interactive-barplot')
])

# --- 5. CALLBACK DEFINITION ---
@app.callback(
    Output('interactive-barplot', 'figure'),
    [Input('variable-dropdown', 'value'),
     Input('group-dropdown', 'value')]
)
def update_graph(selected_variable, selected_group):
    # Use a copy of the gss_clean DataFrame to avoid modifying the original
    df_copy = gss_clean.copy()
    
    # Drop missing values for the selected columns to prevent errors
    df_copy.dropna(subset=[selected_variable, selected_group], inplace=True)
    
    # For 'education', group into bins to make the plot readable
    if selected_group == 'education':
        df_copy['education'] = pd.cut(df_copy['education'],
                                     bins=[0, 8, 12, 15, 21],
                                     labels=['< High School', 'High School', 'Some College', 'College Grad+'])
        df_copy.dropna(subset=['education'], inplace=True)

    # Create the figure with a title that updates based on user selection
    fig = px.bar(
        df_copy,
        x=selected_variable,
        color=selected_group,
        barmode='group',
        labels={
            selected_variable: selected_variable.replace('_', ' ').title(),
            'count': 'Number of Respondents',
            selected_group: selected_group.title()
        },
        title=f"Responses for '{selected_variable.replace('_', ' ').title()}' grouped by '{selected_group.title()}'"
    )
    fig.update_layout(transition_duration=500)
    return fig

# --- 6. RUN THE APP ---
if __name__ == '__main__':
    app.run(debug=True)