## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest. (You can also check out `get_gss.ipynb` for some processed data.)
2. Write a short description of the data you chose, and why. (~500 words)
3. Load the data using Pandas. Clean them up for EDA. Do this in this notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations.
5. Describe your findings. (500 - 1000 words, or more)

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.


#1.

In [13]:
from google.colab import files
import pandas as pd
# Upload file(s)
uploaded = files.upload() #downloaded 7 variables

gss_df = pd.read_excel("eda_lab_data.xlsx") #loaded the data using Pandas (for number 3)
gss_df.head()

gss_df.tail()

Saving eda_lab_data.xlsx to eda_lab_data (4).xlsx


Unnamed: 0,year,id_,hrs2,educ,happy,health,ballot
75694,2024,3305,.i: Inapplicable,8 or more years of college,Pretty happy,Fair,Ballot c
75695,2024,3306,.i: Inapplicable,2 years of college,Pretty happy,Fair,Ballot a
75696,2024,3307,.i: Inapplicable,12th grade,Pretty happy,Good,Ballot a
75697,2024,3308,.i: Inapplicable,8 or more years of college,Pretty happy,Excellent,Ballot b
75698,2024,3309,.i: Inapplicable,6 years of college,Pretty happy,Good,Ballot b


#2.

The dataset we selected provides allows us to explore how individual experiences of work, education, health, and happiness are connected across time. It includes seven variables: id, ballot, year, hrs2, educ, happy, and health. Each variable contributes to understanding broader social and personal trends, while also offering clear avenues for statistical and interpretive analysis.

The id and ballot variables correspond to the respondant's assigned id and the ballot they used to provide their responses.

The year variable correlates to when a respondent’s data was collected. This is important because it introduces the factor of time. This allows us to examine how people’s experiences may shift over time. For example, changes in the economy, healthcare systems, or cultural values across years might influence how much people work, how they evaluate their health, or how they perceive happiness. By including year, we are able to look not only at individual-level variation but also at larger social changes that affect people’s lives.

The hrs2 variable represents the number of hours a respondent worked per week.
This allows us to examine how an individual's working hours affects their personal well-being. Longer hours can lead to stress and health problems, but also provide financial security, which contributes to happiness.

The educ variable reflects the highest year of school a respondent completed. Education often plays a foundational role in shaping a person's employment opportunities, income, hours worked, and subsequently, their health and happiness.

The happy variable captures a respondent’s self-reported happiness. Happiness is a subjective measure of well-being, and though it can be influenced by many factors, some big factors influencing happiness include the ones being examined in this dataset. Studying happiness allows us to see an individual's quality of life. In this dataset, we hope to see how happiness interacts with the other variables in this dataset, and how those variables influence the amount of happiness an indivdual perceives they have.

The health variable reflects respondents’ self-reported physical health. Health is a crucial component of overall well-being and may also mediate the relationship between work and happiness. For instance, individuals working extremely long hours might report worse health, which in turn could lower happiness. Conversely, those with good health might feel more capable of handling demanding work schedules.

By using this dataset, our primary goal is to explore the relationships between these variables. Specifically, we want to examine how the number of hours worked per week correlates with self-reported health and happiness. We are also interested in how responses vary across different years, as well as how education influences both work hours and well-being. Ultimately, this dataset allows us to connect individual-level experiences with broader social trends, providing a richer understanding of how work, education, health, and happiness intersect over time.

#3.

In [14]:
gss_df_clean = gss_df.replace({r'^\..*': None}, regex=True) #Replace codes like ".i: Inapplicable" with NaN

gss_df_clean = gss_df_clean.applymap(lambda x: x.strip() if isinstance(x, str) else x) #Strip whitespace from string variables
gss_df_clean.columns = gss_df_clean.columns.str.lower().str.strip().str.replace(" ", "_") #Optional: standardize column names (lowercase, no spaces)

gss_df_clean.head()
gss_df_clean.tail()

  gss_df_clean = gss_df_clean.applymap(lambda x: x.strip() if isinstance(x, str) else x) #Strip whitespace from string variables


Unnamed: 0,year,id_,hrs2,educ,happy,health,ballot
75694,2024,3305,,8 or more years of college,Pretty happy,Fair,Ballot c
75695,2024,3306,,2 years of college,Pretty happy,Fair,Ballot a
75696,2024,3307,,12th grade,Pretty happy,Good,Ballot a
75697,2024,3308,,8 or more years of college,Pretty happy,Excellent,Ballot b
75698,2024,3309,,6 years of college,Pretty happy,Good,Ballot b


#4.

#5.
