## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest. (You can also check out `get_gss.ipynb` for some processed data.)
2. Write a short description of the data you chose, and why. (~500 words)
3. Load the data using Pandas. Clean them up for EDA. Do this in this notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations.
5. Describe your findings. (500 - 1000 words, or more)

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.


In [16]:
import pandas as pd

#list of variables we chose to keep
var_list = [
    'year',            # year of survey
    'wrkstat',         # labor force status
    'degree',          # respondent's highest degree
    'childs',          # number of children
    'marital',         # marital status
    'sex',             # respondents sex
    'race',            # race of respondent
    'income',          # total family income
    'happy',           # happiness index
    'partyid'          # political party affiliation
]


output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to

phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode

#list to store the chunks of data into a parquet file
all_chunks = []

for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet'
    df = pd.read_parquet(url)
    all_chunks.append(df.loc[:, var_list])

#concat all chunks into a single df
gss_df = pd.concat(all_chunks, ignore_index=True)

gss_df = gss_df[gss_df['year'] != 1972]

#save the combined df to a CSV file (easier to use for EDA)
gss_df.to_csv(output_filename, index=False)
print(f"Loaded and saved as'{output_filename}'")
display(gss_df.head(20))

Loaded and saved as'selected_gss_data.csv'


Unnamed: 0,year,wrkstat,degree,childs,marital,sex,race,income,happy,partyid
1613,1973,working full time,less than high school,1.0,married,male,white,"$10,000 to $14,999",not too happy,other party
1614,1973,keeping house,less than high school,2.0,married,female,white,"$7,000 to $7,999",very happy,not very strong democrat
1615,1973,working full time,less than high school,8.0,married,female,white,"$10,000 to $14,999",pretty happy,"independent, close to republican"
1616,1973,working full time,high school,2.0,married,male,white,"$10,000 to $14,999",pretty happy,not very strong democrat
1617,1973,keeping house,less than high school,4.0,married,female,white,"$10,000 to $14,999",pretty happy,"independent, close to republican"
1618,1973,working full time,high school,3.0,divorced,female,white,"$10,000 to $14,999",pretty happy,not very strong democrat
1619,1973,keeping house,less than high school,3.0,married,female,white,"$4,000 to $4,999",very happy,"independent, close to democrat"
1620,1973,working full time,bachelor's,0.0,never married,male,white,"$7,000 to $7,999",pretty happy,not very strong democrat
1621,1973,keeping house,less than high school,2.0,widowed,female,white,"$1,000 to $2,999",pretty happy,not very strong democrat
1622,1973,working full time,bachelor's,1.0,married,male,white,"$15,000 to $19,999",very happy,"independent, close to republican"


###Why These Variables
We chose to go with these variables specifically because these seem to be significant indicators of both social status and general well being. Specifically, these variables were consistently a question in the survey from at least 1973 onward (income was a separate variable for the 1972 survey, so we decided to drop that year).

Each variable serves a unique purpose for our analysis. The 'year' serves as a consistent independent variable that can be used to see trends in the data over time and make inferences about how contemporary events/circumstances affect the data. The 'wrkstat' column allows us to see in what way the respondent works which defines a majority of their waking hours. The 'degree' column shows the highest degree earned which often correlates with income, children, party affiliation, and demographic data. 'childs' is the number of children the respondent has which may be correlated with marital status, happiness, and work status. 'marital' serves as our indicator of marital status which can be broken into values to see how divorcees or married individuals describe their general social status and wellbeing. We have two demographic columns 'sex' and 'race' to allow us to pull demographic comparisons. 'income' is one of the most useful columns in that it often determines how much access to leisure the respondent has and has correlations across nearly all variables. 'happy' serves as a self-reported happiness index for us to examine how people view their own happiness given their social status and data. Finally, 'partyid' gives us detailed values on political party affiliation so we can draw conclusions from their political ideology.

Most importantly, because most of these variables come from consistent questions across many years of this survey, there are less NaN values to process which can contribute to inaccuracies or overgeneralizations.

By nature of these columns being common questions across the many years of this survey there is evidence that these are consistently important considerations when studying social phenomena. We can also use these data to avoid biases from the survey sampling. For example, if we find that the only people filling out the survey for certain years, sexes, etc. are individuals with an income above a certain threshold, we can assume that the data is not representative of people below a certain income. By using these very set and determinative variables, we are more likely to recognize biases before more assumptions that are really just the outcome to selection bias.