## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [1]:
# import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the data using Pandas

In [6]:
# import data from the GSS
# Define file path
file_path = "/content/GSS.xlsx"

# Read the Excel file
df = pd.read_excel(file_path)

In [7]:
print(df.head())

   year  id_               hrs2        marital childs age     sex   race  \
0  1990    1  .i:  Inapplicable  Never married      0  65  FEMALE  White   
1  1990    2  .i:  Inapplicable  Never married      0  42    MALE  White   
2  1990    3  .i:  Inapplicable  Never married      0  25    MALE  White   
3  1990    4  .i:  Inapplicable  Never married      0  39  FEMALE  White   
4  1990    5  .i:  Inapplicable  Never married      6  55    MALE  Black   

          happy             hapmar               life    ballot  
0  Pretty happy  .i:  Inapplicable  .i:  Inapplicable  Ballot b  
1    Very happy  .i:  Inapplicable  .i:  Inapplicable  Ballot b  
2  Pretty happy  .i:  Inapplicable           Exciting  Ballot a  
3  Pretty happy  .i:  Inapplicable           Exciting  Ballot a  
4  Pretty happy  .i:  Inapplicable  .i:  Inapplicable  Ballot b  


In [8]:
# see information about df, ensure that all variables I want are there
print(df.shape, '\n') # List the dimensions of df
print(df.dtypes, '\n') # The types of the variables

(47497, 12) 

year        int64
id_         int64
hrs2       object
marital    object
childs     object
age        object
sex        object
race       object
happy      object
hapmar     object
life       object
ballot     object
dtype: object 



# Clean the data

In [None]:
# age should be converted to numeric

In [None]:
# number of children should be numeric, need to account for answers that are not numbers

In [None]:
# number of hours usually work a week should also be numeric, need to account for non-numeric answers

# Numeric Summaries and Visualizations

In [None]:
# description of age variable

In [None]:
# describe marital status

In [None]:
# visualization: scatterplot showing general happiness by age, sex, and race

In [None]:
# visualization: scatterplot showing marital happiness with number of children, hrs usually work, and sex

In [None]:
# visualization: life exciting or dull with age, sex

In [None]:
# visualiaztion: want to see distribution of happiness at different ages

In [None]:
# visalization: over the years, have people gotten happier in general?