## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
import os
os.remove('GSS.xlsx')
upload=files.upload()

Saving GSS.xlsx to GSS.xlsx


In [38]:
gss=pd.read_excel('GSS.xlsx')
gss

Unnamed: 0,year,id_,wrkstat,marital,age,degree,sex,race,res16,family16,ballot
0,1972,9,Working part time,Never married,21,High school,FEMALE,Black,50000 TO 250000,M AND F RELATIVES,.i: Inapplicable
1,1972,20,In school,Never married,21,High school,FEMALE,White,50000 TO 250000,MOTHER & STPFATHER,.i: Inapplicable
2,1972,23,Working full time,Married,21,High school,MALE,White,TOWN LT 50000,MOTHER & FATHER,.i: Inapplicable
3,1972,29,Working full time,Married,20,High school,MALE,White,"COUNTRY,NONFARM",MOTHER & FATHER,.i: Inapplicable
4,1972,35,"With a job, but not at work because of tempora...",Married,21,High school,MALE,White,TOWN LT 50000,MOTHER & FATHER,.i: Inapplicable
...,...,...,...,...,...,...,...,...,...,...,...
4359,2022,3401,"Unemployed, laid off, looking for work",Married,22,High school,FEMALE,White,"COUNTRY,NONFARM",MOTHER & FATHER,Ballot a
4360,2022,3461,In school,Never married,22,Less than high school,MALE,Black,TOWN LT 50000,MOTHER,Ballot c
4361,2022,3465,In school,Never married,20,High school,FEMALE,White,TOWN LT 50000,MOTHER & FATHER,Ballot b
4362,2022,3477,Working part time,Never married,21,High school,FEMALE,Black,TOWN LT 50000,FEMALE RELATIVE,Ballot c


In [68]:
###Take out all non-answers and convert any numerical answers from strings to integers.
gss['ballot']=gss['ballot'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan) #20,000 nan

gss['marital']=gss['marital'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)

gss['wrkstat']=gss['wrkstat'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)

gss['age']=gss['age'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['age']=gss['age'].replace('89 or older',90)
gss['age']=pd.to_numeric(gss['age'],errors='coerce')

gss['degree']=gss['degree'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)

gss['sex']=gss['sex'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)

gss['race']=gss['race'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)

gss['res16']=gss['res16'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)

gss['family16']=gss['family16'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)

gss=gss.dropna()
gss

Unnamed: 0,year,id_,wrkstat,marital,age,degree,sex,race,res16,family16,ballot
1686,1988,15,Keeping house,Never married,19,Less than high school,FEMALE,White,BIG-CITY SUBURB,MOTHER,Ballot b
1687,1988,24,In school,Never married,22,High school,MALE,White,CITY GT 250000,MOTHER & FATHER,Ballot b
1688,1988,52,Working full time,Never married,22,Bachelor's,MALE,White,FARM,MOTHER & FATHER,Ballot a
1689,1988,58,In school,Never married,21,High school,MALE,Black,50000 TO 250000,FATHER,Ballot b
1690,1988,63,Working full time,Never married,22,Associate/junior college,FEMALE,Black,CITY GT 250000,MOTHER,Ballot b
...,...,...,...,...,...,...,...,...,...,...,...
4359,2022,3401,"Unemployed, laid off, looking for work",Married,22,High school,FEMALE,White,"COUNTRY,NONFARM",MOTHER & FATHER,Ballot a
4360,2022,3461,In school,Never married,22,Less than high school,MALE,Black,TOWN LT 50000,MOTHER,Ballot c
4361,2022,3465,In school,Never married,20,High school,FEMALE,White,TOWN LT 50000,MOTHER & FATHER,Ballot b
4362,2022,3477,Working part time,Never married,21,High school,FEMALE,Black,TOWN LT 50000,FEMALE RELATIVE,Ballot c


In [83]:
gss['family16'].describe()

family_degree=pd.crosstab(gss['family16'],gss['degree'])

##

degree,Associate/junior college,Bachelor's,Graduate,High school,Less than high school
family16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FATHER,2,1,0,79,22
FATHER & STPMOTHER,0,1,0,52,16
FEMALE RELATIVE,0,0,0,32,20
M AND F RELATIVES,2,0,0,34,8
MALE RELATIVE,0,0,0,8,4
MOTHER,24,9,0,366,149
MOTHER & FATHER,64,37,3,1101,252
MOTHER & STPFATHER,4,1,0,165,54
OTHER,3,1,0,32,33
