## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
import os
os.remove('GSS.xlsx')
upload=files.upload()

Saving GSS.xlsx to GSS.xlsx


In [84]:
gss=pd.read_excel('GSS.xlsx')
gss

Unnamed: 0,year,id_,wrkstat,hrs2,wrkslf,commute,occ10,marital,agewed,widowed,ballot
0,1972,1,Working full time,.i: Inapplicable,Someone else,.i: Inapplicable,"Wholesale and retail buyers, except farm products",Never married,.i: Inapplicable,.i: Inapplicable,.i: Inapplicable
1,1972,2,Retired,.i: Inapplicable,Someone else,.i: Inapplicable,First-line supervisors of production and opera...,Married,21,.i: Inapplicable,.i: Inapplicable
2,1972,3,Working part time,.i: Inapplicable,Someone else,.i: Inapplicable,Real estate brokers and sales agents,Married,20,.i: Inapplicable,.i: Inapplicable
3,1972,4,Working full time,.i: Inapplicable,Someone else,.i: Inapplicable,Accountants and auditors,Married,24,.i: Inapplicable,.i: Inapplicable
4,1972,5,Keeping house,.i: Inapplicable,Someone else,.i: Inapplicable,Telephone operators,Married,22,.i: Inapplicable,.i: Inapplicable
...,...,...,...,...,...,...,...,...,...,...,...
72385,2022,3541,Working full time,.i: Inapplicable,Someone else,.y: Not available in this year,"Hotel, motel, and resort desk clerks",Never married,.y: Not available in this year,.i: Inapplicable,Ballot a
72386,2022,3542,Working full time,.i: Inapplicable,Someone else,.y: Not available in this year,Elementary and middle school teachers,Married,.y: Not available in this year,NO,Ballot a
72387,2022,3543,Working full time,.i: Inapplicable,Someone else,.y: Not available in this year,Respiratory therapists,Never married,.y: Not available in this year,.i: Inapplicable,Ballot b
72388,2022,3544,Working full time,.i: Inapplicable,Someone else,.y: Not available in this year,Elementary and middle school teachers,Married,.y: Not available in this year,NO,Ballot c


In [87]:
gss['ballot']=gss['ballot'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],'Unknown')

gss['agewed']=gss['agewed'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],-999)
gss['agewed']=pd.to_numeric(gss['agewed'],errors='coerce')

gss['hrs2']=gss['hrs2'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],-999)
gss['hrs2']=gss['hrs2'].replace('89+ hrs',90)
gss['hrs2']=pd.to_numeric(gss['hrs2'],errors='coerce')

gss['commute']=gss['commute'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],-999)
gss['commute']=gss['commute'].replace('97+ MINUTES',80)
gss['commute']=pd.to_numeric(gss['commute'],errors='coerce')

gss['occ10']=gss['occ10'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],'Unknown')

gss['marital']=gss['marital'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],'Unknown')

gss['widowed']=gss['widowed'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],'Unknown')

gss['wrkstat']=gss['wrkstat'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],'Unknown')


gss['wrkstat'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gss['ballot']=gss['ballot'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],'Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gss['agewed']=gss['agewed'].replace(['.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],-999)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#r

array(['With a job, but not at work because of temporary illness, vacation, strike'],
      dtype=object)