## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
import os
# os.remove('GSS.xlsx')
# upload=files.upload()

In [3]:
gss=pd.read_excel('GSS.xlsx')
gss.shape

(72390, 11)

In [122]:
###Take out all non-answers and convert any numerical answers from strings to integers.
# gss=gss.drop(columns='ballot')
gss['marital']=gss['marital'].replace(['.i:  Inapplicable','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['marital']=gss['marital'].replace('.d:  Do not Know/Cannot Choose','Unknown')

gss['wrkstat']=gss['wrkstat'].replace(['.i:  Inapplicable','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['wrkstat']=gss['wrkstat'].replace('.d:  Do not Know/Cannot Choose','Did Not Answer')

gss['age']=gss['age'].replace(['.i:  Inapplicable','.d:  Do not Know/Cannot Choose','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['age']=pd.to_numeric(gss['age'],errors='coerce')

gss['degree']=gss['degree'].replace(['.i:  Inapplicable','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['degree']=gss['degree'].replace('.d:  Do not Know/Cannot Choose','Did Not Answer')

gss['sex']=gss['sex'].replace(['.i:  Inapplicable','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['sex']=gss['sex'].replace('.d:  Do not Know/Cannot Choose','Did Not Answer')

gss['race']=gss['race'].replace(['.i:  Inapplicable','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['race']=gss['race'].replace('.d:  Do not Know/Cannot Choose','Did Not Answer')

gss['res16']=gss['res16'].replace(['.i:  Inapplicable','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['res16']=gss['res16'].replace('.d:  Do not Know/Cannot Choose','Did Not Answer')

gss['family16']=gss['family16'].replace(['.i:  Inapplicable','.n:  No answer','.s:  Skipped on Web'],np.nan)
gss['family16']=gss['family16'].replace('.d:  Do not Know/Cannot Choose','Did Not Answer')

gss=gss.dropna()
gss['res16'].value_counts()

Unnamed: 0_level_0,count
res16,Unnamed: 1_level_1
TOWN LT 50000,1429
50000 TO 250000,795
CITY GT 250000,649
BIG-CITY SUBURB,645
"COUNTRY,NONFARM",501
FARM,235
Did Not Answer,3


In [97]:
gss['family16'].describe()

family_degree=pd.crosstab(gss['family16'],gss['degree'])
family_degree

degree,Associate/junior college,Bachelor's,Graduate,High school,Less than high school,Unknown
family16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
FATHER,2,1,0,79,22,1
FATHER & STPMOTHER,0,1,0,52,16,0
FEMALE RELATIVE,0,0,0,32,20,0
M AND F RELATIVES,2,0,0,34,8,0
MALE RELATIVE,0,0,0,8,4,0
MOTHER,24,9,0,367,149,0
MOTHER & FATHER,65,37,3,1101,252,0
MOTHER & STPFATHER,4,1,0,165,55,1
OTHER,3,1,0,32,33,0


In [123]:
gss['age'].unique()

array([21, 20, 22, 19, 18])