# Getting GSS Data

In [1]:
import pandas as pd

var_list = ['trust', 'age', 'sex', 'race', 'educ', 'relig', 'region']
output_filename = 'selected_gss_data.csv'

phase = 0

for k in range(3):
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet'
    print(url)
    df = pd.read_parquet(url)
    print(df.head())
    if phase == 0 :
        df.loc[:,var_list].to_csv(output_filename,
                                mode='w',
                                header=var_list,
                                index=False)
        phase = 1
    elif phase == 1 :
        df.loc[:,var_list].to_csv(output_filename,
                                mode='a',
                                header=None,
                                index=False)
        phase = 1

https://github.com/DS3001/project_gss/raw/main/gss_chunk_1.parquet
   year  id            wrkstat  hrs1  hrs2 evwork    occ  prestige  \
0  1972   1  working full time   NaN   NaN    NaN  205.0      50.0   
1  1972   2            retired   NaN   NaN    yes  441.0      45.0   
2  1972   3  working part time   NaN   NaN    NaN  270.0      44.0   
3  1972   4  working full time   NaN   NaN    NaN    1.0      57.0   
4  1972   5      keeping house   NaN   NaN    yes  385.0      40.0   

         wrkslf wrkgovt  ...  agehef12 agehef13 agehef14  hompoph wtssps_nea  \
0  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
1  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
2  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
3  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   
4  someone else     NaN  ...       NaN      NaN      NaN      NaN        NaN   

   wtssnrps_nea  wtssps_next wt

## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest. (You can also check out `get_gss.ipynb` for some processed data.)
2. Write a short description of the data you chose, and why. (~500 words)
3. Load the data using Pandas. Clean them up for EDA. Do this in this notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations.
5. Describe your findings. (500 - 1000 words, or more)

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.


In [2]:
df = pd.read_csv('selected_gss_data.csv')
df.head()

Unnamed: 0,trust,age,sex,race,educ,relig,region
0,depends,23.0,female,white,16.0,jewish,east north central
1,can trust,70.0,male,white,10.0,catholic,east north central
2,can't be too careful,48.0,female,white,12.0,protestant,east north central
3,can't be too careful,27.0,female,white,17.0,other,east north central
4,can't be too careful,61.0,female,white,12.0,protestant,east north central


In [6]:
#Finding Range for Analysis
print(df.age.max())
print(df.age.min())

print(df.educ.max())
print(df.educ.min())

89.0
18.0
20.0
0.0


In [7]:
#Removing Blank or Unknown Answers
df.replace(['DK', 'NA', '', 'N/A'], pd.NA, inplace = True)