## Getting the GSS Data

Since the data files are about 40GB zipped, we can't store a compressed or uncompressed version on GitHub, and the entire dataset can't really be loaded into memory with Colab.

One option is to use Rivana: Download the data, unzip it, and work on it in a persistent environment.

The other option is to avoid opening the entire file at once, and instead work with chunks of the data. That's what this code does for you.

On GitHub, the data are broken into three smaller files, saved in .parquet format. The code below will load these chunks into memory, one at a time, you can specify the variables you want in `var_list`, and the results will be saved in `selected_gss_data.csv`.

You can add more cleaning instructions in between the lines where the data are loaded ( `df = pd.read_parquet(url)`) and the data are saved (`df.loc...`). It's probably easiest to use this code to get only the variables you want, and then clean that subset of the data.

In [8]:
import pandas as pd
#
var_list = ['year', 'age', 'sex', 'race', 'educ','marital','realinc', 'region', 'polviews', 'partyid', 'relig', 'happy'] # List of variables you want to save
output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to
#
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#
for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    print(url) # Check the url is correct
    df = pd.read_parquet(url) # Download this chunk of data
    print(df.loc[:, var_list].head())
    if phase == 0 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='w', # control write versus append
                                header=var_list, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode
    elif phase == 1 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='a', # control write versus append
                                header=None, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode

https://github.com/DS3001/project_gss/raw/main/gss_chunk_1.parquet
   year   age     sex   race  educ        marital  realinc  \
0  1972  23.0  female  white  16.0  never married  18951.0   
1  1972  70.0    male  white  10.0        married  24366.0   
2  1972  48.0  female  white  12.0        married  24366.0   
3  1972  27.0  female  white  17.0        married  30458.0   
4  1972  61.0  female  white  12.0        married  50763.0   

               region polviews                             partyid  \
0  east north central      NaN      independent, close to democrat   
1  east north central      NaN            not very strong democrat   
2  east north central      NaN  independent (neither, no response)   
3  east north central      NaN            not very strong democrat   
4  east north central      NaN                     strong democrat   

        relig          happy  
0      jewish  not too happy  
1    catholic  not too happy  
2  protestant   pretty happy  
3       other  