## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

## Part 1

* Clone the repo below and adjust file path as necessary to import data

In [2]:
! git clone https://github.com/kimberlyyliuu/EDA/

Cloning into 'EDA'...
remote: Enumerating objects: 84, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 84 (delta 16), reused 15 (delta 15), pack-reused 62 (from 1)[K
Receiving objects: 100% (84/84), 10.39 MiB | 18.12 MiB/s, done.
Resolving deltas: 100% (28/28), done.


In [4]:
import pandas as pd

df = pd.read_excel('/content/EDA/lab/GSS.xlsx')
df.head()

Unnamed: 0,year,id_,age,sex,race,income06,rincom06,gunlaw,abany,owngun,conrinc,ballot
0,2010,1,31,MALE,Other,$75000 TO $89999,$75000 TO $89999,.i: Inapplicable,.i: Inapplicable,.i: Inapplicable,66247.5,Ballot b
1,2010,2,23,FEMALE,White,$15000 TO 17499,$7 000 TO 7 999,.i: Inapplicable,.i: Inapplicable,.i: Inapplicable,6022.5,Ballot b
2,2010,3,71,FEMALE,Black,$20000 TO 22499,.i: Inapplicable,FAVOR,NO,NO,-100.0,Ballot a
3,2010,4,82,FEMALE,White,$8 000 TO 9 999,.i: Inapplicable,.i: Inapplicable,.i: Inapplicable,.i: Inapplicable,-100.0,Ballot b
4,2010,5,78,FEMALE,Black,.d: Do not Know/Cannot Choose,.i: Inapplicable,FAVOR,.n: No answer,NO,-100.0,Ballot c


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11771 entries, 0 to 11770
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   year      11771 non-null  int64  
 1   id_       11771 non-null  int64  
 2   age       11771 non-null  object 
 3   sex       11771 non-null  object 
 4   race      11771 non-null  object 
 5   income06  11771 non-null  object 
 6   rincom06  11771 non-null  object 
 7   gunlaw    11771 non-null  object 
 8   abany     11771 non-null  object 
 9   owngun    11771 non-null  object 
 10  conrinc   11771 non-null  float64
 11  ballot    11771 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 1.1+ MB


## Part 2

**Data Description and Background**

> As mentioned above, the data we collected comes from the General Social Survey (GSS), a bi-annual nationally representative American  survey of Americans with an abundant of variables to choose from. In short, it is very good data on a lot of interesting topics. We chose to extract 12 variables to form a dataset of 11,771 entries.

> Our dataset primarily surrounds capturing attitudes on particular social issues of interest. Specifically, we selected the gunlaw variable, which an OPPOSE or FAVOR response to the question Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?" Additionally, we chose the abany vaiable, which is a response to whether the respondent believes it should be possible for a pregnant woman to obtain a legal abortion if the woman wants it for any reason.

> To support those primary variables of interest, we also collected the year variable, which corresponds to the  GSS year for the respondent, and the id_ variable, which is a respondent's unique identification number. Additionally, we gathered demographic and economic information on each respondent.


> The demographic variables we collected are age, sex, race, and owngun (whether someone has a gun or revolver in their house or garage). The economic variables we gathered on each respondent are income06 and rincome06, which refer to total family and respondent income, respectively. This income data is available for years 2006, 2008, 2010, 2012, and 2014. Additionally, we gathered conrincome, which is inflation adjusted personal income.

> With this data collected, we felt confident in being able to form an analysis on public opinion on controversial topics, and gain a deeper perspective of American society on these issues.   

