<a href="https://colab.research.google.com/github/madisonhgallagher/project_gss/blob/main/GSS_integrated_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DS 3001 Project #1

Madison Gallagher, Isabella Wright, Megan Vander Wiele

Due Date: 3/11/2024

##Summary

The General Social Survey, since 1972, has asked American’s various questions about themselves and their personal beliefs. The survey, along with an extensive list of more detailed questions, asks about participants' age, level of education, religion, and if they are opposed or in favor of capital punishment for murder. From this data, we wanted to find if religion, age, or education correlate with people’s opinions on capital punishment. To answer this question, the data from the General Social Survey first needed to be “cleaned,” meaning that any missing values or nonsensical answers needed to be recategorized or disposed of. Data was also tidied by binning age groups as well as re-categorizing religions into broader groups to allow for a more clear analysis. Initial exploration was done by cross tabbing different variables to determine any clear correlations and creating count plots to see clear distributions within variables. Kernel density estimate (KDE) plots were made for further analysis of correlation between age, education, and religion in respect to capital punishment opinions. KDE plots are useful when visualizing the estimated distribution of observations within a variable. It is extremely useful for answering our questions since it can group each variable, such as education, age, and religion by those who favor or oppose the death penalty. This allows for a “switch” in majority opinion to be observed from the probability of one opinion crossing the other. From initial cross tabbing, it was seen all religions had a majority favoring opinion towards the death penalty. It was also seen that Christians, non-religious people, and Jewish people have the highest percentage of favoring opinion towards capital punishment. The KDE plot comparing opinion of capital punishment to age revealed that people between 20 and 40 have the strongest presence towards capital punishment, and that preference decreased with age. A KDE plot of opinion over level of education also revealed that highly educated people tend to not favor the death penalty. KDE plots of age and education split up by different religions showed that eastern religions like hinduism has higher opposition rates at older ages and higher levels of education compared to religions which fall under the general umbrella of

# Data
All data used was collected from the General Social Survey. The variables in question were religion, age, education, and opinion on capital punishment, which were coded as the variables RELIG, AGE, EDUC, and CAPPUN, respectfully. The survey question regarding religion was phrased as “What is your religious preference? Is it Protestant, Catholic, Jewish, some other religion, or no religion?” with the possible answer choices of protestant, catholic, jewish, none, other, buddhism, hinduism, other easter religions, muslim/islam, orthodox christian, christian, native american, interdenominational, don't know, no answer, or skipped. All values for don’t know, no answer, and skipped were inputted as n/a, with a total of 437 n/a values. Despite the given answer choices in the code book, some answers to the religion question included “relig,” which does not make logical sense. Given the vast diversity of possible answer choices for christians, to draw clear conclusions the answers of catholic, protestant, inter-non denominational, christian, and orthodox-christian were grouped into a new “Christian” answer choice. The “relig” answers were grouped into “Other”, and given that the percentage of n/a values to total responses was less than 1%, they were dropped.
Age data was given from a range of 19 to 89, with 647 n/a values which were dropped since they were less than 1% of the survey population. Ages were grouped into 18-30, 31-40, 41-50, 51-60, 61-70, 71-80, and 81-90 to make analysis more clear. Education data was given as a number from 0 to 20, with 0 meaning no formal schooling and 20 meaning 8 years of college education or more. There were 178 n/a values, which were dropped for the same reason.
 Education data was coerced to numeric data (integers) to allow for quantitative analysis. The question regarding opinion on capital punishment was phrased as “Do you favor or oppose the death penalty for persons convicted of murder” with answer choices of favor, oppose, don’t know, no answer, or skipped, with the three latter coded as n/a, allowing the response to be a binary variable. There were 204 n/a responses, which were dropped since they represented less than 1% of the survey population.
After data wrangling, age, opinion on capital punishment, and religion were represented as categorical values while education was numerical. The decision to group age into categorical variables arose from difficulties with creating plots with age as numerical data. The age distribution of individuals was concentrated around a few specific ages, making it extremely difficult to determine where trends on capital punishment opinion lay. Grouping the data resolved this issue by allowing for a more uniform distribution across the age ranges and making KDE plots more clear.
Another issue arose when grouping religions. The vast majority of participants identified as some type of Christian, and there were 5 values that all correspond to the overarching Christian faith. When these values were not grouped, correlations were difficult to determine due to the large amount of variables present and the wide distribution within christian denominations. Grouping the christian values solved this issue, but created another issue of over-dominance in the count plots. The count of christians out-numbered other religions by a large extent to the point where graphs comparing all religions were not as useful since the scale was too large to analyze the other religions. This was solved by separating religions into their own KDE plots which use probability densities and do not take into account the scale used for all religions.


##Does religion, age, or education correlate with people's opinions on capital punishment?##

Importing the data and necessary libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
df = pd.read_csv('selected_gss_data.csv',low_memory=False)

FileNotFoundError: [Errno 2] No such file or directory: 'selected_gss_data.csv'

Our selected variables to answer our question are relig, cappun, age, and educ which according to the GSS codebook have the following meanings.
-RELIG: What is your religious preference?
-CAPPUN:  Do you favor or oppose the death penalty for persons convicted of murder?
-AGE: RESPONDENT'S AGE
-EDUC: RESPONDENT'S EDUCATION

In [None]:
print(df.shape)
print(df.dtypes)
df.head()

##Handling errors and missing data

In [None]:
cappun = df['cappun']
cappun.unique()

In [None]:
relig = df['relig']
relig.unique()

In [None]:
age = df['age']
age.unique()

In [None]:
educ = df['educ']
educ.unique()

It looks like there's quite a few NaNs as well as some values "cappun". "cappun" as an entry doesn't make sense and since there is only 2 we will drop the rows with that entry. Same for age, religion, and education we will drop any entries that == the name of the variable.

In [None]:
df = df.drop(df[df['cappun'] == 'cappun'].index)
df = df.drop(df[df['age'] == 'age'].index)
df = df.drop(df[df['relig'] == 'relig'].index)
df = df.drop(df[df['educ'] == 'educ'].index)

In [None]:
df['relig'].isna().sum() #count the nans in the religion column

There are 437 NaN values in the religion category. This is less than 1% of the observations, so we can drop these values.

In [None]:
df = df.dropna(subset=["relig"])
df['relig'].isna().sum() # now we can see that there are no NaN values for religion

In [None]:
education = df['educ']
education.unique()

In [None]:
df['educ'].isna().sum() #count the nans in the educ column

In [None]:
df = df.dropna(subset=["educ"])
df['educ'].isna().sum() # now we can see that there are no NaN values for educ

In [None]:
df['age'].isna().sum() #count the nans in the age column

In [None]:
df = df.dropna(subset=["age"])
df['age'].isna().sum() # now we can see that there are no NaN values for age

In [None]:
# grouping age groups
bins = [18,30, 40, 50, 60, 70, 80, 90]
labels = ['18-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90']

# Try to convert 'age' to integers, handle errors by setting to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Drop rows with NaN values in 'age' column
df = df.dropna(subset=['age'])

# Apply age binning
df['difage'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

print(df['difage'].value_counts())

In [None]:
# Try to convert 'educ' to integers, handle errors by setting to NaN
df['educ'] = pd.to_numeric(df['educ'], errors='coerce')

# Drop rows with NaN values in 'age' column
df = df.dropna(subset=['educ'])

print(df['educ'].value_counts())

##Cleaning Categorical Data

We will want to group together some of the smaller religions into one to get a better understanding of how opinions differ genrally across chrisitan vs non christian religions.

In [None]:
#cleaning data - combining religion variables
difrelig = df['relig'] # Create a temporary vector of values for the relig variable to play with

difrelig = difrelig.replace(['catholic', 'protestant','inter-nondenominational','christian', 'orthodox-christian'],'christian') # All christian values

difrelig = difrelig.replace(['other', 'relig', 'other eastern religions'], 'Other Religions')

df['difrelig'] = difrelig # create a new column 'difrelif' variable with the grouped version
df['difrelig'].value_counts()

## Results


In the results section of our exploratory paper investigating the influence of religion, age, and education on people’s stance towards capital punishment we observed many trends. The prevailing inclination across various groups is a favorable view towards capital punishment. When cross tabulating the data, it looks like buddhists, hinduists, inter-nondenominational, muslim, and native americans slightly favor the death penalty. Catholics, Christians, Jewish people, non-religious people, orthodox christians, protestants, and others strongly favor the death penalty. No single religious group opposes it more than they favor it.


In [None]:
print(pd.crosstab(df['cappun'],df['relig']),'\n')

In [None]:
print(pd.crosstab(df['cappun'],df['difrelig']),'\n')

Some other important basic statistics needed to understand the study were that the most frequent education is 12 years which is a highschool level education, and a majority of the study was conducted on Christians between 18-50 years old. When plotting a simple bar chart with count on the y-axis and religion on the x-axis, we observed that a large portion of the people surveyed were protestant, with the second largest group being catholic, and the third largest group falling under no religion.

In [None]:
df['educ'].describe()

In [None]:
my_plot = sns.countplot(df, x="relig")
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=90)

A simple Kernel Density Estimation plot shows that generally more people favor capital punishment with a higher density in the graph. But as people age the difference in opinion on capital punishment lessens.

In [None]:
sns.kdeplot(data=df,x='age',hue='cappun')

As for age, 31-40 year olds were interviewed the most, followed by 18-30 year olds. 81-90 year olds were interviewed the least. This distribution makes sense as it follows the general population trend for age. Plotting ages into different bins against their count, it seems 31-40 year olds favor capital punishment the most with the highest frequency being shown in the chart.

In [None]:
my_plot = sns.countplot(df, x="difage")
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=90)

In [None]:
my_plot = sns.countplot(df, x="difage", hue="cappun")
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=90)

When looking more specifically into trends for those who oppose capital punishment, we noticed older individuals who are not affiliated with any religious beliefs and possess higher education levels exhibit the highest opposition rates to capital punishment. When looking more into religions, it seems that Christians, irrespective of their educational background, tend to predominantly favor the death penalty.

In [None]:
print(pd.crosstab(df['difage'],df['relig']),'\n')

In [None]:
print(pd.crosstab(df['educ'],df['relig']),'\n')

In [None]:
my_plot = sns.countplot(df, x="difrelig", hue="cappun")
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=90)

In [None]:
rslt_df = df[df['difrelig'] == 'christian']
other_df = df[df['difrelig'] != 'christian']
jewish = df[df['difrelig'] == 'jewish']
hindu = df[df['relig'] == 'hinduism']

KDE Christians opinion by education

In [None]:
sns.kdeplot(data=rslt_df, x="educ", hue="cappun") # Grouped by ccappun

On the other hand, individuals practicing Hinduism, particularly those in their forties, exhibit a higher tendency to oppose capital punishment compared to those who favor it. Furthermore, we noticed that a positive correlation exists between higher education levels and increased opposition to capital punishment.

In [None]:
sns.kdeplot(data=hindu, x="educ", hue="cappun") # Grouped by ccappun

In [None]:
sns.kdeplot(data=df, x="educ", hue="cappun") # Grouped by ccappun

It is important for us to include Kernel Density Estimation plots to help illustrate the distribution of attitudes towards capital punishment within specific subgroups, making it easier to discern patterns and trends with a more accurate representation of the data.  The following KDE plot reveals that individuals with advanced education levels who do not identify as Christian tend to exhibit a stronger inclination towards opposing capital punishment compared to supporting it.
On the flip side, among Christians of various age groups, a consistent and low opposition rate is observed through the following plot, indicating a persistent preference for the death penalty.

KDE Non Christians opinion by education

In [None]:
sns.kdeplot(data=other_df, x="educ", hue="cappun") # Grouped by ccappun

KDE Christians opinion by age

In [None]:
sns.kdeplot(data=rslt_df, x="age", hue="cappun") # Grouped by ccappun

When exploring central tendencies in the age and education variables, we noticed that the average age of both those who favor and oppose capital punishment is around 46 years old. This suggests that a significant portion of the study’s data falls within the middle range of ages, indicating that the perspectives captured are representative of a diverse yet centered age demographic. Similarly, the average education level for both individuals who favor and those who oppose capital punishment is around 13 years, meaning they just graduated high school and are in their first year of a university education. This finding underscores that the study’s data collection is concentrated around a moderate educational attainment level.
 Additionally, the similarity in average education levels for both groups suggests a balanced representation, capturing a substantial portion of individuals with a moderate duration of formal education. The concentration of data in the middle range of age and education has implications for the generalizability of the study’s findings. While the results provide valuable insights into the attitudes of the individuals with average age and education levels, caution should be exercised when extrapolating these findings to extreme age or education brackets. On the other hand, understanding that the study’s data is centered around average age and education levels enhances its relevance to public opinion dynamics. The study captures the sentiments of individuals who represent a considerable segment of the population, making the findings pertinent to discussions on capital punishment that resonate with a broad cross-section of society.

In [None]:
result1 = df.groupby(['cappun','difrelig']).mean()
print(result1)

In [None]:
result = df.groupby('cappun').mean()
print(result)

In [None]:
my_plot = sns.countplot(df, x="educ", hue="cappun")
my_plot.set_xticklabels(my_plot.get_xticklabels(), rotation=90)

The previous findings shed light on the nuanced relationships between religious beliefs, age, education, and attitudes towards capital punishment, providing valuable insights into the complexities of public opinion on this issue.


##Conclusion
Using data from the General Social Survey we analyzed patterns among the following variables: age, education, religion, and opinion on capital punishment. We found that generally, all religious groups have a majority opinion that favors the death penalty. Especially among people who identify as christians, there is a strong favoring opinion. It was only when analyzing specific sub-groups such as highly educated non-christians or middle to older aged hindus that we were able to identify groups where there was a majority opposing capital punishment.
Our data preparation methods cleaned the data while maintaining the integrity of the information, dropped answers tended to make up approximately 1% of all responses. Decisions to group variables and bin information were made in the interest of analyzing the data meaningfully, and not in an effort to misconstrue the information.
There is, however, a benefit to further analysis with more information. The data we used was limited in certain regards; there were few entries from religious groups such as hindus, buddhists, and native americans. A more robust analysis would require a greater number of data points from these groups to produce a more reliable sample.
Further research could be done to explore each religion more in depth, analyzing sub-groups or denominations with each religion. For the purpose of our analysis we generalized Christians but there may be different trends across different Christian sects. Additionally, Islam, Hinduism, and Native Americans have different sects that were not even identified in this particular survey. More information with more specific questions would allow for a more in depth understanding of how religion, age, and education affect people’s opinion on capital punishment.


##Appendices

In [None]:
rslt_df = df[df['educ'] == 'christian']
other_df = df[df['difrelig'] != 'christian']

In [None]:
sns.kdeplot(data=jewish, x="educ", hue="cappun") # Grouped by ccappun

In [None]:
sns.kdeplot(data=hindu, x="age", hue="cappun") # Grouped by ccappun