**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

- Manya Jain
- Jammy Luo
- Katie Le
- Vidhi Oswal

# Research Question

Do mismatches in sociodemographic factors measured through age gaps greater than five years, household income gaps, and differences in education levels affect the quality of committed romantic relationships in the United States for people above 22 years old?

## Background and Prior Work

There are many factors that can influence the length of a relationship. Although some may seem arbitrary, there is a general idea that certain mismatches in compatibility tend to result in shorter relationships. Sociodemographic factors such as age, economic status, race, education level, and even previous relationship histories play a role in shaping the interactions and compatibility of individuals within a relationship. 

Vogue describes a general rule of thumb that the ideal age gap should be half your age then add seven for the lower barrier, and take seven of your age then double it for the upper barrier.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Age is not always a deciding factor that makes or breaks a relationship, but the power dynamic that results can be emotionally taxing. 

A study was done by Li et al., examining the effects on a womens happiness in relation to her education compared to her husband. The data looked at a wide variety of sociodemographic variables, specifically looking at education and self esteem. The results showed a mean of -0.167 for education, referring to the fact that only 22% of women had a higher education than their counterparts.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Additionally, in comparison there is a decrease in a woman's happiness from 0.064 to 0.058 as she increases in education. The study suggests that there is a possible correlation between happiness in a relationship and the education level of both partners. The clear sign of either partner “marrying up” can release negative emotions to the relationship. Despite this, the study does not specify the effect of education on the relationship itself, which is a point to consider. 

According to The Guardian, significant income or wealth gaps in relationships can create unique challenges, often increasing power dynamics that can influence relationship stability and satisfaction. Such differences may highlight contrasting financial habits or values, potentially leading to tensions around shared expenses, lifestyle choices, or long-term goals. Studies indicate that individuals who maintain financial independence within relationships may experience less pressure tied to monetary support, but they still often face societal judgments about income disparity, especially when gender norms are challenged.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)
Previous studies and articles have mentioned many of these sociodemographic factors, but not particularly in terms of how it affects the stability or length of the relationship itself. This project seeks to explore how these factors interact to predict the length of romantic relationships, providing a quantitative analysis that aims to uncover patterns of relationship longevity across diverse sociodemographic backgrounds.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Hunt, E. (2024, September 21). “I earn £2m – my partner £20k. it’s a bit ridiculous”: The truth about wealth-gap relationships. The Guardian. https://www.theguardian.com/lifeandstyle/2024/sep/21/the-truth-about-wealth-gap-relationships#:~:text=Vast%20differences%20in%20income%20and,money%20%E2%80%93%20as%20hitting%20the%20jackpot 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Li, Z., & Feng, X. (2023). Educational Difference Between Partners and Wife’s Happiness. Journal of Family Issues, 44(10), 2684-2707. https://doi.org/10.1177/0192513X221106731
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Rasmussen, T. (2024, June 4). What is an Acceptable Age Gap in Relationships?. Vogue. https://www.vogue.com/article/age-gap-relationships

# Hypothesis


We hypothesize that the greater the mismatch in variables between partners, the greater the likelihood of a committed romantic relationship not lasting past three years; this is due to our personal observations. We infer a negative relationship between the sociodemographic variables and the length of a long-term relationship.

# Data

## Data overview

- Dataset #1
  - Dataset Name: How Couples Meet and Stay Together 2017 (HCMST 2017)
  - Link to the dataset: https://data.stanford.edu/hcmst2017
  - Number of observations: 3510
  - Number of variables: 725

Dataset #1 contains information about 3510 survey respondents and their romantic partners detailing aspects of their relationship in 2017. Some important variables to note include who earned more between the two individuals, education level and age of the respondents as well as that of their partners, where the former two are categorically encoded as a numerical value and the latter is numerical to accurately depict age. This dataset was originally an Excel, therefore we used .read_excel() to load and read the data file and the .dropna() function to filter out for the ‘w1_q34’ variable, which categorically describes the quality of the relationship and the conclusion we are focusing on. The subset data is renamed as ‘subset_df’ with 2847 variables and no missing values.

## Dataset #1 (How Couples Meet and Stay Together 2017 (HCMST 2017))

In [1]:
import pandas as pd

In [2]:
# Loading the dataset
df = pd.read_excel('output_file.xlsx')

print(df.head())

   caseid_new  w3_Weight  w3_Weight_LGB  w3_combo_weight  \
0       53001     0.4422            NaN         0.495308   
1       71609     0.8284            NaN         0.927891   
2      106983     0.8255            NaN         0.924643   
3      121759        NaN            NaN              NaN   
4      158083     0.8810            NaN         0.986809   

   w3_attrition_adj_weight  w2_weight_genpop  w2_weight_LGB  w2_combo_weight  \
0                 0.400185            0.3856            NaN         0.437670   
1                 0.879258            0.9196            NaN         1.043778   
2                 0.706467            0.7748            NaN         0.879425   
3                      NaN            0.9177            NaN         1.041622   
4                 0.655467            0.8697            NaN         0.987140   

   w2_attrition_adj_weights  w1_weight_combo  ...  p20_pppa1634  p20_pppa1902  \
0                  0.380351         0.426861  ...           2.0           0.0

In [3]:
# Selecting relevant variables
columns_to_select = ['w1_ppage','w1_ppagecat','w1_ppeduc','w1_ppeduc','w1_ppeducat','w1_ppethm','w1_ppgender','w1_q23','w1_ppincimp','w1_ppincimp_cat','w1_ppmarit','w1_q4','w1_q6b','w1_q9','w1_q10','w1_q34']
subset_df = df[columns_to_select]

subset_df.head()

Unnamed: 0,w1_ppage,w1_ppagecat,w1_ppeduc,w1_ppeduc.1,w1_ppeducat,w1_ppethm,w1_ppgender,w1_q23,w1_ppincimp,w1_ppincimp_cat,w1_ppmarit,w1_q4,w1_q6b,w1_q9,w1_q10,w1_q34
0,48,4,9,9,2,5,2,3.0,13,2,1,1.0,1.0,46.0,11.0,1.0
1,68,6,10,10,3,1,2,3.0,12,2,1,1.0,1.0,71.0,10.0,1.0
2,39,3,11,11,3,1,1,1.0,15,3,1,2.0,1.0,49.0,10.0,1.0
3,54,4,9,9,2,1,1,3.0,16,3,1,2.0,4.0,59.0,13.0,1.0
4,48,4,10,10,3,1,1,1.0,14,3,3,2.0,1.0,34.0,11.0,


In [4]:
# Drop rows with missing values in the quality of relationship variable
subset_df = subset_df.dropna(subset=['w1_q34'])

subset_df.head()
subset_df.describe()

Unnamed: 0,w1_ppage,w1_ppagecat,w1_ppeduc,w1_ppeduc.1,w1_ppeducat,w1_ppethm,w1_ppgender,w1_q23,w1_ppincimp,w1_ppincimp_cat,w1_ppmarit,w1_q4,w1_q6b,w1_q9,w1_q10,w1_q34
count,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0,2847.0
mean,49.357218,3.981033,10.536705,10.536705,2.976818,1.644187,1.505093,2.108535,13.325255,2.530032,2.107482,1.484721,1.449596,48.76326,10.656832,1.51844
std,16.230093,1.619776,2.010855,2.010855,0.966129,1.182051,0.500062,1.04627,4.591033,1.105341,1.805854,0.505447,1.095336,16.5201,1.920204,0.74575
min,18.0,1.0,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0
25%,35.0,3.0,9.0,9.0,2.0,1.0,1.0,1.0,11.0,2.0,1.0,1.0,1.0,35.0,9.0,1.0
50%,51.0,4.0,10.0,10.0,3.0,1.0,2.0,2.0,14.0,3.0,1.0,1.0,1.0,50.0,10.0,1.0
75%,62.0,5.0,12.0,12.0,4.0,2.0,2.0,3.0,16.0,3.0,3.0,2.0,1.0,62.0,12.0,2.0
max,93.0,7.0,14.0,14.0,4.0,5.0,2.0,4.0,21.0,4.0,6.0,3.0,5.0,95.0,14.0,5.0


# Results

## Exploratory Data Analysis

Make sure you describe the what and why in text here as well as providing interpretation of results and context.

We decided to explore the relationships between the gaps in age, household income, and difference in education levels, and the quality of a committed romantic relationship.

### Section 1 of EDA

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

- Are there any biases/privacy/terms of use issues with the data you proposed?

  There are no biases in the data proposed because the method of survey that is distributed to respondents is through random digit dial phone calls. This allows for a more holistic representation of the standard American society we aim for since surveys won't be directed to one region, but rather across the United States. There are no privacy or terms of use issues with the data we have proposed as well. Each individual is associated with an ID for anonymization.
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)

  It is unlikely that there are any potential biases in our dataset(s) as we intend for it to be composed of a diverse pool of people across the country through randomized voluntary phone call surveys. However, our data excludes particular populations that do not have access to cellular devices or telephones, which could be problematic as it could mean exclusion of populations with lower household income or other reasons for this restricted access.
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?

  To check for potential biases in our dataset(s) and if any groups are underrepresented, we will perform an exploratory data analysis (EDA). During analysis, we will check if there are sociodemographic factors that disproportionately influence our model, indicating any sort of bias.
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?

  A possible issue with anonymization is that unique combinations of sociodemographic data could make participants of the study re-identifiable. The findings in our research could unintentionally reinforce societal biases and discourage individuals from pursuing non-conventional relationships. For instance, based on our study’s correlations, those in relationships with significant age gaps might perceive their dynamic as judged to be even more unconventional. Another issue that could potentially be problematic is that our findings could be used to promote narrow views on the “ideal” relationship to discourage differences between partners in relationships. Our research only analyzes data for potential correlations, and does not imply causation.
- How will you handle issues you identified?

  To handle these issues, we could make sure variables such as incomes are aggregated rather than exact to protect the privacy of participants even further. We will communicate clearly that any correlations in our findings do not imply causation or any predictive, causal factors in relationship success or failure. We will also emphasize the importance of diversity in relationships and validate sociodemographic differences to ensure that our findings are not misinterpreted to reinforce stereotypes.

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team members are expected to keep up to date on the proposed timeline below.*
* *Team members are expected to keep up to date on group communication and messages through Discord.*
* *The team is expected to meet at least once a week, according to the timeline below.*
* *If an individual member is struggling to fulfill their expected role, they must communicate clearly with the rest of the group to come to a compromise.*
* *To resolve any issues, the team will allocate effort equally to finish the work.*
* *For decision making, everything will follow a unanimous vote.*

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/21  |  5 PM | Brainstorm topics/questions  | Determine best form of communication; create Discord group chat; exchange contacts; begin background research, brainstorm into Google Docs | 
| 10/30  |  5 PM |  Brainstorm more topics, add to Google Docs | Work on Project Proposal; decide future meeting times; edit and finalize proposal for submission | 
| 11/06  | 5 PM  | Search for datasets  | Discuss Wrangling and possible analytical approaches; Have each member in the group work on specific parts   |
| 11/13  | 4:30 PM  | Import & Wrangle Data | Import & Wrangle Data; Review/Edit wrangling/EDA; Review Project Proposal feedback from TA/IA and revise   |
| 11/20  | 4 PM  | Await feedback for Data Checkpoint | Start wrangling/EDA; Work on EDA Checkpoint |
| 11/25  | 4:30 PM  | Review feedback for Data Checkpoint | Complete wrangling/EDA; ; Discuss/edit Analysis; Complete project check-in |
| 11/27  | TBA, Online  | Draft results/conclusion/discussion | Finish analysis; Work on results/conclusion/discussion; Discuss/edit full project |
| 12/04  | 5 PM  | Work on draft for results/conclusion/discussion; Finalize results/conclusion/discussion | Discuss/edit full project |
| 12/11  | Before 11:59 PM  | N/A | Turn in Final project, video, team evaluation survey, post-course survey |