# COGS 108 - Project Proposal

# Names

- Hillary Co
- Itzary Vences
- Jewel Nguyen
- Kanon Kitazume

# Research Question

###How do political views profile influence match success for young adults (ages 20s to 30s) on online dating platforms?

We aim to analyze the role of political affiliation in match success, considering whether individuals are more likely to match with users who share similar political views. Variables measured will include political affiliation (e.g., liberal, conservative, moderate), match rates, user preferences, and demographic factors. The study will also examine potential biases and privacy concerns surrounding political data usage on dating platforms.


## Background and Prior Work

###Background

In the digital age, online dating apps have become a primary medium for forming romantic connections, especially among young adults. While traditional factors like age and income continue to play pivotal roles in dating success, the increasing political polarization in society has introduced a new dynamic into the dating scene. Political alignment is becoming a crucial factor in relationship compatibility, reflecting broader societal trends. This study aims to delve into the impact of political views on match success among young adults, hypothesizing that a shared political ideology may significantly influence relationship formation and stability. One's political affiliation may not only impact their match rates but could also potentially influence stability of relationships formed through online dating platforms. With political affilication being an increasingly significant impact in the dating lives of 20-30 year olds, many dating platforms have began catering to specific political preferences. The question lies on if a shared political ideology hinders or enchances the likeliness of forming connections.

###Prior Work
The initial study we reviewed was conducted in June 2024, titled ‘Predict Online Dating Matches Database <a name='cite_ref-1'></a>[<sup>1</sup>](#cite_note-1). This study predicted online dating matches focusing on variables such as gender, membership status, income, number of children, age, attractiveness ratings, and match count. Although it identified patterns in the success rate of matches, it did not consider political alignment as a variable, a factor that is becoming increasingly relevant in modern dating among young adults.

Delving deeper into the significance of political views and compatibility, research titled 'Political Party Identification and Romantic Relationship Quality' from 2021 demonstrated that individuals identifying as Republicans are more likely to adjust and be compatible with their partners than those identifying as Democrats. This underscores the increasing importance of political beliefs in determining compatibility and relationship success <a name='cite_ref-2'></a>[<sup>2</sup>](#cite_note-2).

However, few studies have examined the impact of political affiliation on online dating success specifically. Our research seeks to fill this gap by investigating how political alignment affects match success in online dating among young adults aged 20-30. By doing so, we aim to add a crucial layer of understanding to the complex dynamics of relationship formation in a politically polarized environment.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Rabie El Kharoua (2024). 💌 Predict Online Dating Matches Dataset. Kaggle. DOI: 10.34740/KAGGLE/DSV/8744629.https://www.kaggle.com/datasets/rabieelkharoua/predict-online-dating-matches-dataset
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Brandt, M. J., & Crawford, J. T. (2021). Political Homophily in Online Dating: The Role of Ideological Similarity in Partner Preferences. Personality and Social Psychology Bulletin. PMC8266382. https://pure.uvt.nl/ws/portalfiles/portal/30699390/2019.BrandtCrawford.Worldviewconflictprejudice.Advances.pdf


# Hypothesis

We predict that there will be a strong correlation between political affiliation and match success in young adults in their 20s and 30s. We also predict that there will be a strong correlation between political views and compatibility (if their matches also have aligned political views). 


# Data

## Data overview

- **Dataset #1**
  - Dataset Name: *Tinder Millennial Match Rate - EDA & Hypothesis Test*
  - Link to the dataset: https://www.kaggle.com/code/jintaepark95/tinder-millennial-match-rate-eda-hypothesis-test
  - Number of observations: 1,000
  - Number of variables: 11 variables

- **Dataset #2**
  - Dataset Name: *Data and Materials for 'The influence of algorithms on political and dating decisions'*
  - Link to the dataset: https://osf.io/qujza/
  - Number of observations: 40
  - Number of variables: 11 variables

- **Dataset #3**
  - Dataset Name: *A perfect ML model?*
  - Link to the dataset: https://www.kaggle.com/code/andreyukio/a-perfect-ml-model
  - Number of observations: 100,000
  - Number of variables: 20 variables


For each dataset, the following information is included, along with specific preparation steps:

- **Dataset #1: Tinder Millennial Match Rate - EDA & Hypothesis Test**
  - **Description:** This dataset provides insights into how millennials are engaging with Tinder, focusing on match rates from various universities. Key variables include ID (unique identifier for each user), Segment Type (mode of Tinder usage), Segment Description (university name), Answer (Tinder usage confirmation), Count (number of matches), Percentage (% of matches), and It became a Relationship (outcome of matches).
  - **Preparation Steps:** Convert the 'It became a Relationship' column to boolean values, handle potential outliers in the 'Count' and 'Percentage' columns, and standardize university names for consistency.

- **Dataset #2: Data and Materials for "The influence of algorithms on political and dating decisions"**
  - **Description:** This dataset examines the effects of AI algorithms on decision-making in political and dating contexts. Important variables are Experiment Type (context of the decision), Persuasion Type (explicit or covert), and Decision Outcome (effectiveness of the persuasion).
  - **Preparation Steps:** Encode categorical data such as 'Persuasion Type', check for and handle missing values, and normalize 'Decision Outcome' across different scales for comparative analysis.

- **Dataset #3: Online Dating Behavior**
  - **Description:** This dataset delves into factors influencing match success on dating platforms, with variables like Gender, PurchasedVIP (whether VIP service was purchased), Income, Children, Age, Attractiveness, and Matches.
  - **Preparation Steps:** Ensure categorical variables are properly encoded, handle outliers in 'Income' and 'Attractiveness', and potentially create new features such as age groups or income brackets for more detailed analysis.

## Combining Datasets

To effectively combine these datasets for a comprehensive analysis, we focus on aligning key variables that are common across the datasets, particularly demographic and behavioral data.

**This involves:**

- Standardizing identifiers such as age and gender across the Tinder Millennial Match Rate and Online Dating Behavior datasets, allowing for cross-analysis of user engagement metrics and match outcomes across different dating platforms.
- Integrating findings from the Influence of Algorithms on Decisions dataset by correlating algorithmic persuasion types with user behaviors and outcomes observed in the dating datasets. This linkage explores whether there's a correlation between the effectiveness of algorithmic persuasion and user interaction outcomes on dating platforms.

By consolidating these datasets through shared demographic information and behavioral responses, we aim to derive a more nuanced understanding of how algorithms influence user choices in online dating environments and the broader implications for digital matchmaking efficacy. This merged dataset will serve as a robust basis for examining the interplay between user characteristics, platform interaction, and algorithmic influence in shaping relationship dynamics.

## Dataset #1 (Tinder Millennial Match Rate - EDA & Hypothesis Test)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load the data
data_path = '/path/to/your/dataset/Online_Dating_Behavior_Dataset.csv'
data = pd.read_csv(data_path)
print(data.head())

In [None]:
# Convert data types if necessary
data['Gender'] = data['Gender'].astype('category')
print(data.info())

In [None]:
# Handling outliers or unusual values
data_clean = data[data['Matches'] > 0]


In [None]:
# Visualize distributions of key variables
sns.histplot(data_clean['Age'], bins=10)
plt.title('Distribution of Age')
plt.show()

## Dataset #2 (Data and Materials for "The influence of algorithms on political and dating decisions")

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/COGS108/Group012_WI25/refs/heads/master/Dataset_2.csv?token=GHSAT0AAAAAAC6VLPK6577ZTCO36GB5GBZWZ54F2SA', delimiter=';', skipinitialspace=True)
# Changing column names
new_columns = ['group', 'balancing', 'age', 'gender', 'political_orientation', 'personality', 'mtarget', 'mcontrol']
df.columns = new_columns
# Dropping balancing column
df.drop('balancing', axis=1, inplace=True)
# Changing commas to decimal points
df['mtarget'] = df['mtarget'].astype(str).str.replace(',', '.').astype(float)
df['mcontrol'] = df['mcontrol'].astype(str).str.replace(',', '.').astype(float)


In [None]:
# Scatterplot of Mtarget and Mcontrol
plt.scatter(df['mtarget'], df['mcontrol'])
plt.title('Relationship Between Mtarget and Mcontrol')
plt.xlabel('Mtarget')
plt.ylabel('Mcontrol')


## Dataset #3 (A perfect ML model?)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.stats.multicomp as multi
from scipy.stats import ttest_ind
import os
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp


In [None]:
# Load the data
data_path = '/kaggle/input/predict-online-dating-matches-dataset/Online_Dating_Behavior_Dataset.csv'
data = pd.read_csv(data_path)
print(data.head())
print(data.info())


In [None]:
# Further analysis
data['Match_Status'] = data['Matches'].apply(lambda x: 'With Matches' if x > 0 else 'Without Matches')
sns.boxplot(x='Match_Status', y='Age', data=data)
plt.title('Age Distribution by Match Status')
plt.show()


In [None]:
# Data exploration and visualization
sns.countplot(x='Gender', data=data)
plt.title('Gender Distribution')
plt.show()


In [None]:
# Hypothesis testing
group1 = data[data['Match_Status'] == 'With Matches']['Income']
group2 = data[data['Match_Status'] == 'Without Matches']['Income']
t_stat, p_value = ttest_ind(group1, group2)
print(f'T-test results -- T-statistic: {t_stat:.2f}, P-value: {p_value:.4f}')


In [None]:
# Model building
features = ['Gender', 'PurchasedVIP', 'Income', 'Children', 'Age', 'Attractiveness']
target = 'Matches'
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = RandomForestRegressor()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.2f}')


# Team Expectations 


**Communication:** Regular updates via group chats and in-person meetings every Wednesday at 3 PM.

**Respect:** Listening to and considering all group members' ideas and contributions.

**Punctuality:** Adhering to deadlines and arriving on time for meetings.

**How we will communicate:** Mobile text messages, Google Docs, In-person meetings.

# Project Timeline Proposal




| Meeting Date  | Meeting Time | Completed Before Meeting  | Discuss at Meeting |
|--------------|-------------|--------------------------|--------------------|
| 02/05       | 3 PM        | Decide on a topic and start drafting a proposal. | Draft and Complete Project Proposal |
| **Deadline- 02/09** | N/A | Submission Before 11:59 PM | Complete the Proposal including edits by Friday. Submission of Project Proposal by Sunday Before 11:59 PM. |
| 02/12       | 3 PM        | Research potential datasets and ethical considerations. | Decide on the ideal datasets and discuss any ethical concerns. |
| 02/19       | 3 PM        | Import and wrangle data; conduct exploratory data analysis (EDA) | Review the data wrangling and EDA, and plan further analysis. |
| **Deadline- 02/23** | N/A | Submission Before 11:59 PM | Checkpoint #1: DATA - Submit Checkpoint data by Sunday Before 11:59 PM. |
| 02/26       | 3 PM        | Finalize wrangling/EDA; Begin Analysis | Review and edit the ongoing analysis, and project check-in. |
| 03/05       | 3 PM        | Complete analysis; Draft results/conclusion/discussion | Edit and finalize the full project draft. |
| **Deadline- 03/09** | N/A | Submission Before 11:59 PM | Checkpoint #2: EDA - Submit Checkpoint #2 EDA by Sunday Before 11:59 PM. |
| 03/12       | 3 PM        | Conduct an in-depth analysis and draft the final report. | Review and finalize the report. |
| **Deadline- 03/19** | N/A | Submission Before 11:59 PM | Finalize project - Submit the Final Project by Sunday Before 11:59 PM. |