
### COMP 4447 DSTools1 Final Project
### Authors: Elizabeth Fugikawa & Heather Lemon

### How Online Dating and Dating App Usage Affect Relationships

### GitHub Repository Link 
https://github.com/hypothetical-lemon/COMP4447-final-project 

The motivation behind this analysis is to further understand the dynamics of how online dating and cell phone dating app usage affect relationships using the data collected through [Standford's How Couples Meet and Stay Together (HCMST) 2017](https://data.stanford.edu/hcmst2017). As well as the study done in 2011 from [How Couples Meet and Stay Together 2011](https://data.stanford.edu/hcmst)

Some details of the data collected include; poltical affilation, mother's highest level of education, demographics, and if you met your sigificant other online or not.

We will looking at exploratory data analysis, feature engineering, cleaning, and visualization. Including basic transformations and normalizations of data.

### Detailed Notes Regarding Original Data Collection 2011

How Couples Meet and Stay Together (HCMST) is a study of how Americans meet their spouses and romantic partners.

The study is a nationally representative study of American adults.
<font color='red'>4,002 adults</font> responded to the survey, 3,009 of those had a spouse or main romantic partner.
The study oversamples self-identified gay, lesbian, and bisexual adults
Follow-up surveys were implemented one and two years after the main survey, to study couple dissolution rates. Version 3.0 of the dataset includes two follow-up surveys, waves 2 and 3.
Waves 4 and 5 are provided as separate data files that can be linked back to the main file via variable caseid_new.

### Detailed Notes Regarding the New Data Collection 2017
Administered by GFK group project report on behalf of the Standford Couples study.
This new survey, How Couples Meet and Stay Together 2017 (HCMST 2017), features a fresh set of <font color='red'>3,510 survey respondents</font>, with no overlap in subjects from the original HCMST survey which was first fielded in 2009.
HCMST 2017 features new questions about subjects' use of phone apps like Tinder and Grindr for dating and meeting partners.

Specifically, the purpose of this study is to bring knowledge of how couples meet upâ€toâ€date by
asking detailed questions about both the timing and the social contexts of how Americans meet
their romantic partners. Sameâ€sex couples have been oversampled both in order to provide
better information about the difficultâ€toâ€study sexual minority population, and in order to
provide new perspectives on the changing nature of sameâ€sex couple mating in the US.
Another key purpose is to examine how technology, specifically online dating and cell phone
apps like Tinder and Grindr, affect relationship formation, relationship quality, attachment to
the idea of monogamy, and relationship stability.

### Other work done this area 
Our work takes the basic features (demographics mostly) and keeps it simple without over architecting the main goal.

A couple of published papers
- https://web.stanford.edu/~mrosenfe/Rosenfeld_Tinder_and_dating_apps.pdf 
- https://web.stanford.edu/~mrosenfe/Rosenfeld_et_al_Disintermediating_Friends.pdf 

A couple of news articles written about the dataset 
- https://flowingdata.com/2019/03/19/the-relationship-timeline-continues-to-stretch/
- https://flowingdata.com/2019/03/15/shifts-in-how-couples-meet-online-takes-the-top-spot/


### Reference
Rosenfeld, Michael J., Reuben J. Thomas, and Sonia Hausen. 2019 How Couples Meet and Stay Together 2017 fresh sample. Stanford, CA: Stanford University Libraries.

## Table of Contents
> 0. [Importing Data](#Importing-Data)
> 1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    * 1.0 [Read Stada](#Read-Stada) 
    * 1.1 [Feature Selection](#Feature-Selection)
> 2. [Introduction to Dataset](#Introduction-to-Dataset)
    * 2.0 [DataTypes](#DataTypes)
    * 2.1 [MAR (Missing data at Random)](#MAR-(Missing-data-at-Random)) 
    * 2.2 [One-Hot Encode](#One-Hot-Encode)
    * 2.3 [What month was most popular to meet your Significant Other?](#What-month-was-most-popular-to-meet-your-Significant-Other?)
    * 2.4 [Gender](#Gender)
    * 2.5 [Poltical](#Political)
    * 2.6 [Age](#Age)
    * 2.7 [Age Income](#Age-Income)
    * 2.8 [Bar Chart Race](#Bar-Chart-Race)
    * 1.5 [Visualizing the Data](#Visualization-of-Data)
         * 1.2.1 [Seaborn PairPlot](#Seaborn-Pairplot)
         * 1.2.2 [Correlation Table](#Correlation-Table)
    * 1.4 [Exploratory Data Analysis Conclusion](#Exploratory-Data-Analysis-Conclusion)
    

# Importing Data

In [None]:
%%bash
# pull data
wget 'https://stacks.stanford.edu/file/druid:hg921sg6829/HCMST_2017_public_data_v1.1_stata.zip'
unzip HCMST_2017_public_data_v1.1_stata.zip
# remove zipped file
rm HCMST_2017_public_data_v1.1_stata.zip
# rename file
mv 'HCMST 2017 fresh sample for public sharing draft v1.1.dta' HCMST2017.dta

In [None]:
%%bash
# pull data (2011)
wget 'https://stacks.stanford.edu/file/druid:ns183dp7831/HCMST_ver_3.04_Stata.zip'
unzip HCMST_ver_3.04_Stata.zip
# remove zipped file
rm HCMST_ver_3.04_Stata.zip
# file name is HCMST_ver_3.04_.dta

# Exploratory Data Analysis

We begin by importing the proper libraries and files

In [None]:
import pandas as pd
import logging
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm
import scipy
import leafmap.foliumap as leafmap
pd.options.display.max_columns = None
import warnings
warnings.filterwarnings("ignore")

# Read Stada

In [None]:
df2011 = pd.read_stata('HCMST_ver_3.04.dta')
df2011.head()

In [None]:
df2017 = pd.read_stata('HCMST2017.dta')
df2017.head()

# Feature Selection

In [None]:
df2011.shape

In [None]:
df2017.shape

The first column is total number of respondants and the second column is the number of columns of the dataset. 

In [None]:
df_numeric2011 = pd.DataFrame()
df_numeric_encoded2011 = pd.DataFrame()
df_categorical2011 = pd.DataFrame()
df_categorical_encoded2011 = pd.DataFrame()

# Divide data into numeric or categorical responses
df_numeric2011 = df2011[['caseid_new', 'ppage', 'ppagecat', 'hhinc']].rename\
    ({'caseid_new': 'id', 'ppage': 'age', 'ppagecat': 'cat_age'}, axis=1)

df_categorical2011 = df2011[['ppgender', 'ppeducat', 'ppincimp',
                               'ppwork', 'pppartyid3', 'ppreg9',
                               'ppmarit', 'q24_met_online',
                               'papreligion', 'relationship_quality']].rename(
            columns={'ppgender': 'gender',
                     'ppeducat': 'educ',
                     'ppincimp': "incomecat",
                     'ppwork': 'job_status',
                     'pppartyid3': 'political_aff',
                     'ppreg9': 'region',
                     'papreligion': 'religion',
                     'w6_otherdate_app_2': 'app_used',
                     'ppmarit': 'marital_status',
                     'q24_met_online': 'met_online'})

print(df_numeric2011.head())
print(df_categorical2011.head())

In [None]:
df_numeric2017 = pd.DataFrame()
df_numeric_encoded2017 = pd.DataFrame()
df_categorical2017 = pd.DataFrame()
df_categorical_encoded2017 = pd.DataFrame()

# Divide data into numeric or categorical responses
df_numeric2017 = df2017[['ppage', 'ppagecat', 'w6_q21a_month', 'CaseID']].rename\
    ({'ppage': 'age', 'ppagecat': 'cat_age', 'w6_q21a_month': 'month_met', 'CaseID': 'id'}, axis=1)

df_categorical2017 = df2017[['ppgender', 'ppeducat', 'ppincimp',
                               'ppwork', 'partyid7', 'PPREG4',
                               'ppmarit', 'hcm2017q24_met_online',
                               'w6_relationship_quality', ]].rename(
            columns={'ppgender': 'gender',
                     'ppeducat': 'educ',
                     'ppincimp': "incomecat",
                     'ppwork': 'job_status',
                     'partyid7': 'political_aff',
                     'PPREG4': 'region',
                     'w6_otherdate_app_2': 'app_used',
                     'ppmarit': 'marital_status',
                     'hcm2017q24_met_online': 'met_online', 
                     })

print(df_numeric2017.head())
print(df_categorical2017.head())

# Introduction to Dataset

### DataTypes

### 2011 datatype check

In [None]:
print(df_numeric2011.dtypes)
print(df_categorical2011.dtypes)

### 2017 datatype check

In [None]:
print(df_numeric2017.dtypes)
print(df_categorical2017.dtypes)

### Fix up the numeric datasets 


In [None]:
df_numeric2011.age = df_numeric2011.age.astype(int)
df_numeric2017.age = df_numeric2017.age.astype(int)
print(df_numeric2011.dtypes)
print(df_numeric2017.dtypes)

### Null Check For 2011 data

In [None]:
print(df_categorical2011.isnull().sum())
print(df_numeric2011.isnull().sum())

![met_online_pdf.PNG](attachment:met_online_pdf.PNG)

We can double check out work thanks to the PDF codebook which shows missing values from the survey results. 

### Null Check for 2017 data 

In [None]:
print(df_categorical2017.isnull().sum())
print(df_numeric2017.isnull().sum())

In [None]:
df_categorical2017.met_online

In [None]:
df_categorical2017.w6_relationship_quality

# MAR (Missing data at Random)

We do have a few features with missing data. How to deal with it? We will drop the political aff category as it is not needed. 
However, the met_online and w6_relationship_quality are being used and we will drop those before drawing the graphs. 

# One-Hot Encode

In [None]:
df_numeric2011['hhinc'] = df_numeric2011['hhinc'].astype(int)
df_numeric_encoded2011 = pd.get_dummies(df_numeric2011)
df_categorical_encoded2011 = pd.get_dummies(df_categorical2011)
print(df_categorical_encoded2011.head(3))
print(df_numeric_encoded2011.head(3))

# What month was most popular to meet your Significant Other? 

In [None]:
months=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
counts = []
for m in months:
    count = df_numeric2017['month_met'].value_counts()[m]
    counts.append(count)
month_df = pd.DataFrame({'month': months, 'count': counts})
p = sns.barplot(x='month', y='count', data=month_df, palette='hls')
p.set(xlabel='Month', ylabel='Count', title='Month That Couples Met')
p.tick_params(axis='x', rotation=45)
plt.show()

# Gender 

In [None]:
# Visualize gender representative
female_count = df_categorical_encoded2011['gender_female'].value_counts()[1]
male_count = df_categorical_encoded2011['gender_male'].value_counts()[1]
gender = pd.DataFrame({'gender': ['gender_female', 'gender_male'], 'count': [female_count, male_count]})
gender.rename(columns={'gender_female': 'Female'})
fig_gender, ax_gender = plt.subplots(figsize=(10,10))
sns.barplot(x='gender', y='count', data=gender, palette='hls', ax=ax_gender)
ax_gender.set_xlabel('Gender')
ax_gender.set_ylabel('Count')
ax_gender.set_title('Gender Representation')

# Political 

In [None]:
# Visualize political representative
democrat_count = df_categorical_encoded2011['political_aff_democrat'].value_counts()[1]
republican_count = df_categorical_encoded2011['political_aff_republican'].value_counts()[1]
other_count = df_categorical_encoded2011['political_aff_other'].value_counts()[1]
party_aff = pd.DataFrame({'political_party': ['democrat', 'republican', 'other'],
                          'count': [democrat_count, republican_count, other_count]})
fig_pparty, ax_pparty = plt.subplots(figsize=(10,10))
sns.barplot(x='political_party', y='count', data=party_aff, palette="hls", ax=ax_pparty)
ax_pparty.set_xlabel('Political Party')
ax_pparty.set_ylabel('Count')
ax_pparty.set_title('Political Party Affiliation Representation')

# Age vs HouseHold Income (2011)

In [56]:
np.random.seed(1)
print(scipy.stats.shapiro(df_categorical_encoded2011['political_aff_democrat']))
print(scipy.stats.shapiro(df_categorical_encoded2011['political_aff_republican']))
print(scipy.stats.shapiro(df_categorical_encoded2011['political_aff_other'].sample(n=500)))

ShapiroResult(statistic=0.623072624206543, pvalue=0.0)
ShapiroResult(statistic=0.6139096021652222, pvalue=0.0)
ShapiroResult(statistic=0.19334465265274048, pvalue=1.1784920084971712e-41)


# Bar Chart Race

In [27]:
#!pip install bar-chart-race

import bar_chart_race as bcr 

how_couples_met_2017 = df2017.loc[:, 'hcm2017q24_R_cowork':'hcm2017q24_met_online']
year = df2017['Q21A_Year']
year = year.rename('year')
how_couples_met_2017 = how_couples_met_2017.merge(year, left_index=True, right_index=True)
how_couples_met_2017.dropna(inplace=True)
how_couples_met_2017 = how_couples_met_2017.set_index('year')
how_couples_met_2017 = how_couples_met_2017.sort_index()
how_couples_met_2017 = how_couples_met_2017.drop(index=['Refused'], axis=0)
how_couples_met_2017.index = pd.to_datetime(how_couples_met_2017.index).year


In [28]:
how_couples_met_2017 = how_couples_met_2017.rename(columns={'hcm2017q24_R_cowork':'Respondents coworker', 'hcm2017q24_R_friend':'Respondents friend', 'hcm2017q24_R_family':'Respondents Family',
       'hcm2017q24_R_sig_other':'Respondents Significant Other',
       'hcm2017q24_R_neighbor': 'Respondents residential neighbor',
       'hcm2017q24_P_cowork': 'Partners coworker',
       'hcm2017q24_P_friend': 'Partners friend',
       'hcm2017q24_P_family': 'Partners family',
       'hcm2017q24_P_sig_other': 'Partners Significant Other',
       'hcm2017q24_P_neighbor': 'Partners Neighbor',
       'hcm2017q24_btwn_I_cowork': 'coworker relationship between intermediaries',
       'hcm2017q24_btwn_I_friend': 'friendship between intermediaries',
       'hcm2017q24_btwn_I_family': 'friendship between family',
       'hcm2017q24_btwn_I_sig_other': 'friendship between sign other',
       'hcm2017q24_btwn_I_neighbor': 'intermediaries are neighbors',
       'hcm2017q24_school' : 'met in primary or secondary school',
       'hcm2017q24_college': 'met in college',
       'hcm2017q24_mil': 'met during military service',
       'hcm2017q24_church': 'met in or through church or religious organization',
       'hcm2017q24_vol_org': 'met through voluntary organization',
       'hcm2017q24_customer': 'customer-client relationship',
       'hcm2017q24_bar_restaurant': 'bar, restaurant, public social gathering place',
       'hcm2017q24_party': 'private party',
       'hcm2017q24_internet_other': 'Internet',
       'hcm2017q24_internet_dating': 'met through Internet dating or phone app',
       'hcm2017q24_internet_soc_network': 'met through internet social networking',
       'hcm2017q24_internet_game': 'met through online gaming',
       'hcm2017q24_internet_chat': 'met through Internet chat',
       'hcm2017q24_internet_org': 'met through Internet site not mainly dedicated to dating',
       'hcm2017q24_public': 'met in public place',
       'hcm2017q24_blind_date': 'met on blind date',
       'hcm2017q24_vacation': 'met while on vacation',
       'hcm2017q24_single_serve_nonint': 'non internet single service',
       'hcm2017q24_business_trip': 'met while on business trip',
       'hcm2017q24_work_neighbors': 'met as work neighbors',
       'hcm2017q24_met_online': 'met online, all kinds'})

In [29]:
how_couples_met_2017 = pd.get_dummies(how_couples_met_2017)
how_couples_met_2017 = how_couples_met_2017[how_couples_met_2017.columns.drop(list(how_couples_met_2017.filter(regex='_no')))]
# groupby every 5 years
how_couples_met_2017 = how_couples_met_2017.groupby((how_couples_met_2017.index//5)*5).sum()
how_couples_met_2017.columns = how_couples_met_2017.columns.str.replace('_yes', '')
how_couples_met_2017.head(3)

Unnamed: 0_level_0,Respondents coworker,Respondents friend,Respondents Family,Respondents Significant Other,Respondents residential neighbor,Partners coworker,Partners friend,Partners family,Partners Significant Other,Partners Neighbor,coworker relationship between intermediaries,friendship between intermediaries,friendship between family,friendship between sign other,intermediaries are neighbors,met in primary or secondary school,met in college,met during military service,met in or through church or religious organization,met through voluntary organization,customer-client relationship,"bar, restaurant, public social gathering place",private party,Internet,met through Internet dating or phone app,met through internet social networking,met through online gaming,met through Internet chat,met through Internet site not mainly dedicated to dating,met in public place,met on blind date,met while on vacation,non internet single service,met while on business trip,met as work neighbors,"met online, all kinds"
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
1935,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1945,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,2,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0
1950,0,7,3,1,3,0,6,3,1,4,0,0,0,1,0,3,2,2,2,1,0,3,1,0,0,0,0,0,0,1,1,1,0,0,0,0


In [30]:
how_couples_met_2017.index = pd.to_datetime(how_couples_met_2017.index, format='%Y')
how_couples_met_2017.index

DatetimeIndex(['1935-01-01', '1945-01-01', '1950-01-01', '1955-01-01',
               '1960-01-01', '1965-01-01', '1970-01-01', '1975-01-01',
               '1980-01-01', '1985-01-01', '1990-01-01', '1995-01-01',
               '2000-01-01', '2005-01-01', '2010-01-01', '2015-01-01'],
              dtype='datetime64[ns]', name='year', freq=None)

In [36]:
bcr.bar_chart_race(df=how_couples_met_2017, sort='desc',
                   orientation='h', steps_per_period=65, period_length=1000,
                   bar_size=1, dpi=200,
                   label_bars=True,
                   bar_label_size=7,
                   tick_label_size=7,
                   n_bars=10,
                   fixed_max=True,
                   title='How Couples Met (1938-2017)',
                   period_label={'x': .99, 'y': .1, 'ha': 'right', 'color': 'red'},
                   period_fmt='%Y',
                   filename=None)