
### COMP 4447 DSTools1 Final Project
### Authors: Elizabeth Fugikawa & Heather Lemon

### How Online Dating and Dating App Usage Affect Relationships

**Love**. Love permeates many decisions we have in life. The motivation behind this analysis is to further understand the dynamics of how online dating and cell phone dating app usage affect relationships using the data collected through [Standford's How Couples Meet and Stay Together (HCMST) 2017](https://data.stanford.edu/hcmst2017).

Some details of the data collected include; poltical affilation, mother's highest level of education, demographics, and if you met your sigificant other online or not.

We will looking at exploratory data analysis, feature engineering, cleaning, and visualization. Including basic transformations and normalizations of data.

### Detailed Notes Regarding Original Data Collection
Administered by GFK group project report on behalf of the Standford Couples study.
This new survey, How Couples Meet and Stay Together 2017 (HCMST 2017), features a fresh set of 3,510 survey respondents, with no overlap in subjects from the original HCMST survey which was first fielded in 2009.
HCMST 2017 features new questions about subjects' use of phone apps like Tinder and Grindr for dating and meeting partners.

Specifically, the purpose of this study is to bring knowledge of how couples meet up‐to‐date by
asking detailed questions about both the timing and the social contexts of how Americans meet
their romantic partners. Same‐sex couples have been oversampled both in order to provide
better information about the difficult‐to‐study sexual minority population, and in order to
provide new perspectives on the changing nature of same‐sex couple mating in the US.
Another key purpose is to examine how technology, specifically online dating and cell phone
apps like Tinder and Grindr, affect relationship formation, relationship quality, attachment to
the idea of monogamy, and relationship stability.

### Reference
Rosenfeld, Michael J., Reuben J. Thomas, and Sonia Hausen. 2019 How Couples Meet and Stay Together 2017 fresh sample. Stanford, CA: Stanford University Libraries.

## Table of Contents
> 1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    * 1.1 [Missing-Values](#Missing-Values) 
    * 1.2 [Basic Transformations](#Basic-Transformations)
    * 1.3 [Visualizing the Data](#Visualization-of-Data)
         * 1.2.1 [Seaborn PairPlot](#Seaborn-Pairplot)
         * 1.2.2 [Correlation Table](#Correlation-Table)
    * 1.4 [Exploratory Data Analysis Conclusion](#Exploratory-Data-Analysis-Conclusion)
    

# Importing Data

In [1]:
%%bash
# pull data
wget 'https://stacks.stanford.edu/file/druid:hg921sg6829/HCMST_2017_public_data_v1.1_stata.zip'
unzip HCMST_2017_public_data_v1.1_stata.zip
# remove zipped file
rm HCMST_2017_public_data_v1.1_stata.zip
# rename file
mv 'HCMST 2017 fresh sample for public sharing draft v1.1.dta' HCMST2017.dta


--2022-10-28 18:12:38--  https://stacks.stanford.edu/file/druid:hg921sg6829/HCMST_2017_public_data_v1.1_stata.zip
Resolving stacks.stanford.edu (stacks.stanford.edu)... 171.67.37.91
Connecting to stacks.stanford.edu (stacks.stanford.edu)|171.67.37.91|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 463647 (453K) [application/zip]
Saving to: ‘HCMST_2017_public_data_v1.1_stata.zip’

     0K .......... .......... .......... .......... .......... 11%  551K 1s
    50K .......... .......... .......... .......... .......... 22% 1.01M 0s
   100K .......... .......... .......... .......... .......... 33%  485K 0s
   150K .......... .......... .......... .......... .......... 44% 1.13M 0s
   200K .......... .......... .......... .......... .......... 55% 1.06M 0s
   250K .......... .......... .......... .......... .......... 66% 19.3M 0s
   300K .......... .......... .......... .......... .......... 77% 1.19M 0s
   350K .......... .......... .......... .......... .......

Archive:  HCMST_2017_public_data_v1.1_stata.zip
  inflating: HCMST 2017 fresh sample for public sharing draft v1.1.dta  


# Exploratory Data Analysis

We begin by importing the proper libraries and files

In [10]:
import pandas as pd
import logging
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm
import scipy
pd.options.display.max_columns = None
import warnings
warnings.filterwarnings("ignore")

In [11]:
df = pd.read_stata('HCMST2017.dta')
df.head()

Unnamed: 0,CaseID,CASEID_NEW,qflag,weight1,weight1_freqwt,weight2,weight1a,weight1a_freqwt,weight_combo,weight_combo_freqwt,duration,speed_flag,consent,xlgb,S1,S2,S3,DOV_Branch,Q3_Refused,Q4,Q5,Q6A,Q6B,Q9,Q10,Q11,Q12,Q14,Q15A7,Q16,Q16_Refused,Q17A,Q17B,Q17C,Q17D,Q19,Q20,Q21A_Year,Q21A_Month,Q21B_Year,Q21B_Month,Q21C_Year,Q21C_Month,Q21D_Year,Q21D_Month,w6_identity,w6_outness,w6_outness_timing,Q23,Q24_Refused,Q25,Q26,Q27,Q28,w6_friend_connect_1,w6_friend_connect_2,w6_friend_connect_3,w6_friend_connect_4,w6_friend_connect_Refused,Q32,Q34,Q35_Refused,w6_sex_frequency,w6_otherdate,w6_how_many,w6_how_meet_Refused,w6_otherdate_app,w6_how_many_app,Past_Partner_Q1,w6_relationship_end_nonmar,w6_breakup_nonmar,w6_relationship_end_mar,w6_who_breakup,Q5_2,Q6A_2,Q9B_2,Q10_2,Q11_2,Q12_2,Q14_2,Q15A7_2_1,Q16_2,Q16_2_Codes,Q17B_2,Q17C_2,Q17D_2,Q20_2,Q21A_2_Year,Q21A_2_Month,Q21B_2_Year,Q21B_2_Month,Q21C_2_Year,Q21C_2_Month,Q21D_2_Year,Q21D_2_Month,Q21E_2_Year,Q21E_2_Month,Q21F_2_start_range,Q21F_2_Year,Q21F_2_Month,w6_identity_2,w6_outness_2,w6_outness_timing_2,Q23_2,Q25_2,Q26_2,Q27_2,Q28_2,w6_friend_connect_2_1,w6_friend_connect_2_2,w6_friend_connect_2_3,w6_friend_connect_2_4,w6_friend_connect_2_Refused,Q32_2,w6_otherdate_2,w6_how_many_2,w6_otherdate_app_2,w6_how_many_app_2,partyid7,PERSNET_hom,ppc10017,ppc21310,ppp20071,ppp20072,ppage,ppagecat,ppagect4,ppeduc,ppeducat,ppethm,ppgender,pphhhead,pphhsize,pphouse,ppincimp,ppmarit,ppmsacat,PPREG4,ppreg9,pprent,PPT01,PPT25,PPT612,PPT1317,PPT18OV,ppwork,Race_1,Race_2,Race_3,Race_4,Race_5,Race_6,race1,race2,race3,race4,race5,race6,race7,race8,race9,race10,race11,race12,race13,race14,race15,w6_took_the_survey,w6_prior_identity_lgb,w6_same_sex_couple,w6_same_sex_couple_gender,w6_q4,w6_q5,w6_q6a,w6_q6b,w6_q9,w6_q10,w6_q11,w6_q12,w6_q14,w6_q15a1_truncated,w6_q15a4_truncated,w6_q15a7,w6_q16,w6_q17,w6_attraction,w6_q19,w6_q20,w6_q21a_year,w6_q21a_month,w6_q21a_month_flag,w6_q21b_year,w6_q21b_month,w6_q21b_month_flag,w6_q21c_year,w6_q21c_month,w6_q21c_month_flag,w6_q21d_year,w6_q21d_month,w6_q21e_year,w6_q21e_month,w6_q21f_year,w6_q21f_month,w6_identity_all,w6_outness_all,w6_outness_timing_all,w6_q23,w6_q24_length,w6_q25,w6_q26,w6_q27,w6_q28,w6_friend_connect_1_all,w6_friend_connect_2_all,w6_friend_connect_3_all,w6_friend_connect_4_all,w6_q32,w6_q34,w6_otherdate_all,w6_how_many_all,w6_otherdate_app_all,w6_how_many_app_all,w6_number_people_met,w6_otherdate_dichotomous,w6_married,relate_duration_at_w6_years,w6_number_people_met_app,weight_combo_v2,partnership_status,female,year_fraction_met,year_fraction_relstart,age_when_met,time_from_met_to_rel,year_fraction_first_cohab,time_from_rel_to_cohab,hcm2017q24_R_cowork,hcm2017q24_R_friend,hcm2017q24_R_family,hcm2017q24_R_sig_other,hcm2017q24_R_neighbor,hcm2017q24_P_cowork,hcm2017q24_P_friend,hcm2017q24_P_family,hcm2017q24_P_sig_other,hcm2017q24_P_neighbor,hcm2017q24_btwn_I_cowork,hcm2017q24_btwn_I_friend,hcm2017q24_btwn_I_family,hcm2017q24_btwn_I_sig_other,hcm2017q24_btwn_I_neighbor,hcm2017q24_school,hcm2017q24_college,hcm2017q24_mil,hcm2017q24_church,hcm2017q24_vol_org,hcm2017q24_customer,hcm2017q24_bar_restaurant,hcm2017q24_party,hcm2017q24_internet_other,hcm2017q24_internet_dating,hcm2017q24_internet_soc_network,hcm2017q24_internet_game,hcm2017q24_internet_chat,hcm2017q24_internet_org,hcm2017q24_public,hcm2017q24_blind_date,hcm2017q24_vacation,hcm2017q24_single_serve_nonint,hcm2017q24_business_trip,hcm2017q24_work_neighbors,hcm2017q24_met_online,hcm2017_q24_length,hcm2017q24_summary_all_codes,w6_relationship_quality,hcm2017q24_met_through_family,hcm2017q24_met_through_friend,hcm2017q24_met_through_as_nghbrs,hcm2017q24_met_as_through_cowork,w6_subject_race,interracial_5cat,partner_mother_yrsed,subject_mother_yrsed,partner_yrsed,subject_yrsed
0,2,2014039,Qualified,,,0.8945,,,0.277188,19240.0,9,Completed survey in over 2 minutes,"Yes, I agree to participate",LGB sample,"No, I am not Married","No, I am single, with no boyfriend, no girlfri...",Yes,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,No,We broke up,I wanted to break up more,,,"Yes, we were a same-sex couple",No (Not Latino or Hispanic),1991.0,HS graduate or GED,HS graduate or GED,Leans Republican,Associate degree,I met [Partner Name] in [answer in Q15A7_2],1.0,,Once,,I am equally sexually attracted to men and women,No,2017.0,March,2017.0,March,,,,,2017.0,June,Q21B_2,,,bisexual,Only a few of them,13.0,I earned more,Different High School,,No,No,No,No,No,Yes,No,"Yes, an Internet dating or matchmaking site (l...","Yes, I have met at least one person for dating...",Two to Five people. I met between two and five...,"No, I have not used a phone dating app in the ...",,Leans Democrat,Yes,Yes,Every day,Not asked,Never,30,25-34,30-44,Associate degree,Some college,"White, Non-Hispanic",Male,Yes,1,A one-family house detached from any other house,"$40,000 to $49,999",Divorced,Metro,Northeast,Mid-Atlantic,Owned or being bought by you or someone in you...,0,0,0,0,1,Working - as a paid employee,Yes,No,No,No,No,No,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,took the survey,LGB,same_sex_couple,gay male couple,[Partner Name] is Male,"Yes, we are a same-sex couple",No (Not Latino or Hispanic),White,26.0,HS graduate or GED,HS graduate or GED,Leans Republican,Associate degree,United States,United States,I met [Partner Name] in [Answer in Q15A6],1.0,1.0,sexually attracted to men and women equally,,No,2017.0,March,no,2017.0,March,0.0,,,,,,2017.0,June,,,bisexual,Only a few of them,13.0,I earned more,232.0,Different High School,,No,No,no,no,no,yes,"Yes, an Internet dating or matchmaking site (l...",,"Yes, I have met at least one person for dating...","Yes, I have met at least one person for dating...",1.0,,3.5,yes,no,,0.0,0.39005,"unpartnered, has had past partner",0,2017.208374,2017.208374,30.0,0.0,,,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,yes,no,no,no,no,no,no,no,no,no,no,yes,232.0,1.0,,no,no,no,no,White,no,12.0,14.0,12.0,14.0
1,3,2019003,Qualified,0.9078,71115.0,,0.9026,70707.0,1.020621,70841.0,11,Completed survey in over 2 minutes,"Yes, I agree to participate",gen pop,"Yes, I am Married",,,1,,[Partner Name] is Male,,No (Not Latino or Hispanic),White,52.0,Masters degree,HS graduate or GED,Leans Republican,Bachelors degree,I met [Partner Name] in [Answer in Q15A6],1.0,,Once (this is my first marriage),,I am sexually attracted only to men,,Yes,,1983.0,May,1995.0,August,1996.0,February,1996.0,February,heterosexual or straight,,,[Partner Name] earned more,,Different High School,Did not attend same college or university,No,No,No,No,No,Yes,No,"No, I did NOT meet [Partner Name] through the ...",Excellent,,Once a month or less,"No, I have not met anyone for dating, romance,...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Not Strong Republican,Not asked,Yes,Every day,No,Never,55,55-64,45-59,Masters degree,Bachelor's degree or higher,"White, Non-Hispanic",Female,Yes,4,A one-family house detached from any other house,"$150,000 to $174,999",Married,Metro,Midwest,East-North Central,Owned or being bought by you or someone in you...,0,0,0,2,2,Not working - other,Yes,No,No,No,No,No,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,took the survey,straight/ non LGB,NOT same-sex souple,hetero couple,[Partner Name] is Male,,No (Not Latino or Hispanic),White,52.0,Masters degree,HS graduate or GED,Leans Republican,Bachelors degree,United States,United States,I met [Partner Name] in [Answer in Q15A6],1.0,1.0,sexually attracted only to opposite gender,Yes,,1983.0,May,no,1995.0,August,0.0,1996.0,February,,1996.0,February,,,,,heterosexual or straight,,,[Partner Name] earned more,213.0,Different High School,2.0,No,No,no,no,no,yes,"No, I did NOT meet [Partner Name] through the ...",Excellent,"No, I have not met anyone for dating, romance,...",,,,0.0,no,yes,21.916666,,0.999373,married,1,1983.375,1995.625,21.0,12.25,1996.125,0.5,yes,no,no,no,no,yes,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,213.0,2.0,excellent,no,no,no,yes,White,no,12.0,16.0,17.0,17.0
2,5,2145527,Qualified,0.7205,56442.0,,0.7164,56121.0,0.810074,56227.0,7,Completed survey in over 2 minutes,"Yes, I agree to participate",gen pop,"Yes, I am Married",,,1,,[Partner Name] is Female,,No (Not Latino or Hispanic),White,45.0,Associate degree,9th grade,Leans Democrat,7th or 8th grade,I met [Partner Name] in [Answer in Q15A6],0.0,,Once (this is my first marriage),,,I am sexually attracted only to women,Yes,,2006.0,January,2006.0,June,2006.0,July,2008.0,May,heterosexual or straight,,,I earned more,,Different High School,Did not attend same college or university,No,No,No,No,No,Yes,No,"Yes, an Internet dating or matchmaking site (l...",Good,Refused,2 to 3 times a month,"No, I have not met anyone for dating, romance,...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Leans Democrat,Yes,Yes,Every day,Yes,Once or twice a month,47,45-54,45-59,Masters degree,Bachelor's degree or higher,"White, Non-Hispanic",Male,Yes,5,A one-family house detached from any other house,"$200,000 to $249,999",Married,Metro,South,South Atlantic,Owned or being bought by you or someone in you...,0,1,2,0,2,Working - as a paid employee,Yes,No,No,No,No,No,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,took the survey,straight/ non LGB,NOT same-sex souple,hetero couple,[Partner Name] is Female,,No (Not Latino or Hispanic),White,45.0,Associate degree,9th grade,Leans Democrat,7th or 8th grade,"Another country, please specify","Another country, please specify",I met [Partner Name] in [Answer in Q15A6],0.0,1.0,sexually attracted only to opposite gender,Yes,,2006.0,January,no,2006.0,June,0.0,2006.0,July,,2008.0,May,,,,,heterosexual or straight,,,I earned more,87.0,Different High School,2.0,No,No,no,no,no,yes,"Yes, an Internet dating or matchmaking site (l...",Good,"No, I have not met anyone for dating, romance,...",,,,0.0,no,yes,11.083333,,0.793209,married,0,2006.041626,2006.458374,36.0,0.416748,2006.541626,0.083252,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,yes,no,yes,no,no,no,no,no,no,no,no,no,no,no,yes,87.0,2.0,good,no,no,no,no,White,no,9.0,7.5,14.0,17.0
3,6,2648857,Qualified,1.2597,98682.0,1.3507,1.2524,98110.0,0.418556,29052.0,5,Completed survey in over 2 minutes,"Yes, I agree to participate",gen pop,"No, I am not Married","No, I am single, with no boyfriend, no girlfri...",Yes,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,No,We broke up,[Partner Name] wanted to break up more,,,"Yes, we were a same-sex couple",No (Not Latino or Hispanic),1991.0,HS graduate or GED,Bachelors degree,Undecided/Independent/Other,HS graduate or GED,"I met [Partner Name] somewhere else , Please s...",0.0,,Never married,"I am mostly sexually attracted to women, less ...",,No,2012.0,March,2013.0,April,,,,,2013.0,June,Q21B_2,,,bisexual,All or most of them,17.0,I earned more,Different High School,,No,No,No,No,No,Yes,No,"Yes, a social networking site (like Facebook o...","No, I have not met anyone for dating, romance,...",,,,Strong Democrat,Yes,Yes,Every day,No,Never,28,25-34,18-29,12th grade NO DIPLOMA,Less than high school,"White, Non-Hispanic",Female,No,3,A one-family house detached from any other house,"$40,000 to $49,999",Never married,Metro,Midwest,West-North Central,Owned or being bought by you or someone in you...,0,0,0,0,3,Not working - other,Yes,No,No,No,No,No,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,took the survey,LGB,same_sex_couple,lesbian couple,[Partner Name] is Female,"Yes, we are a same-sex couple",No (Not Latino or Hispanic),White,26.0,HS graduate or GED,Bachelors degree,Undecided/Independent/Other,HS graduate or GED,United States,United States,"I met [Partner Name] somewhere else, Please sp...",0.0,0.0,"sexually attracted mostly to same gender, some...",,No,2012.0,March,no,2013.0,April,0.0,,,,,,2013.0,June,,,bisexual,All or most of them,17.0,I earned more,80.0,Different High School,,No,No,no,no,no,yes,"Yes, a social networking site (like Facebook o...",,"No, I have not met anyone for dating, romance,...",,,,0.0,no,no,,,0.588978,"unpartnered, has had past partner",1,2012.208374,2013.291626,23.0,1.083252,,,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,yes,no,no,no,no,no,no,no,no,yes,80.0,1.0,,no,no,no,no,White,no,16.0,12.0,12.0,12.0
4,7,2623465,Qualified,0.8686,68044.0,,0.8636,67652.0,0.976522,67781.0,13,Completed survey in over 2 minutes,"Yes, I agree to participate",gen pop,"Yes, I am Married",,,1,,[Partner Name] is Male,,No (Not Latino or Hispanic),White,59.0,Bachelors degree,Associate degree,Strong Democrat,Masters degree,I met [Partner Name] in [Answer in Q15A6],0.0,,Once (this is my first marriage),,"I am mostly sexually attracted to men, less of...",,Yes,,1983.0,September,1983.0,October,1984.0,August,1984.0,August,heterosexual or straight,,,[Partner Name] earned more,,Different High School,Did not attend same college or university,No,No,No,No,No,Yes,No,"No, I did NOT meet [Partner Name] through the ...",Excellent,,3 to 6 times a week,"No, I have not met anyone for dating, romance,...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Strong Democrat,Yes,Yes,Every day,No,Once a year or less,59,55-64,45-59,Bachelors degree,Bachelor's degree or higher,"White, Non-Hispanic",Female,Yes,4,A one-family house detached from any other house,"$175,000 to $199,999",Married,Metro,South,South Atlantic,Owned or being bought by you or someone in you...,0,0,0,0,4,Working - as a paid employee,Yes,No,No,No,No,No,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,took the survey,straight/ non LGB,NOT same-sex souple,hetero couple,[Partner Name] is Male,,No (Not Latino or Hispanic),White,59.0,Bachelors degree,Associate degree,Strong Democrat,Masters degree,United States,United States,I met [Partner Name] in [Answer in Q15A6],0.0,1.0,"sexually attracted mostly to opposite gender, ...",Yes,,1983.0,September,no,1983.0,October,0.0,1984.0,August,,1984.0,August,,,,,heterosexual or straight,,,[Partner Name] earned more,648.0,Different High School,2.0,No,No,no,no,no,yes,"No, I did NOT meet [Partner Name] through the ...",Excellent,"No, I have not met anyone for dating, romance,...",,,,0.0,no,yes,33.75,,0.956191,married,1,1983.708374,1983.791626,25.0,0.083252,1984.625,0.833374,no,no,no,no,yes,no,no,no,no,yes,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,no,yes,no,no,no,no,no,818.0,3.0,excellent,no,no,yes,no,White,no,14.0,17.0,16.0,16.0


### Read Stada & Feature Selection

In [14]:
df = pd.read_stata('HCMST2017.dta')
df_numeric = pd.DataFrame()
df_numeric_encoded = pd.DataFrame()
df_categorical = pd.DataFrame()
df_categorical_encoded = pd.DataFrame()

# Divide data into numeric or categorical responses
df_numeric = df[['ppage', 'ppagecat', 'hhinc']].rename\
    ({'ppage': 'age', 'ppagecat': 'cat_age'}, axis=1)

df_categorical = df[['ppgender', 'ppeducat', 'ppincimp',
                               'ppwork', 'pppartyid3', 'ppreg9',
                               'ppmarit', 'q24_met_online',
                               'papreligion', 'relationship_quality']].rename(
            columns={'ppgender': 'gender',
                     'ppeducat': 'educ',
                     'ppincimp': "incomecat",
                     'ppwork': 'job_status',
                     'pppartyid3': 'political_aff',
                     'ppreg9': 'region',
                     'papreligion': 'religion',
                     'w6_otherdate_app_2': 'app_used',
                     'ppmarit': 'marital_status',
                     'q24_met_online': 'met_online'})

KeyboardInterrupt: 

### Introduction to Dataset

In [6]:
df_numeric['hhinc'] = df_numeric['hhinc'].astype(int)
print(df_categorical_encoded.head())
print(df_numeric_encoded.head())
df_numeric_encoded = pd.get_dummies(df_numeric)
df_categorical_encoded = pd.get_dummies(df_categorical)

NameError: name 'self' is not defined

# DataTypes

In [None]:
print(df_numeric.dtypes)
print(df_categorical.dtypes)

### Null Check

In [None]:
print(df_categorical.isnull().sum())
print(df_numeric.isnull().sum())

### Describe dataset

In [None]:
print(df_categorical_encoded.describe())
print(df_numeric_encoded.describe())

In [None]:
# interesting question, what season or month did you meet your significant other?
# TODO: viz of map/region, pull month met data, pairpolt, (ggqqplot) normalize plot for numeric values
# income, pivot_tables, regplot, avg age vs income, missing data?

In [None]:
big_df = pd.concat([df_numeric, df_categorical], axis=1)
t1 = big_df.pivot_table(values=["hhinc"], index=["region"], aggfunc=np.mean)
t2 = big_df.pivot_table(values=["id"], index=["marital_status", "met_online"], aggfunc='count')
t3 = big_df.pivot_table(values=["id"], index=["political_aff", "cat_age"], aggfunc='count')

In [None]:
# Visualize gender representative
female_count = df_categorical['gender'].value_counts()['Female']
male_count = df_categorical['gender'].value_counts()['Male']
gender = pd.DataFrame({'gender': ['female', 'male'], 'count': [female_count, male_count]})
sns.barplot(x='gender', y='count', data=gender, palette='hls')

In [None]:
# Visualize political representative
democrat_count = df_categorical_encoded['political_aff_democrat'].value_counts()[1]
republican_count = df_categorical_encoded['political_aff_republican'].value_counts()[1]
other_count = df_categorical_encoded['political_aff_other'].value_counts()[1]
party_aff = pd.DataFrame({'political_party': ['democrat', 'republican', 'other'],
                          'count': [democrat_count, republican_count, other_count]})
sns.barplot(x='political_party', y='count', data=party_aff, palette="hls")
plt.xlabel('Political Party Affiliation')
plt.ylabel('Count')
plt.title('Political Party Affiliation Representation')

In [None]:
age_col_df = df_numeric_encoded.iloc[2:, :10]
age_col_df.drop(labels='id', axis=1, inplace=True)
age_col_df.drop(labels='age', axis=1, inplace=True)
age_col_df.drop(labels='hhinc', axis=1, inplace=True)
count_df = pd.DataFrame({'count': age_col_df.sum()})
count_df.rename(index={'cat_age_18-24': '18-24', 'cat_age_25-34': '25-34',
                       'cat_age_35-44': '35-44', 'cat_age_45-54': '45-54',
                       'cat_age_55-64': '55-64', 'cat_age_65-74': '65-74',
                       'cat_age_75+': '75+'}, inplace=True)
count_df.reset_index(inplace=True)
count_df.rename(columns={'index': 'age'}, inplace=True)
sns.barplot(x='age', y='count', data=count_df, palette='hls')

In [None]:
sns.relplot(x='age', y='hhinc', kind='line', data=df_numeric)
# sns.pairplot(df_numeric, x_vars=['age'], y_vars=['hhinc'], palette='hls', hue='hhinc', height=5)

In [None]:
model = sm.OLS(df_categorical_encoded['met_online_met offline'], df_numeric['hhinc'])
results = model.fit()
print(results.params)
print(results.summary())

In [None]:
np.random.seed(1)
print(scipy.stats.shapiro(df_categorical_encoded['political_aff_democrat']))
print(scipy.stats.shapiro(df_categorical_encoded['political_aff_republican']))
print(scipy.stats.shapiro(df_categorical_encoded['political_aff_other'].sample(n=500)))

In [3]:
sm.qqplot(data=df_categorical['political_aff'], line='45')