GitHub - keithmcnulty/speed_dating: Analysis of data from a speed dating experiment

It’s Valentines Day - a day when people think about love and relationships. How people meet and form relationship works a lot quicker than in our parent’s or grandparent’s day. I’m sure many of you are told how it used to be - you met someone, dated them for a while, proposed, got married. People who grew up in small towns maybe had one shot at finding love, so they made sure they didn’t mess it up.

Today finding a date is not a challenge - finding a match is probably the issue. In the last 20 years we’ve gone from traditional dating to online dating to speed dating to online speed dating. Now you just swipe left or swipe right, if that’s your thing.

In 2002-2004, Columbia University ran a speed-dating experiment where they tracked data over 21 speed dating sessions for mostly young adults meeting people of the opposite sex. I found the dataset and the key to the data here: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/.

I was interested in finding out what it was about someone during that short interaction that determined whether or not someone viewed them as a match. This is a great opportunity to practice simple logistic regression if you’ve never done it before.

The speed dating dataset

The dataset at the link above is quite substantial - over 8,000 observations with almost 200 datapoints for each. However, I was only interested in the speed dates themselves, and so I simplified the data and uploaded a smaller version of the dataset to my Github account here. I’m going to pull this dataset down and do some simple regression analysis on it to determine what it is about someone that influences whether someone sees them as a match.

Let’s pull the data and take a quick look:

library(tidyverse)
library(corrplot)

download.file("https://raw.githubusercontent.com/keithmcnulty/speed_dating/master/speed_data_data.RDS", "speed_dating_data.RDS")

data <- readRDS("speed_dating_data.RDS")

head(data, 3) %>% 
  knitr::kable()

age	income	goal	career	dec	attr	sinc	intel	fun	amb	shar	like	prob	met
21	69487	2	lawyer	1	6	9	7	7	6	5	7	6	2
21	69487	2	lawyer	1	7	8	7	8	5	6	7	5	1
21	69487	2	lawyer	1	5	8	9	8	5	7	7	NA	1

We can work out from the key that:

The first five columns are demographic - we may want to use them to look at subgroups later.
The next seven columns are important. dec is the raters decision on whether this individual was a match and then follows scores out of ten on six characteristics: attractiveness, sincerity, intelligence, fun, ambitiousness and shared interests.
The like column is an overall rating. The prob column is a rating on whether the rater believed that interest would be reciprocated and the final column is a binary on whether the two had met prior to the speed date, with the lower value indicating that they had met before.

We can leave the first four columns out of any analysis we do. Our outcome variable here is dec. I’m interested in the rest as potential explanatory variables. Before I start to do any analysis, I want to check if any of these variables are highly colinear - ie, have very high correlations. If two variables are measuring pretty much the same thing, I should probably remove one of them.

corr_matrix <- data %>% 
  dplyr::select(attr, sinc, intel, fun, amb, shar, like, prob, met) %>% 
  as.matrix() 

M <- cor(corr_matrix, use = "complete.obs")

corrplot::corrplot(M)

OK, clearly there’s mini-halo effects running wild when you speed date. But none of these get up really high (eg past 0.75), so I’m going to leave them all in because this is just for fun. I might want to spend a bit more time on this issue if my analysis had serious consequences here.

Running a logistic regression on the data

The outcome of this process is binary. The respondent decides yes or no. That’s harsh, I give you. But for a statistician it points straight to a binomial logistic regression as our primary analytic tool. Let’s run a logistic regression model on the outcome and potential explanatory variables I’ve identified above, and take a look at the results.

model <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, 
             data = data, 
             family = "binomial")

summary(model)

## 
## Call:
## glm(formula = dec ~ attr + sinc + intel + fun + amb + shar + 
##     like + prob + met, family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8210  -0.7602  -0.2311   0.7921   3.6432  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.0769718  0.2101454 -28.918  < 2e-16 ***
## attr         0.4296537  0.0235843  18.218  < 2e-16 ***
## sinc        -0.2088995  0.0273358  -7.642 2.14e-14 ***
## intel       -0.0005595  0.0327799  -0.017    0.986    
## fun          0.1357729  0.0258310   5.256 1.47e-07 ***
## amb         -0.1904715  0.0251996  -7.559 4.08e-14 ***
## shar         0.1073134  0.0208044   5.158 2.49e-07 ***
## like         0.5607669  0.0325454  17.230  < 2e-16 ***
## prob         0.1503168  0.0178518   8.420  < 2e-16 ***
## met          0.0021065  0.0310848   0.068    0.946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9322.9  on 6818  degrees of freedom
## Residual deviance: 6460.5  on 6809  degrees of freedom
##   (1559 observations deleted due to missingness)
## AIC: 6480.5
## 
## Number of Fisher Scoring iterations: 5

So, perceived intelligence doesn’t really matter. This could be a factor of the population being studied, whom I believe were all undergraduates at Columbia and so would all have a high average SAT I suspect, and so intelligence might be less of a differentiator. Neither does whether on not you’d met someone before - not that surprising. Everything else seems to play a significant role.

More interesting is how much of a role each factor plays. The Coefficients Estimates in the model output above tell us the effect of each variable, assuming other variables are held still. But in the form above they are expressed in log odds, and we need to convert them to regular odds ratios so we can understand them better, so let’s adjust our results to do that.

ctable <- coef(summary(model))
odds_ratio <- exp(coef(summary(model))[ , c("Estimate")])
(coef_summary <-  cbind(ctable, as.data.frame(odds_ratio, nrow = nrow(ctable), ncol = 1))) %>% 
  knitr::kable()

	Estimate	Std. Error	z value	Pr(>\|z\|)	odds_ratio
(Intercept)	-6.0769718	0.2101454	-28.9179377	0.0000000	0.0022951
attr	0.4296537	0.0235843	18.2178165	0.0000000	1.5367252
sinc	-0.2088995	0.0273358	-7.6419635	0.0000000	0.8114768
intel	-0.0005595	0.0327799	-0.0170688	0.9863817	0.9994406
fun	0.1357729	0.0258310	5.2562008	0.0000001	1.1454217
amb	-0.1904715	0.0251996	-7.5585158	0.0000000	0.8265693
shar	0.1073134	0.0208044	5.1582160	0.0000002	1.1132831
like	0.5607669	0.0325454	17.2302765	0.0000000	1.7520156
prob	0.1503168	0.0178518	8.4202745	0.0000000	1.1622023
met	0.0021065	0.0310848	0.0677678	0.9459705	1.0021088

So we have some interesting observations:

Unsurprisingly, the respondents overall rating on someone is the biggest indicator of whether they decide to match with them.
Attractiveness seems substantially the primary positive indicator of a match.
Interestingly, sincerity and ambitiousness decreased the likelihood of a match - they were seemingly turn-offs for potential dates.
Other factors played a minor positive role, including whether or not the respondent believed the interest to be reciprocated.

Comparing the genders

It’s of course natural to ask whether there are gender differences in these dynamics. So I’m going to rerun the analysis on the two gender subsets and then create a chart that illustrates any differences.

# females only

model_f <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, 
               data = data %>% 
                 dplyr::filter(gender == 0), 
               family = "binomial")

ctable_f <- coef(summary(model_f))
odds_ratio_f <- exp(coef(summary(model_f))[ , c("Estimate")])
coef_summary_f <-  cbind(ctable_f, as.data.frame(odds_ratio_f, nrow = nrow(ctable_f), ncol = 1))


model_m <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, 
               data = data %>% 
                 dplyr::filter(gender == 1), 
               family = "binomial")

ctable_m <- coef(summary(model_m))
odds_ratio_m <- exp(coef(summary(model_m))[ , c("Estimate")])
coef_summary_m <-  cbind(ctable_m, as.data.frame(odds_ratio_m, nrow = nrow(ctable_m), ncol = 1))

chart_data <- coef_summary_f %>% 
  dplyr::add_rownames() %>% 
  dplyr::left_join(coef_summary_m %>% 
                     dplyr::add_rownames(), by = "rowname") %>% 
  dplyr::select(rowname, odds_ratio_f, odds_ratio_m) %>% 
  tidyr::pivot_longer(cols = c("odds_ratio_f", "odds_ratio_m"), names_to = "odds_ratio") %>% 
  dplyr::mutate(Effect = value - 1,
                Gender = ifelse(odds_ratio == "odds_ratio_f", "Female", "Male"),
                Factor = dplyr::recode(rowname, amb = "Ambitious", attr = "Attractive",
                                       fun = "Fun", intel = "Intelligent", like = "Liked",
                                       met = "Never met\nbefore", prob = "Believe\nthey like\nme",
                                       shar = "Shared\nInterests", sinc = "Sincere"))

ggplot(data=chart_data %>% 
              dplyr::filter(rowname != "(Intercept)"), 
            aes(x=Factor, y=Effect, fill=Gender)) +
  geom_bar(stat="identity", color="black", position=position_dodge()) +
  theme_minimal() +
  labs(x = "", title = "What matters in speed dating?") +
  scale_fill_manual(values=c('#FFC0CB', '#0000FF'))

We find a couple of interesting differences. As per the long suggested stereotype, intelligence does matter more to women. It has a significant positive effect versus men where it doesn’t seem to play a meaningful role. The other interesting difference is that whether you have met someone before does have a significant effect on both groups, but we didn’t see it before because it has the opposite effect for men and women. Men seemingly prefer new interactions, versus women who like to see a familiar face.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
speed_dating_files/figure-markdown_strict		speed_dating_files/figure-markdown_strict
README.md		README.md
speed_data_data.RDS		speed_data_data.RDS
speed_data_data.csv		speed_data_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The speed dating dataset

Running a logistic regression on the data

Comparing the genders

About

Releases

Packages

keithmcnulty/speed_dating

Folders and files

Latest commit

History

Repository files navigation

The speed dating dataset

Running a logistic regression on the data

Comparing the genders

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages