Skip to content

keithmcnulty/speed_dating

Repository files navigation

It’s Valentines Day - a day when people think about love and relationships. How people meet and form relationship works a lot quicker than in our parent’s or grandparent’s day. I’m sure many of you are told how it used to be - you met someone, dated them for a while, proposed, got married. People who grew up in small towns maybe had one shot at finding love, so they made sure they didn’t mess it up.

Today finding a date is not a challenge - finding a match is probably the issue. In the last 20 years we’ve gone from traditional dating to online dating to speed dating to online speed dating. Now you just swipe left or swipe right, if that’s your thing.

In 2002-2004, Columbia University ran a speed-dating experiment where they tracked data over 21 speed dating sessions for mostly young adults meeting people of the opposite sex. I found the dataset and the key to the data here: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/.

I was interested in finding out what it was about someone during that short interaction that determined whether or not someone viewed them as a match. This is a great opportunity to practice simple logistic regression if you’ve never done it before.

The speed dating dataset

The dataset at the link above is quite substantial - over 8,000 observations with almost 200 datapoints for each. However, I was only interested in the speed dates themselves, and so I simplified the data and uploaded a smaller version of the dataset to my Github account here. I’m going to pull this dataset down and do some simple regression analysis on it to determine what it is about someone that influences whether someone sees them as a match.

Let’s pull the data and take a quick look:

library(tidyverse)
library(corrplot)

download.file("https://raw.githubusercontent.com/keithmcnulty/speed_dating/master/speed_data_data.RDS", "speed_dating_data.RDS")

data <- readRDS("speed_dating_data.RDS")

head(data, 3) %>% 
  knitr::kable()
gender age income goal career dec attr sinc intel fun amb shar like prob met
0 21 69487 2 lawyer 1 6 9 7 7 6 5 7 6 2
0 21 69487 2 lawyer 1 7 8 7 8 5 6 7 5 1
0 21 69487 2 lawyer 1 5 8 9 8 5 7 7 NA 1

We can work out from the key that:

  1. The first five columns are demographic - we may want to use them to look at subgroups later.
  2. The next seven columns are important. dec is the raters decision on whether this individual was a match and then follows scores out of ten on six characteristics: attractiveness, sincerity, intelligence, fun, ambitiousness and shared interests.
  3. The like column is an overall rating. The prob column is a rating on whether the rater believed that interest would be reciprocated and the final column is a binary on whether the two had met prior to the speed date, with the lower value indicating that they had met before.

We can leave the first four columns out of any analysis we do. Our outcome variable here is dec. I’m interested in the rest as potential explanatory variables. Before I start to do any analysis, I want to check if any of these variables are highly colinear - ie, have very high correlations. If two variables are measuring pretty much the same thing, I should probably remove one of them.

corr_matrix <- data %>% 
  dplyr::select(attr, sinc, intel, fun, amb, shar, like, prob, met) %>% 
  as.matrix() 

M <- cor(corr_matrix, use = "complete.obs")

corrplot::corrplot(M)

OK, clearly there’s mini-halo effects running wild when you speed date. But none of these get up really high (eg past 0.75), so I’m going to leave them all in because this is just for fun. I might want to spend a bit more time on this issue if my analysis had serious consequences here.

Running a logistic regression on the data

The outcome of this process is binary. The respondent decides yes or no. That’s harsh, I give you. But for a statistician it points straight to a binomial logistic regression as our primary analytic tool. Let’s run a logistic regression model on the outcome and potential explanatory variables I’ve identified above, and take a look at the results.

model <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, 
             data = data, 
             family = "binomial")

summary(model)

## 
## Call:
## glm(formula = dec ~ attr + sinc + intel + fun + amb + shar + 
##     like + prob + met, family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8210  -0.7602  -0.2311   0.7921   3.6432  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.0769718  0.2101454 -28.918  < 2e-16 ***
## attr         0.4296537  0.0235843  18.218  < 2e-16 ***
## sinc        -0.2088995  0.0273358  -7.642 2.14e-14 ***
## intel       -0.0005595  0.0327799  -0.017    0.986    
## fun          0.1357729  0.0258310   5.256 1.47e-07 ***
## amb         -0.1904715  0.0251996  -7.559 4.08e-14 ***
## shar         0.1073134  0.0208044   5.158 2.49e-07 ***
## like         0.5607669  0.0325454  17.230  < 2e-16 ***
## prob         0.1503168  0.0178518   8.420  < 2e-16 ***
## met          0.0021065  0.0310848   0.068    0.946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9322.9  on 6818  degrees of freedom
## Residual deviance: 6460.5  on 6809  degrees of freedom
##   (1559 observations deleted due to missingness)
## AIC: 6480.5
## 
## Number of Fisher Scoring iterations: 5

So, perceived intelligence doesn’t really matter. This could be a factor of the population being studied, whom I believe were all undergraduates at Columbia and so would all have a high average SAT I suspect, and so intelligence might be less of a differentiator. Neither does whether on not you’d met someone before - not that surprising. Everything else seems to play a significant role.

More interesting is how much of a role each factor plays. The Coefficients Estimates in the model output above tell us the effect of each variable, assuming other variables are held still. But in the form above they are expressed in log odds, and we need to convert them to regular odds ratios so we can understand them better, so let’s adjust our results to do that.

ctable <- coef(summary(model))
odds_ratio <- exp(coef(summary(model))[ , c("Estimate")])
(coef_summary <-  cbind(ctable, as.data.frame(odds_ratio, nrow = nrow(ctable), ncol = 1))) %>% 
  knitr::kable()
Estimate Std. Error z value Pr(>|z|) odds_ratio
(Intercept) -6.0769718 0.2101454 -28.9179377 0.0000000 0.0022951
attr 0.4296537 0.0235843 18.2178165 0.0000000 1.5367252
sinc -0.2088995 0.0273358 -7.6419635 0.0000000 0.8114768
intel -0.0005595 0.0327799 -0.0170688 0.9863817 0.9994406
fun 0.1357729 0.0258310 5.2562008 0.0000001 1.1454217
amb -0.1904715 0.0251996 -7.5585158 0.0000000 0.8265693
shar 0.1073134 0.0208044 5.1582160 0.0000002 1.1132831
like 0.5607669 0.0325454 17.2302765 0.0000000 1.7520156
prob 0.1503168 0.0178518 8.4202745 0.0000000 1.1622023
met 0.0021065 0.0310848 0.0677678 0.9459705 1.0021088

So we have some interesting observations:

  1. Unsurprisingly, the respondents overall rating on someone is the biggest indicator of whether they decide to match with them.
  2. Attractiveness seems substantially the primary positive indicator of a match.
  3. Interestingly, sincerity and ambitiousness decreased the likelihood of a match - they were seemingly turn-offs for potential dates.
  4. Other factors played a minor positive role, including whether or not the respondent believed the interest to be reciprocated.

Comparing the genders

It’s of course natural to ask whether there are gender differences in these dynamics. So I’m going to rerun the analysis on the two gender subsets and then create a chart that illustrates any differences.

# females only

model_f <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, 
               data = data %>% 
                 dplyr::filter(gender == 0), 
               family = "binomial")

ctable_f <- coef(summary(model_f))
odds_ratio_f <- exp(coef(summary(model_f))[ , c("Estimate")])
coef_summary_f <-  cbind(ctable_f, as.data.frame(odds_ratio_f, nrow = nrow(ctable_f), ncol = 1))


model_m <- glm(dec ~ attr + sinc + intel + fun + amb + shar + like + prob + met, 
               data = data %>% 
                 dplyr::filter(gender == 1), 
               family = "binomial")

ctable_m <- coef(summary(model_m))
odds_ratio_m <- exp(coef(summary(model_m))[ , c("Estimate")])
coef_summary_m <-  cbind(ctable_m, as.data.frame(odds_ratio_m, nrow = nrow(ctable_m), ncol = 1))

chart_data <- coef_summary_f %>% 
  dplyr::add_rownames() %>% 
  dplyr::left_join(coef_summary_m %>% 
                     dplyr::add_rownames(), by = "rowname") %>% 
  dplyr::select(rowname, odds_ratio_f, odds_ratio_m) %>% 
  tidyr::pivot_longer(cols = c("odds_ratio_f", "odds_ratio_m"), names_to = "odds_ratio") %>% 
  dplyr::mutate(Effect = value - 1,
                Gender = ifelse(odds_ratio == "odds_ratio_f", "Female", "Male"),
                Factor = dplyr::recode(rowname, amb = "Ambitious", attr = "Attractive",
                                       fun = "Fun", intel = "Intelligent", like = "Liked",
                                       met = "Never met\nbefore", prob = "Believe\nthey like\nme",
                                       shar = "Shared\nInterests", sinc = "Sincere"))

ggplot(data=chart_data %>% 
              dplyr::filter(rowname != "(Intercept)"), 
            aes(x=Factor, y=Effect, fill=Gender)) +
  geom_bar(stat="identity", color="black", position=position_dodge()) +
  theme_minimal() +
  labs(x = "", title = "What matters in speed dating?") +
  scale_fill_manual(values=c('#FFC0CB', '#0000FF'))

We find a couple of interesting differences. As per the long suggested stereotype, intelligence does matter more to women. It has a significant positive effect versus men where it doesn’t seem to play a meaningful role. The other interesting difference is that whether you have met someone before does have a significant effect on both groups, but we didn’t see it before because it has the opposite effect for men and women. Men seemingly prefer new interactions, versus women who like to see a familiar face.

About

Analysis of data from a speed dating experiment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published