In [1]:
import pandas as pd

## Sources and Citations

Fisman, R. J., Iyengar, S. S, Kamenica, E. & Simonson, I. (2006). *Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment*. [http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating]. The Quarterly Journal of Economics. https://academiccommons.columbia.edu/doi/10.7916/D8FB585Z

Short post and supplemental information about the dataset and the associated experiment: [https://statmodeling.stat.columbia.edu/2008/01/21/the_speeddating_1](https://statmodeling.stat.columbia.edu/2008/01/21/the_speeddating_1). 

The link for the file share in the formal citation at the start of this cell has both the data in CSV format and a key for understanding the CSV: [http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating](http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating)

Dataset that I originally found and that led me to the source at the Columbia site: [https://www.openml.org/search?type=data&sort=runs&status=active&id=40536](https://www.openml.org/search?type=data&sort=runs&status=active&id=40536)

## Data and Research Paper Summary 

This dataset comes from an experiment by Raymond Fisman and Sheena Iyengar of the Columbia Business School about participants in a speed dating experiment. I found an open copy of the data hosted on a Columbia University file share from a well-known statistics professor at Columbia.

The researchers conducted a series of separate speed date events where they rotated participants through four-minute speed dates and asked participants to rate their partners after each date. The ratings are based around five attributes:

- Attractive
- Sincere
- Intelligent
- Fun
- Ambitious
- Shared interests

The researchers end up using attractiveness, intelligence, and ambition while dropping the remaining three attributes. Participants are also asked to fill out pre-event surveys to provide non-event data that is also used to determine degrees of similarity between participants. 

Part of the large feature space in the CSV is due to the combinations of attribute ratings across different partners, averages about individuals based on ratings by all of their partners, and other ways of combining the event survey and pre-event survey results.

The researchers focus on differential gender preferences about these attributes, trying to glean ways in which women and men value different characteristics during these dates. 

It is important to note a couple of potential issues with the experiment design. The participants were students in Columbia University graduate and professional programs, and the dates are heteronormative. These both restrict the generalizability of the findings. I am also curious about the justification of using speed dating as a stand-in for all dating since speed dating seems to me to be a very different experience from dating over a longer time scale.

The data explores how different demographics and personal characteristics affect participants' feelings about the speed dates they participate in. One target variable is the `match` column that indicates if both participants in the speed date want to meet again, but the experimental design allows for decisions that are more granular than match or no match, and the associated paper focuses on the `dec` column that has yes and no decisions by individuals about each date instead of matches that have yes decisions on both sides. The researchers use the `dec` values to determine what women and men value in partners, honing in on differences in what women and men prioritize.

The paper also references different social theories and starts to explore larger meanings about men and women based on this research's findings, including starting to challenge some of their own findings.

## Project

The choice of marriage partner is a perennial question. The more successful relationships out there, the better. On a personal note, I am looking at starting dating again myself, so I am curious about research that can help me to understand my own preferences and how those may fit in with dates that I go on. 

Because of the size of the feature space, layers of complexity, social theory, and prior knowledge built into the approach in the paper, I am going to attempt something simpler: can I predict if I will want a second date with someone and/or evaluate what factors are more important in terms of determining if I want a second date?

This is a classification problem that will use different features to predict if I will make a yes or no decision to see a person again at the end of the date. After training the model, I will provide most of the inputs before going on a date. Only one set of features will come from my experience on the date. This reduces the complexity of the model by quite a lot, but it keeps me from having to gather information from others in order to use the model, a simplification that I want to lean into for the first iteration, though this admittedly cuts out important information from the other person on the date that would help refine the model further and be more honest in terms of capturing both people's opinions about the date and about each other.

A next step of this analysis would be to begin to look at features leading to matches -- situations where both date participants say yes to a second date -- but I would rather break the progression to that up into steps.

## Fixing Bad Characters in Source CSV

The source CSV from the link above has some characters that got encoded with the Unicode replacement character, the diamond with a question mark in it. You can search for this by using a regex and looking for code `\uFFFD` or `\xEF\xBF\xBD`. These replacement characters look like the only non-ASCII characters. You can search for non-ASCII characters with the character set `[^\x00-\x7F]`. A visual scan of the bad rows looks like the characters are meant to be 'é'.

Sample bad value in the `undergra` field from the source CSV: Ecole Normale Sup�rieure, Paris


The following code block calls out to the shell to create a copy of the speed dating CSV and then run `sed` to replace the replacement characters with regular 'e's to keep everything in ASCII. I decided to use shell commands because I had trouble finding a way to read the bad characters in with Python, so I could not get to the step of replacing them in Python. I was able to manually change them with find and replace in Vim, so I modified that for `sed` for the solution below.

I ran this on MacOS. You may need to make some tweaks for Windows in particular.

In [2]:
! cp ./data/Speed\ Dating\ Data.csv ./data/speed_dating_data_fixed.csv
! sed -i '' 's/\xEF\xBF\xBD/e/g' ./data/speed_dating_data_fixed.csv  

## Loading and Initial Exploration

In [3]:
# The low_memory=False param tells pandas to determine column data types by looking at all rows
# in each column, resulting in needing to read in the entire CSV before being able to determine
# data types. low_memory=True results in chunking when reading in the file, and each chunk can
# infer a different data type.

df_raw = pd.read_csv('./data/speed_dating_data_fixed.csv', low_memory=False)
df_raw.head(10)

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1.0,1.0,0.0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1.0,1.0,0.0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1.0,1.0,0.0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1.0,1.0,0.0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1.0,1.0,0.0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,
5,1.0,1.0,0.0,1,1,1,10,7,,6,...,5.0,7.0,7.0,7.0,7.0,,,,,
6,1.0,1.0,0.0,1,1,1,10,7,,1,...,5.0,7.0,7.0,7.0,7.0,,,,,
7,1.0,1.0,0.0,1,1,1,10,7,,2,...,5.0,7.0,7.0,7.0,7.0,,,,,
8,1.0,1.0,0.0,1,1,1,10,7,,8,...,5.0,7.0,7.0,7.0,7.0,,,,,
9,1.0,1.0,0.0,1,1,1,10,7,,9,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [4]:
print(f'The dataset has {df_raw.shape[0]:,} rows and {df_raw.shape[1]} columns')

The dataset has 8,379 rows and 195 columns


## Column Summary

One of the more difficult aspects of this project is understanding and deciding what to do with the feature space of 195 columns. There are features with answers to the same questions asked at different times. There are features with speculative responses by participants. There are features with self-reported answers about questions such as what attributes someone likes in a partner, what someone thinks others find important in partners, and what others think about themselves. There are different approaches to ratings between different groups of participants. These and other particulars make the preprocessing for this dataset a bit trickier than expected. That said, because of the range of data available, the dataset does offer itself up to a range of questions.

One clarifying point for terminology. I use "participant" to refer to the individual that a row is for, "partner" for the other person participating in the date with the participant in that row, and "participants" (plural) to refer to more than one row or rows in general.

The details for these columns come from the "Speed Dating Data Key.doc" data dictionary reference document. 

Here is a general summary of the features:

There is a group of 28 questions that show up 4 times. Combined, these account for about 57% of the total columns. There is an interesting research question looking at what participants change during each of these repetitions, but I will go a different route and will not need to use all of these repetitions for my analysis.

These 28 repeating questions are related to these six attributes:

- Attractive
- Sincere
- Intelligent
- Fun
- Ambitious
- Shared interests (sometimes shared interests is not included)

This block of questions for each row asks that participant how they value these attributes, how they think fellow men or women value these attributes, how they think the opposite sex rates the importance of each of these attributes in potential partners, how participants rate themselves on these attributes, and how participants think others rate them on each of these attributes.

Note how within this repeating block of questions there are features that come from ratings that participants give themselves, that they predict about others, that they give to date partners, and that they receive from date partners. There is another interesting research question here about how participants may change their answers between repetitions, and there is another question looking at how self-ratings line up with ratings from date partners. I will not go into either of those directions with my analysis, but they would be interesting to look at.

Outside of the repeating block of features, there are a couple of other feature groupings that can help in understanding the feature space.

Intro info for a particular speed date:
- Different types of identifiers with different groupings
- Metadata about the event and date at event
- Boolean flag for match or no match
- Correlation between both participants' ratings of interest
- Boolean flag for same race
- Info about or from partner for that date, including preferences, decision about meeting again, and ratings of individual for that row

Signup/Time1 -- survey filled out by students interested in participating in speed dating event:
- Demographic and other personal info about survey applicant
- Ratings interest in different activities
- Expectations for the dating event
- First time answering the repeating block of questions

Scorecard -- filled out by participants after each date:
- Rate date partner on the six attributes
- Decision about wanting to meet that partner again
- Overall rating on partner and if you think they will say yes to wanting to see you again

Halfway point of speed dating event:
- Answer the repeating block of questions again

Followup/Time2 -- filled out the day after participating in an event:
- Feedback on the event
- Answer the repeating block of questions again

Followup2/Time3 -- 3-4 weeks after being sent matches:
- Feedback related to participants' matches
- Answer the repeating block of questions again

## What to Do With All These Features

Next is to narrow the feature space down to what we will use for modeling. Note that the only categorical variable that will need encoding is `goal`. I am treating the 1-10 ratings as discrete numeric variables instead of as ordinal categorical variables since the current numeric encoding captures the order that we want for them.

Here are the features that we will keep:
- gender (boolean): indicates if participant is male or female
- int_corr (continuous & in range [-1, 1]): correlation between ratings of interests between date participants; this summarizes down participants' interest ratings for a group of 17 activities
- age_o (continuous & in range [18,55]): partner's age
- age (continuous & in range [18,55]): participant's age
- goal (categorical): participant's primary goal in participating in event
    1. Seemed like a fun night out
    2. To meet new people
    3. To get a date
    4. Looking for a serious relationship
    5. To say I did it
    6. Other
- exphappy (discrete & in range [1,10]): how happy participant expects to be with people they meet at event

From here onwards, we encounter repetitions of the six attributes above. I will provide headers that explain what the repetitions are asking about instead of adding that as a description for each feature.

Here are the abbreviations:

- attr#_#: attractive
- sinc#_#: sincere
- intel#_#: intelligent
- fun#_#: fun
- amb#_#: ambitious
- shar#_#: shared interests

The data dictionary says that waves 6-9 rate some groups of these features on a 1-10 scale while the remaining waves distribute 100 points across all five or six features. Some feature groupings that the dictionary indicates should be [1,10] are already rescaled to [0,100] such that the total of all of the features in the grouping for each row sum to 100, so the standardization turned [1,10] ratings into relative ratings on a [1,100] percentage scale. For groupings that require further standardization, we will do the same by summing the total for the grouping for a row and standardize it if the amount does not sum to 100. The one case we may be missing out on here is if someone gives 10s for all of the attributes on a 1-10 scale, but those would get rescaled to 10s anyways.

This means that the data type for each of these columns will end up being continuous & in range [0,100].

Now, back to the columns.

These six features are related to what the participant looks for in the opposite sex:
- attr1_1
- sinc1_1
- intel1_1
- fun1_1
- amb1_1
- shar1_1

The next six features are what the participant thinks most of their fellow men/women look for in the opposite sex:
- attr4_1
- sinc4_1
- intel4_1
- fun4_1
- amb4_1
- shar4_1

The next six features are what the participant thinks the opposite sex looks for in a date:
- attr2_1
- sinc2_1
- intel2_1
- fun2_1
- amb2_1
- shar2_1

The next five features are how the participant rates themselves (discrete & in range [1,10]):
- attr3_1
- sinc3_1
- intel3_1
- fun3_1
- amb3_1

The next five features are how the participant thinks others would rate them (discrete & in range [1,10]):
- attr5_1
- sinc5_1
- intel5_1
- fun5_1
- amb5_1

Moving on to information collected at the dates:
- dec (boolean): does the participant want to meet their date partner again

Next is how the participant rates their date on the six attributes from above (discrete & in range [1,10]): 
- attr
- sinc
- intel
- fun
- amb
- shar

And two final features related to the specific date:
- like (discrete & in range [1,10]): how much the participant likes their date overall
- prob (discrete & in range [1,10]): how probable do you think it is that your partner will want to see you again

## Preprocessing

In [6]:
df = df_raw[[
    'gender'
    , 'int_corr'
    , 'age_o'
    , 'age'
    , 'goal'
    , 'exphappy'
    , 'attr1_1'
    , 'sinc1_1'
    , 'intel1_1'
    , 'fun1_1'
    , 'amb1_1'
    , 'shar1_1'
    , 'attr4_1'
    , 'sinc4_1'
    , 'intel4_1'
    , 'fun4_1'
    , 'amb4_1'
    , 'shar4_1'
    , 'attr2_1'
    , 'sinc2_1'
    , 'intel2_1'
    , 'fun2_1'
    , 'amb2_1'
    , 'shar2_1'
    , 'attr3_1'
    , 'sinc3_1'
    , 'intel3_1'
    , 'fun3_1'
    , 'amb3_1'
    , 'attr5_1'
    , 'sinc5_1'
    , 'intel5_1'
    , 'fun5_1'
    , 'amb5_1'
    , 'dec'
    , 'attr'
    , 'sinc'
    , 'intel'
    , 'fun'
    , 'amb'
    , 'shar'
    , 'like'
    , 'prob'
]]

df.head()

Unnamed: 0,gender,int_corr,age_o,age,goal,exphappy,attr1_1,sinc1_1,intel1_1,fun1_1,...,amb5_1,dec,attr,sinc,intel,fun,amb,shar,like,prob
0,0.0,0.14,27.0,21.0,2.0,3.0,15.0,20.0,20.0,15.0,...,,1.0,6.0,9.0,7.0,7.0,6.0,5.0,7.0,6.0
1,0.0,0.54,22.0,21.0,2.0,3.0,15.0,20.0,20.0,15.0,...,,1.0,7.0,8.0,7.0,8.0,5.0,6.0,7.0,5.0
2,0.0,0.16,22.0,21.0,2.0,3.0,15.0,20.0,20.0,15.0,...,,1.0,5.0,8.0,9.0,8.0,5.0,7.0,7.0,
3,0.0,0.61,23.0,21.0,2.0,3.0,15.0,20.0,20.0,15.0,...,,1.0,7.0,6.0,8.0,7.0,6.0,8.0,7.0,6.0
4,0.0,0.21,24.0,21.0,2.0,3.0,15.0,20.0,20.0,15.0,...,,1.0,5.0,6.0,7.0,7.0,6.0,6.0,6.0,6.0


## EDA

## Model Identification

I also want to test out a few different classification models. The obvious model for a binary classification problem is logistic regression. I also want to test out support vector machines (SVM), especially for a larger feature space. Finally, I would like to test out a tree-based model, but the specifics for this one will depend on what I find from some of the feature reduction.


## Model Building

## Model Training

## Results

## Conclusion