In [1]:
import pandas as pd

## Sources and Citations

Fisman, R. J., Iyengar, S. S, Kamenica, E. & Simonson, I. (2006). *Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment*. [http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating]. The Quarterly Journal of Economics. https://academiccommons.columbia.edu/doi/10.7916/D8FB585Z

Short post and supplemental information about the dataset and the associated experiment: [https://statmodeling.stat.columbia.edu/2008/01/21/the_speeddating_1](https://statmodeling.stat.columbia.edu/2008/01/21/the_speeddating_1). 

The link for the file share in the formal citation at the start of this cell has both the data in CSV format and a key for understanding the CSV: [http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating](http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating)

Dataset that I originally found and that led me to the source at the Columbia site: [https://www.openml.org/search?type=data&sort=runs&status=active&id=40536](https://www.openml.org/search?type=data&sort=runs&status=active&id=40536)

## Data and Research Paper Summary 

This dataset comes from an experiment by Raymond Fisman and Sheena Iyengar of the Columbia Business School about participants in a speed dating experiment. I found an open copy of the data hosted on a Columbia University file share from a well-known statistics professor at Columbia.

The researchers conducted a series of separate speed date events where they rotated participants through four-minute speed dates and asked participants to rate their partners after each date. The ratings are based around five attributes:

- Attractive
- Sincere
- Intelligent
- Fun
- Ambitious
- Shared interests

The researchers end up using attractiveness, intelligence, and ambition while dropping the remaining three attributes. Participants are also asked to fill out pre-event surveys to provide non-event data that is also used to determine degrees of similarity between participants. 

Part of the large feature space in the CSV is due to the combinations of attribute ratings across different partners, averages about individuals based on ratings by all of their partners, and other ways of combining the event survey and pre-event survey results.

The researchers focus on differential gender preferences about these attributes, trying to glean ways in which women and men value different characteristics during these dates. 

It is important to note a couple of potential issues with the experiment design. The participants were students in Columbia University graduate and professional programs, and the dates are heteronormative. These both restrict the generalizability of the findings. I am also curious about the justification of using speed dating as a stand-in for all dating since speed dating seems to me to be a very different experience from dating over a longer time scale.

The data explores how different demographics and personal characteristics affect participants' feelings about the speed dates they participate in. One target variable is the `match` column that indicates if both participants in the speed date want to meet again, but the experimental design allows for decisions that are more granular than match or no match, and the associated paper focuses on the `dec` column that has yes and no decisions by individuals about each date instead of matches that have yes decisions on both sides. The researchers use the `dec` values to determine what women and men value in partners, honing in on differences in what women and men prioritize.

The paper also references different social theories and starts to explore larger meanings about men and women based on this research's findings, including starting to challenge some of their own findings.

## Project

The choice of marriage partner is a perennial question. The more successful relationships out there, the better. On a personal note, I am looking at starting dating again myself, so I am curious about research that can help me to understand my own preferences and how those may fit in with dates that I go on. 

Because of the size of the feature space, layers of complexity, social theory, and prior knowledge built into the approach in the paper, I am going to attempt something simpler: can I learn what factors on a date will influence me to lean towards wanting to meet for a second date? 

This is a classification problem that will use different features to predict if I will make a yes or no decision at the end of the date. This is useful in terms of helping me to make sense what might be general signals that people tend to pick up on when making this decision, and I can then reflect on if I want to settle for that more automatic decision or if I might be missing something that could affect that decision.

This also does not tell me if the other person will choose yes. A next step of this analysis would be to begin to look at features leading to matches, but I would rather break the progression to that up into steps. Also, the paper mentions a lot of complexity and social theory that I would like to read about before pivoting to the more targeted match decision.


## Fixing Bad Characters in Source CSV

The source CSV from the link above has some characters that got encoded with the Unicode replacement character, the diamond with a question mark in it. You can search for this by using a regex and looking for code `\uFFFD` or `\xEF\xBF\xBD`. These replacement characters look like the only non-ASCII characters. You can search for non-ASCII characters with the character set `[^\x00-\x7F]`. A visual scan of the bad rows looks like the characters are meant to be 'é'.

Sample bad value in the `undergra` field from the source CSV: Ecole Normale Sup�rieure, Paris


The following code block calls out to the shell to create a copy of the speed dating CSV and then run `sed` to replace the replacement characters with regular 'e's to keep everything in ASCII. I decided to use shell commands because I had trouble finding a way to read the bad characters in with Python, so I could not get to the step of replacing them in Python. I was able to manually change them with find and replace in Vim, so I modified that for `sed` for the solution below.

I ran this on MacOS. You may need to make some tweaks for Windows in particular.

In [2]:
! cp ./data/Speed\ Dating\ Data.csv ./data/speed_dating_data_fixed.csv
! sed -i '' 's/\xEF\xBF\xBD/e/g' ./data/speed_dating_data_fixed.csv  

## Loading and Initial Exploration

The research paper goes into detail about what features and combinations of features are in the dataset and primarily looks at differential gender effects between male and female participants.

The main target feature is `dec` for decision and has 1/0 boolean values for yes/no in terms of if that individual wants to see the other person again. A secondary target feature could be `match` which indicates if both participants want to see each other again. The paper goes the route of focusing on differences between genders for `dec`.



In [16]:
# The low_memory=False param tells pandas to determine column data types by looking at all rows
# in each column, resulting in needing to read in the entire CSV before being able to determine
# data types. low_memory=True results in chunking when reading in the file, and each chunk can
# infer a different data type. I will likely pass in a stricter data type spec later on to keep
# chunking but to get explicit data types

df_raw = pd.read_csv('./data/speed_dating_data_fixed.csv', low_memory=False)
df_raw.head(10)

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1.0,1.0,0.0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1.0,1.0,0.0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1.0,1.0,0.0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1.0,1.0,0.0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1.0,1.0,0.0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,
5,1.0,1.0,0.0,1,1,1,10,7,,6,...,5.0,7.0,7.0,7.0,7.0,,,,,
6,1.0,1.0,0.0,1,1,1,10,7,,1,...,5.0,7.0,7.0,7.0,7.0,,,,,
7,1.0,1.0,0.0,1,1,1,10,7,,2,...,5.0,7.0,7.0,7.0,7.0,,,,,
8,1.0,1.0,0.0,1,1,1,10,7,,8,...,5.0,7.0,7.0,7.0,7.0,,,,,
9,1.0,1.0,0.0,1,1,1,10,7,,9,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [15]:
print(f'The dataset has {df_raw.shape[0]:,} rows and {df_raw.shape[1]} columns')
df_raw.dtypes

The dataset has 8,379 rows and 195 columns


iid         float64
id          float64
gender      float64
idg           int64
condtn        int64
             ...   
attr5_3     float64
sinc5_3     float64
intel5_3    float64
fun5_3      float64
amb5_3      float64
Length: 195, dtype: object

## Pre-Cleaning Exploratory Data Analysis (EDA)


## Preprocessing

The first goal will be to reduce the feature space since the CSV has a large number of features. The paper helps in terms of understanding where all these columns are coming from and how they are contributing to the study. The feature reduction will take a few different approaches. First will be to perform some exploratory data analysis (EDA) to see if I can manually remove some features based on the EDA findings combined with domain knowledge gleaned from the paper. Another approach will be to limit features based on domain knowledge and then test out simplification methods such as principal components analysis (PCA) and ridge or lasso regression.

After that, we will need to determine how to handle missing values as well as when and how to substitute in level values for factor features.


## EDA

## Model Identification

I also want to test out a few different classification models. The obvious model for a binary classification problem is logistic regression. I also want to test out support vector machines (SVM), especially for a larger feature space. Finally, I would like to test out a tree-based model, but the specifics for this one will depend on what I find from some of the feature reduction.


## Model Building

## Model Training

## Results

## Conclusion