## To-Do list

I am using JetBrains' DataSpell IDE for this project. It is essentially a nice JetBrains IDE experience wrapped around Jupyter notebooks. Normally JetBrains IDEs have built-in support for to-dos, but I cannot get it working for DataSpell. So I will use a separate markdown codeblock here to track outstanding items.

### Project Cell

- Finish story around why this dataset/study/analysis is important, both generally and personally
- Call out issues with the design and what improvements could be made to it. For example, the only matches they look at are for opposite-sex individuals.
- Talk about methods used for EDA and analysis. For example, maybe look at some unsupervised learning such as PCA to trim the feature space.
- Determine a couple of questions that make sense after reading the paper and test out some models and statistical questions based on those
- Make sure to talk about some of the assumptions of the paper. For example, the paper mentions a bunch of social theories related to dating. What happens if we do not follow some of those theories?


### Notes from Published Paper

- Researchers looked at yes/no decisions separate from matches
- Four minute speed dates
- Participants were students in graduate and professional schools at Columbia University (selection bias maybe)
- Score card
    - Yes/No (main variable of interest)
        - Decision$_{ij}$ is the decision of subject *i* about person *j*
    - Six attributes to rate the other person on
        - Attractive
        - Sincere
        - Intelligent
        - Fun
        - Ambitious
        - Shared interests
- Women stayed seated and men rotated
- Study primarily looks at differential gender effects
    - Male$_i$ indicator variable
- Study only looks at attractiveness, intelligence, and ambition. The omitted characteristics had similar weights. 
    - Rating$_ijc$ is subject *i's* rating on a ten-point scale about *j* on characteristic *c* $\in$ {Attractiveness, Intelligence, Ambition}
    - Observations that have missing values for one of these three characteristics are omitted from the regression
    - $\bar{Rating}_{-ijc}$ is the average rating for a characteristic from everyone who rated person *j* and the $-i$ means that *i* is excluded from this average
- There are a few log measures for SAT at undergrad institution, median income in zip code, and population density in zip code
- Self$_{ic}$ is subject *i's* rating for themselves on characteristic *c*
    - Others$_{ic}$ is how others who rated *i* rated them on *c*
- Each participant did a pre-event survey where the researchers gathered some of the non-event data about subjects
- Pre-event survey is used to determine ahead of time a few measures SameABC$_{ij}$ to say if the two people had the same ABC attribute
- Yeses%_i$ and other related measures relate to how many people *i* said yes to
- Table 2a shows descriptive statistics. Maybe include a section in the project that does some summary descriptive statistics to orient ourselves to the data. Also look at table 2b
- Page 685 starts to have some conditions and assumptions about men fearing rejection leading to lower desire for more intelligent or more ambitious women. Also asks why there is a difference in fear of rejection between genders. At least the paper is starting to question some of these things.
- Some survey results are subjective (ratings) and others are objective (SAT, zip code, etc)
    - Objective data is much more sparse if you try to include all three features
- In pre-event survey, participants rated their interest in seventeen activities and has SharedInterests features to compare between participants

### Preprocessing Cell (not there yet)

- Look at categorical features and decide what types of mappings need to happen (numerical to string categories).

In [1]:
import pandas as pd

## Data

Fisman, R. J., Iyengar, S. S, Kamenica, E. & Simonson, I. (2006). *Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment*. [http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating]. The Quarterly Journal of Economics. https://academiccommons.columbia.edu/doi/10.7916/D8FB585Z

Short post and supplemental information about the dataset and the associated experiment: [https://statmodeling.stat.columbia.edu/2008/01/21/the_speeddating_1](https://statmodeling.stat.columbia.edu/2008/01/21/the_speeddating_1). 

The link for the file share in the formal citation at the start of this cell has both the data in CSV format and a key for understanding the CSV: [http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating](http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating)

Data source that I originally found and that led me to the less processed and closer-to-raw source at the Columbia site: [https://www.openml.org/search?type=data&sort=runs&status=active&id=40536](https://www.openml.org/search?type=data&sort=runs&status=active&id=40536)

This dataset comes from an experiment by Ray Fisman and Sheena Iyengar of the Columbia Business School and contains data about participants in a speed dating experiment. The most salient target is the `match` column, but the experimental design allows for observation of decisions that are more granular that match or no match, and the associated paper focuses on yes/no decisions by individuals instead of matches that have yes decisions on both sides.

## Project

Determining what shared features between date participants seems like a perennial question. The more successful relationships out there, the better. On a personal note, I am looking at starting dating again myself, so I am curious about research that can help me to understand the different preferences that I or my date hold and how those may affect a second date. 

The speed dating setting and the demographics of the study's participants may not generalize well enough to the larger population, but the findings provide a starting point that is backed by empirical evidence.

The paper that reports on this data is really useful for determining ways to trim the number of features down.

## Fixing Bad Characters in Source CSV

The source CSV from the link above has some characters that got encoded with the Unicode replacement character, the diamond with a question mark in it. You can search for this by using a regex and looking for code `\uFFFD` or `\xEF\xBF\xBD`. These replacement characters look like the only non-ASCII characters. You can search for these with the character set `[^\x00-\x7F]` A visual scan of the bad rows looks like the characters are meant to be 'é'.

Sample bad value in the `undergra` field from the source CSV: Ecole Normale Sup�rieure, Paris


The following code block calls out to the shell to create a copy of the speed dating CSV and then run `sed` to replace the replacement characters with regular 'e's to keep everything in ASCII. I decided to use shell commands because I had trouble finding a way to read the bad characters in with Python, so I could not get to the step of replacing them in Python. I was able to manually change them with find and replace in Vim, so I modified that for `sed` for the solution below.

I ran this on MacOS. You may need to make some tweaks for Windows in particular.

In [2]:
! cp ./data/Speed\ Dating\ Data.csv ./data/speed_dating_data_fixed.csv
! sed -i '' 's/\xEF\xBF\xBD/e/g' ./data/speed_dating_data_fixed.csv  

## Loading


In [3]:
# The low_memory=False param tells pandas to determine column data types by looking at all rows
# in each column, resulting in needing to read in the entire CSV before being able to determine
# data types. low_memory=True results in chunking when reading in the file, and each chunk can
# infer a different data type. I will likely pass in a stricter data type spec later on to keep
# chunking but to get explicit data types

df_raw = pd.read_csv('./data/speed_dating_data_fixed.csv', low_memory=False)
df_raw.head(10)

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1.0,1.0,0.0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1.0,1.0,0.0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1.0,1.0,0.0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1.0,1.0,0.0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1.0,1.0,0.0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,
5,1.0,1.0,0.0,1,1,1,10,7,,6,...,5.0,7.0,7.0,7.0,7.0,,,,,
6,1.0,1.0,0.0,1,1,1,10,7,,1,...,5.0,7.0,7.0,7.0,7.0,,,,,
7,1.0,1.0,0.0,1,1,1,10,7,,2,...,5.0,7.0,7.0,7.0,7.0,,,,,
8,1.0,1.0,0.0,1,1,1,10,7,,8,...,5.0,7.0,7.0,7.0,7.0,,,,,
9,1.0,1.0,0.0,1,1,1,10,7,,9,...,5.0,7.0,7.0,7.0,7.0,,,,,


## Pre-Cleaning Exploratory Data Analysis (EDA)
