# Practium USA - Code Pudding 2 - Team Coding Challenge
## 1. Team "Three for Three" - Nare Chitturi, Ian Dizney, Josh Greenberg

Jupyter Notebook analysis by Josh Greenberg

### Introduction
Given the choice between three data sets - Tinder data, OKCupid profile data, and speed dating data - we opted to investigate the speed dating data.

The data was collected from surveys and scorecards filled out by speed dating participants and contained information about who they are, what traits are important to them in a partner, what they like to do for fun, how they rate themselves, how they rate their dates, along with some information about their experience with the event itself.

In this notebook, I will load, inspect, and clean the data where needed. I will create additional columns to aid in analysis, and I will output results to be used to create data visualizations on the web page.

## 2. Load and inspect data

In [13]:
# load libraries
import pandas as pd
import plotly.express as px
import plotly.offline as pyo
import re
import numpy as np

In [14]:
# load data
directory = "C:\\Users\\joshg\\OneDrive\\Desktop\\code_pudding\\code_pudding_v2\\code_pudding2"
df = pd.read_csv(directory + "\speed_dating_data.csv", encoding='latin1')

In [15]:
# viewing a sample of the data
print(df.info())
df.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Columns: 195 entries, iid to amb5_3
dtypes: float64(174), int64(13), object(8)
memory usage: 12.5+ MB
None


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
1580,107,14.0,1,28,2,4,18,4,,11,...,,,,,,,,,,
5341,358,18.0,0,35,2,14,18,18,18.0,10,...,,,,,,,,,,
8274,548,18.0,1,36,2,21,22,10,9.0,2,...,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,9.0,7.0
5000,338,7.0,1,14,1,13,10,2,2.0,4,...,,,,,,,,,,
4245,284,12.0,1,24,2,11,21,13,13.0,18,...,7.0,7.0,7.0,7.0,6.0,7.0,6.0,8.0,7.0,7.0
4566,302,10.0,0,18,2,12,14,13,13.0,9,...,8.0,9.0,9.0,10.0,10.0,8.0,8.0,9.0,8.0,9.0
4585,303,11.0,0,20,2,12,14,2,2.0,11,...,8.0,10.0,8.0,7.0,7.0,8.0,9.0,9.0,7.0,7.0
152,16,6.0,1,12,1,1,10,9,,8,...,6.0,6.0,8.0,8.0,8.0,,,,,
147,15,5.0,1,10,1,1,10,10,,10,...,7.0,7.0,9.0,7.0,9.0,,,,,
7327,496,1.0,0,1,1,20,7,2,2.0,2,...,,,,,,,,,,


In [80]:
# get rid of a bad entry
df = df.dropna(subset='pid') # there was an iid/pid for which no data was returned, disrupting counts

## 3. Feature Engineering

In [17]:
# number of dates per attendee
dates_count = df.groupby('iid')['iid'].count()

def get_dates_count(obs):
    return dates_count[obs['iid']]

df['dates_count'] = df.apply(get_dates_count, axis=1)

In [79]:
# check to see if above matches 'round'
df['sanity'] = df['dates_count'] - df['round']
df[df['sanity'] != 0][['iid', 'wave', 'pid', 'round', 'dates_count', 'sanity']]

Unnamed: 0,iid,wave,pid,round,dates_count,sanity
1746,122,5,112.0,10,9,-1
1747,122,5,113.0,10,9,-1
1748,122,5,114.0,10,9,-1
1749,122,5,115.0,10,9,-1
1750,122,5,116.0,10,9,-1
...,...,...,...,...,...,...
6799,453,17,435.0,11,10,-1
6800,453,17,436.0,11,10,-1
6801,453,17,437.0,11,10,-1
6802,453,17,438.0,11,10,-1


There are two waves where 'round' does not match 'dates_count'. Wave 5, where the errant 'pid' was removed above and Wave 11, which looks to be a recording error. 'dates_count' will be used as it is verified accurate to the data we have.

In [81]:
# count how many times each participant said they'd match with their partner
matches_given = df.groupby('iid')['dec'].sum()

def get_matches_given(obs):
    return matches_given[obs['iid']]

df['matches_given'] = df.apply(get_matches_given, axis=1)

In [82]:
# create a dictionary of which 'pid's each 'iid' said they'd match with
matches = {}

for iid in df['iid'].unique():
    matches[iid] = []

def get_matches(obs):
    if obs['dec'] == 1:
        matches[obs['iid']].append(obs['pid'])

df.apply(get_matches, axis=1)

0       None
1       None
2       None
3       None
4       None
        ... 
8373    None
8374    None
8375    None
8376    None
8377    None
Length: 8368, dtype: object

In [84]:
# generate a new 'dec_o' column from each respondents own data - should match 'dec_o'
def get_dec_o2(obs):
    if obs['iid'] in matches[obs['pid']]:
        return 1
    else:
        return 0

df['dec_o2'] = df.apply(get_dec_o2, axis=1)
df[df['dec_o'] != df['dec_o2']]

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,mutual_rate,gender_name,goal_dc,career_name,satisfaction,length_dc,wave_size,numdat_dc,age_group,contacted


In [85]:
# count the number of times each 'iid's date said they'd match
matches_received = df.groupby('iid')['dec_o2'].sum()

def get_matches_received(obs):
    return matches_received[obs['iid']]

df['matches_received'] = df.apply(get_matches_received, axis=1)

In [86]:
# count 'match' column for each 'iid', for number of times both 'iid' and 'pid' said they'd match
mutual_count = df.groupby('iid')['match'].sum()

def get_mutual_count(obs):
    return mutual_count[obs['iid']]
    
df['mutual_count'] = df.apply(get_mutual_count, axis=1)

In [87]:
#  compute a normalized per-date rate for some metrics
df['match_given_rate'] = df['matches_given'] / df['dates_count']
df['match_receive_rate'] = df['matches_received'] / df['dates_count']
df['mutual_rate'] = df['mutual_count'] / df['matches_given']

In [25]:
# get words for genders instead of numbers
# on review- I could have done this in a dictionary when using the data in section 5
def decode_genders(obs):
    if obs['gender'] == 1:
        return 'Male'
    else:
        return 'Female'
    
df['gender_name'] = df.apply(decode_genders, axis=1)

In [26]:
# get actual responses instead of codes for 'goal'
# on review- I could have referenced this dictionary when using the data in section 5 instead of adding a column

goals = {1.0: 'Seemed like a fun night out',
         2.0: 'To meet new people',
         3.0: 'To get a date',
         4.0: 'Looking for a serious relationship',
         5.0: 'To say I did it',
         6.0: 'Other'}

def decode_goal(obs):
    if obs['goal'] in list(goals.keys()):
        return goals[obs['goal']]
    
df['goal_dc'] = df.apply(decode_goal, axis=1)

In [88]:
# get categorical responses instead of codes for 'career'
# on review- I could have referenced this dictionary when using the data in section 5 instead of adding a column
career_codes = {1.0: 'Lawyer',
                2.0: 'Academic/Research',
                3.0: 'Psychologist',
                4.0: 'Doctor/Medicine',
                5.0: 'Engineer',
                6.0: 'Creative Arts/Entertainment',
                7.0: 'Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin',
                8.0: 'Real Estate',
                9.0: 'International/Humanitarian Affairs',
                10.0: 'Undecided',
                11.0: 'Social Work',
                12.0: 'Speech Pathology',
                13.0: 'Politics',
                14.0: 'Pro sports/Athletics',
                15.0: 'Other',
                16.0: 'Journalism',
                17.0: 'Architecture'}

career_codes = pd.Series(career_codes)

def decode_careers(obs):
    if obs['career_c'] in career_codes.keys():
        return career_codes[obs['career_c']]
    else:
        return 'Other'

df['career_name'] = df.apply(decode_careers, axis=1)

In [28]:
# categorize responses instead of codes for 'satis'
# on review- I could have done this in dictionary when using the data in section 5 instead of adding a column
def demand_satisfaction(obs):
    satis = obs['satis_2']
    if satis <= 2:
        return 'Dissatisfied'
    elif 2 < satis <=4:
        return 'Somewhat Dissatisfied'
    elif 4 < satis <= 6:
        return 'Somewhat Satisfied'
    elif 6 < satis <= 8:
        return 'Satisfied'
    elif 8 < satis:
        return 'Very Satisfied'

df['satisfaction'] = df.apply(demand_satisfaction, axis=1)

In [89]:
# many responses in this section were out of the bounds (1 for yes, 2 for no)
# all errant responses counted as not met before
def clean_met(obs):
    if obs['met'] == 1:
        return 'Familiar'
    else:
        return 'Strangers'
    
def clean_met_o(obs):
    if obs['met_o'] == 1:
        return 'Familiar'
    else:
        return 'Strangers'

df['met'] = df.apply(clean_met, axis=1)
df['met_o'] = df.apply(clean_met_o, axis=1)

In [30]:
# get categorical responses instead of codes for 'career'
# on review- I could have referenced this dictionary when using the data in section 5 instead of adding a column
length_codes = {1.0: 'Too short', 2.0: 'Too long', 3.0: 'Just right'}

def decode_length(obs):
    if obs['length'] in list(length_codes.keys()):
        return length_codes[obs['length']]
    
df['length_dc'] = df.apply(decode_length, axis=1)

In [91]:
# break wave sizes into groups to see if they had different opinions on number of dates
# on review- I could have done this in a dictionary when using the data in section 5 instead of adding a column

small_waves = [6,16,18,20]
medium_waves = [1,3,5,7,9,11,12,15]
large_waves = [2,4,6,8,10,13,14,16]

def get_wave_size(obs):
    if obs['wave'] in small_waves:
        return 'small'
    elif obs['wave'] in medium_waves:
        return 'medium'
    elif obs['wave'] in large_waves:
        return 'large'
    
df['wave_size'] = df.apply(get_wave_size, axis=1)

In [32]:
# categorize responses instead of codes for 'numdat_2'
# on review- I could have done this in dictionary when using the data in section 5 instead of adding a column
numdat_codes = {1.0: 'Too few', 2.0: 'Too many', 3.0: 'Just right'}

def decode_numdat(obs):
    if obs['numdat_2'] in list(length_codes.keys()):
        return numdat_codes[obs['numdat_2']]
    
df['numdat_dc'] = df.apply(decode_numdat, axis=1)

In [33]:
# group ages into buckets
# on review- I could have done this in dictionary when using the data in section 5 instead of adding a column
def get_age_group(obs):
    if obs['age'] < 24:
        return '23 and under'
    elif 24 <= obs['age'] <= 28:
        return '24 to 28'
    elif 28 < obs['age']:
        return '29 and over'
    
df['age_group'] = df.apply(get_age_group, axis=1)

In [92]:
# get a true/false response for whether or not a call was made in either direction
# original dataset has number of calls for each column
def get_contacted(obs):
    if obs['you_call'] >= 1:
        return True
    elif obs['them_cal'] >= 1:
        return True
    else:
        return False
    
df['contacted'] = df.apply(get_contacted, axis=1)

In [67]:
# some waves rated their trait preference on a 1-10 scale, instead of allocating from 100 pts
# this cell will scale the desired attributes for the needed waves
pref_alloc = ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1'] # traits to scale
waves_to_scale = [6, 7, 8, 9] # waves with 1-10 rating
scaled_atbs = {} # establish dictionary

for iid in df['iid'].unique(): 
    scaled_atbs[iid] = {} # populate dictionary with keys for each iid

def scale_prefs(obs):
    if obs['wave'] in waves_to_scale:
        sum_pref = obs[pref_alloc].sum()
        for atb in pref_alloc:
            try:
                scaled_atbs[obs['iid']][atb] = obs[atb] / sum_pref * 100
            except:
                scaled_atbs[obs['iid']][atb] = np.nan

iids_to_scale = df.query('wave in @waves_to_scale')['iid']

df.apply(scale_prefs, axis=1)

df.replace(scaled_atbs)

print('.')

.


In [95]:
# create a dataset where each 'iid' (participant) shows up only once
# loses individual date data, drops all duplicated response and aggregated data
df_uniq_id = df.drop_duplicates('iid')

### Looking for some correlations to investigate

In [37]:
career_counts = df_uniq_id[['career_name']].value_counts().reset_index()
career_counts.columns = ['career_name', 'count']

#career_counts

In [38]:
career_success = df_uniq_id.groupby('career_name')['match_receive_rate'].mean().sort_values(ascending=False).reset_index()
career_success.columns = ['career_name', 'match_receive_rate']
#career_success[['career_name', 'match_receive_rate']]

In [39]:
age_count = df_uniq_id.groupby(['gender', 'age'])['match_receive_rate'].count()
#age_count

In [40]:
age_success = df_uniq_id.groupby('age')['match_receive_rate'].mean()
#age_success

In [41]:
scorecard_ratings = ['attr', 'sinc', 'intel', 'fun', 'amb']
others_ratings = ['attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o']
self_rate_before = ['attr3_1', 'sinc3_1', 'intel3_1', 'fun3_1', 'amb3_1']
self_rate_day_after = ['attr3_2', 'sinc3_2', 'intel3_2', 'fun3_2', 'amb3_2']
self_rate_weeks_after = ['attr3_3', 'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3']
self_rate_before = ['attr3_1', 'sinc3_1', 'intel3_1', 'fun3_1', 'amb3_1']
others_ratings_avg = df.groupby('pid')[['attr', 'sinc', 'intel', 'fun', 'amb']].mean()


In [42]:
self_ratings = df[['iid','attr3_2', 'sinc3_2', 'intel3_2', 'fun3_2', 'amb3_2']]
self_ratings = self_ratings.drop_duplicates(subset='iid')

all_ratings = self_ratings.merge(others_ratings_avg, left_on='iid', right_on='pid')

all_ratings['diff_attr'] = all_ratings['attr'] - all_ratings['attr3_2']
all_ratings['diff_sinc'] = all_ratings['sinc'] - all_ratings['sinc3_2']
all_ratings['diff_intel'] = all_ratings['intel'] - all_ratings['intel3_2']
all_ratings['diff_fun'] = all_ratings['fun'] - all_ratings['fun3_2']
all_ratings['diff_amb'] = all_ratings['amb'] - all_ratings['amb3_2']

all_ratings.loc[:,'diff_attr':].describe()

Unnamed: 0,diff_attr,diff_sinc,diff_intel,diff_fun,diff_amb
count,485.0,485.0,485.0,485.0,485.0
mean,-0.901542,-0.716509,-0.82492,-1.137294,-0.643552
std,1.532698,1.624215,1.310515,1.518988,1.816832
min,-5.733333,-4.4375,-5.125,-5.888889,-4.6
25%,-1.777778,-1.904762,-1.666667,-2.166667,-2.0
50%,-0.888889,-0.875,-0.842105,-1.166667,-0.8
75%,0.0,0.266667,-0.055556,-0.222222,0.4
max,4.166667,5.666667,4.3,5.6,5.0


## 4. Visualizations

Visualizations in this section are mainly for exploring the data and getting a sense of outputs. Visualizations used on the website are generated from json data created in section 5.

In [43]:
# histogram of frequency of input column
inputs = ['age','career_name','goal']

for input in inputs:
    fig = px.histogram(df_uniq_id, x=input, color='gender', barmode='overlay')
    fig.show()
    #pyo.plot(fig, filename=input + '_histogram.html')

In [44]:
# pie chart of attendee responses per input
inputs = ['satisfaction', 'length', 'numdat_2']

for input in inputs:
    breakdown_by_input = df_uniq_id.groupby(input)['iid'].count().reset_index()
    fig = px.pie(breakdown_by_input, values='iid', names=input)
    fig.show()
    #pyo.plot(fig, filename=input + '_pie.html')

In [45]:
# bar chart of match receive rate by input
inputs = ['age', 'career_name', 'met_o', 'goal']

for input in inputs:
    match_rate_by_input = df_uniq_id.groupby(['gender', input])['match_receive_rate'].mean().reset_index()

    fig = px.bar(match_rate_by_input, x=input, y='match_receive_rate', color='gender', color_discrete_sequence=['#FF5733', '#33FF7A'], barmode='overlay')
    fig.show()
    #pyo.plot(fig, filename=input + '_match_bar.html')

In [46]:
corr_cols = ['shar_o']
for i in range(0,len(others_ratings)):
    corr_cols.append(others_ratings[i])
target = 'dec_o2'
corr1 = df[corr_cols].corrwith(df[target])

corr_cols2 = ['shar']
for i in range(0,len(scorecard_ratings)):
    corr_cols2.append(scorecard_ratings[i])
target2 = 'dec'
corr2 = df[corr_cols2].corrwith(df[target2])

print("Correlations between others' ratings of attendee's attributes and whether or not the other says 'match'\n", corr1)
print("Correlations between attendee's ratings of others' attributes and whether or not they say 'match'\n", corr2)


Correlations between others' ratings of attendee's attributes and whether or not the other says 'match'
 shar_o     0.400501
attr_o     0.486885
sinc_o     0.209811
intel_o    0.216704
fun_o      0.414276
amb_o      0.183216
dtype: float64
Correlations between attendee's ratings of others' attributes and whether or not they say 'match'
 shar     0.400501
attr     0.486881
sinc     0.209811
intel    0.216704
fun      0.414387
amb      0.183216
dtype: float64


## 5. Creating and Exporting Data for Website Display

In this section, I extract and clarify the data needed to answer each of the questions we sought to answer with the data for our analysis. Data is output as json format for use in website code.

In [47]:
# setting orientation preferred by SE
orientation = 'records'

### Organizers

#### How do people feel about 4-minutes as a length for each date?


In [48]:
# get value counts for each value in from df_uniq_id
date_length_survey = df_uniq_id['length_dc'].value_counts().reset_index()
date_length_survey = date_length_survey.rename(
    columns={'index': 'response', 'length_dc': 'amount'})
date_length_survey.to_json('./date_length_survey.json', orient=orientation)

date_length_survey


Unnamed: 0,response,amount
0,Too short,269
1,Just right,202
2,Too long,14


#### How many dates should each person go on?


In [72]:
# count values in numdat_2 and group by wave size
num_dates_survey_small_wave = df_uniq_id[df_uniq_id['wave_size'] == 'small']['numdat_dc'].value_counts().reset_index()
num_dates_survey_small_wave = num_dates_survey_small_wave.rename(
    columns={'index':'response', 'numdat_dc':'amount'})
num_dates_survey_small_wave.to_json('./num_dates_survey_small_wave.json', orient=orientation)

num_dates_survey_medium_wave = df_uniq_id[df_uniq_id['wave_size'] == 'medium']['numdat_dc'].value_counts().reset_index()
num_dates_survey_medium_wave = num_dates_survey_medium_wave.rename(
    columns={'index':'response', 'numdat_dc':'amount'})
num_dates_survey_medium_wave.to_json('./num_dates_survey_medium_wave.json', orient=orientation)

num_dates_survey_large_wave = df_uniq_id[df_uniq_id['wave_size'] == 'large']['numdat_dc'].value_counts().reset_index()
num_dates_survey_large_wave = num_dates_survey_large_wave.rename(
    columns={'index':'response', 'numdat_dc':'amount'})
num_dates_survey_large_wave.to_json('./num_dates_survey_large_wave.json', orient=orientation)

num_dates_survey = df_uniq_id['numdat_dc'].value_counts().reset_index()
num_dates_survey = num_dates_survey.rename(
    columns={'index':'response', 'numdat_dc':'amount'})
num_dates_survey.to_json('./num_dates_survey.json', orient=orientation)

print('small waves\n', num_dates_survey_small_wave.values)
print('medium waves\n', num_dates_survey_medium_wave.values)
print('large waves\n', num_dates_survey_large_wave.values)
print('all waves\n', num_dates_survey)
num_dates_survey.to_json(orient=orientation)

small waves
 [['Too few' 22]
 ['Just right' 13]
 ['Too many' 1]]
medium waves
 [['Too many' 94]
 ['Just right' 90]
 ['Too few' 20]]
large waves
 [['Too many' 72]
 ['Just right' 66]
 ['Too few' 16]]
all waves
      response  amount
0  Just right     213
1    Too many     206
2     Too few      63


'[{"response":"Just right","amount":213},{"response":"Too many","amount":206},{"response":"Too few","amount":63}]'

#### What professions get the most matches from their dates?

In [51]:
# get a sorted list of mean match_receive_rate by career
career_match_success = pd.Series(df_uniq_id.groupby('career_name')['match_receive_rate'].mean().sort_values(ascending=False)).reset_index()
career_match_success = career_match_success.rename(
    columns={'index':'career', 'match_receive_rate':'like_receive_rate'})
career_match_success.to_json('./career_match_success.json', orient=orientation)

career_match_success

Unnamed: 0,career_name,like_receive_rate
0,International/Humanitarian Affairs,0.48083
1,Other,0.462108
2,Lawyer,0.459086
3,Banking/Consulting/Finance/Marketing/Business/...,0.444399
4,Journalism,0.431818
5,Doctor/Medicine,0.431586
6,Social Work,0.430159
7,Undecided,0.409943
8,Academic/Research,0.407258
9,Creative Arts/Entertainment,0.400911


#### Are participants more likely to receive a match if they've met before?

In [52]:
# get mean 'match_receive_rate' by response to 'met_o'
met_before_gender_match_success = df_uniq_id.groupby(['gender_name', 'met_o'])['match_receive_rate'].mean().sort_values(ascending=False).reset_index()
met_before_gender_match_success = met_before_gender_match_success.rename(
    columns={'gender_name':'gender', 'met_o':'status', 'match_receive_rate':'like_receive_rate'})
met_before_gender_match_success.to_json('./met_before_gender_match_success.json', orient=orientation)

met_before_match_success = df_uniq_id.groupby(['met_o'])['match_receive_rate'].mean().sort_values(ascending=False).reset_index()
met_before_match_success = met_before_match_success.rename(
    columns={'met_o':'status', 'match_receive_rate':'like_receive_rate'})
met_before_match_success.to_json('./met_before_match_success.json', orient=orientation)

print('with genders\n', met_before_gender_match_success)
print('altogether\n', met_before_match_success)


with genders
    gender     status  like_receive_rate
0  Female   Familiar           0.483754
1  Female  Strangers           0.478962
2    Male   Familiar           0.398432
3    Male  Strangers           0.372313
altogether
       status  like_receive_rate
0   Familiar           0.450234
1  Strangers           0.424720


#### What ages get the most matches?

In [53]:
# get a sorted list of mean match_receive_rate by age
age_gender_match_success = df_uniq_id.groupby(['gender_name', 'age'])['match_receive_rate'].mean().sort_values(ascending=False).reset_index()
age_gender_match_success = age_gender_match_success.rename(
    columns={'gender_name':'gender', 'age':'age', 'match_receive_rate':'like_receive_rate'})
age_gender_match_success.to_json('./age_gender_match_success.json', orient=orientation)

age_match_success = df_uniq_id.groupby(['age'])['match_receive_rate'].mean().sort_values(ascending=False).reset_index()
age_match_success = age_match_success.rename(
    columns={'age':'age', 'match_receive_rate':'like_receive_rate'})
age_match_success.to_json('./age_match_success.json', orient=orientation)

age_group_gender_match_success = df_uniq_id.groupby(['gender_name', 'age_group'])['match_receive_rate'].mean().sort_values(ascending=False).reset_index()
age_group_gender_match_success = age_group_gender_match_success.rename(
    columns={'gender_name':'gender', 'match_receive_rate':'like_receive_rate'})
age_group_gender_match_success.to_json('./age_group_gender_match_success.json', orient=orientation)

age_group_match_success = df_uniq_id.groupby(['age_group'])['match_receive_rate'].mean().sort_values(ascending=False).reset_index()
age_group_match_success = age_group_match_success.rename(
    columns={'match_receive_rate':'like_receive_rate'})
age_group_match_success.to_json('./age_group_match_success.json', orient=orientation)

print('gendered, by age\n', age_gender_match_success)
print('altogether, by age\n', age_match_success)
print('gendered, by age group\n', age_group_gender_match_success)
print('altogether, by age group\n', age_group_match_success)

gendered, by age
     gender   age  like_receive_rate
0   Female  19.0           0.800000
1     Male  20.0           0.688889
2   Female  31.0           0.619907
3   Female  25.0           0.611240
4   Female  20.0           0.566667
5   Female  38.0           0.526316
6   Female  27.0           0.509001
7   Female  23.0           0.505795
8   Female  29.0           0.498640
9   Female  26.0           0.491021
10  Female  28.0           0.481589
11  Female  30.0           0.472454
12    Male  21.0           0.461435
13    Male  34.0           0.458333
14    Male  18.0           0.444444
15  Female  24.0           0.442078
16  Female  22.0           0.440437
17  Female  21.0           0.412266
18    Male  28.0           0.410830
19  Female  35.0           0.392460
20    Male  25.0           0.387204
21    Male  27.0           0.385411
22    Male  30.0           0.383656
23    Male  33.0           0.377778
24    Male  23.0           0.375183
25    Male  29.0           0.371514
26    Male

#### How satisfied were participants with the people they met?

In [54]:
# count values in satisfaction
satisfaction_survey = df_uniq_id['satisfaction'].value_counts().reset_index()
satisfaction_survey = satisfaction_survey.rename(
    columns={'index': 'response', 'satisfaction': 'amount'})
satisfaction_survey.to_json('./satisfaction_survey.json', orient=orientation)

satisfaction_survey

Unnamed: 0,response,amount
0,Somewhat Satisfied,208
1,Satisfied,144
2,Somewhat Dissatisfied,83
3,Dissatisfied,31
4,Very Satisfied,19


### Participants


#### What hobbies are most popular among participants?


In [55]:
# get mean hobby ratings and sort
hobby_ratings = df_uniq_id.loc[:,'sports':'yoga'].mean().sort_values(ascending=False).reset_index()
hobby_ratings = hobby_ratings.rename(
    columns={'index': 'hobby', 0: 'avg_rating'})
hobby_ratings.to_json('./hobby_ratings.json', orient=orientation)

hobby_ratings

Unnamed: 0,hobby,avg_rating
0,movies,7.898897
1,music,7.875
2,dining,7.775735
3,reading,7.647059
4,museums,6.972426
5,concerts,6.84375
6,theater,6.761029
7,art,6.689338
8,sports,6.395221
9,exercise,6.286765


#### What is the age of participants?

In [56]:
age_counts = df_uniq_id['age'].value_counts().reset_index()
age_counts = age_counts.rename(
    columns={'index': 'age', 'age': 'amount'})
age_counts.to_json('./age_counts.json', orient=orientation)


age_group_counts = df_uniq_id['age_group'].value_counts().reset_index()
age_group_counts = age_group_counts.rename(
    columns={'index': 'age_group', 'age_group': 'amount'})
age_group_counts.to_json('./age_group_counts.json', orient=orientation)

print('counts by age\n', age_counts)
print('counts by age group\n', age_group_counts)

counts by age
      age  amount
0   27.0      68
1   24.0      56
2   23.0      56
3   25.0      55
4   26.0      55
5   28.0      47
6   22.0      44
7   29.0      40
8   30.0      36
9   21.0      22
10  32.0      13
11  33.0      12
12  34.0      11
13  31.0       7
14  20.0       5
15  35.0       4
16  36.0       4
17  19.0       2
18  39.0       1
19  18.0       1
20  37.0       1
21  42.0       1
22  38.0       1
23  55.0       1
counts by age group
       age_group  amount
0      24 to 28     281
1   29 and over     132
2  23 and under     130


#### What are participants goals?


In [57]:
# count values in goal_dc
goals_survey = df_uniq_id['goal_dc'].value_counts().reset_index()
goals_survey = goals_survey.rename(
    columns={'index': 'goal', 'goal_dc': 'amount'})
goals_survey.to_json('./goals_survey.json', orient=orientation)

goals_survey

Unnamed: 0,goal,amount
0,Seemed like a fun night out,228
1,To meet new people,189
2,To get a date,40
3,To say I did it,35
4,Other,30
5,Looking for a serious relationship,22


#### How likely are you to match? 

In [58]:
# get mean of matches logged, mutual matches, a "yes or no" to calls either way, and date
match_to_date_funnel = pd.Series({'Likes': df_uniq_id['matches_given'].mean(),
                                  'Matches': df_uniq_id['mutual_count'].mean(),
                                  'Contact made': df_uniq_id['contacted'].mean(),
                                  'Had a date': df_uniq_id['date_3'].fillna(0).mean()}).reset_index()

match_to_date_funnel = match_to_date_funnel.rename(
    columns={'index': 'stage', 0: 'avg_per_participant'})
match_to_date_funnel.to_json('./match_to_date_funnel.json', orient=orientation)

match_to_date_funnel

Unnamed: 0,stage,avg_per_participant
0,Likes,6.37931
1,Matches,2.504537
2,Contact made,0.3049
3,Had a date,0.170599


#### Does self-perception match dates' ratings?

In [59]:
traits = ['Attractiveness', 'Sincerity', 'Intelligence', 'Fun', 'Ambition']

traits_dict = {0: 'Female',
               1: 'Male'}

for i in range(0,len(others_ratings)):
    traits_dict[others_ratings[i]] = traits[i] + '_others'
    traits_dict[self_rate_before[i]] = traits[i] +'_self'

self_vs_others_ratings = df[list(others_ratings) + list(self_rate_before)].mean().reset_index()
self_vs_others_ratings = self_vs_others_ratings.rename(
    columns={'index': 'trait', 0: 'avg_rating'}).replace(traits_dict)
self_vs_others_ratings.to_json('./self_vs_others_ratings.json', orient=orientation)

self_vs_others_gender_ratings = df.groupby('gender')[list(others_ratings) + list(self_rate_before)].mean().reset_index()
self_vs_others_gender_ratings = self_vs_others_gender_ratings.rename(
    columns=traits_dict).replace(traits_dict)
self_vs_others_gender_ratings.to_json('./self_vs_others_gender_ratings.json', orient=orientation)

print('altogether\n', self_vs_others_ratings)
print('gendered\n', self_vs_others_gender_ratings)

altogether
                    trait  avg_rating
0  Attractiveness_others    6.190411
1       Sincerity_others    7.175256
2    Intelligence_others    7.369301
3             Fun_others    6.400599
4        Ambition_others    6.778409
5    Attractiveness_self    7.084352
6         Sincerity_self    8.295776
7      Intelligence_self    8.404454
8               Fun_self    7.704345
9          Ambition_self    7.577877
gendered
    gender  Attractiveness_others  Sincerity_others  Intelligence_others  \
0  Female               6.461401          7.251053             7.291202   
1    Male               5.919422          7.099778             7.447362   

   Fun_others  Ambition_others  Attractiveness_self  Sincerity_self  \
0    6.520164         6.604591             7.219092        8.458343   
1    6.280555         6.952773             6.950555        8.134346   

   Intelligence_self  Fun_self  Ambition_self  
0           8.320622  7.893612       7.632499  
1           8.487699  7.516401     

#### How important are these traits?



In [94]:
# get mean of each attributes preference
traits.append('Shared Interests')

trait_importance = df[pref_alloc].mean()
trait_importance = trait_importance.set_axis(
    ['Attractiveness', 'Sincerity', 'Intelligence', 'Fun', 'Ambition', 'Shared Interests']).reset_index()
trait_importance = trait_importance.rename(
    columns={'index': 'trait', 0: 'score'})
trait_importance.to_json('./trait_importance.json', orient=orientation)

trait_importance

Unnamed: 0,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1
0,15.0,20.0,20.0,15.0,15.0,15.0
1,15.0,20.0,20.0,15.0,15.0,15.0
2,15.0,20.0,20.0,15.0,15.0,15.0
3,15.0,20.0,20.0,15.0,15.0,15.0
4,15.0,20.0,20.0,15.0,15.0,15.0
...,...,...,...,...,...,...
8373,70.0,0.0,15.0,15.0,0.0,0.0
8374,70.0,0.0,15.0,15.0,0.0,0.0
8375,70.0,0.0,15.0,15.0,0.0,0.0
8376,70.0,0.0,15.0,15.0,0.0,0.0


#### Which traits correlate best with receiving a like?


In [61]:
# get mean attribute scores from others compared to match_receive_rate
trait_match_correlation = others_ratings_avg.merge(df_uniq_id[['iid','match_receive_rate']], left_on='pid', right_on='iid').drop('iid', axis=1)
trait_match_correlation

for trait in list(others_ratings_avg.columns):
    trait_match_correlation[[trait, 'match_receive_rate']].to_json('./' + trait + '_correlation.json', orient=orientation)