# Data Filter
We are going to apply the CRISP-DM Framework for Data Analysis here (as outlined here: https://www.datascience-pm.com/crisp-dm-2/)
This notebook deals with the first three steps Business Understanding, Data Understanding and Data Preparation. The remaining two components can be found in the Doc2Vec Jupyter Notebook. 

## Business Understanding

*Business understanding – What does the business need?*
The business, in this case, is OkCupid, and it might be facing a large number of men dropping out of the system because of high competition and limited responses. We cannot confirm this directly with OkCupid, but we do know that self-representation through online dating is a relatively new skill to gain in our species' long and checkered history. So why not provide some guidance along the way? 

The goal would be to provide data-driven guidance to male users of the service so that they stand out from the competition and get matched more often. This will increase the rating of the app, and lead to more sign-ups and revenue. 

Getting present-day data would be challenging. Many researchers have gained access to profiles and conversation data, but usually have the funding support and credentials of their universities to back them. Moreover, online dating data involves considerable privacy concerns. In such a situation, it would be best to acquire a low-cost data set that anonymizes data, but need not include all aspects of profiles or even be up-to-date. 

Success for us would involve first testing for the extent of homogeneity in dating profiles, and then providing support with helpful UX features that provide tips to remove that homeogeneity and sound memorable vis-a-vis other users. In technical terms, this means identifying from text the most common topics, language patterns and keywords, and then providing guidance to prevent such repetition. It would also be useful to check if these patterns vary in different subgroups of users, as indicated by variables like height, weight/fitness level, race and education level. 

Given these objectives, we will be proceeding with using Python and R (depending on which of them contains the most suitable packages for our specific and evolving tasks).

## Data Understanding
    
- Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.
- Describe data: Examine the data and document its surface properties like data format, number of records, or field identities.
- Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.
- Verify data quality: How clean/dirty is the data? Document any quality issues.

### Step 1: Collect Initial Data

In [2]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from tqdm import tqdm
tqdm.pandas()
# For Data Cleaning
from bs4 import BeautifulSoup
from split_utils import *
#from text_complexity_utils import get_npoly, get_flesch

In [10]:
from IPython.display import Image

In [4]:
#reading in raw data
df = pd.read_csv('../profiles.csv/profiles.csv')
df.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


### Step 2: Describe data

In [5]:
print(df.shape)

(59946, 31)


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  object 
 17  height       59943 non-null  float64
 18  income       59946 non-null  int64  
 19  job 

In [7]:
df.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


There are 10 essays and a number of descriptors. The only numerical variables are income, height and age. 
We also have a large dataset with close to 60,000 entries (59946). 

### Step 3: Explore Data

With mostly categorical data, the suite of available methods for exploratory data analysis is somewhat limited. Here, I am including a treemap plot generated in R 

![Treemap of OkCupid Dating Profiles Data](full_treemap.png)

We can also detect interesting relationships across cross-tabulations of categorical variables. Here again, I have included cross-tabulations from R plots for Ethnicity and Gender against the remaining variables. 

![Treemap of OkCupid Data by Gender](img/by_gender.png)

![Treemap of OkCupid Data by Ethnicity](ethnicity_treemap.png)

### Step 4: Verify Data Quality

We are concerned with quality along two dimensions. 
- Extent of missing values. 
- Reliability of text data (free of spelling errors, URLS and any unncessary characters)

In [14]:
df.isna().mean()

age            0.000000
body_type      0.088346
diet           0.406950
drinks         0.049795
drugs          0.234878
education      0.110566
essay0         0.091549
essay1         0.126314
essay2         0.160778
essay3         0.191439
essay4         0.175775
essay5         0.180996
essay6         0.229723
essay7         0.207704
essay8         0.320705
essay9         0.210239
ethnicity      0.094752
height         0.000050
income         0.000000
job            0.136756
last_online    0.000000
location       0.000000
offspring      0.593217
orientation    0.000000
pets           0.332316
religion       0.337404
sex            0.000000
sign           0.184433
smokes         0.091949
speaks         0.000834
status         0.000000
dtype: float64

## Data Preparation
**Data Preparation – How do we organize the data for modeling?**

Fortunately, this is our sole dataset and does not seem to require any form of integration

**Select Data: Determine which data sets will be used and document reasons for inclusion/exclusion.**
   We would need data on OkCupid users, but without violating user privacy. We therefore leverage the anonymized and open dataset from 2012 (linked in README.md). With our focus on single male heterosexual users, we are able to filter this original dataset down from about 60,000 users (men and women) to only about 20,000. 
   
**Clean Data**   
The focus of our cleaning will be on the core  dating profiles themselves, with common mis-spellings and conjoined words. 
We will also 'shrink' the large number of categories for variables such as fitness and ethnicity, to save on degrees of freedom. 
    
**Construct data**: 
One key aspect of this research exercise is aspects of language and data use. These are not presenting the data to begin with. We will leverage Spacy for this purpose to generate readability metrics like the Flesch-Kincaid index. 
We will also create a variable to classify heights, since our focus on topic models fits better with categorical rather than continuous outcomes. 

**Integrate data:**
Fortunately, in this instance, this dataset itself is sufficient for addressing our research questions. The anonymized nature of the data (with no ID or identifying variable) would have made integration challenging. 

Out of all of these, in our research question, we only care about the dating profiles of straight, single males. So we filter accordingly

In [15]:
#correct subset of data
df = df[(df['sex']=="m")
        &(df['orientation']=="straight") 
        & (df['status']=="single")]

In [16]:
df.shape

(29163, 31)

### Explore data: 
Dig deeper into the data. Query it, visualize it, and identify relationships among the data.


Verify data quality: How clean/dirty is the data? Document any quality issues

## Imputation Decision
The variables of interest are categorical, and therefore not easily imputed. 
It may be possible to impute missing height. But the remaining categorical values in the data bear no causal relationship with height (other than perhaps, race)
In any case, we just saw the percentage of missing data for height is negligible.

In [5]:
#Focus on the chosen variables of importance and the essay
must_haves = ['body_type', 'height', 'education', 'ethnicity', 'sex', 'essay0']
#drop the rest
df = df[must_haves]
#drop null values
df = df.dropna(subset= must_haves)

In [4]:
# Some of the essays have just a link in the text. BeautifulSoup sees that and gets 
# the wrong idea. This line hides those warnings.
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
def clean(text):
    """
    Takes in raw text of essays
    Removes all null values and url links
    
    Parameters
    ---------
    text: string
        Usually, this is the raw profile essay 
    
    Returns
    -------
    t: string
        This refers to the cleaned profile essay
    """
    if pd.isnull(text):
        t = np.nan
    else:
        t = BeautifulSoup(text, 'lxml').get_text()
        t = t.lower()
        t = t.strip().replace('\n','').replace("\r", " ").replace('\t', '')
        bad_words = ['http', 'www', '\nnan']

        for b in bad_words:
            t = t.replace(b, '')
    #After these subsitutions, the string may become empty
    if t == '':
        t = np.nan
    
    return t

#Clearing out all HTML and unnecessary characters
df['essay0'] = df['essay0'].progress_apply(clean)

100%|██████████████████████████████████████████████████████████████████████████| 29163/29163 [00:12<00:00, 2302.66it/s]


In [6]:
df.shape

(20576, 6)

### CREATING NEW COLUMNS


Many of the sections here are taken directly from the following link, with specific modifications
Taken directly from:
https://github.com/UM-CSS/CSSLabs-NLP/blob/master/1_Data_munging.ipynb

In [7]:
def recode(text, dictionary, default=np.nan):
    """
    Function for recoding categories in a column based on exact matches
    
    Parameters
    ----------
    text: a string
    
    dictionary: dictionary
        contains desired values as keys, and all the
        labels to be matched with it used as values
    
    default: string or None
        the value to be used if no match is found with the 
        dictionary keys
    
    Returns
    ------
    out: a string or None
    
    """
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y == text: #exact match
                out = x
                return out
    return out

#Might be possible to refactor this function out completely
def recode_fuzzy(text, dictionary, default=np.nan):
    """
    Function for recoding categories in a column based on partial matches
    
    text: a string
    
    dictionary: dictionary
        contains desired values as keys, and all the
        labels to be matched with it used as values
    
    default: string or None
        the value to be used if no match is found with the 
        dictionary keys
        
    Returns
    ------
    out: a string or None

    """
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y in text: #partial match
                out = x
                return out
    return out

In [8]:
#Tese dictionaries were created from all the observed unique values

#Education
ed_levels = {'High School or less': ['dropped out of high school', 'working on high school','graduated from high school', 'working on college/university', 
                    'two-year college', 'dropped out of college/university', 
                    'high school'], 
             'More than High School': ['graduated from college/university', 
                    'working on masters program', 'working on ph.d program', 
                    'college/university', 'working on law school', 
                    'dropped out of masters program', 
                    'dropped out of ph.d program', 'dropped out of law school', 
                    'dropped out of med school',
                    'graduated from masters program',
                    'graduated from ph.d program',                           
                    'graduated from law school', 
                    'graduated from med school', 'masters program', 
                    'ph.d program', 'law school', 'med school']}

#body type
bodies = {'fit': ['fit', 'athletic', 'jacked'], 
          'not_fit': ['average', 'thin', 'skinny','curvey', 'a little extra', 
                      'full figured', 'overweight', 'rather not say', 'used up']
         }

In [9]:
df['edu'] = df.education.apply(recode, dictionary=ed_levels, 
                                            default='unknown')
df['fit'] = df.body_type.apply(recode, dictionary=bodies, 
                                            default='unknown')

In [10]:
# race/ethnicity for exact matching
ethn = {'White': ['white', 'middle eastern', 'middle eastern, white'], 
        'Asian': ['asian', 'indian', 'asian, pacific islander'], 
        'Black': ['black']
       }   

# race/ethnicityfor fuzzy matching
ethn2 = {'Latinx': ['latin'], 
         'multiple': [','], 
         np.nan: ['nan']
        }

In [11]:
def census_2010_ethnicity(t):
    '''
    recodes ethnicity variables according to census categories
    This conversion happens through dictionaries declared in the
    previous cell. 
    
    Parameters
    ----------
    t- string
    
    Returns
    -------
    e- string
    '''
    text = str(t)
    e = recode(text, ethn, default='other')
    if 'other' == e:
        e = recode_fuzzy(text, ethn2, default='other')
    return e

df['race_ethnicity'] = df.ethnicity.apply(census_2010_ethnicity)

In [12]:
#there may be some way to build in the calculation of the first quartile
def height_check(inches):
    """
    takes in height and returns a label of short or not short
    uses the first quartile as the cutoff for not being short
    
    parameters
    ----------
    inches: float
        The height of the user in inches
    
    returns
    ------
    h: string
        A label- 'short' or 'not short'
    
    """
    h = 'not_short'
    if inches <= 69:
        #This number was extracted as the first quartile of the distribution of height
        h = 'short'
    return h
df['height'] = pd.to_numeric(df['height'])
df['height_group'] = df.height.apply(height_check)

In [13]:
#Now drop the original variables
df.drop(columns=['body_type', 'ethnicity','height','education'], inplace=True)

In [14]:
df.to_csv('profiles_filtered.csv')

## PROFILE LENGTH AND VARIABLES OF INTEREST

In [None]:
# By Ethnicity
sns_race_plot = sns.boxplot(x="race_ethnicity", y="profile_length", data=df)
sns_race_plot.set(title = 'Racial Background and Length of Dating Profile', 
                  xlabel = 'Race', ylabel = 'Number of Words')
sns_race_plot.figure.savefig('profile_race.png')

In [None]:
# By Education 
sns_plot = sns.boxplot(x="edu", y="profile_length", data=df)
sns_plot.set(title = 'Education and Length of Dating Profile', 
                                                           xlabel = 'Education', 
                                                           ylabel = 'Number of Words' )
sns_plot.figure.savefig('profile_educ.png')

In [None]:
# By Height 
sns_plot = sns.boxplot(x="height_group, y="profile_length", data=df)
sns_plot.set(title = 'Height and Length of Dating Profile', 
                                                           xlabel = 'Height', 
                                                           ylabel = 'Number of Words' )
sns_plot.figure.savefig('profile_height.png')

In [None]:
# By Ditness Level
sns_plot = sns.boxplot(x="fit, y="profile_length", data=df)
sns_plot.set(title = 'Fitness and Length of Dating Profile', 
                                                           xlabel = 'Height', 
                                                           ylabel = 'Number of Words' )
sns_plot.figure.savefig('profile_fitness.png')

## TEXT EDITING

In [None]:
# First, fix conjoined words in the essay
# This may take up to 10 minutes
df['essay0'] = df['essay0'].progress_apply(split_incorrect)

100%|█████████████████████████████████████████████████████████████████████████| 20576/20576 [79:41:19<00:00, 10.85s/it]

In [None]:
df['long_words'] = df['essay0'].progress_apply(get_npoly)
df['flesch'] = df['essay0'].progress_apply(get_flesch)

In [None]:
#this will the main data file for the rest of the analysis
df.to_csv('compressed_okcupid.csv')