# Data Filter
### The purpose of this notebook is four-fold:
1) Filter data to only the relevant rows

2) Delete the unnecessary columns

3) Suitably edit the text to allow for topic modeling

4) Create new variables to assist with demographic comparisons of topics


We are going to apply the CRISP-DM Framework for Data Analysis here (as outlined here:
https://www.datascience-pm.com/crisp-dm-2/)


*Business understanding – What does the business need?*
The business, in this case, is OkCupid, and it might be facing a large number of men dropping out of the system because of high competition and limited responses. We cannot confirm this directly with OkCupid, but we do know that self-representation through online dating is a relatively new skill to gain in our species' long and checkered history. So why not provide some guidance along the way? The goal would be to provide data-driven guidance to male users of the service so that they stand out from the competition and get matched more often. This will increase the rating of the app, and lead to more sign-ups and revenue. 

Getting present-day data would be challenging. Many researchers have gained access to profiles and conversation data, but usually have the funding support and credentials of their universities to back them. Moreover, online dating data involves considerable privacy concerns. In such a situation, it would be best to acquire a low-cost data set that anonymizes data, but need not include all aspects of profiles or even be up-to-date. 

Success for us would involve first testing for the extent of homogeneity in dating profiles, and then providing support with helpful UX features that provide tips to remove that homeogeneity and sound memorable vis-a-vis other users. In technical terms, this means identifying from text the most common topics, language patterns and keywords, and then providing guidance to prevent such repetition. It would also be useful to check if these patterns vary in different subgroups of users, as indicated by variables like height, weight/fitness level, race and education level. 

Given these objectives, we will be proceeding with using Python and R (depending on which of them contains the most suitable packages for our specific and evolving tasks)

**Data understanding – What data do we have / need? Is it clean?**



Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.
Describe data: Examine the data and document its surface properties like data format, number of records, or field identities.
Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.
Verify data quality: How clean/dirty is the data? Document any quality issues


**Data preparation – How do we organize the data for modeling?**

Fortunately, this is our sole dataset and does not seem to require any form of integration

    Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.
    Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A common practice during this task is to correct, impute, or remove erroneous values.
    Construct data: Derive new attributes that will be helpful. For example, derive someone’s body mass index from height and weight fields.
    Integrate data: Create new data sets by combining data from multiple sources.
    Format data: Re-format data as necessary. For example, you might convert string values that store numbers to numeric values so that you can perform mathematical operations.

**Modeling – What modeling techniques should we apply?**

We are focused on different versions of topic models here. One choice is between the commonly used Latent Dirichlet Allocation (LDA). This model is advantageous because it can be built on for more complex models, such as Structural Topic Models. 

On the other hand, the alternative Non-Negative Matrix Factorization model may be useful too, and has been shown to offer many advantages over LDA when dealing with very short documents such as SMS and Tweets (see references in Thesis- linked in the readme). 

We will need to check how these models play out on our data, and choose accordingly. However, a priori, it seems like 

    Select modeling techniques: Determine which algorithms to try (e.g. regression, neural net).
    Generate test design: Pending your modeling approach, you might need to split the data into training, test, and validation sets.
    Build model: As glamorous as this might sound, this might just be executing a few lines of code like “reg = LinearRegression().fit(X, y)”.
    Assess model: Generally, multiple models are competing against each other, and the data scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design.

Evaluation – Which model best meets the business objectives?

    Evaluate results: Do the models meet the business success criteria? Which one(s) should we approve for the business?
    Review process: Review the work accomplished. Was anything overlooked? Were all steps properly executed? Summarize findings and correct anything if needed.
    Determine next steps: Based on the previous three tasks, determine whether to proceed to deployment, iterate further, or initiate new projects.

In [None]:
Deployment – How do stakeholders access the results

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from tqdm import tqdm
tqdm.pandas()
# For Data Cleaning
from bs4 import BeautifulSoup
from split_utils import *
from text_complexity_utils import get_npoly, get_flesch

In [2]:
#reading in raw data
df = pd.read_csv('../profiles.csv/profiles.csv')
#correct subset of data
df = df[(df['sex']=="m")
        &(df['orientation']=="straight") 
        & (df['status']=="single")]

In [3]:
df.shape

(29163, 31)

In [4]:
# Some of the essays have just a link in the text. BeautifulSoup sees that and gets 
# the wrong idea. This line hides those warnings.
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
def clean(text):
    """
    Takes in raw text of essays
    Removes all null values and url links
    
    Parameters
    ---------
    text: string
        Usually, this is the raw profile essay 
    
    Returns
    -------
    t: string
        This refers to the cleaned profile essay
    """
    if pd.isnull(text):
        t = np.nan
    else:
        t = BeautifulSoup(text, 'lxml').get_text()
        t = t.lower()
        t = t.strip().replace('\n','').replace("\r", " ").replace('\t', '')
        bad_words = ['http', 'www', '\nnan']

        for b in bad_words:
            t = t.replace(b, '')
    #After these subsitutions, the string may become empty
    if t == '':
        t = np.nan
    
    return t

#Clearing out all HTML and unnecessary characters
df['essay0'] = df['essay0'].progress_apply(clean)

100%|██████████████████████████████████████████████████████████████████████████| 29163/29163 [00:12<00:00, 2302.66it/s]


The variables of interest are categorical, and therefore not easily imputed. 
It may be possible to impute missing height. But the remaining categorical values in the data bear no causal relationship with height (other than perhaps, race) 

In [5]:
#Focus on the chosen variables of importance and the essay
must_haves = ['body_type', 'height', 'education', 'ethnicity', 'sex', 'essay0']
#drop the rest
df = df[must_haves]
#drop null values
df = df.dropna(subset= must_haves)

In [6]:
df.shape

(20576, 6)

### CREATING NEW COLUMNS


Many of the sections here are taken directly from the following link, with specific modifications
Taken directly from:
https://github.com/UM-CSS/CSSLabs-NLP/blob/master/1_Data_munging.ipynb

In [7]:
def recode(text, dictionary, default=np.nan):
    """
    Function for recoding categories in a column based on exact matches
    
    Parameters
    ----------
    text: a string
    
    dictionary: dictionary
        contains desired values as keys, and all the
        labels to be matched with it used as values
    
    default: string or None
        the value to be used if no match is found with the 
        dictionary keys
    
    Returns
    ------
    out: a string or None
    
    """
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y == text: #exact match
                out = x
                return out
    return out

#Might be possible to refactor this function out completely
def recode_fuzzy(text, dictionary, default=np.nan):
    """
    Function for recoding categories in a column based on partial matches
    
    text: a string
    
    dictionary: dictionary
        contains desired values as keys, and all the
        labels to be matched with it used as values
    
    default: string or None
        the value to be used if no match is found with the 
        dictionary keys
        
    Returns
    ------
    out: a string or None

    """
    out = default
    text = str(text)
    
    for x in dictionary.keys():
        for y in dictionary[x]:
            if y in text: #partial match
                out = x
                return out
    return out

In [8]:
#Tese dictionaries were created from all the observed unique values

#Education
ed_levels = {'High School or less': ['dropped out of high school', 'working on high school','graduated from high school', 'working on college/university', 
                    'two-year college', 'dropped out of college/university', 
                    'high school'], 
             'More than High School': ['graduated from college/university', 
                    'working on masters program', 'working on ph.d program', 
                    'college/university', 'working on law school', 
                    'dropped out of masters program', 
                    'dropped out of ph.d program', 'dropped out of law school', 
                    'dropped out of med school',
                    'graduated from masters program',
                    'graduated from ph.d program',                           
                    'graduated from law school', 
                    'graduated from med school', 'masters program', 
                    'ph.d program', 'law school', 'med school']}

#body type
bodies = {'fit': ['fit', 'athletic', 'jacked'], 
          'not_fit': ['average', 'thin', 'skinny','curvey', 'a little extra', 
                      'full figured', 'overweight', 'rather not say', 'used up']
         }

In [9]:
df['edu'] = df.education.apply(recode, dictionary=ed_levels, 
                                            default='unknown')
df['fit'] = df.body_type.apply(recode, dictionary=bodies, 
                                            default='unknown')

In [10]:
# race/ethnicity for exact matching
ethn = {'White': ['white', 'middle eastern', 'middle eastern, white'], 
        'Asian': ['asian', 'indian', 'asian, pacific islander'], 
        'Black': ['black']
       }   

# race/ethnicityfor fuzzy matching
ethn2 = {'Latinx': ['latin'], 
         'multiple': [','], 
         np.nan: ['nan']
        }

In [11]:
def census_2010_ethnicity(t):
    '''
    recodes ethnicity variables according to census categories
    This conversion happens through dictionaries declared in the
    previous cell. 
    
    Parameters
    ----------
    t- string
    
    Returns
    -------
    e- string
    '''
    text = str(t)
    e = recode(text, ethn, default='other')
    if 'other' == e:
        e = recode_fuzzy(text, ethn2, default='other')
    return e

df['race_ethnicity'] = df.ethnicity.apply(census_2010_ethnicity)

In [12]:
#there may be some way to build in the calculation of the first quartile
def height_check(inches):
    """
    takes in height and returns a label of short or not short
    uses the first quartile as the cutoff for not being short
    
    parameters
    ----------
    inches: float
        The height of the user in inches
    
    returns
    ------
    h: string
        A label- 'short' or 'not short'
    
    """
    h = 'not_short'
    if inches <= 69:
        #This number was extracted as the first quartile of the distribution of height
        h = 'short'
    return h
df['height'] = pd.to_numeric(df['height'])
df['height_group'] = df.height.apply(height_check)

In [13]:
#Now drop the original variables
df.drop(columns=['body_type', 'ethnicity','height','education'], inplace=True)

In [14]:
df.to_csv('profiles_filtered.csv')

## PROFILE LENGTH AND VARIABLES OF INTEREST

In [None]:
# By Ethnicity
sns_race_plot = sns.boxplot(x="race_ethnicity", y="profile_length", data=df)
sns_race_plot.set(title = 'Racial Background and Length of Dating Profile', 
                  xlabel = 'Race', ylabel = 'Number of Words')
sns_race_plot.figure.savefig('profile_race.png')

In [None]:
# By Education 
sns_plot = sns.boxplot(x="edu", y="profile_length", data=df)
sns_plot.set(title = 'Education and Length of Dating Profile', 
                                                           xlabel = 'Education', 
                                                           ylabel = 'Number of Words' )
sns_plot.figure.savefig('profile_educ.png')

In [None]:
# By Height 
sns_plot = sns.boxplot(x="height_group, y="profile_length", data=df)
sns_plot.set(title = 'Height and Length of Dating Profile', 
                                                           xlabel = 'Height', 
                                                           ylabel = 'Number of Words' )
sns_plot.figure.savefig('profile_height.png')

In [None]:
# By Ditness Level
sns_plot = sns.boxplot(x="fit, y="profile_length", data=df)
sns_plot.set(title = 'Fitness and Length of Dating Profile', 
                                                           xlabel = 'Height', 
                                                           ylabel = 'Number of Words' )
sns_plot.figure.savefig('profile_fitness.png')

## TEXT EDITING

In [None]:
# First, fix conjoined words in the essay
# This may take up to 10 minutes
df['essay0'] = df['essay0'].progress_apply(split_incorrect)

100%|█████████████████████████████████████████████████████████████████████████| 20576/20576 [79:41:19<00:00, 10.85s/it]

In [None]:
df['long_words'] = df['essay0'].progress_apply(get_npoly)
df['flesch'] = df['essay0'].progress_apply(get_flesch)

In [None]:
#this will the main data file for the rest of the analysis
df.to_csv('compressed_okcupid.csv')