In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.

In this portfolio project, you will analyze some data from OKCupid, an app that focuses on using multiple choice and short answers to match users.

You will also create a presentation about your findings from this OKCupid dataset.

The purpose of this project is to practice formulating questions and implementing machine learning techniques to answer those questions. However, the questions you ask and how you answer them are entirely up to you.

We’re excited to see the different topics you explore.

Project Objectives:
- Complete a project to add to your portfolio
- Use Jupyter Notebook to communicate findings
- Build, train, and evaluate a machine learning model

Prerequisites:
- Natural Language Processing
- Supervised Machine Learning
- Unsupervised Machine Learning


The dataset provided has the following columns of multiple-choice data:

- body_type
- diet
- drinks
- drugs
- education
- ethnicity
- height
- income
- job
- offspring
- orientation
- pets
- religion
- sex
- sign
- smokes
- speaks
- status

And a set of open short-answer responses to :

- essay0 - My self summary
- essay1 - What I’m doing with my life
- essay2 - I’m really good at
- essay3 - The first thing people usually notice about me
- essay4 - Favorite books, movies, show, music, and food
- essay5 - The six things I could never do without
- essay6 - I spend a lot of time thinking about
- essay7 - On a typical Friday night I am
- essay8 - The most private thing I am willing to admit
- essay9 - You should message me if…

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
import re
import spacy
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_lg')

In [2]:
df = pd.read_csv('profiles.csv')

In [3]:
df = df[:100]

In [4]:
pd.options.display.max_columns = 100

In [5]:
df.shape

(100, 31)

In [6]:
df.fillna('',axis=0,inplace=True)
df.rename(columns={'essay0': 'my_self', 'essay1': 'life', 'essay2': 'good_at', 'essay3': 'people_notice', 
                         'essay4': 'favorites', 'essay5': 'six_needed', 'essay6': 'think_to', 'essay7': 'friday_night', 
                         'essay8': 'private_admit', 'essay9': 'message_me_if'}, inplace=True)

#### Preprocess Text

Since not all the essay questions are populated for every user I am going to consolidate them all into one column called Essay.

##### Code to combine all essay columns into one. No longer used
df['essay'] = df[df.columns[6:16]].apply(lambda x: ' '.join(x.astype(str)), axis=1)
df['essay'] = df['essay'].astype(str)

In [7]:
def essay_prep(data):
    
    data = data.str.replace("\n", " ", regex=False)
    data = data.str.replace(r"<[^>]*>", "", regex=True)
    data = data.str.replace(r'[^\w\s]', '', regex=True)
    data = data.str.lower()
    print('regex applied')
    
    def remove_numbers(data):
        number_pattern = r'\d+'
        data = data.apply(
            lambda text: re.sub(pattern=number_pattern, repl=" ", string=text))
        print('remove_numbers applied')
        return data
    
    data = remove_numbers(data)
    
    def remove_frequent_words(data):
        cnt = Counter()
        for text in data.values:
            for word in text.split(' '):
                cnt[word] += 1
        FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
        data = data.apply(
            lambda text: " ".join([word for word in str(text).split(' ') if word not in FREQWORDS]))
        print('remove_frequent_words applied')
        return data

    data = remove_frequent_words(data)
    
    def lemmatize_words(data):
        lemmatizer = WordNetLemmatizer()
        data = data.apply(
            lambda text: " ".join([lemmatizer.lemmatize(word) for word in text.split()]))
        print('lemmatize_words applied')
        # print(data_frame[column_name])
        return data

    data = lemmatize_words(data)
    
    data = [nlp(data[x]) for x in range(len(data))]
    print('nlp data applied')
    
    return data

In [8]:
essays_cols = df.columns.to_list()[6:16]

for col in essays_cols:
    print([col])
    df[col] = essay_prep(df[col])

['my_self']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['life']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['good_at']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['people_notice']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['favorites']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['six_needed']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['think_to']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['friday_night']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['private_admit']
regex ap

In [9]:
userid = 0
temp_list = []
def userid_v_others(data):
    for i in range(len(data)):
        temp_list.append(data[userid].similarity(data[i]))
    return temp_list

for col in essays_cols:
    temp_list = []
    df[str(col)+'_score'] = userid_v_others(df[col])

  temp_list.append(data[userid].similarity(data[i]))


#### Find out how much of a match someone is based on their response to essay questions

In [12]:
df.good_at.iloc[43]

designing thing using my hand thinking creative solution problem but im not so good identifying problem coping with chaos

In [13]:
df.good_at.iloc[0]

people laugh ranting about good salting finding simplicity complexity complexity simplicity

In [11]:
# Calculate Average Score of matched essays
df['essay_match'] = df[df.columns[-10:]].mean(axis=1)

# Sort top 10 highest match
df.sort_values(['essay_match'],ascending=False).head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,my_self,life,good_at,people_notice,favorites,six_needed,think_to,friday_night,private_admit,message_me_if,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status,my_self_score,life_score,good_at_score,people_notice_score,favorites_score,six_needed_score,think_to_score,friday_night_score,private_admit_score,message_me_if_score,essay_match
0,22,a little extra,strictly anything,socially,never,working on college/university,"(about, me, would, love, think, that, wa, some...","(currently, working, a, an, international, age...","(people, laugh, ranting, about, good, salting,...","(way, look, am, six, foot, half, asian, half, ...","(book, absurdistan, republic, mouse, men, only...","(food, water, cell, phone, shelter)","(duality, humorous, thing)","(trying, to, find, someone, to, hang, am, down...","(am, new, california, looking, for, someone, w...","(want, be, swept, off, your, foot, tired, norm...","asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
43,40,fit,,socially,,graduated from college/university,"(do, nt, really, like, summarizing, myself, bu...","(spending, lot, of, time, building, thing, bus...","(designing, thing, using, my, hand, thinking, ...","(i, ve, never, figured, out, answer, this, que...","(book, almost, anything, by, vonnegut, or, ste...","(ignoring, obvious, air, water, food, shelter,...","(build, thingsmake, thing, better, cosmos, our...","(there, is, no, typical, night, for, me, somet...","(have, no, secret, wo, nt, tell, but, have, as...","(you, re, still, reading, at, least, say, hell...",white,71.0,60000,construction / craftsmanship,2012-06-30-00-01,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs,agnosticism but not too serious about it,m,gemini and it&rsquo;s fun to think about,no,"english (okay), french (poorly), spanish (poor...",single,0.976987,0.921732,0.793368,0.904595,0.895263,0.764673,0.610774,0.877851,0.773201,0.930465,0.844891
59,31,average,,socially,,graduated from college/university,"(when, it, come, own, life, do, a, please, but...","(read, book, soak, up, a, much, sun, a, humanl...","(conversation, analyzing, movie, spelling, gra...","(people, generally, ca, nt, discern, ethnicity...","(short, list, awakening, intelligence, one, hu...","(bicycle, cell, phone, laptop, food, water, ob...","(finding, balance, all, aspect, life, also, co...","(jumble, in, between, extreme, ton, at, root, ...","(list, dirty, dancing, a, one, deep, down, gui...","(like, bike, riding, you, re, swimmer, that, s...",,71.0,-1,,2012-06-05-13-04,"san francisco, california",,straight,likes dogs and likes cats,agnosticism and somewhat serious about it,m,libra but it doesn&rsquo;t matter,when drinking,"english (fluently), spanish (poorly)",single,0.930046,0.915309,0.753655,0.886667,0.875889,0.871213,0.557309,0.915803,0.781329,0.921246,0.840847
40,30,average,,often,never,graduated from masters program,"(am, new, san, francisco, bay, area, looking, ...","(write, software, fun, profit, complain, about...","(like, think, am, good, communicating, friend,...","(wear, funny, tshirts, they, have, funny, ando...","(book, almost, anything, fantasy, lord, ring, ...","(happiness, fun, hug, fresh, air, internet, i,...","(life, universe, everything, me, friend, every...","(doing, same, thing, do, every, night, in, cas...","(hmm, so, want, know, little, secret, ey, well...","(should, message, me, ifwell, feel, like, it, ...",,76.0,-1,computer / hardware / software,2012-06-29-22-56,"menlo park, california",doesn&rsquo;t have kids,straight,likes cats,agnosticism,m,,no,"english (fluently), dutch (fluently), lisp (fl...",single,0.975224,0.891383,0.789386,0.917571,0.903603,0.629294,0.617737,0.9159,0.805243,0.912192,0.835753
16,33,fit,,socially,,working on masters program,"(just, moved, bay, area, from, austin, tx, ori...","(making, music, programming, getting, back, in...","(i, m, from, louisiana, so, cooking, eating, a...","(lately, keep, getting, asked, are, you, with,...","(moviestvetc, big, lebowski, other, cohen, bro...","(in, no, particular, order, food, music, outdo...","(methodology, for, practicing, creative, skill...","(just, moved, here, am, still, getting, to, kn...","(am, in, s, still, can, not, grow, mustache, i...","(want, help, me, assemble, ikea, stuff, andor,...",white,70.0,-1,entertainment / media,2012-06-29-16-08,"oakland, california",,straight,likes dogs and likes cats,,m,pisces but it doesn&rsquo;t matter,sometimes,"english (fluently), c++ (fluently), german (po...",single,0.975612,0.87778,0.740459,0.906484,0.85354,0.725052,0.608091,0.938251,0.802734,0.913045,0.834105
