In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.

In this portfolio project, you will analyze some data from OKCupid, an app that focuses on using multiple choice and short answers to match users.

You will also create a presentation about your findings from this OKCupid dataset.

The purpose of this project is to practice formulating questions and implementing machine learning techniques to answer those questions. However, the questions you ask and how you answer them are entirely up to you.

We’re excited to see the different topics you explore.

Project Objectives:
- Complete a project to add to your portfolio
- Use Jupyter Notebook to communicate findings
- Build, train, and evaluate a machine learning model

Prerequisites:
- Natural Language Processing
- Supervised Machine Learning
- Unsupervised Machine Learning


The dataset provided has the following columns of multiple-choice data:

- body_type
- diet
- drinks
- drugs
- education
- ethnicity
- height
- income
- job
- offspring
- orientation
- pets
- religion
- sex
- sign
- smokes
- speaks
- status

And a set of open short-answer responses to :

- essay0 - My self summary
- essay1 - What I’m doing with my life
- essay2 - I’m really good at
- essay3 - The first thing people usually notice about me
- essay4 - Favorite books, movies, show, music, and food
- essay5 - The six things I could never do without
- essay6 - I spend a lot of time thinking about
- essay7 - On a typical Friday night I am
- essay8 - The most private thing I am willing to admit
- essay9 - You should message me if…

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
import re
import spacy
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_lg')

In [13]:
df = pd.read_csv('profiles.csv')

In [3]:
df = df[:1000]

In [4]:
pd.options.display.max_columns = 100

In [14]:
df.shape

(59946, 31)

In [15]:
df.fillna('',axis=0,inplace=True)
df.rename(columns={'essay0': 'my_self', 'essay1': 'life', 'essay2': 'good_at', 'essay3': 'people_notice', 
                         'essay4': 'favorites', 'essay5': 'six_needed', 'essay6': 'think_to', 'essay7': 'friday_night', 
                         'essay8': 'private_admit', 'essay9': 'message_me_if'}, inplace=True)

#### Preprocess Text

Since not all the essay questions are populated for every user I am going to consolidate them all into one column called Essay.

In [7]:
def essay_prep(data):
    
    data = data.str.replace("\n", " ", regex=False)
    data = data.str.replace(r"<[^>]*>", "", regex=True)
    data = data.str.replace(r'[^\w\s]', '', regex=True)
    data = data.str.lower()
    
    def remove_numbers(data):
        number_pattern = r'\d+'
        data = data.apply(
            lambda text: re.sub(pattern=number_pattern, repl=" ", string=text))
        return data
    
    data = remove_numbers(data)
    
    def remove_frequent_words(data):
        cnt = Counter()
        for text in data.values:
            for word in text.split(' '):
                cnt[word] += 1
        FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
        data = data.apply(
            lambda text: " ".join([word for word in str(text).split(' ') if word not in FREQWORDS]))
        return data

    data = remove_frequent_words(data)
    
    def lemmatize_words(data):
        lemmatizer = WordNetLemmatizer()
        data = data.apply(
            lambda text: " ".join([lemmatizer.lemmatize(word) for word in text.split()]))
        return data

    data = lemmatize_words(data)
    
    data = [nlp(data[x]) for x in range(len(data))]
    
    return data

In [16]:
essays_cols = df.columns.to_list()[6:16]

for col in essays_cols:
    print([col])
    df[col] = essay_prep(df[col])

['my_self']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['life']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['good_at']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['people_notice']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['favorites']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['six_needed']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['think_to']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['friday_night']
regex applied
remove_numbers applied
remove_frequent_words applied
lemmatize_words applied
nlp data applied
['private_admit']
regex ap

### Finding the best match
#### Based on responses to essay questions

In [17]:
userid = 0
essays_cols = df.columns.to_list()[6:16]

def find_best_match(data):
    essays_cols = data.columns.to_list()[6:16]
    temp_list = []
    def userid_v_others(data):
        for i in range(len(data)):
            temp_list.append(data[userid].similarity(data[i]))
        return temp_list

    for col in essays_cols:
        temp_list = []
        data[str(col)+'_score'] = userid_v_others(data[col])

    def orientation(data):
        if data.orientation.iloc[userid] == 'straight' and data.sex.iloc[userid] == 'm':
            return data[(data.orientation == 'straight') & (data.sex == 'f')]
        if data.orientation.iloc[userid] == 'straight' and data.sex.iloc[userid] == 'f':
            return data[(data.orientation == 'straight') & (data.sex == 'm')]
        elif data.orientation.iloc[userid] == 'gay' and data.sex.iloc[userid] == 'm':
            return data[(data.orientation == 'gay') & (data.sex == 'm')]
        elif data.orientation.iloc[userid] == 'gay' and data.sex.iloc[userid] == 'f':
            return data[(data.orientation == 'gay') & (data.sex == 'f')]
        elif data.orientation.iloc[userid] == 'bisexual':
            return data[data.orientation == 'bisexual']
        return data

    data = orientation(data)
    
    # Calculate Average Score of matched essays
    data['essay_match'] = data.iloc[: , -10:].mean(axis=1)
    
    # Sort top 10 highest match
    data = data.sort_values(['essay_match'],ascending=False).head()
    
    return data

In [19]:
best_match = find_best_match(df)

  temp_list.append(data[userid].similarity(data[i]))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['essay_match'] = data.iloc[: , -10:].mean(axis=1)


In [20]:
best_match.head(10)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,my_self,life,good_at,people_notice,favorites,six_needed,think_to,friday_night,private_admit,message_me_if,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status,my_self_score,life_score,good_at_score,people_notice_score,favorites_score,six_needed_score,think_to_score,friday_night_score,private_admit_score,message_me_if_score,essay_match
43284,25,average,mostly anything,socially,never,working on ph.d program,"(onequarter, nerd, onequarter, explorer, onesi...","(left, new, zealand, for, california, three, y...","(smiling, raising, one, eyebrow, time, baking,...","(think, this, depends, on, context, in, which,...","(think, taste, book, movie, make, me, out, be,...","(warm, blanket, comfortable, shoe, good, laugh...","(concept, god, religion, science, determinism,...","(doing, something, low, key, like, going, rest...","(do, nt, have, any, secret, often, share, more...","(idea, church, doe, not, send, running, for, n...",indian,63.0,-1,student,2012-06-24-15-14,"berkeley, california",,straight,likes dogs and dislikes cats,catholicism and somewhat serious about it,f,aquarius and it&rsquo;s fun to think about,no,english (fluently),single,0.974263,0.925613,0.804371,0.925844,0.912147,0.674843,0.669407,0.89163,0.807152,0.926472,0.851174
4455,24,,mostly anything,socially,never,graduated from college/university,"(hi, there, really, is, nt, much, for, me, say...","(came, back, from, studying, abroad, uk, miss,...","(finding, positive, in, all, negative, like, s...","(probably, me, laughing, tend, laugh, lot, at,...","(book, complete, work, sherlock, holmes, bridg...","(music, cell, phone, good, food, laughter)","(future, all, possibility, because, they, are,...","(either, at, home, staying, in, watching, good...","(like, sing, mostly, myself, tend, have, song,...","(want, meet, somebody, new, just, want, chat, ...",asian,60.0,-1,other,2012-06-28-23-18,"rodeo, california",,straight,likes dogs and likes cats,catholicism but not too serious about it,f,pisces and it&rsquo;s fun to think about,no,"english (fluently), spanish (okay)",single,0.982187,0.910961,0.775361,0.919054,0.920145,0.82958,0.550076,0.913259,0.773535,0.92672,0.850088
42377,29,fit,mostly vegetarian,socially,,graduated from college/university,"(an, interview, interviewer, thanks, for, comi...","(so, past, you, ve, written, some, poetry, her...","(where, do, you, have, some, talent, situation...","(what, s, first, thing, going, notice, when, l...","(list, list, list, i, m, just, going, come, ou...","(let, just, say, that, you, already, have, foo...","(so, what, s, your, thinkpot, filled, with, th...","(what, s, your, social, life, looking, like, t...","(give, u, some, juicy, dirty, secret, feminist...","(why, should, someone, contact, they, made, it...",white,65.0,30000,other,2012-06-28-14-42,"berkeley, california",,straight,likes dogs and likes cats,other,f,virgo,no,english,single,0.965347,0.881772,0.816727,0.93816,0.899005,0.69355,0.642792,0.92669,0.81597,0.918816,0.849883
57338,35,thin,mostly anything,socially,,graduated from college/university,"(true, story, behind, this, profile, is, one, ...","(trying, take, it, day, by, day, sometimes, se...","(staying, up, all, night, thinking, about, eve...","(have, redhairor, auburn, be, exact, look, you...","(joe, meno, hunter, s, thompson, chuck, palahn...","(coffee, soft, comforter, good, conversation, ...","(oh, please, find, switch, calm, worried, head...","(why, must, it, be, friday, how, about, tuesda...","(am, an, open, book, but, you, will, have, tak...","(can, hold, conversation, make, me, laugh, wil...",white,67.0,30000,medicine / health,2012-06-30-20-19,"san mateo, california",,straight,likes dogs and likes cats,christianity but not too serious about it,f,cancer and it&rsquo;s fun to think about,,"english (fluently), german (okay), japanese (p...",single,0.984959,0.927562,0.822452,0.944342,0.924152,0.628592,0.605569,0.916261,0.807689,0.936935,0.849851
54274,34,fit,,socially,,graduated from college/university,"(hm, tigertamer, synchronized, swimming, chore...","(making, thing, all, kind, head, with, hand, f...","(playing, thinking, way, remind, u, that, we, ...","(spectacular, dog, by, side, no, do, nt, know,...","(read, way, le, than, wish, did, these, day, b...","(music, good, black, ink, pen, preferably, pil...","(accept, address, appreciate, that, art, is, s...","(driving, up, route, good, songlist, notebook,...","(turned, off, by, men, who, use, term, baby, d...","(need, someone, make, song, scene, picture, st...",white,64.0,-1,artistic / musical / writer,2012-06-30-01-18,"mill valley, california",,straight,has dogs,agnosticism,f,pisces and it&rsquo;s fun to think about,no,"english, french (okay)",single,0.97901,0.920432,0.803636,0.929837,0.871048,0.691017,0.64164,0.914959,0.809861,0.930192,0.849163
