In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.

In this portfolio project, you will analyze some data from OKCupid, an app that focuses on using multiple choice and short answers to match users.

You will also create a presentation about your findings from this OKCupid dataset.

The purpose of this project is to practice formulating questions and implementing machine learning techniques to answer those questions. However, the questions you ask and how you answer them are entirely up to you.

We’re excited to see the different topics you explore.

Project Objectives:
- Complete a project to add to your portfolio
- Use Jupyter Notebook to communicate findings
- Build, train, and evaluate a machine learning model

Prerequisites:
- Natural Language Processing
- Supervised Machine Learning
- Unsupervised Machine Learning


The dataset provided has the following columns of multiple-choice data:

- body_type
- diet
- drinks
- drugs
- education
- ethnicity
- height
- income
- job
- offspring
- orientation
- pets
- religion
- sex
- sign
- smokes
- speaks
- status

And a set of open short-answer responses to :

- essay0 - My self summary
- essay1 - What I’m doing with my life
- essay2 - I’m really good at
- essay3 - The first thing people usually notice about me
- essay4 - Favorite books, movies, show, music, and food
- essay5 - The six things I could never do without
- essay6 - I spend a lot of time thinking about
- essay7 - On a typical Friday night I am
- essay8 - The most private thing I am willing to admit
- essay9 - You should message me if…

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
import re

In [2]:
df = pd.read_csv('profiles.csv')

In [3]:
df = df[:1000]

In [4]:
pd.options.display.max_columns = 100

In [5]:
df.head(2)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...","books:<br />\nabsurdistan, the republic, of mi...",food.<br />\nwater.<br />\ncell phone.<br />\n...,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet!<br />\nyou...,"asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories.<br /...,,,i am very open and will share just about anyth...,,white,70.0,80000,hospitality / travel,2012-06-29-21-41,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single


In [6]:
df.shape

(1000, 31)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          1000 non-null   int64  
 1   body_type    926 non-null    object 
 2   diet         626 non-null    object 
 3   drinks       946 non-null    object 
 4   drugs        753 non-null    object 
 5   education    884 non-null    object 
 6   essay0       889 non-null    object 
 7   essay1       865 non-null    object 
 8   essay2       851 non-null    object 
 9   essay3       813 non-null    object 
 10  essay4       813 non-null    object 
 11  essay5       800 non-null    object 
 12  essay6       769 non-null    object 
 13  essay7       772 non-null    object 
 14  essay8       662 non-null    object 
 15  essay9       791 non-null    object 
 16  ethnicity    892 non-null    object 
 17  height       1000 non-null   float64
 18  income       1000 non-null   int64  
 19  job    

In [8]:
df.rename(columns={'essay0': 'my_self', 'essay1': 'life', 'essay2': 'good_at', 'essay3': 'people_notice', 
                         'essay4': 'favorites', 'essay5': 'six_needed', 'essay6': 'think_to', 'essay7': 'friday_night', 
                         'essay8': 'private_admit', 'essay9': 'message_me_if'}, inplace=True)

In [9]:
df.fillna('',axis=0,inplace=True)

#### Preprocess Text

In [10]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 

In [11]:
# Save a list with essay columns names
essays_cols = df.columns.to_list()[6:16]

# Remove newlines and HTML charachters from essay columns
for col in essays_cols:
    df[col] = df[col].str.replace("\n", " ", regex=False)
    df[col] = df[col].str.replace(r"<[^>]*>", "", regex=True)
    df[col] = df[col].str.replace(r'[^\w\s]', '', regex=True)
    df[col] = df[col].str.lower()

#### Tokenize Essay Fields

In [12]:
temp = []
for col in essays_cols:
    for row in df[col]:
        tokenized_survey = word_tokenize(row)
        temp.append([w for w in tokenized_survey if not w in stop_words])
    df[col] = temp
    temp = []

In [13]:
df.head(1)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,my_self,life,good_at,people_notice,favorites,six_needed,think_to,friday_night,private_admit,message_me_if,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,"[would, love, think, kind, intellectual, eithe...","[currently, working, international, agent, fre...","[making, people, laugh, ranting, good, salting...","[way, look, six, foot, half, asian, half, cauc...","[books, absurdistan, republic, mice, men, book...","[food, water, cell, phone, shelter]","[duality, humorous, things]","[trying, find, someone, hang, anything, except...","[new, california, looking, someone, wisper, se...","[want, swept, feet, tired, norm, want, catch, ...","asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single


#### Find out how much of a match someone is based on their response to my_self