In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.

In this portfolio project, you will analyze some data from OKCupid, an app that focuses on using multiple choice and short answers to match users.

You will also create a presentation about your findings from this OKCupid dataset.

The purpose of this project is to practice formulating questions and implementing machine learning techniques to answer those questions. However, the questions you ask and how you answer them are entirely up to you.

We’re excited to see the different topics you explore.

Project Objectives:
- Complete a project to add to your portfolio
- Use Jupyter Notebook to communicate findings
- Build, train, and evaluate a machine learning model

Prerequisites:
- Natural Language Processing
- Supervised Machine Learning
- Unsupervised Machine Learning


The dataset provided has the following columns of multiple-choice data:

- body_type
- diet
- drinks
- drugs
- education
- ethnicity
- height
- income
- job
- offspring
- orientation
- pets
- religion
- sex
- sign
- smokes
- speaks
- status

And a set of open short-answer responses to :

- essay0 - My self summary
- essay1 - What I’m doing with my life
- essay2 - I’m really good at
- essay3 - The first thing people usually notice about me
- essay4 - Favorite books, movies, show, music, and food
- essay5 - The six things I could never do without
- essay6 - I spend a lot of time thinking about
- essay7 - On a typical Friday night I am
- essay8 - The most private thing I am willing to admit
- essay9 - You should message me if…

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
import re
import spacy
from collections import Counter
nlp = spacy.load('en_core_web_sm')

In [34]:
df = pd.read_csv('profiles.csv')

In [35]:
df = df[:1000]

In [4]:
pd.options.display.max_columns = 100

In [5]:
df.shape

(1000, 31)

In [36]:
df.fillna('',axis=0,inplace=True)
df.rename(columns={'essay0': 'my_self', 'essay1': 'life', 'essay2': 'good_at', 'essay3': 'people_notice', 
                         'essay4': 'favorites', 'essay5': 'six_needed', 'essay6': 'think_to', 'essay7': 'friday_night', 
                         'essay8': 'private_admit', 'essay9': 'message_me_if'}, inplace=True)

#### Preprocess Text

In [6]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 

In [37]:
# Save a list with essay columns names
essays_cols = df.columns.to_list()[6:16]

# Remove newlines and HTML charachters from essay columns
for col in essays_cols:
    df[col] = df[col].str.replace("\n", " ", regex=False)
    df[col] = df[col].str.replace(r"<[^>]*>", "", regex=True)
    df[col] = df[col].str.replace(r'[^\w\s]', '', regex=True)
    df[col] = df[col].str.lower()

Since not all the essay questions are populated for every user I am going to consolidate them all into one column called Essay.

In [38]:
df['essay'] = df[df.columns[6:16]].apply(lambda x: ' '.join(x.astype(str)), axis=1)
df['essay'] = df['essay'].astype(str)

In [10]:
def remove_frequent_words(data):
    cnt = Counter()
    for text in data.values:
        for word in text.split(' '):
            cnt[word] += 1
    FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
    data = data.apply(
        lambda text: " ".join([word for word in str(text).split(' ') if word not in FREQWORDS]))
    print('remove_frequent_words applied')
    return data

df['essay'] = remove_frequent_words(df['essay'])

In [39]:
doc4 = nlp(df.essay[0])
doc5 = nlp(df.essay[1])
doc6 = nlp(df.essay[90])

In [53]:
print(doc4.similarity(doc5))
print(doc4.similarity(doc6))

0.8857345011788965
0.7192977675925847


  print(doc4.similarity(doc5))
  print(doc4.similarity(doc6))


In [54]:
df.essay[90]

'mmmmmm idk what say so just ask me hair manga black tie laptop someone funny ill comeback dont have anything else do really'

remove_frequent_words applied


#### Tokenize Essay Fields

In [46]:
from nltk.stem import WordNetLemmatizer

In [47]:
def lemmatize_words(data):
    lemmatizer = WordNetLemmatizer()
    data = data.apply(
        lambda text: " ".join([lemmatizer.lemmatize(word) for word in text.split()]))
    print('lemmatize_words applied')
    # print(data_frame[column_name])
    return data

In [48]:
df['essay'] = lemmatize_words(df['essay'])

lemmatize_words applied


In [51]:
def remove_numbers(data):
    number_pattern = r'\d+'
    data = data.apply(
        lambda text: re.sub(pattern=number_pattern, repl=" ", string=text))
    print('remove_numbers applied')
    return data

In [52]:
df['essay'] = remove_numbers(df['essay'])

remove_numbers applied


#### Find out how much of a match someone is based on their response to my_self