<a href="https://colab.research.google.com/github/mrinaaall/OkCupid/blob/main/OkCupid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kmodes

Collecting kmodes
  Downloading https://files.pythonhosted.org/packages/b2/55/d8ec1ae1f7e1e202a8a4184c6852a3ee993b202b0459672c699d0ac18fc8/kmodes-0.10.2-py2.py3-none-any.whl
Installing collected packages: kmodes
Successfully installed kmodes-0.10.2


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import datetime
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes

In [3]:
missing_values = ["n/a", "na", "--", '-1', "NaN"]
profile_data = pd.read_csv('/content/drive/MyDrive/profiles.csv', encoding = 'utf8', na_values = missing_values)

In [4]:
profile_data.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')

In [6]:
profile_data.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...","books:<br />\nabsurdistan, the republic, of mi...",food.<br />\nwater.<br />\ncell phone.<br />\n...,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet!<br />\nyou...,"asian, white",75.0,,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories.<br /...,,,i am very open and will share just about anyth...,,white,70.0,80000.0,hospitality / travel,2012-06-29-21-41,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,okay this is where the cultural matrix gets so...,movement<br />\nconversation<br />\ncreation<b...,,viewing. listening. dancing. talking. drinking...,"when i was five years old, i was known as ""the...","you are bright, open, intense, silly, ironic, ...",,68.0,,,2012-06-27-09-10,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,"bataille, celine, beckett. . .<br />\nlynch, j...",,cats and german philosophy,,,you feel so inclined.,white,71.0,20000.0,student,2012-06-28-14-22,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians<br />\nat the...",,,,,,"asian, black, other",66.0,,artistic / musical / writer,2012-06-27-21-26,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [5]:
# Eliminating the Chained Assignment error.
pd.set_option('mode.chained_assignment', None)

In [7]:
profile_data = profile_data.dropna()

-----------------------

Profile Demographics:

In [8]:
'''
Creating a date-time object from the feature: last_online.
'''
profile_data['last_online'] = pd.to_datetime(profile_data.last_online, format='%Y-%m-%d-%H-%M')

'''
Splitting DATETIME instance of last_online into Date AND Time.
'''
profile_data['last_date_online'] = pd.to_datetime(profile_data['last_online']).dt.date
profile_data['last_time_online'] = pd.to_datetime(profile_data['last_online']).dt.time

'''
Splitting date into Day, Month, and Year.
'''
profile_data['year'] = pd.to_datetime(profile_data['last_date_online']).dt.year
profile_data['month'] = pd.to_datetime(profile_data['last_date_online']).dt.month
profile_data['day'] = pd.to_datetime(profile_data['last_date_online']).dt.day

--------------------------------------

In [9]:
'''
Separating location into CITY and STATE.
'''
location_separated = profile_data["location"].str.split(", ", n = 1, expand = True)
profile_data['city'] = location_separated[0]
profile_data['state'] = location_separated[1]

-------------------------------------------

In [10]:
'''
Retaining the religion, no other information.
'''
profile_data['religion'] = profile_data['religion'].str.split(' ').str[0]

---------------------

In [11]:
'''
Retaining just the zodiac sign, no other information.
'''
profile_data['sign'] = profile_data['sign'].str.split(' ').str[0]

----------------

In [12]:
'''
Exploding language into multiple columns, each column containing one language.
'''
languages_separated = profile_data['speaks'].str.split(", ", n = 5, expand = True)

profile_data['language_1'] = languages_separated[0]
profile_data['language_2'] = languages_separated[1]
profile_data['language_3'] = languages_separated[2]
profile_data['language_4'] = languages_separated[3]
profile_data['language_5'] = languages_separated[4]

'''
Converting type to STRING:
'''
profile_data['language_1']=profile_data['language_1'].apply(str)
profile_data['language_2']=profile_data['language_2'].apply(str)
profile_data['language_3']=profile_data['language_3'].apply(str)
profile_data['language_4']=profile_data['language_4'].apply(str)
profile_data['language_5']=profile_data['language_5'].apply(str)

'''
Eliminating fluency levels in languages to obtain one single language"
'''
# Importing re package for using regular expressions 
import re 
  
# Function to clean the names 
def language_names(language_name): 
    # Search for opening bracket in the name followed by any characters repeated any number of times 
    if re.search('\(.*', language_name): 
  
        # Extract the position of beginning of pattern 
        pos = re.search('\(.*', language_name).start() 
  
        # return the cleaned name 
        return language_name[:pos] 
  
    else: 
        # if clean up needed return the same name 
        return language_name 
          
'''
Updating dataframes with clean text.
'''
profile_data['language_1'] = profile_data['language_1'].apply(language_names) 
profile_data['language_2'] = profile_data['language_2'].apply(language_names) 
profile_data['language_3'] = profile_data['language_3'].apply(language_names) 
profile_data['language_4'] = profile_data['language_4'].apply(language_names) 
profile_data['language_5'] = profile_data['language_5'].apply(language_names) 

'''
Getting rid of SPACE post eliminating FLUENCY.
'''
profile_data['language_1'] = profile_data['language_1'].str.replace(" ", "")
profile_data['language_2'] = profile_data['language_2'].str.replace(" ", "")
profile_data['language_3'] = profile_data['language_3'].str.replace(" ", "")
profile_data['language_4'] = profile_data['language_4'].str.replace(" ", "")
profile_data['language_5'] = profile_data['language_5'].str.replace(" ", "")

---------

In [13]:
'''
Clean the column offspring, eliminating &rsquo;.
'''
profile_data["offspring"] = profile_data['offspring'].str.replace("&rsquo;", "'")

-----

In [14]:
profile_data['height'] = profile_data['height'].apply(lambda x: x*2.54)

--------

In [15]:
profile_data = profile_data.drop(columns = (['last_online', 'location', 'religion', 'sign', 'speaks']))

----------

In [27]:
# list_of_complete_col = []
# list_of_missing_cols = []
# for cols in profile_data.columns:
#   if profile_data[cols].isna().sum() > 0:
#     list_of_missing_cols.append(cols)
#   else:
#     list_of_complete_col.append(cols)

In [20]:
# print('Features with complete data: {}'.format(list_of_complete_col))

Features with complete data: ['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job', 'offspring', 'orientation', 'pets', 'sex', 'smokes', 'status', 'last_date_online', 'last_time_online', 'year', 'month', 'day', 'city', 'state', 'language_1', 'language_2', 'language_3', 'language_4', 'language_5']


In [21]:
# print('Features with missing values: {}'.format(list_of_missing_cols))

Features with missing values: []


-----------------------------

Profile Essays:

In [24]:
'''
preprocess_text: Function to strip html tags, \n and https from text.
params: data_frame
'''
def preprocess_text(data_frame):
  for cols in data_frame.columns:
    data_frame[cols] = data_frame[cols].str.replace('<[^<]+?>', ' ', regex = True)
    data_frame[cols] = data_frame[cols].str.replace('\n', ' ')
    data_frame[cols] = data_frame[cols].str.replace('\'', '')

  return data_frame

In [25]:
profile_data[['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
                               'essay8', 'essay9']] = preprocess_text(profile_data[['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
                               'essay8', 'essay9']])

----------

Topic Modeling:

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')

In [30]:
# doc_term_matrix = count_vect.fit_transform(profile_data['essay9'].values.astype('U'))

Each of 1424 documents is represented as 2279 dimensional vector, which means that our vocabulary has 14546 words. Next, we will use LDA to create topics along with the probability distribution for each word in our vocabulary for each topic.

In [38]:
from sklearn.decomposition import LatentDirichletAllocation
LDA = LatentDirichletAllocation(n_components=5, random_state=42)

In [None]:
# LDA.fit(doc_term_matrix)

In the cell above I used the 'LatentDirichletAllocation' class from the sklearn.decomposition library to perform LDA on our document-term matrix. The parameter n_components specifies the number of categories, or topics, that we want our text to be divided into.
Let's randomly fetch words from our vocabulary. Count vectorizer contains all the words in our vocabulary. We can use the get_feature_names() method and pass it the ID of the word that we want to fetch.

In [None]:
# import random

# for i in range(10):
#     random_id = random.randint(0,len(count_vect.get_feature_names()))
#     print(count_vect.get_feature_names()[random_id])

In [43]:
# first_topic = LDA.components_[0]

In [None]:
# first_topic

The first topic contains the probabilities of 2279 words for topic 1. To sort the indexes according to probability values, we can use the argsort() function. Once sorted, the 10 words with the highest probabilities will now belong to the last 10 indexes of the array.

In [45]:
# top_topic_words = first_topic.argsort()[-10:]

In [46]:
# top_topic_words

array([2229, 1518,  834, 1245, 1167, 2187,  833, 1875, 1219, 2280])

These indexes can then be used to retrieve the value of the words from the count_vect object.

In [None]:
# for i in top_topic_words:
#     print(count_vect.get_feature_names()[i])

These are the top 10 words obtained for the response 'You should message me if..' From the top 10 words above we can infer that the first topic might be about how users want other users to message them if they're looking to have fun or if they love sense of humor.

Printing the 10 words with highest probabilities for all the five topics:

In [None]:
# for i,topic in enumerate(LDA.components_):
#     print(f'Top 10 words for topic #{i}:')
#     print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
#     print('\n')

From the top 5 topics above, we see what people want other users to initiate conversations on or match with them if they're looking for the said things.

As a final step, we will add a column to the original data frame that will store the topic for the text. To do so, we can use LDA.transform() method and pass it our document-term matrix. This method will assign the probability of all the topics to each document.

In [49]:
# topic_values = LDA.transform(doc_term_matrix)
# topic_values.shape

(1424, 5)

In [50]:
# profile_data['essay9_topic'] = topic_values.argmax(axis=1)

In [51]:
# profile_data.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,offspring,orientation,pets,sex,smokes,status,last_date_online,last_time_online,year,month,day,city,state,language_1,language_2,language_3,language_4,language_5,essay9_topic
94,29,fit,mostly anything,socially,sometimes,graduated from college/university,"my names josh, and i create art for a living. ...",living it,everything,i honestly couldnt say....,-books: anything joseph campbell - osho - terr...,invalid question,the world,out.,no,youre curious.,white,170.18,40000.0,artistic / musical / writer,doesn't want kids,straight,likes dogs and likes cats,m,no,single,2012-05-28,15:18:00,2012,5,28,san francisco,california,english,,,,,0
113,23,curvy,mostly anything,rarely,never,working on college/university,"hey im angel, heres a little about myself. i ...",discovering and exploring! finding myself. ta...,"art, math, learnin things fast, puzzles and br...",id say my hair and my eyes. i would say my smi...,favorite books are: my life as a teenage fairy...,"uumm, well music thats one, i love food but th...","everything; the world, what i can do for fun, ...",home happy to have some quite and a good movie...,sometimes i like to sit in the shower.. :) th...,"you are chill, nice, down to earth, educated, ...","black, native american, hispanic / latin",167.64,20000.0,other,has a kid,straight,likes dogs and likes cats,f,when drinking,single,2012-06-28,18:40:00,2012,6,28,san francisco,california,english,,,,,0
123,21,thin,strictly anything,socially,often,working on space camp,"ill-matic, drastically fantastic, orgasmic, in...",enjoying it.,"producing music, entertaining, struggling with...","hair or clothes, or that im longboarding.","the art of war, fear and loathing in las vegas...",i dont need a god damn thing.,everything. also your mom.,doing the most.,im probably down for a lot more than you realize.,you feel so inclined.,"hispanic / latin, white",177.8,1000000.0,medicine / health,"doesn't have kids, and doesn't want any",straight,likes dogs,m,sometimes,single,2012-06-29,17:01:00,2012,6,29,san francisco,california,english,,,,,0
137,50,average,mostly anything,often,never,college/university,i am a good guy looking to find that someone s...,working hard and learning about myself.,cooking on the grill!,my smile,"i dont read a lot, mostly mags and trade journ...",this is good first date stuff...,not enough room here,unwinding from the week with a glass of wine &...,"if you ask, i will answer!",you have a great smile!!! are not too uptight...,white,185.42,80000.0,transportation,doesn't want kids,straight,has cats,m,no,single,2012-06-29,21:54:00,2012,6,29,benicia,california,english,,,,,2
167,26,curvy,mostly anything,socially,never,working on college/university,i am strong woman who loves to dance (even tho...,"so, i work part time in sales while attending ...",i am really good at boxing so if any luck lady...,my hair . . . oddly enough . . . my friends jo...,remedios: stories of eatrh and iron from the h...,my sorority sisters (lmao dont hate!) car my...,about my film. its kind of taken up a huge prt...,i ususally just kick it with my friends and ha...,i am really awkward when it comes to the club ...,just message me if you feel like it. its chill...,"hispanic / latin, white",160.02,20000.0,sales / marketing / biz dev,"doesn't have kids, but might want them",gay,likes dogs and likes cats,f,no,single,2012-06-23,23:10:00,2012,6,23,berkeley,california,english,,,,,1


Wrapping topic modeling in a function:

In [None]:
count_vect = CountVectorizer(max_df = 0.8, min_df = 2, stop_words = 'english')
LDA = LatentDirichletAllocation(n_components = 5, random_state = 42)

In [56]:
def topic_modeling(data_frame, cols, count_vectorizer, lda):
  '''
  Function: To perform topic modeling on multiple columns of a df.
  Params: data_frame - Dataframe as an input.
          cols - list of columns to obtain topics.
  '''

  og_data_frame = data_frame.copy()

  # Obtaining document term matrix using count_vect as count_vectorizer:
  dt_matrix = count_vectorizer.fit_transform(data_frame[cols].values.astype('U'))

  # LDA
  lda.fit(dt_matrix)
  first_topic = lda.components_[0]
  top_topic_words = first_topic.argsort()[-10:]
  topic_values = lda.transform(dt_matrix)
  og_data_frame[str(cols)+'_new'] = topic_values.argmax(axis=1) 

  return og_data_frame

In [62]:
asd = profile_data
for column in profile_data[['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
                               'essay8', 'essay9']]:
  asd = topic_modeling(asd, column, count_vect, LDA)

In [None]:
asd = asd.drop(columns = ['essay9_topic'])

In [69]:
asd.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,offspring,orientation,pets,sex,smokes,status,last_date_online,last_time_online,year,month,day,city,state,language_1,language_2,language_3,language_4,language_5,essay0_new,essay1_new,essay2_new,essay3_new,essay4_new,essay5_new,essay6_new,essay7_new,essay8_new,essay9_new
94,29,fit,mostly anything,socially,sometimes,graduated from college/university,"my names josh, and i create art for a living. ...",living it,everything,i honestly couldnt say....,-books: anything joseph campbell - osho - terr...,invalid question,the world,out.,no,youre curious.,white,170.18,40000.0,artistic / musical / writer,doesn't want kids,straight,likes dogs and likes cats,m,no,single,2012-05-28,15:18:00,2012,5,28,san francisco,california,english,,,,,2,4,0,3,4,1,3,0,0,0
113,23,curvy,mostly anything,rarely,never,working on college/university,"hey im angel, heres a little about myself. i ...",discovering and exploring! finding myself. ta...,"art, math, learnin things fast, puzzles and br...",id say my hair and my eyes. i would say my smi...,favorite books are: my life as a teenage fairy...,"uumm, well music thats one, i love food but th...","everything; the world, what i can do for fun, ...",home happy to have some quite and a good movie...,sometimes i like to sit in the shower.. :) th...,"you are chill, nice, down to earth, educated, ...","black, native american, hispanic / latin",167.64,20000.0,other,has a kid,straight,likes dogs and likes cats,f,when drinking,single,2012-06-28,18:40:00,2012,6,28,san francisco,california,english,,,,,2,4,0,2,3,2,1,3,2,0
123,21,thin,strictly anything,socially,often,working on space camp,"ill-matic, drastically fantastic, orgasmic, in...",enjoying it.,"producing music, entertaining, struggling with...","hair or clothes, or that im longboarding.","the art of war, fear and loathing in las vegas...",i dont need a god damn thing.,everything. also your mom.,doing the most.,im probably down for a lot more than you realize.,you feel so inclined.,"hispanic / latin, white",177.8,1000000.0,medicine / health,"doesn't have kids, and doesn't want any",straight,likes dogs,m,sometimes,single,2012-06-29,17:01:00,2012,6,29,san francisco,california,english,,,,,3,4,1,2,1,3,1,4,0,0
137,50,average,mostly anything,often,never,college/university,i am a good guy looking to find that someone s...,working hard and learning about myself.,cooking on the grill!,my smile,"i dont read a lot, mostly mags and trade journ...",this is good first date stuff...,not enough room here,unwinding from the week with a glass of wine &...,"if you ask, i will answer!",you have a great smile!!! are not too uptight...,white,185.42,80000.0,transportation,doesn't want kids,straight,has cats,m,no,single,2012-06-29,21:54:00,2012,6,29,benicia,california,english,,,,,2,2,1,0,3,1,2,1,1,2
167,26,curvy,mostly anything,socially,never,working on college/university,i am strong woman who loves to dance (even tho...,"so, i work part time in sales while attending ...",i am really good at boxing so if any luck lady...,my hair . . . oddly enough . . . my friends jo...,remedios: stories of eatrh and iron from the h...,my sorority sisters (lmao dont hate!) car my...,about my film. its kind of taken up a huge prt...,i ususally just kick it with my friends and ha...,i am really awkward when it comes to the club ...,just message me if you feel like it. its chill...,"hispanic / latin, white",160.02,20000.0,sales / marketing / biz dev,"doesn't have kids, but might want them",gay,likes dogs and likes cats,f,no,single,2012-06-23,23:10:00,2012,6,23,berkeley,california,english,,,,,1,2,1,3,2,1,0,1,4,1
