<a href="https://colab.research.google.com/github/mrinaaall/OkCupid/blob/main/OkCupid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [85]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import datetime
from sklearn.cluster import KMeans

In [86]:
missing_values = ["n/a", "na", "--", '-1', "NaN"]
profile_data = pd.read_csv('/content/drive/MyDrive/profiles.csv', encoding = 'utf8', na_values = missing_values)

In [87]:
'''
Count of missing values in each column.
'''
profile_data.isna().sum()

age                0
body_type       5296
diet           24395
drinks          2985
drugs          14080
education       6628
essay0          5489
essay1          7572
essay2          9639
essay3         11476
essay4         10538
essay5         10851
essay6         13771
essay7         12451
essay8         19227
essay9         12603
ethnicity       5680
height             3
income         48442
job             8198
last_online        0
location           0
offspring      35561
orientation        0
pets           19921
religion       20226
sex                0
sign           11056
smokes          5512
speaks            50
status             0
dtype: int64

In [89]:
profile_data.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')

Partitioning the data into two different dataframes:
1. Demographics of the user.
2. Responses to the 10 questions on OkCupid.

Demographics:

1. body_type- rather not say, thin, overweight, skinny, average, fit, athletic, jacked, a little extra, curvy, full figured, used up

2. diet- mostly/strictly; anything, vegetarian, vegan, kosher, halal, other

3. drinks- very often, often, socially, rarely, desperately, not at all

4. drugs- never, sometimes, often

5. education- graduated from, working on, dropped out of; high school, two-year college, university, masters program, law school, med school, Ph.D program, space camp

6. ethnicity- Asian, middle eastern, black, native American, indian, pacific islander, Hispanic/latin, white, other

7. height- inches

8. income- (US $, -1 means rather not say) -1, 20000, 30000, 40000, 50000, 60000 70000, 80000, 100000, 150000, 250000, 500000, 1000000,

9. job- student, art/music/writing, banking/finance, administration, technology, construction, education, entertainment/media, management, hospitality, law, medicine, military, politics/government, sales/marketing, science/engineering, transportation, unemployed, other, rather not say, retire

10. offspring- has a kid, has kids, doesnt have a kid, doesn't want kids; ,and/,but might want them, wants them, doesnt want any, doesnt want more

11. orientation- straight, gay, bisexual

12. pets- has dogs, likes dogs, dislikes dogs; and has cats, likes cats, dislikes cats

13. religion- agnosticism, atheism, Christianity, Judaism, Catholicism, Islam, Hinduism, Buddhism, Other; and very serious about it, and somewhat serious about it, but not too serious about it, and laughing about it

14. sex- m, f

15. sign- aquarius, pices, aries, Taurus, Gemini, cancer, leo, virgo, libra, scorpio, saggitarius, Capricorn; but it doesn’t matter, and it matters a lot, and it’s fun to think about

16. smokes- yes, sometimes, when drinking, trying to quit, no

17. speaks- English (fluently, okay, poorly). Afrikaans, Albanian, Arabic, Armenian, Basque, Belarusan, Bengali, Breton, Bulgarian, Catalan, Cebuano, Chechen, Chinese, C++, Croatian, Czech, Danish, Dutch, Esperanto, Estonian, Farsi, Finnish, French, Frisian, Georgian, German, Greek, Gujarati, Ancient Greek, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Ilongo, Indonesian, Irish, Italian, Japanese, Khmer, Korean, Latin, Latvian, LISP, Lithuanian, Malay, Maori, Mongolian, Norwegian, Occitan, Other, Persian, Polish, Portuguese, Romanian, Rotuman, Russian, Sanskrit, Sardinian, Serbian, Sign Language, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Tibetan, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish (fluently, okay, poorly)

18. status- single, seeing someone, married, in an open relationship

Questions:

1. essay0- My self summary
2. essay1- What I’m doing with my life
3. essay2- I’m really good at
4. essay3- The first thing people usually notice about me
5. essay4- Favorite books, movies, show, music, and food
6. essay5- The six things I could never do without
7. essay6- I spend a lot of time thinking about
8. essay7- On a typical Friday night I am
9. essay8- The most private thing I am willing to admit
10. essay9- You should message me if...

In [92]:
# Eliminating the Chained Assignment error.
pd.set_option('mode.chained_assignment', None)

In [93]:
# Creating a dataframe containing the demographic information.
profile_demographics = profile_data[['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'ethnicity', 'height', 'income', 'job',
              'last_online', 'location', 'offspring', 'orientation', 'pets',
              'religion', 'sex', 'sign', 'smokes', 'speaks', 'status' ]]

# Creating a dataframe containing just the responses to OkCupid questions.
profile_essays = profile_data[['essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
                               'essay8', 'essay9']]

-----------------------

Profile Demographics:

In [14]:
'''
Creating a date-time object from the feature: last_online.
'''
profile_demographics['last_online'] = pd.to_datetime(profile_demographics.last_online, format='%Y-%m-%d-%H-%M')

'''
Splitting DATETIME instance of last_online into Date AND Time.
'''
profile_demographics['last_date_online'] = pd.to_datetime(profile_demographics['last_online']).dt.date
profile_demographics['last_time_online'] = pd.to_datetime(profile_demographics['last_online']).dt.time

'''
Splitting date into Day, Month, and Year.
'''
profile_demographics['year'] = pd.to_datetime(profile_demographics['last_date_online']).dt.year
profile_demographics['month'] = pd.to_datetime(profile_demographics['last_date_online']).dt.month
profile_demographics['day'] = pd.to_datetime(profile_demographics['last_date_online']).dt.day

--------------------------------------

In [12]:
'''
Separating location into CITY and STATE.
'''
location_separated = profile_demographics["location"].str.split(", ", n = 1, expand = True)
profile_demographics['city'] = location_separated[0]
profile_demographics['state'] = location_separated[1]

-------------------------------------------

In [17]:
'''
Retaining the religion, no other information.
'''
profile_demographics['religion'] = profile_demographics['religion'].str.split(' ').str[0]

---------------------

In [13]:
'''
Retaining just the zodiac sign, no other information.
'''
profile_demographics['sign'] = profile_demographics['sign'].str.split(' ').str[0]

----------------

In [7]:
'''
Exploding language into multiple columns, each column containing one language.
'''
languages_separated = profile_demographics['speaks'].str.split(", ", n = 5, expand = True)

profile_demographics['language_1'] = languages_separated[0]
profile_demographics['language_2'] = languages_separated[1]
profile_demographics['language_3'] = languages_separated[2]
profile_demographics['language_4'] = languages_separated[3]
profile_demographics['language_5'] = languages_separated[4]

'''
Converting type to STRING:
'''
profile_demographics['language_1']=profile_demographics['language_1'].apply(str)
profile_demographics['language_2']=profile_demographics['language_2'].apply(str)
profile_demographics['language_3']=profile_demographics['language_3'].apply(str)
profile_demographics['language_4']=profile_demographics['language_4'].apply(str)
profile_demographics['language_5']=profile_demographics['language_5'].apply(str)

'''
Eliminating fluency levels in languages to obtain one single language"
'''
# Importing re package for using regular expressions 
import re 
  
# Function to clean the names 
def language_names(language_name): 
    # Search for opening bracket in the name followed by any characters repeated any number of times 
    if re.search('\(.*', language_name): 
  
        # Extract the position of beginning of pattern 
        pos = re.search('\(.*', language_name).start() 
  
        # return the cleaned name 
        return language_name[:pos] 
  
    else: 
        # if clean up needed return the same name 
        return language_name 
          
'''
Updating dataframes with clean text.
'''
profile_demographics['language_1'] = profile_demographics['language_1'].apply(language_names) 
profile_demographics['language_2'] = profile_demographics['language_2'].apply(language_names) 
profile_demographics['language_3'] = profile_demographics['language_3'].apply(language_names) 
profile_demographics['language_4'] = profile_demographics['language_4'].apply(language_names) 
profile_demographics['language_5'] = profile_demographics['language_5'].apply(language_names) 

# Print the updated dataframe 
print(profile_demographics[['language_1', 'language_2', 'language_3', 'language_4', 'language_5']])

'''
Getting rid of SPACE post eliminating FLUENCY.
'''
profile_demographics['language_1'] = profile_demographics['language_1'].str.replace(" ", "")
profile_demographics['language_2'] = profile_demographics['language_2'].str.replace(" ", "")
profile_demographics['language_3'] = profile_demographics['language_3'].str.replace(" ", "")
profile_demographics['language_4'] = profile_demographics['language_4'].str.replace(" ", "")
profile_demographics['language_5'] = profile_demographics['language_5'].str.replace(" ", "")

-----------------------------

Profile Essays:

In [79]:
'''
preprocess_text: Function to strip html tags, \n and https from text.
params: data_frame
'''
def preprocess_text(data_frame):
  for cols in profile_essays.columns:
    data_frame[cols] = data_frame[cols].str.replace('<[^<]+?>', ' ', regex = True)
    data_frame[cols] = data_frame[cols].str.replace('\n', ' ')

  return data_frame

In [18]:
# user preferences api request
ajeya_age = 26
ajeya_height = 75
ajeya_job = 'student'
ajeya_sex_pref = 'f'
ajeya_location = 'san francisco'
ajeya_drinks_pref = 'socially'
ajeya_status_pref = 'single'
ajeya_drugs_pref = 'never'
ajeya_pets_pref = 'likes dogs'

In [19]:
# getting profiles within ajeya age range
curated_profiles_user = profile_demographics[profile_demographics.age.between(ajeya_age - 1, ajeya_age + 1)]

# Gender filter
curated_profiles_user = curated_profiles_user[curated_profiles_user.sex == ajeya_sex_pref]

#specific to my job
curated_profiles_user = curated_profiles_user[curated_profiles_user.job == ajeya_job]

# get curated list for that user
curated_profiles_user = curated_profiles_user[curated_profiles_user.height <= ajeya_height]

curated_profiles_user = curated_profiles_user[curated_profiles_user.city == ajeya_location]

curated_profiles_user = curated_profiles_user[curated_profiles_user.drinks == ajeya_drinks_pref]

curated_profiles_user = curated_profiles_user[curated_profiles_user.status == ajeya_status_pref]

curated_profiles_user = curated_profiles_user[curated_profiles_user.drugs == ajeya_drugs_pref]

curated_profiles_user = curated_profiles_user[curated_profiles_user.pets == ajeya_pets_pref]

# 1. clustering of profiles
# 2. check ajeya_profile would be clustered
# 3. combine rule based + clustering for more personlized
# postprocessin. engines read 

In [20]:
curated_profiles_user.shape

(16, 33)

In [21]:
curated_profiles_user.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status,language_1,language_2,language_3,language_4,language_5,city,state,last_date_online,last_time_online,year,month,day
2107,26,average,mostly anything,socially,never,working on masters program,white,62.0,,student,2012-06-29 18:10:00,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs,agnosticism,f,cancer,no,"english (fluently), spanish (fluently)",single,english,spanish,,,,san francisco,california,2012-06-29,18:10:00,2012,6,29
8235,25,average,anything,socially,never,graduated from college/university,asian,63.0,,student,2012-06-28 12:17:00,"san francisco, california",wants kids,straight,likes dogs,agnosticism,f,capricorn,no,"english (fluently), chinese (fluently), spanis...",single,english,chinese,spanish,,,san francisco,california,2012-06-28,12:17:00,2012,6,28
8828,26,average,mostly anything,socially,never,working on masters program,asian,65.0,,student,2012-06-30 02:39:00,"san francisco, california","doesn&rsquo;t have kids, but wants them",straight,likes dogs,christianity,f,aries,no,"english, chinese",single,english,chinese,,,,san francisco,california,2012-06-30,02:39:00,2012,6,30
12164,27,curvy,,socially,never,working on masters program,asian,67.0,,student,2012-06-28 21:27:00,"san francisco, california","doesn&rsquo;t have kids, but wants them",straight,likes dogs,christianity,f,aquarius,no,"english, chinese",single,english,chinese,,,,san francisco,california,2012-06-28,21:27:00,2012,6,28
15413,26,fit,anything,socially,never,graduated from college/university,white,66.0,,student,2012-06-29 20:28:00,"san francisco, california",,straight,likes dogs,,f,pisces,,english,single,english,,,,,san francisco,california,2012-06-29,20:28:00,2012,6,29


Feature Engineeing:
1. Eliminate - from datetime separating date and time.
2. Split datetime into DATE and TIME.
3. City / State - split location into city / state

In [22]:
# Missing values that can be replaced with mean: HEIGHT
# Missing values that can be replaced with values before / after NA: body_type, diet, drinks, drugs, education, ethnicity, job, 

In [23]:
profile_demographics['body_type'].mode()[0]

'average'

In [24]:
profile_demographics['diet'].mode()[0]

'mostly anything'

In [25]:
profile_demographics['drinks'].mode()[0]

'socially'

In [26]:
profile_demographics['drugs'].mode()[0]

'never'

In [27]:
profile_demographics['education'].mode()[0]

'graduated from college/university'

In [28]:
profile_demographics['ethnicity'].mode()[0]

'white'

In [29]:
profile_demographics['height'].mean()

68.29528051649066

In [30]:
profile_demographics['job'].mode()[0]

'other'

In [31]:
profile_demographics['offspring'].mode()[0]

'doesn&rsquo;t have kids'

In [32]:
profile_demographics['pets'].mode()[0]

'likes dogs and likes cats'

Part A: Analysis of demographic data. Ask questions.
1. Age distribution in the dataset, with an additional orientation distribution.
2. Age distribution filtering on the sex of our users.
3. What types of Jobs do our users have and how many of them are female?

In [None]:
fig1 = px.histogram(profile_demographics, x = 'age', color = "orientation", title = 'Age distibution with orientation information.')
fig1.update_layout(xaxis = dict(tickmode = 'linear'))
fig1.show()

In [None]:
fig2 = px.pie(profile_demographics, values = 'age', names = 'sex')
fig2.show()

In [None]:
fig3 = px.histogram(profile_demographics, x = 'age', color = 'sex')
fig3.update_layout(xaxis = dict(tickmode = 'linear'))
fig3.show()

In [None]:
fig4 = px.histogram(profile_demographics, x = 'job', color = 'sex', title = 'Gender distribution in jobs.')
fig4.show()

In [None]:
fig5 = px.histogram(profile_demographics, x = 'age', color = 'status')
fig5.update_layout(xaxis = dict(tickmode = 'linear'))
fig5.show()

In [None]:
fig6 = px.histogram(profile_demographics, x = 'age', color = 'location', title = 'Age distribution based on location.')
fig6.update_layout(xaxis = dict(tickmode = 'linear'))
fig6.show()