# Supervised Learning Project

In [1]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

import plotly.express as px
import plotly.graph_objects as go

# Read in data
ok_cupid_df = pd.read_csv('data/okcupid_profiles.csv')
ok_cupid_df.columns

Index(['age', 'status', 'sex', 'orientation', 'body_type', 'diet', 'drinks',
       'drugs', 'education', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'pets', 'religion', 'sign',
       'smokes', 'speaks', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4',
       'essay5', 'essay6', 'essay7', 'essay8', 'essay9'],
      dtype='object')

### Step 1: Gather data, determine the method of data collection and provenance of the data

I obtained the heart disease dataset from kaggle here: https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles

### Step 2: Identify a Supervised Machine Learning Problem

This project explores the OkCupid dataset in which I try to answer if I can find meaningful clusters of users and answer the following questions:

1. Is there a lot of variety of people which OkCupid markets to its users or is everybody the same
2. Are there distinct groups on OkCupid in which the users can be separated into?

I extracted the OkCupid dataset from Kaggle in CSV format. The single .csv file was used throughout my entire project. Age, income, and height were the only numeric features, and all others being type object.

For this dataset, I will be using clustering techniques to understand the different kinds of users in the dataset.

### Step 3: Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data

#### Describe the factors or components that make up the dataset (The "factors" here are called "features" in the machine learning term. These factors are often columns in the tabulated data). For each factor, use a box-plot, scatter plot, histogram, etc., to describe the data distribution as appropriate.

The features of this dataset are:

- age
- status
- sex
- orientation
- body_type
- diet
- drinks
- drugs
- education
- ethnicity
- height
- income
- job
- last_online
- location
- offspring
- pets
- religion
- sign
- smokes
- speaks
- essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9

The OkCupid dataset has many rows and columns with missing values (i.e. null). The features income, and offspring will be completely removed as they had over 59% missing values or values which were out of range (income: -1).

1. smokes, drinks, education, body_type, and drugs will be label encoded (i.e. doing drugs is 1, not doing drugs is 0)

2. religion, speaks, diet, and sign will be transformed using CountVectorizer as well as some form of label encoding. All three of these features have either a level of seriousness or fluency which cannot not be ignored. For example, fluent in English is 3, poorly is 0. Seriousness about religion has a higher score versus not serious.

- ethnicity, all essay columns, and pets also transformed using CountVectorizer (for pets only likes pet X, and has pet X were kept)
- For essays, the length of essay also turned into a feature

In [2]:
# Check shape
ok_cupid_df.shape

(59946, 31)

In [3]:
ok_cupid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   sex          59946 non-null  object 
 3   orientation  59946 non-null  object 
 4   body_type    54650 non-null  object 
 5   diet         35551 non-null  object 
 6   drinks       56961 non-null  object 
 7   drugs        45866 non-null  object 
 8   education    53318 non-null  object 
 9   ethnicity    54266 non-null  object 
 10  height       59943 non-null  float64
 11  income       59946 non-null  int64  
 12  job          51748 non-null  object 
 13  last_online  59946 non-null  object 
 14  location     59946 non-null  object 
 15  offspring    24385 non-null  object 
 16  pets         40025 non-null  object 
 17  religion     39720 non-null  object 
 18  sign         48890 non-null  object 
 19  smok

In [4]:
ok_cupid_df.head(2)

Unnamed: 0,age,status,sex,orientation,body_type,diet,drinks,drugs,education,ethnicity,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
0,22,single,m,straight,a little extra,strictly anything,socially,never,working on college/university,"asian, white",...,about me: i would love to think that i was so...,currently working as an international agent fo...,making people laugh. ranting about a good salt...,"the way i look. i am a six foot half asian, ha...","books: absurdistan, the republic, of mice and ...",food. water. cell phone. shelter.,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet! you are ti...
1,35,single,m,straight,average,mostly other,often,sometimes,working on space camp,white,...,i am a chef: this is what that means. 1. i am ...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories. my b...,,,i am very open and will share just about anyth...,


In [5]:
# fill income column containing -1 with null (to determine null values)
ok_cupid_df.loc[ok_cupid_df['income'] == -1, 'income'] = np.nan

In [6]:
# check percentage of null values
round(ok_cupid_df.isna().sum()/ok_cupid_df.shape[0]*100,2)

age             0.00
status          0.00
sex             0.00
orientation     0.00
body_type       8.83
diet           40.69
drinks          4.98
drugs          23.49
education      11.06
ethnicity       9.48
height          0.01
income         80.81
job            13.68
last_online     0.00
location        0.00
offspring      59.32
pets           33.23
religion       33.74
sign           18.44
smokes          9.19
speaks          0.08
essay0          9.15
essay1         12.63
essay2         16.08
essay3         19.14
essay4         17.58
essay5         18.10
essay6         22.97
essay7         20.77
essay8         32.07
essay9         21.02
dtype: float64

In [7]:
# refill income null back with -1
ok_cupid_df['income'] = ok_cupid_df['income'].fillna(-1)

#### Fill null values:

In [8]:
# drugs
ok_cupid_df['drugs'] = ok_cupid_df['drugs'].fillna('unknown_drugs')
ok_cupid_df['drugs'].value_counts()

never            37724
unknown_drugs    14080
sometimes         7732
often              410
Name: drugs, dtype: int64

In [9]:
# find index where drug value is unknown an drop it from rows
index_drugs = ok_cupid_df[ok_cupid_df['drugs'] == 'unknown_drugs'].index
ok_cupid_df.drop(index_drugs, inplace = True)

In [10]:
# Diet
# fill null diet with unknown
ok_cupid_df['diet'] = ok_cupid_df['diet'].fillna('unknown_diet')
ok_cupid_df['diet'].value_counts()

unknown_diet           18560
mostly anything        13105
anything                4805
strictly anything       3653
mostly vegetarian       2530
mostly other             785
strictly vegetarian      698
vegetarian               450
strictly other           333
other                    257
mostly vegan             252
strictly vegan           182
vegan                     96
mostly kosher             70
mostly halal              41
strictly kosher           15
strictly halal            15
halal                     10
kosher                     9
Name: diet, dtype: int64

##### Fill unknown_diet based on essay columns:
It occured to me that I can possibly fill other features if I can find that information from the essay columns. I did read these essays to determine if user dier was X.

In [11]:
# use .loc to find specific essay string and fill 'diet' column with appropriate diet
ok_cupid_df.loc[ok_cupid_df['essay0'] == "im looking for someone to share some raging adhd. im a self motivated and light hearted superhero who enjoy's riding my bike everywhere and eating every goddamn thing i can.  im looking for someone to go adventuring with. i enjoy blind drunken adventures sometimes but you dont have to be a drinker. no vegans, i will eat anything... including people... especially hipsters. im not really a nerd (i don't play magic cards/excessive videogames) but i can like nerdy girls.  i just got this account, so gimmie some time to write down more shenanigans that are important  if u make chiptunes hit me the fuck up! i wanna make some!  i am awesome, eccentric, and energetic", 'diet'] = 'strictly anything'

ok_cupid_df.loc[ok_cupid_df['essay0'] == "rabid bibliophile, humorless feminist (that's a joke), eternal student. i like to write poetry on people, bake (vegan) cupcakes, make art and dress-up.  i identify as queer but my choices here are limited so i chose bisexual.  i am quiet, empathetic, and geeky", 'diet'] = 'vegan'

ok_cupid_df.loc[ok_cupid_df['essay2'] == "cooking all types of foods, caring for others, being very social and outgoing. i think i am really good at horseback riding it is one of my true passions even though i have not gone in a long time. singing in the shower is always a good time....lol.", 'diet'] = 'anything'

ok_cupid_df.loc[ok_cupid_df['essay2'] == 'writing and listening. in addition, i am an amazing troubleshooter. i thoroughly enjoy (and therefore have gotten good at) making people laugh, acting, bicycling, juggling, copy editing, computers... and lots more. - why does okcupid rate me "less energetic" than other guys? that\'s kind of weird if you know me. i\'m, like, chatty. plus i manufacture heat really well. i save on heating costs!  also, i\'ve recently learned that i\'m a good cook, mainly because i can follow a recipe. ask me about my unerring chicken & dumplings or my stupid good pot pie. it\'s so easy when it\'s easy!', 'diet'] = 'anything'

ok_cupid_df.loc[ok_cupid_df['essay2'] == "i'm a musician at heart and i'm great at playing the drums. i also love to cook, especially butter chicken.", 'diet'] = 'anything'

ok_cupid_df.loc[ok_cupid_df['essay2'] == "eating chicken wings, shooting guns(recreational lol not at people) i'm actually pretty good at a lot if stuff it's just those first two are probably the most relevant lol :-) if you get to know me youll find out ;-) i'd say im a pretty interesting individual, but i know most of you lady's are basically looking at my photos and only that and couldn't give a dam about what i'm bout besides whether or not i'm making money..", 'diet'] = 'anything'

essay4_diet_string = ok_cupid_df[ok_cupid_df['diet'] == 'unknown_diet']['essay4'].tolist()[2]
essay4_diet_string

ok_cupid_df.loc[ok_cupid_df['essay4'] == essay4_diet_string, 'diet'] = 'anything'

ok_cupid_df.loc[ok_cupid_df['essay4'] == "one of my hobbies is making sushi.  overall i like most foods. one of my personal rules is to try things at least once, so i get to experience a lot of variety.  books: i have enjoyed harry potter series, da vinci code, and what i wish i knew when i was 20. i don't read as much as i would like to, but when i do i enjoy it.  i am open with music, i can find good music in any genre, even country, one can find good in even the worst. for the most part, i enjoy techno/dance, 80's rock (queen, bon jovi, scorpions), classic/jazz (sinatra, elvis, big bad voodoo daddy, red elvises), and current top charts. one of my favorite new discovery's have been apple trees & tangerines.  movies: i enjoy watching movies in theaters. one of my favorite is lord of war but the list includes; ferris bueler, sean connery movies (the rock, finding forester, hunt for red october, bond), indiana jones trilogy, quentin terantino flicks (pulp fiction, reservoir dogs, inglorious bastards, etc.). leonardo dicaprio has put himself in awesome roles since titanic, catch me if you can and the aviator. the brothers bloom. .", 'diet'] = 'anything'

In [12]:
ok_cupid_df['diet'].value_counts()

unknown_diet           18553
mostly anything        13105
anything                4810
strictly anything       3654
mostly vegetarian       2530
mostly other             785
strictly vegetarian      698
vegetarian               450
strictly other           333
other                    257
mostly vegan             252
strictly vegan           182
vegan                     97
mostly kosher             70
mostly halal              41
strictly kosher           15
strictly halal            15
halal                     10
kosher                     9
Name: diet, dtype: int64

Change unknown_diet to rather not say:

In [13]:
# find index of unknown_diet and replace with 'rather not say'
ok_cupid_df['diet'] = ok_cupid_df['diet'].str.replace('unknown_diet', 'rather not say')
ok_cupid_df['diet'].value_counts()

rather not say         18553
mostly anything        13105
anything                4810
strictly anything       3654
mostly vegetarian       2530
mostly other             785
strictly vegetarian      698
vegetarian               450
strictly other           333
other                    257
mostly vegan             252
strictly vegan           182
vegan                     97
mostly kosher             70
mostly halal              41
strictly kosher           15
strictly halal            15
halal                     10
kosher                     9
Name: diet, dtype: int64

In [14]:
ok_cupid_df.shape

(45866, 31)

##### Status:

In [15]:
ok_cupid_df['status'].value_counts()

single            42697
seeing someone     1552
available          1357
married             254
unknown               6
Name: status, dtype: int64

In [16]:
# replace unknown with 'unknown_status' so it's more distinct
ok_cupid_df["status"].replace({'unknown': 'unknown_status'}, inplace=True)

In [17]:
ok_cupid_df.loc[ok_cupid_df['essay7'] == 'at home with my boy friend usually, listening to music on the i:pod in shuffle mode and painting', 'status'] = 'seeing someone'

Drop unknwon_status

In [18]:
index_status = ok_cupid_df[ok_cupid_df['status'] == 'unknown_status'].index
ok_cupid_df.drop(index_status, inplace = True)

In [19]:
ok_cupid_df['status'].value_counts()

single            42697
seeing someone     1552
available          1357
married             254
Name: status, dtype: int64

Body Type:

In [20]:
# fill nulls in body_type with unknown_body_type
ok_cupid_df['body_type'] = ok_cupid_df['body_type'].fillna('unknown_body_type')
ok_cupid_df['body_type'].value_counts()

average              11653
fit                   9413
athletic              9116
unknown_body_type     3589
thin                  3580
curvy                 3010
a little extra        2169
skinny                1392
full figured           838
overweight             368
jacked                 309
used up                256
rather not say         167
Name: body_type, dtype: int64

Fill unknown_body_type with info from essay columns (similar to diet):

In [21]:
ok_cupid_df.loc[ok_cupid_df['essay0'] == 'gainfully self-employed by day, amateur musician in my spare time, i play guitar and bass guitar ok, organ/ keyboards not so well. i sing and play in a band as a hobby. i read, tinker with electronics and mechanical stuff, write music and other stuff occasionally, in the little spare time my career affords me. i am fascinated with astronomy, but i\'m not an expert astronomer. i did, however, enjoy building my own 8" newtonian reflector telescope. i just like to set up and look at the moon, venus, jupiter, whatever i can find on any given night. i\'m a "blue-collar intellectual", i suppose. i can fix anything. the things that i have left blank in my details are blank because there were no suitable multiple-choice answers. for body type, i would have to say; "sturdy, stocky, strong as an ox". my income varies because i\'m self-employed. i\'ll say; "income: enough that i don\'t have to watch my pennies, but not enough to get rich quick". i have diverse interests, including astronomy, music, guitar, saxophone, organ, keyboard, electronics, antiques, camping, outdoors, commerce, and more!  i am sincere, spontaneous, and scatterbrained', 'body_type'] = 'rather not say'
ok_cupid_df.loc[ok_cupid_df['essay0'] == 'sweet queer poly geeky kinky feminist liberal overly verbose cynical optimist with a dirty and dark sense of humor and fever dreams of polymathy iso... awesome people i haven\'t met yet, for surprising and delicious and hilarious interactions across the spectrum of possibilities. (this legitimately and honestly includes new friends -- i\'m not interested in filtering and discarding neat individuals out of my life just because there\'s no mutual relationship chemistry.)  i consider \'there\'s no way to do that\' a statement of challenge, not a concession to inevitability. i take sex both as intensely and as casually as it deserves. i don\'t like losing, but i try very hard to be gracious when i do. i\'ve got a sadly limited vocal range, but that doesn\'t stop me from singing when the mood moves me. my pronunciation is irrevocably and curiously erroneous in amusing ways. i\'m happiest when i\'ve just made someone laugh -- the harder the better. my (white, male, financial) privilege is examined, and kept under recurring surveillance. i\'ll try almost anything once, and on zero notice if my schedule is free.  (as for the body type self-summary in the details sidebar, i prefer "stocky", but that\'s not an available option on the list, so it gets put here.)', 'body_type'] = 'rather not say'
ok_cupid_df.loc[ok_cupid_df['essay0'] == 'i\'m specifically looking to connect with those nearby* who have cultivated the ability to notice thought as it arises, and grok that thoughts aren\'t personal. (*or near bos, nyc, hnl -- places i\'ll be going to in the near future)  points of reference: rumi, yoda, tolle, ramana maharshi  gratitude, moments of noticing being both the ocean and a drop of water, easy laughter, embodied awareness/presence practices  fun-loving -- whether being deeply silent or raucously silly committed to serving the evolution of consciousness (and open to other ways of describing it) sometimes simultaneously grounded and expanded/ethereal  once in a while, i run; often dance. love to walk in forests, sit in forests, stand in forests, breathe in forests :) also have come to deeply appreciate the silence and magic of swimming with fishes, turtles, and the like. (i would say "snorkeling," but i only use a mask.)  seem to have developed a green-ish thumb, and envision gardening more in the upcoming months.  a note about children -- for a while, wanted to adopt 10 and birth a couple; life unfurled differently: no biological children, and 2 sort-of-semi-stepdaughters who are now adults.  unsure what to put down for "body type" -- kinda fit, kinda not, kinda average, kinda thin = ? when seated, am taller than one of my 6\' tall friends.', 'body_type'] = 'average'
ok_cupid_df.loc[ok_cupid_df['essay0'] == "i would describe my self as someone loves making people laugh and loves to follow their visions. i am generally into open relationships, queerdos and newcummers too. i am open to being in a closed relationships too if it feels right. oh ya and i am kinda poor, so don't expect me to buy you gifts all the time but you can expect me to warm your little tender boygirl heart with a cup of hot chocolate with marsh mellows inside. or show you a documentary on direct actions, or videos of circus and clowns.  i love movies, nature, kink, love making, radical faeries, cage fighting (for some gay reason), thrift shopping, being supportive, booty shaking fun time, sex positivity, all body type positivity. i am big bodied, wacky, muscular, cuddly, chewy, and have tattood hair and smile lines. i am loving, strange, sweet and have luscious lips. mwahahaha!", 'body_type'] = 'curvy'
ok_cupid_df.loc[ok_cupid_df['essay1'] == 'i am not afraid of working hard. but i am not defined by my work. trust me, i love to play! i have been very busy traveling over the past few years. i also like to keep myself busy by trying new activities. in the past i have trained for two century rides (100 mile bike rides) and in the past few months - since the end of 2011 - i have really gotten into crossfit (which is why i didn\'t answer the body type question. there wasn\'t an option for improving everyday-lol). i am also hoping to resume taking guitar lessons, if time allows this year. basically as the years go on, my "bucket list" keeps growing , and i keep trying to check things off of it. there doesn\'t really seem to be any good place to add this, but i also love going to watch baseball games either at the park or local bars.', 'body_type'] = 'rather not say'
ok_cupid_df.loc[ok_cupid_df['essay8'] == 'i didn\'t pick a body type because "fat kid with sticky fingers trapped inside" wasn\'t an option.', 'body_type'] = 'rather not say'

Replace remaining unknown_body_type with rather not say:

In [22]:
ok_cupid_df['body_type'] = ok_cupid_df['body_type'].str.replace('unknown_body_type', 'rather not say')

In [23]:
ok_cupid_df.isna().sum()/ok_cupid_df.shape[0]*100

age             0.000000
status          0.000000
sex             0.000000
orientation     0.000000
body_type       0.000000
diet            0.000000
drinks          3.517226
drugs           0.000000
education      10.928914
ethnicity       8.637157
height          0.002181
income          0.000000
job            12.812909
last_online     0.000000
location        0.000000
offspring      57.738770
pets           33.185783
religion       31.689926
sign           18.118186
smokes          5.994331
speaks          0.054514
essay0          9.498474
essay1         13.279546
essay2         16.761884
essay3         19.156127
essay4         18.587004
essay5         18.556476
essay6         23.826864
essay7         21.046664
essay8         32.810728
essay9         21.829481
dtype: float64

Education:

In [24]:
# fill null with unknown education
ok_cupid_df['education'] = ok_cupid_df['education'].fillna('unknown_education')
ok_cupid_df['education'].value_counts()

graduated from college/university    17706
graduated from masters program        7022
unknown_education                     5012
working on college/university         4595
graduated from two-year college       1302
working on masters program            1234
graduated from high school            1223
graduated from ph.d program           1028
working on two-year college            921
graduated from law school              804
dropped out of college/university      778
working on ph.d program                724
college/university                     609
graduated from space camp              498
graduated from med school              370
dropped out of space camp              361
working on space camp                  304
working on law school                  190
two-year college                       186
working on med school                  173
dropped out of two-year college        154
masters program                        105
dropped out of masters program         103
dropped out

In [25]:
# drop unknown education
index_education = ok_cupid_df[ok_cupid_df['education'] == 'unknown_education'].index
ok_cupid_df.drop(index_education, inplace = True)
ok_cupid_df.isna().sum()/ok_cupid_df.shape[0]*100

age             0.000000
status          0.000000
sex             0.000000
orientation     0.000000
body_type       0.000000
diet            0.000000
drinks          1.938895
drugs           0.000000
education       0.000000
ethnicity       7.689483
height          0.000000
income          0.000000
job             8.622209
last_online     0.000000
location        0.000000
offspring      56.225519
pets           30.182628
religion       28.446925
sign           15.692323
smokes          4.205836
speaks          0.058754
essay0          8.590384
essay1         11.652957
essay2         15.158637
essay3         17.596945
essay4         16.438993
essay5         16.661770
essay6         21.942323
essay7         19.141696
essay8         30.838719
essay9         20.179691
dtype: float64

Job:

In [26]:
ok_cupid_df['job'] = ok_cupid_df['job'].fillna('rather not say')
ok_cupid_df['job'].value_counts()

other                                5345
rather not say                       3852
student                              3797
science / tech / engineering         3530
computer / hardware / software       3349
sales / marketing / biz dev          3160
artistic / musical / writer          3036
medicine / health                    2725
education / academia                 2577
executive / management               1737
banking / financial / real estate    1694
entertainment / media                1474
law / legal services                  947
hospitality / travel                  944
construction / craftsmanship          692
clerical / administrative             583
political / government                541
transportation                        269
unemployed                            227
retired                               194
military                              175
Name: job, dtype: int64

Ethinicity:

In [27]:
ok_cupid_df['ethnicity'] = ok_cupid_df['ethnicity'].fillna('unknown_ethnicity')
ok_cupid_df['ethnicity'].value_counts()

white                                                                       22526
asian                                                                        4679
unknown_ethnicity                                                            3141
hispanic / latin                                                             1911
black                                                                        1504
                                                                            ...  
black, native american, pacific islander, hispanic / latin, white, other        1
middle eastern, black, native american, white, other                            1
asian, middle eastern, black, white, other                                      1
middle eastern, pacific islander, hispanic / latin                              1
middle eastern, native american, hispanic / latin, white                        1
Name: ethnicity, Length: 195, dtype: int64

In [28]:
# drop rows with unknown ethnicity:
index_ethnicity = ok_cupid_df[ok_cupid_df['ethnicity'] == 'unknown_ethnicity'].index
ok_cupid_df.drop(index_ethnicity, inplace = True)

Offspring:

In [29]:
ok_cupid_df['offspring'] = ok_cupid_df['offspring'].fillna('unknown_offspring')
ok_cupid_df['offspring'].value_counts()

unknown_offspring                          21067
doesn't have kids                           4742
doesn't have kids, but might want them      2712
doesn't have kids, but wants them           2615
doesn't want kids                           2022
has kids                                    1346
has a kid                                   1279
doesn't have kids, and doesn't want any      793
has kids, but doesn't want more              342
has a kid, but doesn't want more             211
has a kid, and might want more               169
wants kids                                   144
might want kids                              114
has kids, and might want more                 79
has a kid, and wants more                     52
has kids, and wants more                      20
Name: offspring, dtype: int64

Pets:

In [30]:
ok_cupid_df['pets'] = ok_cupid_df['pets'].fillna('unknown_pets')

index_pets = ok_cupid_df[ok_cupid_df['pets'] == 'unknown_pets'].index

# drop rows with unknown pets
ok_cupid_df.drop(index_pets, inplace = True)
ok_cupid_df.isna().sum()/ok_cupid_df.shape[0]*100

age             0.000000
status          0.000000
sex             0.000000
orientation     0.000000
body_type       0.000000
diet            0.000000
drinks          0.991915
drugs           0.000000
education       0.000000
ethnicity       0.000000
height          0.000000
income          0.000000
job             0.000000
last_online     0.000000
location        0.000000
offspring       0.000000
pets            0.000000
religion       22.443480
sign           10.768828
smokes          2.676299
speaks          0.044917
essay0          7.231622
essay1          9.645905
essay2         12.576733
essay3         14.448271
essay4         13.490043
essay5         13.478814
essay6         18.625543
essay7         16.012876
essay8         26.369966
essay9         17.169486
dtype: float64

Religion:

In [31]:
ok_cupid_df['religion'] = ok_cupid_df['religion'].fillna('unknown_religion')
ok_cupid_df['religion'].value_counts()

unknown_religion                              5996
agnosticism but not too serious about it      1535
catholicism but not too serious about it      1392
agnosticism and laughing about it             1300
other                                         1255
agnosticism                                   1227
christianity but not too serious about it     1219
other and laughing about it                   1100
atheism and laughing about it                 1050
other but not too serious about it             956
christianity                                   905
atheism                                        881
judaism but not too serious about it           773
atheism but not too serious about it           674
christianity and somewhat serious about it     581
other and somewhat serious about it            498
atheism and somewhat serious about it          481
catholicism                                    441
catholicism and laughing about it              412
buddhism but not too serious ab

Find religion from essay columns (similar to diet):

In [32]:
ok_cupid_df.loc[ok_cupid_df['essay0'] == "i just have to say that internet dating is so weird. i love perusing these intimate little snapshots of all of you, and trying to imagine what it would be like to go out with you, or what your skin smells like, or how your face moves when you talk. and probably the imagination is more fun then the actual date. meaning, that undeniable chemistry that you feel when you meet someone is not something you can feel except within sight of that person. and btw i would really like to hear the scientific explanation for the way i can feel someone looking at me and raise my head and zero in on that person's face in two milliseconds without even having to look around. i already know exactly the coordinates of those eyes. what is that? but back to what i was saying, which is that this is really a poor substitute for meeting you in person. but, considering that most of the time i'm too busy to even read my damn personal email, i guess it will have to do. (update; it is much easier to keep up with email now that it is in my phone.)  i was raised atheist by scientists and now i go to church sometimes. i am dichotomous. i have been called little miss kick ass, and if there is one reason why i'm still single, i think it's probably summed up best by my chinese horoscope -- fire dragon.", 'religion'] =  'christianity but not too serious about it'

for essay in ok_cupid_df[(ok_cupid_df['religion'] == 'unknown_religion') & (ok_cupid_df['essay0'] != np.nan) & (ok_cupid_df['essay0'].str.contains("agnostic"))]['essay0'].tolist():
    ok_cupid_df.loc[ok_cupid_df['essay0'] == essay, 'religion'] = 'agnosticism'

for essay in ok_cupid_df[(ok_cupid_df['religion'] == 'unknown_religion') & (ok_cupid_df['essay0'] != np.nan) & (ok_cupid_df['essay0'].str.contains("atheist"))]['essay0'].tolist():
    ok_cupid_df.loc[ok_cupid_df['essay0'] == essay, 'religion'] = 'atheism'

ok_cupid_df.loc[ok_cupid_df['essay6'] == "i'm an atheist. i am usually not a closeted anti-theist, but sometimes i am. i think this is it, but the notion of an afterlife, like many of the popular religions profess to, would be a pretty rad idea. unfortunately, faith alone, isn't a basis with which i wish to base this enticing concept. having said that, i'm generally a subscriber of the old adage, live life and let live. if it rains for 40 days and 40 nights i'm not going to start building an ark, but if you do, it's ok, as long as i get to laugh during construction time.", 'religion'] =  'atheism'

Save dataframe to be used in models notebook

In [33]:
ok_cupid_df.to_csv(r'data/okcupid_profiles_half_clean.csv', index = False, header=True)