# Introduction

In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance

The purpose of this project is to practice formulating questions and implementing machine learning techniques to solve a question

**Data sources:**

`profiles.csv` was provided by Codecademy.com.

## Scoping

### Goals

In this project, the goal is to utilize the skills learned through Codecademy and apply machine learning techniques to a data set. 

The primary research question that will be answered is whether there is a link between other variables in their profiles and their religious commitment (or lack thereof).

This project is important because the data affords us a look at answers that individuals would be more motivated to be accurate and truthful in reporting about themselves, as there is a major disincentive of not being so (i.e. not finding a suitable partner)


### Data

The project has one data set provided by Codecademy called profiles.csv. In the data, each row represents an OkCupid user and the columns are the responses to their user profiles which include multi-choice and short answer questions.

### Analysis

This solution will use descriptive statistics and data visualization to find key figures in understanding the distribution, count, and relationship between variables. Since the goal of the project to make predictions on the user's religious commitment, classification algorithms from the supervised learning family of machine learning models will be implemented. 

### Evaluation

The project will conclude with the evaluation of the machine learning model selected with a validation data set. The output of the predictions can be checked through a confusion matrix, and metrics such as accuracy, precision, recall, F1 and Kappa scores. 


## Imports

In [1]:
import pandas as pd

## Loading the Data

In [2]:
profiles = pd.read_csv('profiles.csv')
profiles.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


The columns in the dataset include: 

- **age:** continuous variable of age of user
- **body_type:** categorical variable of body type of user
- **diet:** categorical variable of dietary information
- **drinks:**  categorical variable of alcohol consumption
- **drugs:** categorical variable of drug usage
- **education:** categorical variable of educational attainment
- **ethnicity:** categorical variable of ethnic backgrounds
- **height:** continuous variable of height of user
- **income:** continuous variable of income of user
- **job:** categorical variable of employment description
- **offspring:** categorical variable of children status
- **orientation:** categorical variable of sexual orientation
- **pets:** categorical variable of pet preferences
- **religion:** categorical variable of religious background
- **sex:** categorical variable of gender
- **sign:** categorical variable of astrological symbol
- **smokes:** categorical variable of smoking consumption
- **speaks:** categorical variable of language spoken
- **status:** categorical variable of relationship status
- **last_online:** date variable of last login
- **location:** categorical variable of user locations

And a set of open short-answer responses to :

- **essay0:** My self summary
- **essay1:**  What I’m doing with my life
- **essay2:** I’m really good at
- **essay3:** The first thing people usually notice about me
- **essay4:** Favorite books, movies, show, music, and food
- **essay5:** The six things I could never do without
- **essay6:** I spend a lot of time thinking about
- **essay7:** On a typical Friday night I am
- **essay8:** The most private thing I am willing to admit
- **essay9:** You should message me if…

In [3]:
list(profiles.columns)

['age',
 'body_type',
 'diet',
 'drinks',
 'drugs',
 'education',
 'essay0',
 'essay1',
 'essay2',
 'essay3',
 'essay4',
 'essay5',
 'essay6',
 'essay7',
 'essay8',
 'essay9',
 'ethnicity',
 'height',
 'income',
 'job',
 'last_online',
 'location',
 'offspring',
 'orientation',
 'pets',
 'religion',
 'sex',
 'sign',
 'smokes',
 'speaks',
 'status']

## Data Exploration

Let's look at the religions/adherence.  There are a few distinct religions, but qualifiers concerning the level of seriousness bloat the number to 45 different answers.

In [4]:
print('number of religions:', profiles.religion.nunique())
print('religions:', profiles.religion.unique())

number of religions: 45
religions: ['agnosticism and very serious about it'
 'agnosticism but not too serious about it' nan 'atheism' 'christianity'
 'christianity but not too serious about it'
 'atheism and laughing about it' 'christianity and very serious about it'
 'other' 'catholicism' 'catholicism but not too serious about it'
 'catholicism and somewhat serious about it'
 'agnosticism and somewhat serious about it'
 'catholicism and laughing about it' 'agnosticism and laughing about it'
 'agnosticism' 'atheism and somewhat serious about it'
 'buddhism but not too serious about it'
 'other but not too serious about it' 'buddhism'
 'other and laughing about it' 'judaism but not too serious about it'
 'buddhism and laughing about it' 'other and somewhat serious about it'
 'other and very serious about it' 'hinduism but not too serious about it'
 'atheism but not too serious about it' 'judaism'
 'christianity and somewhat serious about it'
 'hinduism and very serious about it' 'atheis

### Label cleanup

By extracting out the specific religion from the column, we can get the adherence qualifier that will be used for our predictions

In [5]:
def get_religion(religion_str):
    religion_map = {'unknown': -1, 'atheism': 0,'agnosticism': 1,
                   'christianity': 2, 'catholicism': 3, 'judaism': 4,
                    'buddhism': 5, 'hinduism': 6, 'islam': 7,
                   'other': 8}
    if ' ' not in religion_str:
        religion = religion_str
    else:
        religion = religion_str.split(' ',1)[0]
    return religion_map[religion]

Some of the answers for religion did not have a qualifier.  My guess is that such a generic answer falls between "not too serious" and "somewhat serious".  If no religion was specified, that would have its own seriousness of 0

In [6]:
def get_religious_seriousness(religion_str):
    serious_map = {'unknown' : 0,
                   'and laughing about it': 1,
                   'but not too serious about it': 2,
                   'and somewhat serious about it': 4,
                   'and very serious about it': 5}
    if religion_str == 'unknown':
        return 0
    elif ' ' not in religion_str:
        serious_code = 3
    else:
        seriousness = religion_str.split(' ',1)[1]
        serious_code = serious_map[seriousness]
    return serious_code

In [7]:
profiles['religion'] = profiles['religion'].fillna('unknown')
profiles['religious_code'] = profiles.religion.apply(get_religion)
profiles['religious_seriousness'] = profiles.religion.apply(get_religious_seriousness)
profiles.head(5)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,orientation,pets,religion,sex,sign,smokes,speaks,status,religious_code,religious_seriousness
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single,1,5
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single,1,2
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,straight,has cats,unknown,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available,-1,0
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,straight,likes cats,unknown,m,pisces,no,"english, german (poorly)",single,-1,0
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,straight,likes dogs and likes cats,unknown,m,aquarius,no,english,single,-1,0


In [8]:
print("number of levels of seriousness:",profiles.religious_seriousness.nunique())
print("levels of seriousness:", profiles.religious_seriousness.unique())

profiles.religious_seriousness.value_counts()


number of levels of seriousness: 6
levels of seriousness: [5 2 0 3 1 4]


0    20226
2    12212
3    11781
1     8995
4     4516
5     2216
Name: religious_seriousness, dtype: int64