# Introduction

This Data Science Project analyzes data from a dating application called OKCupid. As a rise in dating apps continues, the amount fo data collected has also rised. This data can be used to gain insights and find hidden traits that would have been otherwise impossible to find without using Data Science. 

The goals of this project are to scope, prep, analyze and build a machine learning model from it 

**Data Sources**

`profiles.csv` provided by Codecademy.com (which I have not added to the git repository due to file size restrictions)

### Project Goals

The primary goal of this project to apply skills learned through the Code Academy Data Science career path and apply machine learning techniques to a dataset. The question in focus is to predict user's astrological sign from the given variables from their profiles. 

### Data

The data is provided by Codecademy in `profiles.csv`. The rows contain individual users while the columns represent the features from thier prodiles. 

### Analysis

This project will use descriptive statistics and data visualization to find key features and relationships between variables. Since we have to predict user's astrological signs, a classification algorithm will be used from the supervised ML models.

### Evaluation

The project will conclude with evaluating the models score through confusion matrix and metrics such as accuracy, precision, recall, and f1 scores.



### Importing required libraries

In [17]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np 
pd.set_option('display.max_columns', None)

### Load the dataset


In [18]:
data = pd.read_csv('profiles.csv')
data.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...","books:<br />\nabsurdistan, the republic, of mi...",food.<br />\nwater.<br />\ncell phone.<br />\n...,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet!<br />\nyou...,"asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories.<br /...,,,i am very open and will share just about anyth...,,white,70.0,80000,hospitality / travel,2012-06-29-21-41,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,okay this is where the cultural matrix gets so...,movement<br />\nconversation<br />\ncreation<b...,,viewing. listening. dancing. talking. drinking...,"when i was five years old, i was known as ""the...","you are bright, open, intense, silly, ironic, ...",,68.0,-1,,2012-06-27-09-10,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,"bataille, celine, beckett. . .<br />\nlynch, j...",,cats and german philosophy,,,you feel so inclined.,white,71.0,20000,student,2012-06-28-14-22,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians<br />\nat the...",,,,,,"asian, black, other",66.0,-1,artistic / musical / writer,2012-06-27-21-26,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


### Understanding the data

We can extract some key information from the data by finding the columns present, the # of rows and if column names need to be renamed for easier access

In [19]:
print(data.columns)
print(data.shape)

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')
(59946, 31)


From the information above we can see that there is plenty of data available to use in a machine learning model for training.
There are 31 columns and 59946 rows

As we are interested in the astrological signs of the users, lets see that column and make any necessary cleaning

In [20]:
print(data['sign'])
data['sign_cleaned'] = data['sign'].str.split().str.get(0)
print(data['sign_cleaned'])

0                                          gemini
1                                          cancer
2              pisces but it doesn&rsquo;t matter
3                                          pisces
4                                        aquarius
                           ...                   
59941    cancer and it&rsquo;s fun to think about
59942             leo but it doesn&rsquo;t matter
59943     sagittarius but it doesn&rsquo;t matter
59944       leo and it&rsquo;s fun to think about
59945    gemini and it&rsquo;s fun to think about
Name: sign, Length: 59946, dtype: object
0             gemini
1             cancer
2             pisces
3             pisces
4           aquarius
            ...     
59941         cancer
59942            leo
59943    sagittarius
59944            leo
59945         gemini
Name: sign_cleaned, Length: 59946, dtype: object


In [21]:
print(data.sign_cleaned.nunique())
print(data.sign_cleaned.unique())

12
['gemini' 'cancer' 'pisces' 'aquarius' 'taurus' 'virgo' 'sagittarius'
 'leo' nan 'aries' 'libra' 'scorpio' 'capricorn']


Now that we have cleaned the data for our use, visualizing the data would be essential to see what features correlate. That will allow us to remove any unwanted features that would produce noise