<a href="https://colab.research.google.com/github/jasminekgohil/Analyzing-Regional-Job-Trends-Across-Industry-Sectors-in-the-United-States/blob/main/1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BA820 Project: OkCupid Profiles

In [2]:
# Importing libraries and mounting the drive

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
!pip install -U googlemaps
drive.mount('/content/drive')
data_folder = '/content/drive/MyDrive/Colab Notebooks/BA820/Data/'

Collecting googlemaps
  Downloading googlemaps-4.10.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: googlemaps
  Building wheel for googlemaps (setup.py) ... [?25l[?25hdone
  Created wheel for googlemaps: filename=googlemaps-4.10.0-py3-none-any.whl size=40711 sha256=de16f3851ed4c6984ca00471a9426b1495751baf6e3ca938a4c967536573dddb
  Stored in directory: /root/.cache/pip/wheels/17/f8/79/999d5d37118fd35d7219ef57933eb9d09886c4c4503a800f84
Successfully built googlemaps
Installing collected packages: googlemaps
Successfully installed googlemaps-4.10.0
Mounted at /content/drive


In [3]:
data = pd.read_csv(data_folder+'okcupid_profiles.csv')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   sex          59946 non-null  object 
 3   orientation  59946 non-null  object 
 4   body_type    54650 non-null  object 
 5   diet         35551 non-null  object 
 6   drinks       56961 non-null  object 
 7   drugs        45866 non-null  object 
 8   education    53318 non-null  object 
 9   ethnicity    54266 non-null  object 
 10  height       59943 non-null  float64
 11  income       59946 non-null  int64  
 12  job          51748 non-null  object 
 13  last_online  59946 non-null  object 
 14  location     59946 non-null  object 
 15  offspring    24385 non-null  object 
 16  pets         40025 non-null  object 
 17  religion     39720 non-null  object 
 18  sign         48890 non-null  object 
 19  smok

In [5]:
data.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


In [6]:
num_rows, num_columns = data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

Number of rows: 59946
Number of columns: 31


In [7]:
data.corr()

  data.corr()


Unnamed: 0,age,height,income
age,1.0,-0.022262,-0.001004
height,-0.022262,1.0,0.065049
income,-0.001004,0.065049,1.0


In [8]:
data.isna().sum()

age                0
status             0
sex                0
orientation        0
body_type       5296
diet           24395
drinks          2985
drugs          14080
education       6628
ethnicity       5680
height             3
income             0
job             8198
last_online        0
location           0
offspring      35561
pets           19921
religion       20226
sign           11056
smokes          5512
speaks            50
essay0          5488
essay1          7572
essay2          9638
essay3         11476
essay4         10537
essay5         10850
essay6         13771
essay7         12451
essay8         19225
essay9         12603
dtype: int64

In [9]:
data["income"]=data["income"].replace(-1, 0)

Convert non-numeric values to NaN, Removes rows with NaN values, and caps ages above 100 to 100 in the "Age" column

In [10]:
data['age'] = pd.to_numeric(data['age'], errors='coerce')

data = data.dropna(subset=['age'])

data['age'] = data['age'].apply(lambda x: x if x <= 100 else 100)

data["age"]

0        22
1        35
2        38
3        23
4        29
         ..
59941    59
59942    24
59943    42
59944    27
59945    39
Name: age, Length: 59946, dtype: int64

Map various body type categories to standardized categories by replacing the original categories with the standardized ones then fill any remaining missing values in the "body_type" column with 'rather not say' and count the occurrences of each category in the "body_type" column to display the distribution of body types.





In [11]:
category_mapping = {
    'curvy': 'curvy',
    'Thin': 'thin', 'skinny': 'thin',
    'athletic': 'fit', 'fit': 'fit',
    'rather not say': 'rather not say', 'nan': 'rather not say',
    'Overweight': 'Overweight',
    'used up': 'average', 'average': 'average',
    'Jacked': 'jacked',
    'A little extra': 'A little extra',
    'full figured': 'full figured'
}

data['body_type'] = data['body_type'].map(category_mapping)

data['body_type'] = data['body_type'].fillna('rather not say')

data['body_type'].value_counts()

fit               24530
average           15007
rather not say    13699
curvy              3924
thin               1777
full figured       1009
Name: body_type, dtype: int64

Map various diet categories to standardized categories by replacing the original categories with the standardized ones then fill any remaining missing values in the "diet" column with 'Prefer not to say' and count the occurrences of each category in the "diet" column to display the distribution of diets.

In [12]:
diet_mapping = {
    'strictly vegetarian': 'vegetarian',
    'vegetarian': 'vegetarian',
    'mostly vegetarian': 'vegetarian',
    'strictly anything': 'Prefer not to say',
    'anything': 'Prefer not to say',
    'mostly anything': 'Prefer not to say',
    'mostly other': 'Prefer not to say',
    'strictly other': 'Prefer not to say',
    'other': 'Prefer not to say',
    'strictly vegan': 'vegan',
    'mostly vegan': 'vegan',
    'vegan': 'vegan',
    'strictly halal': 'halal',
    'halal': 'halal',
    'mostly halal': 'halal',
    'strictly kosher': 'kosher',
    'kosher': 'kosher',
    'mostly kosher': 'kosher'
}

data['diet'] = data['diet'].map(diet_mapping)

data['diet'].fillna('Prefer not to say', inplace=True)

data['diet'].value_counts()

Prefer not to say    54066
vegetarian            4986
vegan                  702
kosher                 115
halal                   77
Name: diet, dtype: int64

Fill missing values with 'prefer not to say'. Then, it counts the occurrences of each category in the "drinks" column to display the distribution of drink preferences.

In [13]:
data['drinks'].fillna('prefer not to say', inplace=True)
data['drinks'].value_counts()

socially             41780
rarely                5957
often                 5164
not at all            3267
prefer not to say     2985
very often             471
desperately            322
Name: drinks, dtype: int64

Fill missing values with 'prefer not to say'. Then, it counts the occurrences of each category in the "drugs" column to display the distribution of drug preferences.

In [14]:
data['drugs'].fillna('prefer not to say', inplace=True)
data['drugs'].value_counts()

never                37724
prefer not to say    14080
sometimes             7732
often                  410
Name: drugs, dtype: int64

Map various education categories to standardized categories by replacing the original categories with the standardized ones then fill any remaining missing values in the "education" column with 'Prefer not to say' and count the occurrences of each category in the "education" column to display the distribution of education.

In [15]:
education_mapping = {
    'working on high school': 'In high school',
    'high school': 'In high school',
    'working on college/university': 'In college',
    'working on two-year college': 'In college',
    'college/university': 'In college',
    'two-year college': 'In college',
    'working on med school': 'In college',
    'med school': 'In college',
    'law school': 'In college',
    'working on law school': 'In college',
    'working on masters program': 'In grad school',
    'masters program': 'In grad school',
    'graduated from high school': 'High school degree',
    'graduated from college/university': 'College degree',
    'graduated from two-year college': 'College degree',
    'graduated from med school': 'College degree',
    'graduated from law school': 'College degree',
    'graduated from masters program': 'Graduate degree',
    'ph.d program': 'PHD',
    'working on ph.d program': 'PHD',
    'graduated from ph.d program': 'PHD title',
    'dropped out of ph.d program': 'PHD dropout',
    'dropped out of med school': 'Dropped out of high school',
    'dropped out of high school': 'Dropped out of high school',
    'dropped out of college/university': 'Dropped out of college',
    'dropped out of two-year college': 'Dropped out of college',
    'dropped out of law school': 'Dropped out of college',
    'dropped out of masters program': 'Dropped out of grad school',
    'working on space camp': 'In space camp',
    'space camp': 'In space camp',
    'graduated from space camp': 'Graduated from space camp',
    'dropped out of space camp': 'Dropped out of space camp'
}

data['education'] = data['education'].map(education_mapping)

data['education'].fillna('Prefer not to say', inplace=True)

data['education'].value_counts()

College degree                27058
Graduate degree                8961
In college                     8320
Prefer not to say              6628
In grad school                 1819
High school degree             1428
PHD title                      1272
Dropped out of college         1204
PHD                            1009
Graduated from space camp       657
Dropped out of space camp       523
In space camp                   503
In high school                  183
Dropped out of grad school      140
PHD dropout                     127
Dropped out of high school      114
Name: education, dtype: int64

Fill missing values with 'prefer not to say'. Then, it counts the occurrences of each category in the "ethnicity" column to display the distribution of ethnicity preferences.

In [16]:
data['ethnicity'].fillna('prefer not to say', inplace=True)
data['ethnicity'].value_counts()

white                                                                 32831
asian                                                                  6134
prefer not to say                                                      5680
hispanic / latin                                                       2823
black                                                                  2008
                                                                      ...  
middle eastern, indian, white                                             1
asian, middle eastern, black, white, other                                1
asian, middle eastern, indian, hispanic / latin, white, other             1
black, native american, indian, pacific islander, hispanic / latin        1
asian, black, indian                                                      1
Name: ethnicity, Length: 218, dtype: int64

 Handle missing values in the "height" column by filling them with the median height value.