### Import Python packages and read csv in to Pandas dataframe

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

df = pd.read_csv('../original_data/okcupid/okcupid_profiles.csv')

I don't think I need to reset the index because there is not an index column in the csv, so pd will add one upon reading.

### General dataframe info

In [2]:
print(df.info())
print(len(df.age))
print(df.columns)
print(df.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   sex          59946 non-null  object 
 3   orientation  59946 non-null  object 
 4   body_type    54650 non-null  object 
 5   diet         35551 non-null  object 
 6   drinks       56961 non-null  object 
 7   drugs        45866 non-null  object 
 8   education    53318 non-null  object 
 9   ethnicity    54266 non-null  object 
 10  height       59943 non-null  float64
 11  income       59946 non-null  int64  
 12  job          51748 non-null  object 
 13  last_online  59946 non-null  object 
 14  location     59946 non-null  object 
 15  offspring    24385 non-null  object 
 16  pets         40025 non-null  object 
 17  religion     39720 non-null  object 
 18  sign         48890 non-null  object 
 19  smok

In [3]:
print(df.head()) 
# I put this in a separate cell because sometimes I like to run the other four without this one. 

   age     status sex orientation       body_type               diet  \
0   22     single   m    straight  a little extra  strictly anything   
1   35     single   m    straight         average       mostly other   
2   38  available   m    straight            thin           anything   
3   23     single   m    straight            thin         vegetarian   
4   29     single   m    straight        athletic                NaN   

     drinks      drugs                          education  \
0  socially      never      working on college/university   
1     often  sometimes              working on space camp   
2  socially        NaN     graduated from masters program   
3  socially        NaN      working on college/university   
4  socially      never  graduated from college/university   

             ethnicity  ...  \
0         asian, white  ...   
1                white  ...   
2                  NaN  ...   
3                white  ...   
4  asian, black, other  ...   

             

### Checking what the possible values of sex variable are

In [4]:
print(df.sex.unique())

['m' 'f']


### Check for duplicates

In [5]:
duplicates = df.duplicated()
print(duplicates.value_counts())

False    59946
dtype: int64


There are no duplicates.

### Check for missing

In [6]:
# This tells us if there are null values somewhere in the df
print(df.isnull().values.any())

# This prints out the sum of null values by column 
print(df.isnull().sum())

True
age                0
status             0
sex                0
orientation        0
body_type       5296
diet           24395
drinks          2985
drugs          14080
education       6628
ethnicity       5680
height             3
income             0
job             8198
last_online        0
location           0
offspring      35561
pets           19921
religion       20226
sign           11056
smokes          5512
speaks            50
essay0          5488
essay1          7572
essay2          9638
essay3         11476
essay4         10537
essay5         10850
essay6         13771
essay7         12451
essay8         19225
essay9         12603
dtype: int64


There are misssing values in some columns, and I will address it later/as it comes up for analyses.

### Income 

In [7]:
# Explore the income variable here
print(df.income.isnull().sum())
# While this tells us there are no missing in income, -1 indicates the person chose not to provide income

print(df.income.nunique())
print(df.income.unique())
# 0ne-way freq table of income
print(pd.crosstab(index=df.income, columns='count'))

# Percent -1 AKA missing
print((48442/59946)*100)

# Count of rows where income != -1
print(59946 - 48442)

0
13
[     -1   80000   20000   40000   30000   50000   60000 1000000  150000
  100000  500000   70000  250000]
col_0     count
income         
-1        48442
 20000     2952
 30000     1048
 40000     1005
 50000      975
 60000      736
 70000      707
 80000     1111
 100000    1621
 150000     631
 250000     149
 500000      48
 1000000    521
80.80939512227671
11504


While print(df.income.isnull().sum()) tells us there are no missing income values, -1 indicates the person chose not to provide an income value ([via okcupid_codebook_revised.txt](https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook_revised.txt)).
We can consider -1 values as missing values. 
From the one-way frequency table of income, we can see that 80% of respondents did not provide an income value. So the income variable is missing ~80% of responses. This makes the income variable less useful. 

There are situations where I could justify replacing missing numerical data with an average or some other value, but replacing 80% of income data with fill-ins for missing values will affect the integrity of the data. I don't think anything usefull would be gained from such an endeavor.

I can do income-related analyses with the ~20% of respondents who did provide an income value, but I cannot use these analyses to make statements about the entire 59946 respondents as a whole. I will do analyses with the ~20% of respondents who did provide an income value for the purpose of demonstrating my quantitative data analysis skills. (e.g. a histogram of income responses)

Note that the income responses are already discrete as opposed to continuous. 

For the record, as an OKCupid user (in 2014 but not in 2015), I have no recollection of OKCupid ever asking for income data, but perhaps I just forgot or overlooked that question.

Count of rows where income != -1 is 11504

#### Create a new dataframe that has only the rows where income != -1

In [8]:
# Create a new dataframe that has only the rows where income != -1
df_hasincome = df[df.income != -1]
print(df_hasincome.info())
print(len(df_hasincome.age))
print(df_hasincome.columns)
print(df_hasincome.dtypes)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11504 entries, 1 to 59943
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          11504 non-null  int64  
 1   status       11504 non-null  object 
 2   sex          11504 non-null  object 
 3   orientation  11504 non-null  object 
 4   body_type    10938 non-null  object 
 5   diet         7465 non-null   object 
 6   drinks       11277 non-null  object 
 7   drugs        9692 non-null   object 
 8   education    10783 non-null  object 
 9   ethnicity    10784 non-null  object 
 10  height       11504 non-null  float64
 11  income       11504 non-null  int64  
 12  job          11163 non-null  object 
 13  last_online  11504 non-null  object 
 14  location     11504 non-null  object 
 15  offspring    5612 non-null   object 
 16  pets         8717 non-null   object 
 17  religion     9019 non-null   object 
 18  sign         10136 non-null  object 
 19  smok

In [9]:
# This tells us there are null values somewhere in the df
print("(df_hasincome.isnull().values.any())")
print(df_hasincome.isnull().values.any())

# This prints out the sum of null values by column 
print("(df_hasincome.isnull().sum())")
print(df_hasincome.isnull().sum())

(df_hasincome.isnull().values.any())
True
(df_hasincome.isnull().sum())
age               0
status            0
sex               0
orientation       0
body_type       566
diet           4039
drinks          227
drugs          1812
education       721
ethnicity       720
height            0
income            0
job             341
last_online       0
location          0
offspring      5892
pets           2787
religion       2485
sign           1368
smokes          531
speaks            5
essay0          903
essay1         1221
essay2         1522
essay3         1817
essay4         1900
essay5         1945
essay6         2301
essay7         2032
essay8         3084
essay9         2013
dtype: int64


Fun thing: Noticed the 3 people for whom a response to height was missing were not included in this data set <br>

In [10]:
print(df_hasincome.income.nunique())
print(df_hasincome.income.unique())
# 0ne-way freq table of income
print(pd.crosstab(index=df_hasincome.income, columns='count'))

12
[  80000   20000   40000   30000   50000   60000 1000000  150000  100000
  500000   70000  250000]
col_0    count
income        
20000     2952
30000     1048
40000     1005
50000      975
60000      736
70000      707
80000     1111
100000    1621
150000     631
250000     149
500000      48
1000000    521


I wonder if the income distribution of respondents is patterned similarly to that of the San Francisco income distribution at the time of collection. I'm putting this here because it is a thought I had at this point in the analysis, and it's my analysis. I would need to confirm the year of the data collection. I think Kaggle said 2015, but the data owner GitHub said "2010s". I will have to search data owner GitHub for a year later. 

Maybe it is because it is late and I am tired, but now I'm wondering if since the income data is already binned, is it considered categorical? Would a hisogram be appropriate? This is definitely me being tired and a sign that it is time for bed. 

### Notes to self for later

Best rounding practices? <br>

### Possible things I can do with data:

make sure you're using the words "value" and "variable" right <br>

deal with missing values <br>

1. split the sex column into two columns: male and female OR I change the sex column to be male = 1 femaile = 0
2. maybe take advantage of age being a number to do some quantitative analyses 
3. make demographics tables (including treating age as a categorical variable)
4. I can potentially do quantitative analyses with income 
5. This is your reminder to ADD HYPERLINKS TO read_me.md LATER <br>
to be continued