### Import Python packages and read csv in to Pandas dataframe

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

df = pd.read_csv('../original_data/okcupid/okcupid_profiles.csv')
#df = pd.read_csv('../original_data/okcupid/okcupid_profiles.csv', index_col = False)

# Do not do this one (index_col =0). It removes age, etc.
# I don't remember the details. Just don't use it in this context
#df = pd.read_csv('../original_data/okcupid/okcupid_profiles.csv', index_col = 0)

### On the index


###### Initially (days ago):
I don't think I need to reset the index because there is not an index column in the csv, so pd will add one upon reading.

###### Today 10/1/21 2:45pm ish
I left off messing around with index of the df. I noticed the df doesn't have an "id" column, which might be nice for groupby so i was trying to make one

I think the index exists regardless of whether there is a literal column in the dataframe for the index?

https://pandas.pydata.org/pandas-docs/dev/reference/indexing.html

https://towardsdatascience.com/pandas-index-explained-b131beaf6f7b

While reading the towardsdatacience article, I noticed they were getting these pretty table output when they called dataframe.head() in their notebook. So now I'm learning how to get the pretty table output, and I think it might have something to do with IPython. So I'm gonna work on understanding IPython now. 

In [None]:
print("Shape of data{}".format(df.shape))

### General dataframe info

In [None]:
print(df.info())
print(len(df.drugs))
#print(df.columns)
#print(df.dtypes)

In [None]:
print(df.head()) 
# I put this in a separate cell because sometimes I like to run commands without this one. 

### Checking what the possible values of sex variable are

In [None]:
print(df.sex.unique())

### Check for duplicates

In [None]:
duplicates = df.duplicated()
print(duplicates.value_counts())

There are no duplicates.

### Check for missing

In [None]:
# I can see from df.info() that some variables but missing values, but I prefer to also use these commands to check
# This tells us if there are null values somewhere in the df
print(df.isnull().values.any())

# This prints out the sum of null values by column 
print(df.isnull().sum())

There are misssing values in some columns, and I will address it later/as it comes up for analyses.

### Messing Around

In [None]:
print(df['age'].nunique())
print(df['age'].dtypes)
print(df['age'].min())
print(df['age'].max())
# the max age is 110... someone was probably messing around 
print(pd.crosstab(index=df['age'], columns='count'))
print()

### Income 

In [None]:
# Explore the income variable here
print(df.income.isnull().sum())
# While this tells us there are no missing in income, -1 indicates the person chose not to provide income

print(df.income.nunique())
print(df.income.unique())
# 0ne-way freq table of income
print(pd.crosstab(index=df.income, columns='count'))

# Percent -1 AKA missing
print((48442/59946)*100)

# Count of rows where income != -1
print(59946 - 48442)

While print(df.income.isnull().sum()) tells us there are no missing income values, -1 indicates the person chose not to provide an income value ([via okcupid_codebook_revised.txt](https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook_revised.txt)).
We can consider -1 values as missing values. 
From the one-way frequency table of income, we can see that 80% of respondents did not provide an income value. So the income variable is missing ~80% of responses. This makes the income variable less useful. 

There are situations where I could justify replacing missing numerical data with an average or some other value, but replacing 80% of income data with fill-ins for missing values will affect the integrity of the data. I don't think anything usefull would be gained from such an endeavor.

I can do income-related analyses with the ~20% of respondents who did provide an income value, but I cannot use these analyses to make statements about the entire 59946 respondents as a whole. I will do analyses with the ~20% of respondents who did provide an income value for the purpose of demonstrating my quantitative data analysis skills. (e.g. a histogram of income responses)

Note that the income responses are already discrete as opposed to continuous. 

For the record, as an OKCupid user (in 2014 but not in 2015), I have no recollection of OKCupid ever asking for income data, but perhaps I just forgot or overlooked that question.

Count of rows where income != -1 is 11504

#### Create a new dataframe that has only the rows where income != -1

In [None]:
# Create a new dataframe that has only the rows where income != -1
df_hasincome = df[df.income != -1]
print(df_hasincome.info())
print(len(df_hasincome.age))
print(df_hasincome.columns)
print(df_hasincome.dtypes)

In [None]:
# This tells us there are null values somewhere in the df
print("(df_hasincome.isnull().values.any())")
print(df_hasincome.isnull().values.any())

# This prints out the sum of null values by column 
print("(df_hasincome.isnull().sum())")
print(df_hasincome.isnull().sum())

Fun thing: Noticed the 3 people for whom a response to height was missing were not included in this data set <br>

In [None]:
print(df_hasincome.income.nunique())
print(df_hasincome.income.unique())
# 0ne-way freq table of income
print(pd.crosstab(index=df_hasincome.income, columns='count'))

I wonder if the income distribution of respondents is patterned similarly to that of the San Francisco income distribution at the time of collection. I'm putting this here because it is a thought I had at this point in the analysis, and it's my analysis. I would need to confirm the year of the data collection. I think Kaggle said 2015, but the data owner GitHub said "2010s". I will have to search data owner GitHub for a year later. 

Maybe it is because it is late and I am tired, but now I'm wondering if since the income data is already binned, is it considered categorical? Would a hisogram be appropriate? This is definitely me being tired and a sign that it is time for bed. 

### Notes to self for later

Best rounding practices? <br>

### Possible things I can do with data:

make sure you're using the words "value" and "variable" right <br>

deal with missing values <br>

1. split the sex column into two columns: male and female OR I change the sex column to be male = 1 femaile = 0
2. maybe take advantage of age being a number to do some quantitative analyses 
3. make demographics tables (including treating age as a categorical variable)
4. I can potentially do quantitative analyses with income 
5. This is your reminder to ADD HYPERLINKS TO read_me.md LATER <br>
to be continued