# Milestone 4 - Independent Project

## Author - Matthew Denko



## Instructions

1. Generate a clear statement problem and provide the location for the datasetyou use.
2. Provide a clear solution to the problem for a non-technical audience.
3. Visually explore the data to generate insight and include summary statistics.
4. Use an appropriate statistical analysis method.
5. Prepare the data via cleaning, normalization, encoding, et cetera.
6. Generate and evaluate a working model (hypothesis, linear, or time series).
7. Draw direct inferences and conclusions from model results.
8. Use professional coding standards and techniques including:

    - explanatory markdown text
    - proper code comments
    - functions to minimize redundant code
    - minimize hard-coded variables

### Note
Please use the <a class="icon-pdf" title="Independent Project Rubric" href="https://library.startlearninglabs.uw.edu/DATASCI410/Handouts/DATASCI%20410%20Independent%20Project%20Rubric.pdf" target="_blank" rel="noopener">Rubric</a> as a general guide for your project.

# Abstract

This dataset contains demographic data from the 1994 Census database which was gathered to see if it could predict if an Adult makes >50k annually 

## Problem

Can education level in the presence of capital gains be used as indicators of whether or not an adult makes >50k 
annual salary? 

## Conclusion

Based on the results of the model, we cannot determine that education level and capital gain are a good indicator of whether or not a person makes greater than 50k anually.

In [None]:
# Source Citation

source_citation = "Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science."
print("source citation = ",source_citation)
url = 'https://archive.ics.uci.edu/ml/datasets/Adult'
print("url =",url)

In [None]:
# Load necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.model_selection import train_test_split
import matplotlib

In [None]:
# Defining Functions

##k-means

def Plot2DKMeans(Points, Labels, ClusterCentroids, Title):
    for LabelNumber in range(max(Labels)+1):
        LabelFlag = Labels == LabelNumber
        color =  ['c', 'm', 'y', 'b', 'g', 'r', 'c', 'm', 'y', 'b', 'g', 'r', 'c', 'm', 'y'][LabelNumber]
        marker = ['s', 'o', 'v', '^', '<', '>', '8', 'p', '*', 'h', 'H', 'D', 'd', 'P', 'X'][LabelNumber]
        plt.scatter(Points.loc[LabelFlag,0], Points.loc[LabelFlag,1],
                    s= 100, c=color, edgecolors="black", alpha=0.3, marker=marker)
        plt.scatter(ClusterCentroids.loc[LabelNumber,0], ClusterCentroids.loc[LabelNumber,1], s=200, c="black", marker=marker)
    plt.title(Title)
    plt.show()

In [None]:
##Reading url

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
Adult= pd.read_csv(url, header=None)

##Assigning reasonable column names

Adult.columns = ["age","workclass","fnlwgt","education","education-num","marital-status","occupation",
                 "relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country",">50K, <=50k"]
print(Adult.columns)
Adult.describe()

# PART 1: Data Cleanup

## Checking the Distribution of Numeric Variables

I want to examine the distribution of the numeric variables to see if any should be normalized

In [None]:
#age

age_hist = plt.hist(Adult.loc[:,'age'])
plt.title("Age Histogram")
plt.xlabel('age')
plt.ylabel('frequency')
plt.show(age_hist)
age_comment = """Age is strongly skewed right and does not represent a 
normal distribution, there is a higher concentrate of younger participants to 
older."""
print(age_comment)


In [None]:
#fnlwgt

fnlwgt_hist = plt.hist(Adult.loc[:,'fnlwgt'])
plt.title("Fnlwg Histogram")
plt.xlabel('fnlwg')
plt.ylabel('frequency')
plt.show(fnlwgt_hist)
fnlwg_comment = """fnlwgt is also strongly right skewed. represents final weigh
which is the number of units in the target population that the responding unit
represents"""
print(fnlwg_comment)

In [None]:
#education-num

education_num_hist = plt.hist(Adult.loc[:,'education-num'])
plt.title("Education Num Histogram")
plt.xlabel('education-num')
plt.ylabel('frequency')
plt.show(education_num_hist)
education_num_comment = """education num has a somewhat bi-modal distribution
with one center around 8-12 and another at 14"""
print(education_num_comment)


In [None]:
#capital-gain

capital_gain_hist = plt.hist(Adult.loc[:,'capital-gain'])
plt.title("Capital Gain Histogram")
plt.xlabel('capital-gain')
plt.ylabel('frequency')
plt.show(capital_gain_hist)
capital_gain_comment = """capital gain is a single modal distribution that
appears slightly right skewed"""
print(capital_gain_comment)

In [None]:
#capital-loss

capital_loss_hist = plt.hist(Adult.loc[:,'capital-loss'])
plt.title("Capital Loss Histogram")
plt.xlabel('capital-loss')
plt.ylabel('frequency')
plt.show(capital_loss_hist)
capital_loss_comment = """captial loss is a single modal distribution that has
some skewed right outliers"""
print(capital_loss_comment)

In [None]:
#hours-per-week

hours_per_week_hist = plt.hist(Adult.loc[:,'hours-per-week'])
plt.title("Hours-Per-Week Histogram")
plt.xlabel('hours-per-week')
plt.ylabel('frequency')
plt.show(hours_per_week_hist)
hours_per_week_comment = """hours per week appears to be close to a normal
distribution, with some slight right skewness"""

Based off the distributions I will normalize age, fnlwgt, capital_gain, and capital_loss

In [None]:
#Extracting the numeric columns which make sense to normalize

age = Adult.loc[:,'age']
fnlwgt = Adult.loc[:,'fnlwgt']
capital_gain = Adult.loc[:,'capital-gain']
capital_loss = Adult.loc[:,'capital-loss']

In [None]:
# Normalizing numeric variables using numpy and z normalization

age_zscaled = (age - np.mean(age))/np.std(age)
fnlwgt_zscaled = (age - np.mean(fnlwgt))/np.std(fnlwgt)
capital_gain_zscaled = (capital_gain - np.mean(capital_gain))/np.std(capital_gain)
capital_loss_zscaled = (capital_loss - np.mean(capital_loss))/np.std(capital_loss)

In [None]:
#replacing the numeric values with the normalized values

replace_age = Adult.loc[:,"age"] = age_zscaled
replace_fnlwgt = Adult.loc[:,"fnlwgt"] = fnlwgt_zscaled
replace_capital_gain = Adult.loc[:,"capital-gain"] = capital_gain_zscaled
replace_capital_loss = Adult.loc[:,"capital-loss"] = capital_loss_zscaled
print(Adult.head)


## Missing Data

In [None]:
#Removing cases with missing data

Adult = Adult.replace(to_replace= "?", value=float("NaN"))
Adult_null = Adult.isnull().sum()
print(Adult_null)
print("There are 0 columns with missing data")


## Encoding

In [None]:
# Create dummy column for >50K,<50K

Adult.loc[:, ">50K"] = (Adult.loc[:, ">50K, <=50k"] == ' >50K').astype(int)

# Removing obsolete columns

Adult = Adult.drop(">50K, <=50k", axis=1)

### Summary:
    There are no missing columns and all numerical columns have close to a normal distribution. I encoded >50k so that it can be used in a model. I will now begin examining the relationship between variables using vizuals

# PART 2: Vizualization

I will now use plots to examine the relationship between education and whether or not an adult makes >50k annually. I will first examine a scatter plot of education num vs capital gain. Since >50K, <=50K is not a numeric variable I am using capital-gain as a proxy.

In [None]:
#Scatter plot of education level vs >50k,<50K

ax = plt.figure(figsize=(6, 6)).gca() # define axis
Adult.plot.scatter(x = 'education-num', y = 'capital-gain', ax = ax)
ax.set_title('Capital Gain vs Education Number') # Give the plot a main title
ax.set_ylabel('Capital Gain')# Set text for y axis
ax.set_xlabel('Education Number')

## Comments:
    This plot shows there is a slight positive correlation between capital gain and education number, however there is significant overplotting so I am going to add a hex bin plot.

In [None]:
#Hexbin Plot

ax = plt.figure(figsize=(6, 6)).gca() # define axis
Adult.plot.hexbin(x = 'education-num', y = 'capital-gain', gridsize = 16, ax = ax)
ax.set_title('Capital Gain vs Education Num') # Give the plot a main title
ax.set_ylabel('Capital Gain')# Set text for y axis
ax.set_xlabel('Education Num')

## Comments:
    The most common pairs appear to be around Education Num of 10 and Capital Gain of zero. 
    It does not appear there is any significant density of capital gain values greater than 

In [None]:
#Facet Plot

g = sns.FacetGrid(Adult, col=">50K", row='education-num')
g = g.map(plt.hist, "capital-gain")

### Comments:
    For education levels 9 and below, adults that do not make >50k generally have larger capital-gains. However, for education levels 10 and greater adults that do make >50k generaly have larger capital-gains. I would like to examine the relationship between these three variables further by looking at a grouped box plot.

In [None]:
# Grouped Box Plot

fig = plt.figure(figsize=(6, 6)) # Define plot area
ax = fig.gca() # Define axis 
Adult.loc[:,['capital-gain', 'education-num','>50K']].boxplot(by = '>50K', ax = ax)
ax.set_title('Box plot of HeadCount') # Give the plot a main title
ax.set_ylabel('HeadCount')# Set text for y axis
ax.set_ylim(0.0, 111.0) # Set the limits of the y axis

### Comments:
    Based on the side by side box plots both education num and capital gain appear to be correlated with >50k. Generally if you make >50k you are more likely to have a higher capital gain and a higher education level then if you do not make >50k.
    
    Based on these plots, I want to examine whether education in the presence of capital gain is a good predictor of whether someone makes >50k.

# PART 3: Modeling

## Unsupervised Learning

In [None]:
### I want to view relationship between education num and capital gain and the captial gain recieved

#extracting the columns

max_education_num =  Adult.loc[:,"education-num"] == 16
capital_gain = Adult.loc[:,"capital-gain"]

#creating the dataframe

kmeansdf = pd.DataFrame()
kmeansdf.loc[:,0] = max_education_num
kmeansdf.loc[:,1] = capital_gain

#Centroid Guesses

ClusterCentroidGuesses = pd.DataFrame()
ClusterCentroidGuesses.loc[:,0] = [-1,1]
ClusterCentroidGuesses.loc[:,1] = [-1,1]

#Doing the clustering

kmeans = KMeans(n_clusters=2, init=ClusterCentroidGuesses, n_init=1).fit(kmeansdf)
Labels = kmeans.labels_
ClusterCentroids = pd.DataFrame(kmeans.cluster_centers_)
Plot2DKMeans(kmeansdf, Labels, ClusterCentroids, 'my cluster of max education-num vs capital-gain')

#Adding the Label to the model

Adult.loc[:,"cluster_label"] = Labels

### Comments:
    There are two main clusters of capital gain values at 0 and 1. The spead of education levels is high for both captial gain clusters of 0 and 1.

## Supervised Learning - Logistic Regression

In [None]:
#Creating Training and Test Sets

#Subsetting dataset for wanted columns columns

Adult_Data = pd.DataFrame()
Adult_Data.loc[:,"education-num"] = Adult.loc[:,"education-num"]
Adult_Data.loc[:,"capital-gain"] = Adult.loc[:,"capital-gain"]
Adult_Data.loc[:,">50K"] = Adult.loc[:,">50K"]
Adult_Data.loc[:,"cluster_label"] = Adult.loc[:,"cluster_label"]

#Training = X

X = []

#Test = Y

Y = []

#splitting data into test and training sets using sklearn

X, Y = train_test_split(Adult_Data,test_size = .20)


print(X,"This is the Training Set")
print(Y,"This is the testing Set")


In [None]:
#Creating the classifier

print ('\n Use logistic regression to predict >50K from education num and capital gain')
Target = ">50K"
Inputs = list(Adult_Data.columns)
Inputs.remove(Target)
clf = LogisticRegression()
clf.fit(X.loc[:,Inputs], X.loc[:,Target])
BothProbabilities = clf.predict_proba(Y.loc[:,Inputs])
probabilities = BothProbabilities[:,1]



In [None]:
#Confusion Matrix

# I will use a probability threshold of .5 in order to have a balance of precision vs recall. 

print ('\nConfusion Matrix and Metrics')
Threshold = 0.5 # Some number between 0 and 1
print ("Probability Threshold is chosen to be:", Threshold)
predictions = (probabilities > Threshold).astype(int)
CM = confusion_matrix(Y.loc[:,Target], predictions)
tn, fp, fn, tp = CM.ravel()
print ("TP, TN, FP, FN:", tp, ",", tn, ",", fp, ",", fn)



In [None]:
#Accuracy Metrics

AR = accuracy_score(Y.loc[:,Target], predictions)
print ("Accuracy rate:", np.round(AR, 2))
P = precision_score(Y.loc[:,Target], predictions)
print ("Precision:", np.round(P, 2))
R = recall_score(Y.loc[:,Target], predictions)
print ("Recall:", np.round(R, 2))



## Comments:
    The accuracy rate and precision are both fairly high and recall is low. The accuracy rate says that this model correcly predicted 81% of the observations in this dataset. The precision rate means that 78% of the total predicted positive observations were predicted correctly. The low recall score implies that there were a large number of false positives predicted along with true positives. This means the net prediction potential of my model is not very high.

In [None]:
# Calculating the ROC curve and its AUC

# Creating False Positive Rate, True Posisive Rate, and probability thresholds

fpr, tpr, th = roc_curve(Y.loc[:,Target], probabilities)

#Calculating ROC

AUC = auc(fpr, tpr)

# Plotting the ROC Curve, presenting AUC in the plot

plt.rcParams["figure.figsize"] = [10, 10] # Square
font = {'family' : 'DejaVu Sans', 'weight' : 'bold', 'size' : 12}
matplotlib.rc('font', **font)
plt.figure()
plt.title('ROC Curve')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.plot(fpr, tpr, LW=3, label='ROC curve (AUC = %0.2f)' % AUC)
plt.plot([0, 1], [0, 1], color='navy', LW=3, linestyle='--') # reference line for random classifier
plt.legend(loc="lower right")
plt.show()






## Comments:
   The ROC Curve helps measure the tradeoff between precision and recall. The AUC value is related to the overall ability of a test to correctly identify normal versus abnormal, so a value of .76 is not great but is encouraging. The model overall appears to be a decent predictor of this dataset but its low recall indicates that we cannot determine that education level and capital gain are a good indicator of whether or not a person makes greater than 50k anually.
  