# Project 4 - Hackathon (Predicting Income under Team Features Constraint)

* Choose algorithm and choice of samples
* Limited to a maximum of 20 features

### This exercise is to build the best model possible under those constraints. 

### The task is to predict if a person's income is in excess of 50,000 dollars given certain profile information, and more specifically to generate predicted probabilities of income being above 50,000 dollars for each row in the test set. The output will be a .csv file with a single column of the probability with 'wage' as a header. The file is to be submitted by the end of the day.

### This section is running the choosen model on the test data. 

The Knn model was choosen for this exercide with the following features and parameters:

Features:
- age
- education-num
- sex
- hours-per-week
- marital_status_num
- occupation_com_House_Services
- occupation_com_Professional
- occupation_com_Specialty
- occupation_com_Tech/sales
- workclass_com_ Government
- workclass_com_ Private
- workclass_com_ Self-employed
- cap_gain_binary
- cap_loss_binary
- gdp_pc

Parameters:
- n_neighbors = 25

The data set is the 'test_clean_data.cvs' file. We will set up the knn model with the choosen features and then run the test data. The output will be the 'wage_final_mb_jw_DEN.csv' file that will be used to summit to complete this project.
 

In [1]:
# Import ome important stuff
import numpy as np
import pandas as pd

from scipy import stats

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV


In [2]:
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Import StandardScaler from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler

In [3]:
np.random.seed(33) 

df_data_clean = pd.read_csv('./data/clean_train.csv')
df_test_clean = pd.read_csv('./data/clean_test.csv')

# Instantiate and fit model: Knn

In [4]:
X = df_data_clean[['age', 'education-num', 'sex',
       'hours-per-week', 'marital_status_num',
       'occupation_com_House_Services',
       'occupation_com_Professional', 'occupation_com_Specialty',
       'occupation_com_Tech/sales', 'workclass_com_ Government', 'workclass_com_ Private',
       'workclass_com_ Self-employed', 'cap_gain_binary', 'cap_loss_binary',
       'gdp_pc']]

In [5]:
y = df_data_clean['wage'] 

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=33)

In [7]:
# will need to scale the data
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [8]:
# Instantiate our model.
knn = KNeighborsClassifier(n_neighbors = 25)

In [9]:
knn.fit(X_train_sc, y_train);

# Generate prediction on test data


In [10]:
X = df_test_clean[['age', 'education-num', 'sex',
       'hours-per-week', 'marital_status_num',
       'occupation_com_House_Services',
       'occupation_com_Professional', 'occupation_com_Specialty',
       'occupation_com_Tech/sales', 'workclass_com_ Government', 'workclass_com_ Private',
       'workclass_com_ Self-employed', 'cap_gain_binary', 'cap_loss_binary',
       'gdp_pc']]

In [11]:
X_tran = ss.transform(X)

In [12]:
wage_proba = knn.predict_proba(X_tran)

In [13]:
wage_proba[2][1]

0.4

In [14]:
wage =[]
for i in wage_proba:
    wage.append(i[1])
    
print(np.mean(wage))    

0.24375161230882628


In [15]:
wage_final = pd.DataFrame(wage, columns=['wage'])

### Save prediction data

In [29]:
wage_final.to_csv('./data/wage_final_mb_jw_DEN.csv')