# Creating features and labels
-----


## Setup
---
The initial packages including [`scikit-learn`](http://scikit-learn.org) to fit modeling. Note the original tutorial uses `psycopg2` to connect to the database, but we instead use `sqlalchemy`.

In [None]:
%pylab inline
import pandas as pd
pd.set_option('display.max_columns', 300)

#import psycopg2
from sqlalchemy import create_engine

### Connect to the database

In [None]:
db_name = "appliedda"
hostname = "10.10.2.10"
schema = 'M3'
pgsql_engine = create_engine( "postgresql://10.10.2.10/appliedda" )

In [None]:
sql_string = " SELECT *"
sql_string +=" FROM M3.prisonerwages"

full_data = pd.read_sql(sql_string, con = pgsql_engine)    
full_data.head()

In [None]:
#print out all the columns for easy copying later
for i in full_data.columns.sort_values():
   print i+","

# Variable descriptions


# Date variables
- ## current admission date variables
    curadm_date,curadmdt,curadmmo,curadmyr,
- ## admission type at reception
    admtypo,
- ## actual release date to supervision variables
    actmsrdt,actmsrmo,actmsryr,
- ## work release date variables
    cccadmdt,cccadmmo,cccadmyr,
- ## actual discharge date variables
    actdis_date, actdisdt,actdismo,actdisyr,
- ## Exit date variables
    exit_date,exitda,exitmo,exitqtr,exityr,
- ## Exit type
    exittyp,



# Demographic and descriptive variables
- ## demographic
    citzshp,vetf,kids,marstat,educlvl,race,sex,
- ## birth info
    birth_date,birth_year,birthpl,birthyr,
- ## Hashed name variables
    name_first_hash,name_full_hash,name_last_hash,name_middle_hash,ssn_hash,
- ## gang variables
    gang,gangact,
- ## drug flag variables
    drugalcf,drugampf,drugcocf,drugherf,drugmarf,drugothf,drugpcpf,drugunkf,
- ## Illinois DOC number
    docnbr,ildoc_docnbr,
- ## sexual offense variables
    sexoff,sexreg,
- ## employment prospects at release
    empplanf,
- ## possibly last known address zipcode variables
    zip5,zipcode
- ## Did they have any wages in the quarter prior to admission
    priorwage,
# Crime variables
- ## total jail time in days
    jailtime,
- ## holding class and holding offense code
    hclass,hofnscd,
- ## last security level
    lstsclvl,
- ## releasing institution
    relinst,

# Wage variables
- ## actual wage variables
    totalwageq0,totalwageq1,totalwageq2,totalwageq3,
    totalwageq4,totalwageq5,totalwageq6,totalwageq7,
    wageq0,wageq1,wageq2,wageq3,wageq4,wageq5,wageq6,wageq7,
- ## number of employers
    counteinq0,counteinq1,counteinq2,counteinq3,counteinq4,counteinq5,
    counteinq6,counteinq7,
- ## percent of wage from primary employer
    pctwageq0,pctwageq1,pctwageq2,pctwageq3,pctwageq4,pctwageq5,pctwageq6,pctwageq7,
- ## Did they have any wage in the quarter
    anywageq0,anywageq1,anywageq2,anywageq3,anywageq4,anywageq5,anywageq6,anywageq7,
- ## quarter and year for possible employment
    q0,q1,q2,q3,q4,q5,q6,q7,
    yearplus2,end,start,

# Employer variables
- ## Auxiliary naics code for most prevelant job
    auxiliary_naicsq0,auxiliary_naicsq1,auxiliary_naicsq2,auxiliary_naicsq3,
    auxiliary_naicsq4,auxiliary_naicsq5,auxiliary_naicsq6,auxiliary_naicsq7,
- ## census block of employeer
    census_blockq0,census_blockq1,census_blockq2,census_blockq3,
    census_blockq4,census_blockq5,census_blockq6,census_blockq7,
    census_idq0,census_idq1,census_idq2,census_idq3,census_idq4,
    census_idq5,census_idq6,census_idq7,
- ## EIN of employer
    einq0,einq1,einq2,einq3,einq4,einq5,einq6,einq7,
- ## Employer  legal and trade name
    name_legalq0,name_legalq1,name_legalq2,name_legalq3,
    name_legalq4,name_legalq5,name_legalq6,name_legalq7,
    name_tradeq0,name_tradeq1,name_tradeq2,name_tradeq3,
    name_tradeq4,name_tradeq5,name_tradeq6,name_tradeq7,

## Creating label variables
The variable `employed` is a binary variable indicating whether or not an individual was ever employed. That is, did the individual have postive wages in any quarter?

Note that the current line creating `employed` produces an equivalent result to the original, commented-out line. Feel free to verify. The point of doing this was to learn a little more about the methods avaiable to `DataFrames` and `DataSeries`, and perhaps to reduce the width of the code. The logic of the code is to see whether or not an individual's maximum wage is positive.

To walk through what the syntax of this lines does, first, filter selects the subset of columns in `full_data` that meet the criterion in the parentheses, a regular expression matching all variable names beginning with "wageq" and ending with any numeric character (i.e., `wageq0` through `wageq1`). Next, the method `max(axis = 1)` evaluates the maximum value of the quarterly wages for each individual (`NaN` values are ignored by default). Finally, `apply(lambda x: (x > 0))` applies the `lambda` function by index: is the maximum wage for a given individual positive? (Side note: you could reverse the order of the `max()` and `apply()` methods here with slight modification and achieve the same result, however this is _extremely_ slow compared to the other two ways.)

In [None]:
#full_data['employed'] = ((full_data['wageq0']>0) | (full_data['wageq1']>0) | (full_data['wageq2']>0) | (full_data['wageq3']>0) | (full_data['wageq4']>0) | (full_data['wageq5']>0) | (full_data['wageq6']>0) | (full_data['wageq7']>0))
full_data['employed'] = full_data.filter(regex = "^wageq[0-9]$").max(axis = 1).apply(lambda x: (x > 0))
full_data['employed']= 1-full_data['employed']
full_data['employed'].value_counts()

In [None]:
full_data['below_poverty_on_avg']=1-(full_data.filter(regex = "^wageq[0-9]$").sum(axis = 1)-full_data.filter(regex = "^baselinepovertyq[0-9]$").sum(axis = 1)>0)
print(full_data['below_poverty_on_avg'].value_counts(normalize=True))

## Creating feature variables

In [None]:
full_data['prisontime']=full_data['prisontime'].round().astype('object')
full_data['prisontime'].value_counts()

In [None]:
full_data['HasKids']=(full_data['kids']>0)-0
print(full_data['HasKids'].value_counts(normalize=True))

In [None]:
DrugVars=['drugalcf','drugampf','drugcocf','drugherf','drugmarf','drugothf','drugpcpf','drugunkf']
for i in DrugVars:
    full_data[i]=full_data[i].replace(['X','F',NaN],['Y','Y','N'])
    print(i)
    print(full_data[i].value_counts(normalize=True))


In [None]:
full_data['anypriorwage']=(full_data['priorwage']>0)-0
print(full_data['anypriorwage'].value_counts(normalize=True))

In [None]:
full_data['vetf'].value_counts(normalize=True)

In [None]:
full_data['birthdecade']=(full_data['birth_year']-(full_data['birth_year']%10)).astype('object')
full_data[['birthdecade','birth_year']].head(25)

In [None]:
full_data['birthdecade1950orprior']=(full_data['birth_year']<1960)-0
print(full_data['birthdecade1950orprior'].value_counts(normalize=True))

In [None]:
full_data['release_year']=full_data['exityr'].astype('object')
full_data['release_year'].value_counts()

In [None]:
full_data['active_gang_member']=(full_data['gangact']=='A')-0
full_data['active_gang_member'].value_counts()

In [None]:
#earned time binary vars
full_data['meritorious_good_time']=(full_data['gttyp15']>0)-0
full_data['education_in_prison']=(full_data['gttyp17']>0)-0
full_data['substanceabuse_treatment']=(full_data['gttyp18']>0)-0
full_data['working_in_prison']=(full_data['gttyp19']>0)-0
good_time_vars=['meritorious_good_time','education_in_prison','substanceabuse_treatment','working_in_prison']

for i in good_time_vars:
    print(i)
    print(full_data[i].value_counts(normalize=True))

What do we want our features to be? Let's make one list containing the variables from which we will derive our features.

In [None]:
feat_source = ['active_gang_member','birthdecade','birthdecade1950orprior','release_year', 'race', 'sex','hclass','sexoff','sexreg','lstsclvl','HasKids','prisontime','anypriorwage']
feat_source+=good_time_vars+DrugVars
feat_source

## Cleaning

No matter how tight our SQL game is, there will always be data aberrations to be cleaned. First, let's identify missing values for the features of interest.

In [None]:
isnan_rows = full_data.filter(items=feat_source).isnull().any(axis=1) # Find the rows where there are NaNs
full_data[isnan_rows].shape

In [None]:
full_data[isnan_rows].head()

In [None]:
nrows_full = full_data.shape[0]
nrows_full_isnan = full_data[isnan_rows].shape[0]
print('%of rows with NaNs {}'.format(float(nrows_full_isnan)/nrows_full*100))

Now, let's remove the offending records from `full_data`.

In [None]:
full_data = full_data[~isnan_rows]

In [None]:
full_data.to_sql( "cleaned_data", con = pgsql_engine, schema = 'm3' )