## Problem Statement

I have 2 attributes, a binary attribute "income" with values "high" or "low",
and a continuous attribute "capital".
I want to know wether "capital" is a good predictor for "income".

## Solution 1: Logistic Regression

## Preprocessing

The features `capital-gain` and `captial-loss` are mutually exclusive.
Meaning that they can never be both non-zero.

### Eliminate Objects

We will remove the objects that have no "capital" information.
Meaning that both `capital-gain` and `capital-loss` are zero.

### Feature Creation

Combine `capital-gain` and `captial-loss` into one feature which we will call `capital`.
This new feature will leave `capital-gain` as is, but will negate `capital-loss` to reflect the "loss" property.

In [11]:
import pandas as pd

In [12]:
df = pd.read_excel('data/existing-customers.xlsx')
print(f'The full dataframe has {len(df)} rows')

  warn("Workbook contains no default style, apply openpyxl's default")


The full dataframe has 32561 rows


In [13]:
df = df.loc[~((df['capital-gain'] == 0) & (df['capital-loss'] == 0))]
print(f'The filtered dataframe has {len(df)} rows')

The filtered dataframe has 4231 rows


In [21]:
df['capital'] = df['capital-gain'] - df['capital-loss']
df[['capital', 'class']].head()

Unnamed: 0,capital,class
0,2174,<=50K
8,14084,>50K
9,5178,>50K
23,-2042,<=50K
32,-1408,<=50K


In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Separate predictor variable (capital) and dependent variable (class)
X = df[['capital']]
y = df['class']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create logistic regression model and fit to training data
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Use model to predict income for testing data
y_pred = logreg.predict(X_test)

# Calculate accuracy of model
accuracy = accuracy_score(y_test, y_pred)

print('Accuracy:', accuracy)


Accuracy: 0.6055118110236221
