<a href="https://colab.research.google.com/github/pismacx/Data-Science-Training/blob/main/Logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Census Income
This is a Sci-Kit Learn + Pandas example of classification problem. The dataset comes from http://archive.ics.uci.edu/. 

Data extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). The data was also preprocessed for the purpose of this example.

Prediction task is to determine whether a person makes over 50K a year.


### List of attributes:

##### Features
- age: continuous. 
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, - 10th, Doctorate, 5th-6th, Preschool. 
- education-num: continuous. 
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
- sex: Female, Male. 
- hours-per-week: continuous. 
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



##### Labels
- income - >50K, <=50K. 

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [None]:
# Uncomment this if you are using Google Colab
!wget https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LogisticRegressionCensus/census.csv

--2023-04-01 13:44:15--  https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LogisticRegressionCensus/census.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3152687 (3.0M) [text/plain]
Saving to: ‘census.csv’


2023-04-01 13:44:16 (114 MB/s) - ‘census.csv’ saved [3152687/3152687]



### Load dataset

In [None]:
df = pd.read_csv("./census.csv")

print (df.shape)
print (df.columns)
df.head(10)

(32561, 12)
Index(['age', 'workclass', 'education', 'education-num', 'marital-status',
       'occupation', 'relationship', 'race', 'sex', 'hours-per-week',
       'native-country', 'income'],
      dtype='object')


Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K
5,37,Private,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,40,United-States,<=50K
6,49,Private,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,16,Jamaica,<=50K
7,52,Self-emp-not-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,>50K
8,31,Private,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,50,United-States,>50K
9,42,Private,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,40,United-States,>50K


## Task 1 - Initial analysis
Perform initial analysis to understand the data.

In [None]:
df.describe()

Unnamed: 0,age,education-num,hours-per-week
count,32561.0,32561.0,32561.0
mean,38.581647,10.080679,40.437456
std,13.640433,2.57272,12.347429
min,17.0,1.0,1.0
25%,28.0,9.0,40.0
50%,37.0,10.0,40.0
75%,48.0,12.0,45.0
max,90.0,16.0,99.0


In [None]:
df.isnull().any()

age               False
workclass         False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
hours-per-week    False
native-country    False
income            False
dtype: bool

In [None]:
df.income.value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

In [None]:
df.sex.value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

## Task 2 - Preparing data
- Select features `X` and labels `y`. Make sure that your selection makes sense.
- Change the data into a numerical form to let your algorithm (logistic regression) deal with them
- Perform One-hot encoding if necessary
- Split your data into train and test subsets. Make sure that your split is reasonable. Use `stratify` if you consider it helpful.

Selecting x and y labels

In [None]:
Xy = df[['age', 'workclass', 'education-num', 'occupation', 'sex', 'hours-per-week','income']]

Xy.head()

Unnamed: 0,age,workclass,education-num,occupation,sex,hours-per-week,income
0,39,State-gov,13,Adm-clerical,Male,40,<=50K
1,50,Self-emp-not-inc,13,Exec-managerial,Male,13,<=50K
2,38,Private,9,Handlers-cleaners,Male,40,<=50K
3,53,Private,7,Handlers-cleaners,Male,40,<=50K
4,28,Private,13,Prof-specialty,Female,40,<=50K


Changing income data into numerical form 

In [None]:
Xy.income = (df.income == '>50K').astype(int)
print(Xy.income.value_counts())
Xy.head()

0    24720
1     7841
Name: income, dtype: int64


Unnamed: 0,age,education-num,sex,hours-per-week,income,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,39,13,1,40,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,50,13,1,13,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,38,9,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
3,53,7,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,28,13,0,40,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0


Changing sex data into numerical form 

In [None]:
Xy.sex = (df.sex == 'Male').astype(int)
print(Xy.sex.value_counts())
Xy.head()

1    21790
0    10771
Name: sex, dtype: int64


Unnamed: 0,age,education-num,sex,hours-per-week,income,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,39,13,1,40,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,50,13,1,13,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,38,9,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
3,53,7,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,28,13,0,40,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0


Replacing rare classes - because if we do one hot encoding on rare classes, we will have almost only 0s and only a few 1s, so our algorithm will learn barely anything from these classes.

In [None]:
df.workclass.value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64

We have to workclasses that won't be helpful: Without-pay only 14 records and Never-worked only 7.
We place them in the '?' class.

In [None]:
Xy.loc[df.workclass.isin(['Without-pay', 'Never-worked']), 'workclass'] = '?'
Xy.workclass.value_counts()

Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1857
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Name: workclass, dtype: int64

Now we do the same thing with occupation column.

In [None]:
df.occupation.value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64

There is more possibilities in occupation column than in workclass, so we can decide to replace columns that have less than 700 records.

In [None]:
Xy.loc[df.occupation.isin(['Armed-Forces','Priv-house-serv','Protective-serv']), 'occupation'] = '?'
Xy.occupation.value_counts()

Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
?                    2650
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Name: occupation, dtype: int64

Lastly, one hot encoding on the prepared data

In [None]:
Xy = pd.get_dummies(Xy, columns = ['workclass', 'occupation'])

In [None]:
print(Xy.shape)
Xy.head()

(32561, 24)


Unnamed: 0,age,education-num,sex,hours-per-week,income,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,...,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Prof-specialty,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,39,13,1,40,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,50,13,1,13,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,38,9,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
3,53,7,1,40,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,28,13,0,40,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0


Spliting data set into labels and features.
Lables - data that we want to predict(income)
Features - everything else

STRATIFY - This stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.

For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's

In [None]:
y = Xy.income
X = Xy.drop('income', axis = 1)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3,stratify = y)

print ('X train shape:', X_train.shape)
print ('X test shape:', X_test.shape)
print ('y train shape:', y_train.shape)
print ('y test shape:', y_test.shape)

X train shape: (22792, 23)
X test shape: (9769, 23)
y train shape: (22792,)
y test shape: (9769,)


## Task 4 - Logistic Regression
Train and test a logistic regression model. If you want to get a maximum score you must be sure that your model:
- Do not overfit
- Do not underfit
- Achieves at least 80% accuracy on the test subset.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#Accuracy

train data:

In [None]:
model.score(X_train,y_train)

0.8112495612495613

test data:

In [None]:
model.score(X_test,y_test)

0.8121609171870202

## Task 5 - Precision and recall
- Compute precision and recall for your model, for both, train and test subsets.
- Make sure that you understand these metrics, you may be asked to explain the meaning of it.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

pred_train = model.predict(X_train)
pred_test = model.predict(X_test)


In [None]:
print ('Precision (train set): {:.2f}%'.format(100*precision_score(y_train, pred_train)))
print ('Precision (test set): {:.2f}%'.format(100*precision_score(y_test, pred_test)))

Precision (train set): 67.10%
Precision (test set): 66.80%


In [None]:
print ('Recall (train set): {:.2f}%'.format(100*recall_score(y_train, pred_train)))
print ('Recall (test set): {:.2f}%'.format(100*recall_score(y_test, pred_test)))

Recall (train set): 42.43%
Recall (test set): 43.71%


In [None]:
print ('Accuracy (train set): {:.2f}%'.format(100*accuracy_score(y_train, pred_train)))
print ('Accuracy (test set): {:.2f}%'.format(100*accuracy_score(y_test, pred_test)))

Accuracy (train set): 81.12%
Accuracy (test set): 81.22%


## Task 6: Applying the model
Use your model to check if you will earn above 50,000$ per year. Check both the response from the model (true/false) and the probability that the response will be true. Check using the data about yourself:
- right now
- two years from now
- ten years from now

In [None]:
print (X_train.columns)

Index(['age', 'education-num', 'sex', 'hours-per-week', 'workclass_?',
       'workclass_Federal-gov', 'workclass_Local-gov', 'workclass_Private',
       'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc',
       'workclass_State-gov', 'occupation_?', 'occupation_Adm-clerical',
       'occupation_Craft-repair', 'occupation_Exec-managerial',
       'occupation_Farming-fishing', 'occupation_Handlers-cleaners',
       'occupation_Machine-op-inspct', 'occupation_Other-service',
       'occupation_Prof-specialty', 'occupation_Sales',
       'occupation_Tech-support', 'occupation_Transport-moving'],
      dtype='object')


In [None]:
right_now = np.array([21,9,1,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]).reshape(1,-1)
in_two_years = np.array([23,13,1,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]).reshape(1,-1)
in_ten_years = np.array([31,14,1,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]).reshape(1,-1)

In [None]:
print("RIGHT NOW")
print(model.predict(right_now))
print(model.predict_proba(right_now))

RIGHT NOW
[0]
[[0.86479243 0.13520757]]




In [None]:
print("TWO YEARS FROM NOW")
print(model.predict(in_two_years))
print(model.predict_proba(in_two_years))

TWO YEARS FROM NOW
[0]
[[0.68237244 0.31762756]]




In [None]:
print("TEN YEARS FROM NOW")
print(model.predict(in_ten_years))
print(model.predict_proba(in_ten_years))


TEN YEARS FROM NOW
[0]
[[0.54130404 0.45869596]]


