## Problem Statement:
- Census-income data plays the most important role in the democratic system of government, highly affecting the economic sectors. Census-related figures are used to allocate federal funding by the government to different states and localities. Census data is also used for post census residents estimates and predictions, economic and social science research, and many other such applications.
- Therefore, the importance of this data and its accurate predictions is very clear to us. The main aim is to increase awareness about how the income factor actually has an impact not only on the individual lives of citizens but also an effect on the nation and its betterment. You will have a look at the data pulled out from the 1994 Census bureau database, and try to find insights into how various features have an effect on the income of an individual.
- The data contains approximately 32,000 observations with over 15 variables. The strategy is to analyze the data and perform a predictive task of classification to predict whether an individual makes over 50K a year or less by using a logistic regression algorithm.

In [147]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


In [148]:
df= pd.read_csv('census-income.csv',
               na_values='?')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,annual_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [149]:
df.duplicated().sum()

24

In [150]:
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
annual_income        0
dtype: int64

In [151]:
df['annual_income'].unique()

array(['<=50K', '>50K'], dtype=object)

In [152]:
df1= df.copy()

In [153]:
#df1.drop_duplicates(inplace= True)

In [154]:
df1.dropna(inplace=True)

In [155]:
df1['occupation'].unique()

array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',
       'Prof-specialty', 'Other-service', 'Sales', 'Transport-moving',
       'Farming-fishing', 'Machine-op-inspct', 'Tech-support',
       'Craft-repair', 'Protective-serv', 'Armed-Forces',
       'Priv-house-serv'], dtype=object)

In [156]:
## 1. How many types of occupations do we have?
df1['occupation'].nunique()

14

In [158]:
## 2. How many people are working as tech support and have an annual income greater than 50k?
df.loc[(df['occupation'] == 'Tech-support') & (df['annual_income'] == '>50K')].shape[0]

283

In [96]:
## How many total missing values are present in the dataset?
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
annual_income        0
dtype: int64

In [97]:
1836+1843+583

4262

In [98]:
## 4. If there are missing values in the Marital Status column, which option among 
#the following should be used for replacing the missing values
df['marital-status'].value_counts()

Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: marital-status, dtype: int64

In [99]:
## 5. How many people are having private work classes and are not from the United States of America?
df1.loc[(df1['workclass'] == 'Private') & (df1['native-country'] != 'United-States')].shape[0]

2151

In [100]:
## 6. How many people are either having Annual Income(last column) less than or
# equal to 50k or their working hours is greater than or equal to 40 hrs:
df1.loc[(df1['annual_income'] == '<=50K') | (df1['hours-per-week'] >= 40)].shape[0]

29505

### Perform the following tasks for answering the remaining questions
- Rename the last column as Annual Income
- Remove the missing values from the dataset
- Change the labels of categorical data into numerical data using Label Encoder.
- Split the dataset into a train and test of proportions 70:30 and set the random state to 0.
- Build a Logistic Regression Model on the data.


In [101]:
## Rename the last column as Annual Income
df.rename(columns = {'annual_income' : 'Annual Income'}, inplace = True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Annual Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [102]:
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
Annual Income        0
dtype: int64

In [103]:
## Remove the missing values from the dataset
df.dropna(inplace = True)
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
Annual Income     0
dtype: int64

In [104]:
## Change the labels of categorical data into numerical data using Label Encoder.
cat = df.select_dtypes(include='object')
le = LabelEncoder()

for i in cat.columns:
    le.fit(cat[i])
    df[i] = le.transform(cat[i])

    

In [105]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,Annual Income
0,39,5,77516,9,13,4,0,1,4,1,2174,0,40,38,0
1,50,4,83311,9,13,2,3,0,4,1,0,0,13,38,0
2,38,2,215646,11,9,0,5,1,4,1,0,0,40,38,0
3,53,2,234721,1,7,2,5,0,2,1,0,0,40,38,0
4,28,2,338409,9,13,2,9,5,2,0,0,0,40,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,2,257302,7,12,2,12,5,4,0,0,0,38,38,0
32557,40,2,154374,11,9,2,6,0,4,1,0,0,40,38,1
32558,58,2,151910,11,9,6,0,4,4,0,0,0,40,38,0
32559,22,2,201490,11,9,4,0,3,4,1,0,0,20,38,0


In [109]:
## Split the dataset into a train and test of proportions 70:30 and set the random state to 0
X = df.iloc[:, :-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size=0.3,
                                                   random_state=0)

In [117]:
y_t = y_train.values.reshape(-1,1)

In [119]:
## Build a Logistic Regression Model on the data.

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [121]:
## 14.What is the accuracy score of the above model?
accuracy_score(y_test, y_pred)


0.792794783954028

In [126]:
## 15.What is the specificity of the above model?
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.98      0.88      6764
           1       0.80      0.24      0.37      2285

    accuracy                           0.79      9049
   macro avg       0.80      0.61      0.62      9049
weighted avg       0.79      0.79      0.75      9049



In [145]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred, labels = [1,0])
print('Confusion Matrix : \n', cm)

total=sum(sum(cm))
#####from confusion matrix calculate accuracy
accuracy=(cm[0,0]+cm[1,1])/total
print ('Accuracy : ', accuracy)

sensitivity = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity )

specificity = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity)

Confusion Matrix : 
 [[ 547 1738]
 [ 137 6627]]
Accuracy :  0.792794783954028
Sensitivity :  0.23938730853391685
Specificity :  0.979745712596097


In [130]:
le.classes_

array(['<=50K', '>50K'], dtype=object)

In [138]:
le.inverse_transform([0,1])

array(['<=50K', '>50K'], dtype=object)

In [141]:
pred_df = pd.DataFrame({'y_true': y_test,
             'y_pred':  y_pred})

In [144]:
pred_df.loc[(pred_df['y_true'] == 1)&(pred_df['y_pred'] == 1)].shape[0]

547

In [146]:
## 19.How many records are correctly classified by the model?
cm[1,1]+cm[0,0]

7174