# Case Study: Income group classification(WHO data) using Logistic Regression

### Context:
- DeltaSquare is an NGO that works with the Government on matters of social policy to bring about a change in the lives of underprivileged sections of society. They are tasked with coming up with a policy framework by looking at the data government got from WHO. You as a data scientist at DeltaSquare are tasked with solving this problem and sharing a proposal for the government.

### Problem:

#### The data-set aims to answer the following key questions:

- What are the different factors that influence the income of an individual?

- Is there a good predictive model for income that exists? What does the performance assessment look like for such a model?

### Attribute Information:
The data contains characteristics of the people

- age: continuous - age of a Person
- workclass: Where do a person works - categorical -Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: This weight is assigned by the Current Population Survey (CPS). People with similar demographic characteristics should have similar weights since it is a feature aimed to allocate similar weights to people with similar demographic characteristics - continuous
- education: Degree the person has - Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: no. of years a person studied - continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: Investment gain of the person other than salary - continuous
- capital-loss: Loss from investments - continuous
- hours-per-week: No. of hours a person works - continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- salary: >50K, <=50K (dependent variable, the salary is in Dollars per year)

### Loading Libraries

In [3]:
import warnings
warnings.filterwarnings("ignore")

# libraries for data reading and manipulating
import numpy as np
import pandas as pd

pd.set_option("display.max_columns",None)
pd.set_option("display.max_rows",200)

# libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# libraries for spliting data into train and test data
from sklearn.model_selection import train_test_split

# libraries - build model for prediction
from sklearn.linear_model import LogisticRegression

# libraries to get different metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    #plot_confusion_matrix,
    precision_recall_curve, 
    roc_curve,
)


#### Load Data

In [7]:
data = pd.read_csv("who_data.csv")

In [8]:
# copying data into another variable to avoid any chnages in original dataset
df = data.copy()

In [11]:
# view the first 5 rows of dataset
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_no_of_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,working_hours_per_week,native_country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:
# view the last 5 rows from dataset
df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education_no_of_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,working_hours_per_week,native_country,salary
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [13]:
df.shape

(32561, 15)

##### there are total 32561 rows and 15 columns in dataset

### Lets create numerical and categorical variable list

In [20]:
nu_col = df.select_dtypes(include=[np.number]).columns
cat_col = df.describe(include=["object"]).columns

print("Numerical Columns: ",nu_col)
print("Categorical Columns: ",cat_col)

Numerical Columns:  Index(['age', 'fnlwgt', 'education_no_of_years', 'capital_gain',
       'capital_loss', 'working_hours_per_week'],
      dtype='object')
Categorical Columns:  Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'salary'],
      dtype='object')


### Summary of numerical data

In [21]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32561.0,38.581647,13.640433,17.0,28.0,37.0,48.0,90.0
fnlwgt,32561.0,189778.366512,105549.977697,12285.0,117827.0,178356.0,237051.0,1484705.0
education_no_of_years,32561.0,10.080679,2.57272,1.0,9.0,10.0,12.0,16.0
capital_gain,32561.0,1077.648844,7385.292085,0.0,0.0,0.0,0.0,99999.0
capital_loss,32561.0,87.30383,402.960219,0.0,0.0,0.0,0.0,4356.0
working_hours_per_week,32561.0,40.437456,12.347429,1.0,40.0,40.0,45.0,99.0


- average age is 38. minimum and maximum age is 17, 90 respectively.
- average education_no_of_years are 10 year.
- capital_gain and capital_loss are 0 for 25, 50 and 75th percentile and have large value for max which means there are outliers
- average working_hours_per_week are 40

### Checking different levels in categorical data

In [24]:
for i in cat_col:
    print(df[i].value_counts())
    print(df[i].value_counts(1))
    print("*"*50)

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64
 Private             0.697030
 Self-emp-not-inc    0.078038
 Local-gov           0.064279
 ?                   0.056386
 State-gov           0.039864
 Self-emp-inc        0.034274
 Federal-gov         0.029483
 Without-pay         0.000430
 Never-worked        0.000215
Name: workclass, dtype: float64
**************************************************
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: 

- there are some values represented by '?' in workclass, occupation, native_country columns, which require further investigation
- there are many distinct values in native_country column, which can be reduced using continents respectivly
- distinct level of marital_status can be reduced

### Data Cleaning

#### we can assume whereever '?' in dataset, which means data is missing or unknown

#### workclass

In [35]:
#print(df[df['workclass'] == ' ?'].count())
df[df['workclass'] == ' ?'].sample(5)


Unnamed: 0,age,workclass,fnlwgt,education,education_no_of_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,working_hours_per_week,native_country,salary
15743,19,?,131982,Some-college,10,Never-married,?,Own-child,White,Male,0,0,40,United-States,<=50K
695,25,?,202480,Assoc-acdm,12,Never-married,?,Other-relative,White,Male,0,0,45,United-States,<=50K
14860,72,?,237229,Assoc-voc,11,Widowed,?,Not-in-family,White,Female,0,0,30,United-States,<=50K
7352,18,?,192399,Some-college,10,Never-married,?,Own-child,White,Male,0,0,60,United-States,<=50K
27022,63,?,29859,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,1485,40,United-States,>50K


- looks like workclass and occupation both have '?' 
- and this records belongs to "United-States" which we need to investigate and confirm


In [36]:
df[df['workclass'] == ' ?']['occupation'].value_counts()

 ?    1836
Name: occupation, dtype: int64

- from above result, it's true that wherever workclass have '?', occupation also have '?'
- this indicates both columns missingness have strong pattern

In [37]:
df[df['workclass'] == ' ?']['native_country'].value_counts()

 United-States         1659
 Mexico                  33
 ?                       27
 Canada                  14
 Philippines             10
 South                    9
 Germany                  9
 Taiwan                   9
 China                    7
 El-Salvador              6
 Italy                    5
 Puerto-Rico              5
 Poland                   4
 England                  4
 Portugal                 3
 Columbia                 3
 Vietnam                  3
 Dominican-Republic       3
 Japan                    3
 Cuba                     3
 Haiti                    2
 France                   2
 Ecuador                  1
 Peru                     1
 Cambodia                 1
 Thailand                 1
 Honduras                 1
 Laos                     1
 Hong                     1
 Guatemala                1
 Trinadad&Tobago          1
 Iran                     1
 Nicaragua                1
 Jamaica                  1
 Scotland                 1
Name: native_country

- the above observation don't hold the native_country as there many other countries where the observations are '?'

### occupation

In [38]:
df[df['occupation'] == ' ?'].sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_no_of_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,working_hours_per_week,native_country,salary
8189,23,?,234970,Some-college,10,Never-married,?,Own-child,Black,Female,0,0,40,United-States,<=50K
8908,29,?,41281,Bachelors,13,Married-spouse-absent,?,Not-in-family,White,Male,0,0,50,United-States,<=50K
15266,18,?,234648,11th,7,Never-married,?,Own-child,Black,Male,0,0,15,United-States,<=50K
27038,22,?,330571,HS-grad,9,Never-married,?,Not-in-family,White,Female,0,0,45,United-States,<=50K
1282,65,?,36039,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,40,United-States,>50K


In [39]:
df[df['occupation'] == ' ?']['workclass'].value_counts()

 ?               1836
 Never-worked       7
Name: workclass, dtype: int64

- we observe the same pattern here, where occupation is '?' most of the values in workclass are '?'

In [40]:
df[df['occupation'] == ' ?']['native_country'].value_counts()

 United-States         1666
 Mexico                  33
 ?                       27
 Canada                  14
 Philippines             10
 South                    9
 Germany                  9
 Taiwan                   9
 China                    7
 El-Salvador              6
 Italy                    5
 Puerto-Rico              5
 Poland                   4
 England                  4
 Portugal                 3
 Columbia                 3
 Vietnam                  3
 Dominican-Republic       3
 Japan                    3
 Cuba                     3
 Haiti                    2
 France                   2
 Ecuador                  1
 Peru                     1
 Cambodia                 1
 Thailand                 1
 Honduras                 1
 Laos                     1
 Hong                     1
 Guatemala                1
 Trinadad&Tobago          1
 Iran                     1
 Nicaragua                1
 Jamaica                  1
 Scotland                 1
Name: native_country

- The native_country column has other countries where the observations are ? corresponding to ?s in occupation.

### native_country

In [41]:
df[df['native_country'] == ' ?'].sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_no_of_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,working_hours_per_week,native_country,salary
29777,26,Private,130620,Assoc-acdm,12,Married-spouse-absent,Craft-repair,Other-relative,Asian-Pac-Islander,Female,0,0,40,?,<=50K
15953,31,Private,591711,Some-college,10,Married-spouse-absent,Transport-moving,Not-in-family,Black,Male,0,0,40,?,<=50K
10777,53,Private,88725,HS-grad,9,Never-married,Craft-repair,Not-in-family,Other,Female,0,0,40,?,<=50K
29680,64,Local-gov,199298,5th-6th,3,Divorced,Other-service,Not-in-family,White,Female,0,0,45,?,<=50K
15863,34,Private,220631,Assoc-voc,11,Never-married,Other-service,Not-in-family,White,Male,0,0,50,?,<=50K


In [42]:
df[df['native_country'] == ' ?']['occupation'].value_counts()

 Prof-specialty       102
 Other-service         83
 Exec-managerial       74
 Craft-repair          69
 Sales                 66
 Adm-clerical          49
 Machine-op-inspct     36
 ?                     27
 Transport-moving      25
 Handlers-cleaners     20
 Tech-support          16
 Priv-house-serv        6
 Farming-fishing        5
 Protective-serv        5
Name: occupation, dtype: int64

In [43]:
df[df['native_country'] == ' ?']['workclass'].value_counts()

 Private             410
 Self-emp-not-inc     42
 Self-emp-inc         42
 ?                    27
 Local-gov            26
 State-gov            19
 Federal-gov          17
Name: workclass, dtype: int64

### there is no clear pattern

### Observations:

- We observe that all the observations where workclass = ? the values in the occupation are ?
- the strong pattern b/w workclass and occupation makes sense as both of these variables capture same information
- there is no strong correlation of ? observations in occupation and workclass with native_country
- for now we will replace these ? with the 'unknown' category

### Replacing ? with 'Unknown'

In [44]:
df.workclass = df.workclass.apply(lambda x: "Unknown" if x == ' ?' else x)
df.native_country = df.native_country.apply(lambda x: "Unknown" if x== ' ?' else x)
df.occupation = df.occupation.apply(lambda x: "Unknown" if x== ' ?' else x)

### Mapping countries to continents to reduce the number of unique values

In [45]:
df.native_country.nunique()

42