<a href="https://colab.research.google.com/github/nikshargithub/ML_PROJECTS/blob/main/2024_01_15_NIKHIL_SHARMA_PROJECT_79.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project 17: Census Income Analysis



---

## Instructions

### Goal of the Project:

From class 67 to class 79, you learned the following concepts:

 - Feature Encoding.
 - Recursive Feature Elimination (RFE).
 - Logistic Regression classification using `sklearn` module.

In this project, you will apply what you have learned in class 67 - 79 to achieve the following goals.

|||
|-|-|
|**Main Goal**|Create a Logistic Regression model classification model with ideal number of features selected using RFE.
|



---

### Context

According to the government, census income is the income received by an individual regularly before payments for personal income taxes, medicare deductions, and so on. This information is asked annually from the people to record in the census. It helps to identify the eligible families for various funds and programs rolled out by communities and the government.



---

#### Getting Started

Follow the steps described below to solve the project:

1. Click on the link provided below to open the Colab file for this project.
   
   https://colab.research.google.com/drive/1pZqPvDkOf0QA48v1GYzd_Vlp_7lPciiQ

2. Create the duplicate copy of the Colab file. Here are the steps to create the duplicate copy:

    - Click on the **File** menu. A new drop-down list will appear.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/0_file_menu.png' width=500>

    - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

      <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/1_create_colab_duplicate_copy.png' width=500>

     - After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_CapstoneProject17** format.

3. Now, write your code in the prescribed code cells.

---

### Problem Statement

The dataset is extracted from 1994 Census Bureau. The data includes an instance of anonymous individual records with features like work-experience, age, gender, country, and so on. Also have divided the records into two labels with people having a salary **more than 50K or less than equal to 50K** so that they can determine the eligibility of individuals for government opted programs.

Looks like a very interesting dataset and as a data scientist, your job is to build a prediction model to predict whether a particular individual has an annual income of **<=50k** or **>50k**.

**Things To Do:**

1. Importing and Analysing the Dataset

2. Data Cleaning

3. Feature Engineering

4. Train-Test Split

5. Data Standardisation

6. Logistic Regression - Model Training

7. Model Prediction and Evaluation

8. Features Selection Using RFE

9. Model Training and Prediction Using Ideal Features


----

### Dataset Description

The dataset includes 32561 instances with 14 features and 1 target column which can be briefed as:

|Field|Description|
|---:|:---|
|age|age of the person, Integer|
|work-class| employment information about the individual, Categorical|
|fnlwgt| unknown weights, Integer|
|education| highest level of education obtained, Categorical|
|education-years|number of years of education, Integer|
|marital-status| marital status of the person, Categorical|
|occupation|job title, Categorical|
|relationship| individual relation in the family-like wife, husband, and so on. Categorical|
|race|Categorical|
|sex| gender, Male or Female|
|capital-gain| gain from sources other than salary/wages, Integer|
|capital-loss| loss from sources other than salary/wages, Integer|
|hours-per-week| hours worked per week, Integer|
|native-country| name of the native country, Categorical|
|income-group| annual income, Categorical,  **<=50k** or **>50k** |


**Notes:**
1. The dataset has no header row for the column name. (Can add column names manually)
2. There are invalid values in the dataset marked as **"?"**.
3. As the information about **fnlwgt** is non-existent it can be removed before model training.
4. Take note of the **whitespaces (" ")**  throughout the dataset.



**Dataset Credits:** https://archive.ics.uci.edu/ml/datasets/adult

**Dataset Creator:**
```
Ronny Kohavi and Barry Becker
Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk '@' live.com for questions.
```

---


#### Activity 1:  Importing and Analysing the Dataset

In this activity, we have to load the dataset and analyse it.


**Perform the following tasks:**
- Load the dataset into a DataFrame.
- Rename the columns with the given list.
- Verify the number of rows and columns.
- Print the information of the DataFrame.


**1.** Start with importing all the required modules:



In [None]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


**2.** Create a Pandas DataFrame for the **Adult Income** dataset using the below link with `header=None`.
> **Dataset Link:**
https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv

**3.** Print the first five rows of the dataset:

In [None]:
# Load the Adult Income dataset into DataFrame.
df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv')
df.head()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


**4.** Rename the columns by applying the `rename()` function using the following column list:

>```python
column_name =['age', 'workclass', 'fnlwgt', 'education', 'education-years', 'marital-status', 'occupation', 'relationship', 'race','sex','capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income-group']
```



In [None]:
# Rename the column names in the DataFrame using the list given above.
column_name = ['age', 'workclass', 'fnlwgt', 'education', 'education-years', 'marital-status', 'occupation', 'relationship', 'race','sex','capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income-group']

df.rename(columns = dict(zip(list(df.columns),column_name)), inplace = True)
# Create the list
df.head()
# Rename the columns using 'rename()'

# Print the first five rows of the DataFrame


Unnamed: 0,age,workclass,fnlwgt,education,education-years,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-group
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K



**Hint:**

Syntax for `rename()` function:

`DataFrame.rename(columns={old_column_name:new_column_name})`



**5.** Verify the number of rows and columns in the DataFrame:

In [None]:
# Print the number of rows and columns of the DataFrame
df.shape


(32560, 15)

**6.** Get the information of the DataFrame:

In [None]:
# Get the information of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32560 non-null  int64 
 1   workclass        32560 non-null  object
 2   fnlwgt           32560 non-null  int64 
 3   education        32560 non-null  object
 4   education-years  32560 non-null  int64 
 5   marital-status   32560 non-null  object
 6   occupation       32560 non-null  object
 7   relationship     32560 non-null  object
 8   race             32560 non-null  object
 9   sex              32560 non-null  object
 10  capital-gain     32560 non-null  int64 
 11  capital-loss     32560 non-null  int64 
 12  hours-per-week   32560 non-null  int64 
 13  native-country   32560 non-null  object
 14  income-group     32560 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


**Q:** Which is the target column?

**A:** Income Group

**7.** Print the labels in the target column and their distribution as well:

In [None]:
# Check the distribution of the labels in the target column.
df['income-group'].value_counts()

 <=50K    24719
 >50K      7841
Name: income-group, dtype: int64

**Q:** Which target label has more records?

**A:** <=50k

**After performing this activity, you must obtain the DataFrame with renamed columns and the target column identified.**

---


#### Activity 2: Data Cleaning


In this activity, we need to clean the DataFrame step by step.

**Perform the following tasks:**
- Check for the null or missing values in the DataFrame.
- Observe the categories in column `native-country`, `workclass`, and `occupation`.
- Replace the invalid `" ?"` values in the columns with `np.nan` using `replace()` function.
- Drop the rows having `nan` values using the `dropna()` function.



**1.** Verify the missing values in the DataFrame:

In [None]:
# Check for null values in the DataFrame.
df.isnull().sum()

age                0
workclass          0
fnlwgt             0
education          0
education-years    0
marital-status     0
occupation         0
relationship       0
race               0
sex                0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income-group       0
dtype: int64

**Q:** Are there any missing/null values that can be observed in the DataFrame?

**A:**No



**2.**  Observe the unique categories in columns `native-country`, `workclass`, and `occupation` to find the invalid values:

In [None]:
# Print the distribution of the columns mentioned to find the invalid values.

# Print the categories in column 'native-country'
print(df['native-country'].value_counts())
print()
# Print the categories in column 'workclass'
print(df['workclass'].value_counts())
print()
# Print the categories in column 'occupation'
print(df['occupation'].value_counts())

 United-States                 29169
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

**Q:** Is there any invalid value or category in any of the three columns?

**A:** Yes, there is a '?'

---

**3.** Replace the invalid values with `np.nan` and verify the number of null values in the DataFrame again.

In [None]:
# Replace the invalid values ' ?' with 'np.nan'.
df = df.replace(to_replace=" ?", value= np.nan)
# Check for null values in the DataFrame again.
df.isnull().sum()

age                   0
workclass          1836
fnlwgt                0
education             0
education-years       0
marital-status        0
occupation         1843
relationship          0
race                  0
sex                   0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      583
income-group          0
dtype: int64

**Q:** Are there any missing/null values that can be observed in the DataFrame?

**A:**Yes

---



**4.** Delete the rows having invalid values and drop the column `fnlwgt`. Print the number of rows of the DataFrame after dropping invalid values:

In [None]:
# Delete the rows with invalid values and the column not required

# Delete the rows with the 'dropna()' function
df.dropna(inplace=True)
# Delete the column with the 'drop()' function
df.drop('fnlwgt', axis = 'columns', inplace = True)

# Print the number of rows and columns in the DataFrame.
df.head()

Unnamed: 0,age,workclass,education,education-years,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-group
0,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [None]:
df.isnull().sum()

age                0
workclass          0
education          0
education-years    0
marital-status     0
occupation         0
relationship       0
race               0
sex                0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income-group       0
dtype: int64

**After this activity, the DataFrame should neither have any null or invalid values nor the `fnlwgt` column.**

----


#### Activity 3: Feature Engineering

The dataset contains certain features that are categorical.  To convert these features into numerical ones, use the `map()` and `get_dummies()` function.


**Perform the following tasks for feature engineering:**

- Create a list of numerical columns.

- Map the values of the column `gender` to:
  - **`Male: 0`**
  - **`Female: 1`**

- Map the values of the column `income-group` to:
  - **` <=50K: 0`**
  - **` >50K: 1`**

- Create a list of categorical columns.

- Perform **one-hot encoding** to obtain numeric values for the rest of the categorical columns.

---

**1.**  Separate the numeric columns first for that create a list of numeric columns using `select_dtypes()` function:


In [None]:
# Create a list of numeric columns names using 'select_dtypes()'.
numeric_dtypes = df.select_dtypes(include = ['int64','float64'])
numeric_dtypes

Unnamed: 0,age,education-years,capital-gain,capital-loss,hours-per-week
0,50,13,0,0,13
1,38,9,0,0,40
2,53,7,0,0,40
3,28,13,0,0,40
4,37,14,0,0,40
...,...,...,...,...,...
32555,27,12,0,0,38
32556,40,9,0,0,40
32557,58,9,0,0,40
32558,22,9,0,0,20


**2.** Map the labels of the column `gender` to convert it into a numerical attribute using the `map()` function:
  - **`Male`** to **`0`**
  - **`Female`** to **`1`**

In [None]:
# Map the 'sex' column and verify the distribution of labels.

# Print the distribution before mapping
dict1 = {' Male':0, ' Female':1}
def mapping(ser):
  return ser.map(dict1)
# Map the values of the column to convert the categorical values to integer
df['sex'] = mapping(df['sex'])
# Print the distribution after mapping
df.head()

Unnamed: 0,age,workclass,education,education-years,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-group
0,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,<=50K
1,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,<=50K
2,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,<=50K
3,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,<=50K
4,37,Private,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,1,0,0,40,United-States,<=50K


**3.** Map the labels of the column `income-group` to convert it into a numerical attribute from categorical one using `map()` function:
  - **` <=50K`** to **`0`**
  - **` >50K`** to **`1`**
  

In [None]:
# Map the 'income-group' column and verify the distribution of labels.

# Print the distribution before mapping
dict2 = {' <=50K':0, ' >50K':1}

# Map the values of the column to convert the categorical values to integer
df['income-group'] = df['income-group'].map(dict2)
# Print the distribution after mapping
df.head()

Unnamed: 0,age,workclass,education,education-years,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-group
0,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,0
1,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,0
2,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,0
3,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,0
4,37,Private,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,1,0,0,40,United-States,0


**4.** Create a list of categorical columns names using `select_dtypes()` function:

In [None]:
# Create the list of categorical columns names using 'select_dtypes()'.
cat_dtypes = df.select_dtypes(include = ['object'])
cat_dtypes.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,native-country
0,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,United-States
1,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,United-States
2,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,United-States
3,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Cuba
4,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,United-States


**5.** Perform **one-hot encoding** on the columns of the DataFrame in the list above and save it in a **dummy DataFrame**. Also use parameter `drop_first= True` in the `get_dummies()` function.

***Recall:***
*This process of obtaining numeric values from non-numeric categorical values is called **one-hot encoding**. In this process a column is added for each of the categories in a particular feature and value in the columns will be binary `0` and `1` based on the original value in the feature and the category column. The `get_dummies()` function can be used to apply **one-hot encoding** to the non-numeric categorical feature columns*.

In [None]:
# Create a 'income_dummies_df' DataFrame using the 'get_dummies()' function on the non-numeric categorical columns
income_dummies_df = pd.get_dummies(cat_dtypes, drop_first = True)
income_dummies_df.head()

Unnamed: 0,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


 **6.** Drop the non-numeric categorical columns from the original income DataFrame using the `drop()` function:

In [None]:
# Drop the categorical columns from the Income DataFrame `income_df`
df.drop(list(cat_dtypes.columns), axis = 'columns', inplace = True)

In [None]:
df.head()

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,income-group
0,50,13,0,0,0,13,0
1,38,9,0,0,0,40,0
2,53,7,0,0,0,40,0
3,28,13,1,0,0,40,0
4,37,14,1,0,0,40,0


**7.** Concat the income DataFrame and the dummy DataFrame to create the final DataFrame for the model.

Print the first five values of the final DataFrame:

In [None]:
# Concat the income DataFrame and dummy DataFrame using 'concat()' function
income_df = pd.concat([df,income_dummies_df],axis=1)

In [None]:
income_df.head()

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,income-group,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,50,13,0,0,0,13,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,38,9,0,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
2,53,7,0,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,28,13,1,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,37,14,1,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


In [None]:
income_df.isnull().sum()

age                                0
education-years                    0
sex                                0
capital-gain                       0
capital-loss                       0
                                  ..
native-country_ Thailand           0
native-country_ Trinadad&Tobago    0
native-country_ United-States      0
native-country_ Vietnam            0
native-country_ Yugoslavia         0
Length: 96, dtype: int64

**8.** Get the information of the DataFrame to verify the final columns and their data types:

In [None]:
# Get the information of the DataFrame
income_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30161 entries, 0 to 32559
Data columns (total 96 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   age                                         30161 non-null  int64
 1   education-years                             30161 non-null  int64
 2   sex                                         30161 non-null  int64
 3   capital-gain                                30161 non-null  int64
 4   capital-loss                                30161 non-null  int64
 5   hours-per-week                              30161 non-null  int64
 6   income-group                                30161 non-null  int64
 7   workclass_ Local-gov                        30161 non-null  uint8
 8   workclass_ Private                          30161 non-null  uint8
 9   workclass_ Self-emp-inc                     30161 non-null  uint8
 10  workclass_ Self-emp-not-inc       

**Q:** How many columns are present in the final DataFrame?

**A:** 96

**Q:** What is the data type of the columns in the final DataFrame?

**A:** int


**After this activity, the DataFrame should not have any non-numeric columns.**

---

#### Activity 4: Train-Test Split

We need to predict the value of the `income-group` variable, using other variables. Thus, `income-group` is the target or dependent variable and other columns except `income-group` are the features or the independent variables.

**1.** Split the dataset into the training set and test set such that the training set contains 70% of the instances and the remaining instances will become the test set.

**2.** Set `random_state = 42`:

In [None]:
# Split the training and testing data
features = list(income_df.columns)
features.remove('income-group')
x = income_df[features]
y = income_df['income-group']
# Import the module
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x, y, test_size = 0.30, random_state = 42)

**After this activity, the feature and target data should be distributed in training and testing data**

In [None]:
print(X_train.shape)
print(y_train.shape)

(21112, 95)
(21112,)


In [None]:
print(X_test.shape)

(9049, 95)


In [None]:
X_test.head()

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
234,59,9,0,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
26878,41,9,0,0,0,40,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
19182,29,13,1,0,0,35,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
13485,41,10,0,0,0,45,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
9626,41,3,1,0,0,40,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
X_test.isnull().sum()

age                                0
education-years                    0
sex                                0
capital-gain                       0
capital-loss                       0
                                  ..
native-country_ Thailand           0
native-country_ Trinadad&Tobago    0
native-country_ United-States      0
native-country_ Vietnam            0
native-country_ Yugoslavia         0
Length: 95, dtype: int64

In [None]:
X_test.shape

(9049, 95)

In [None]:
X_train.isnull().sum()

age                                0
education-years                    0
sex                                0
capital-gain                       0
capital-loss                       0
                                  ..
native-country_ Thailand           0
native-country_ Trinadad&Tobago    0
native-country_ United-States      0
native-country_ Vietnam            0
native-country_ Yugoslavia         0
Length: 95, dtype: int64

---

#### Activity 5: Data Standardisation

To avoid `ConvergenceWarning` message - That is to scale the data using one of the normalisation methods, for instance, standard normalisation.

**1.** Create a function `standard_scalar()` to normalise the numeric columns of `X_train` and `X_test` data-frames using the standard normalisation method:


In [None]:
# Normalise the train and test data-frames using the standard normalisation method.

# Define the 'standard_scalar()' function for calculating Z-scores
def standard_scalar(ser):
  z = (ser-ser.mean())/ser.std()
  return z
# Create the DataFrames norm_X_train and norm_X_train
norm_X_train = X_train.apply(standard_scalar, axis=1)
norm_X_test = X_test.apply(standard_scalar, axis=1)
# Apply the 'standard_scalar()' on X_train on numeric columns using apply() function and get the descriptive statistics of the normalised X_train
# Apply the 'standard_scalar()' on X_test on numeric columns using apply() function and get the descriptive statistics of the normalised X_test


In [None]:
norm_X_train.head()

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
29252,6.209719,1.46845,-0.172759,-0.172759,-0.172759,7.121501,-0.172759,0.009598,-0.172759,-0.172759,...,-0.172759,-0.172759,-0.172759,-0.172759,-0.172759,-0.172759,-0.172759,0.009598,-0.172759,-0.172759
14267,6.338102,2.14377,0.046604,-0.186415,-0.186415,6.804139,-0.186415,0.046604,-0.186415,-0.186415,...,-0.186415,-0.186415,-0.186415,-0.186415,-0.186415,-0.186415,-0.186415,0.046604,-0.186415,-0.186415
26020,5.100078,1.586217,-0.035565,-0.170714,-0.170714,7.938197,-0.170714,-0.035565,-0.170714,-0.170714,...,-0.170714,-0.170714,-0.170714,-0.170714,-0.170714,-0.170714,-0.170714,-0.035565,-0.170714,-0.170714
24277,6.750286,1.71078,0.030944,-0.179035,-0.179035,6.540307,-0.179035,0.030944,-0.179035,-0.179035,...,-0.179035,-0.179035,-0.179035,-0.179035,-0.179035,-0.179035,-0.179035,0.030944,-0.179035,-0.179035
4225,6.341719,1.655667,0.025736,-0.178006,-0.178006,6.952943,-0.178006,0.025736,-0.178006,-0.178006,...,-0.178006,-0.178006,-0.178006,-0.178006,-0.178006,-0.178006,-0.178006,0.025736,-0.178006,-0.178006


In [None]:
norm_X_test.head()

Unnamed: 0,age,education-years,sex,capital-gain,capital-loss,hours-per-week,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
234,7.900525,1.066161,-0.164025,-0.164025,-0.164025,5.303467,-0.164025,-0.164025,-0.164025,-0.164025,...,-0.164025,-0.164025,-0.164025,-0.164025,-0.164025,-0.164025,-0.164025,-0.027337,-0.164025,-0.164025
26878,6.779069,1.354386,-0.171306,-0.171306,-0.171306,6.609547,-0.171306,-0.001784,-0.171306,-0.171306,...,-0.171306,-0.171306,-0.171306,-0.171306,-0.001784,-0.171306,-0.171306,-0.171306,-0.171306,-0.171306
19182,5.854799,2.522978,0.024112,-0.184127,-0.184127,7.104232,-0.184127,0.024112,-0.184127,-0.184127,...,-0.184127,-0.184127,-0.184127,-0.184127,-0.184127,-0.184127,-0.184127,-0.184127,-0.184127,-0.184127
13485,6.361749,1.422295,-0.171078,-0.171078,-0.171078,6.999098,-0.171078,-0.011741,-0.171078,-0.171078,...,-0.171078,-0.171078,-0.171078,-0.171078,-0.171078,-0.171078,-0.171078,-0.011741,-0.171078,-0.171078
9626,6.85154,0.347712,0.005405,-0.165749,-0.165749,6.680387,-0.165749,0.005405,-0.165749,-0.165749,...,-0.165749,-0.165749,-0.165749,-0.165749,-0.165749,-0.165749,-0.165749,-0.165749,-0.165749,-0.165749


In [None]:
norm_X_test.dropna(inplace=True)

In [None]:
norm_X_test.shape

(9049, 95)

**After this activity, training and testing feature data should be normalised using Data Standardisation.**

---

#### Activity 6: Logistic Regression - Model Training

Implement Logistic Regression Classification using `sklearn` module to estimate the values of $\beta$ coefficients in the following way:

1. Deploy the model by importing the `LogisticRegression` class and create an object of this class.
2. Call the `fit()` function on the Logistic Regression object and print the score using the `score()` function.


In [None]:
# Deploy the 'LogisticRegression' model using the 'fit()' function.
from sklearn.linear_model import LogisticRegression
lg_reg = LogisticRegression()
lg_reg.fit(norm_X_train, y_train)
print(lg_reg.score(norm_X_train, y_train))

0.8193918150814703


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**After this activity, a multi-variate logistic regression should be trained with all features.**

---

#### Activity 7: Model Prediction and Evaluation

**1.** Predict the values for both training and test sets by calling the `predict()` function on the Logistic Regression object:

In [None]:
# Make predictions on the test dataset by using the 'predict()' function.
y_test_pred = lg_reg.predict(norm_X_test)
y_test_pred

array([0, 0, 0, ..., 1, 0, 1])

**2.** Display the confusion matrix:

In [None]:
# Display the results of confusion_matrix
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_test_pred))

[[6144  677]
 [ 857 1371]]


**Q:** What is the positive outcome out of both labels?

**A:**

**Q:** Write the count of True Positives and True Negatives?

**A:** TP : 1371  TN:857


**3.** Print the classification report values to evaluate the accuracy of your model:

In [None]:
# Display the results of classification_report
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.88      0.90      0.89      6821
           1       0.67      0.62      0.64      2228

    accuracy                           0.83      9049
   macro avg       0.77      0.76      0.77      9049
weighted avg       0.83      0.83      0.83      9049



**Q** Write the f1-score of both labels?

**A:**0: 0.89, 1: 0.54

**After this activity, a multi-variate logistic regression model is used to predict and evaluate using all the features.**

----

#### Activity 8: Features Selection Using RFE

Select the relevant features from all the features that contribute the most to classifying individuals in income-groups using RFE.

**Steps:**

**1.** Create an empty dictionary and store it in a variable.

**2.** Create a `for` loop that iterates through all the columns in the normalised training data-frame.
Inside the loop:


   - Create an object of the Logistic Regression class and store it in a variable.
   
   - Create an object of RFE class and store it in a variable. Inside the RFE class constructor, pass the object of logistic regression and the number of features to be selected by RFE as inputs.
   
   - Train the model using the `fit()` function of the `RFE` class to train a logistic regression model on the train set with `i` number of features where `i` goes from `1` to the number of columns in the training dataset.
   
   - Create a list to store the important features using the `support_` attribute.
   
   - Create a new data-frame having the features selected by RFE store in a variable.
   
   - Create another Logistic Regression object, store it in a variable and build a logistic regression model using the new training DataFrame created using the rfe features data-frame and the target series.
   
   - Predict the target values for the normalised test set (containing the feature(s) selected by RFE) by calling the `predict()` function on the recent model object.
   
   - Calculate f1-scores using the function `f1_score()` function of `sklearn.metrics` module that returns a NumPy array containing f1-scores for both the classes. Store the array in a variable called `f1_scores_array`.
   
    The sytax for the `f1_score()` is given as:

      >**Syntax:** `f1_score(y_true, y_pred, average = None)`

        Where,
      
        **a.** `y_true`:  the actual labels

        **b.** `y_pred`: the predicted labels
  
        **c.** `average = None`: parameter returns the scores for each class.

  - Add the number of selected features and the corresponding features & f1-scores as key-value pairs in the dictionary.


**Note:**   
As the number of features is very high, the code will be a computationally heavy program. It will require very GPU to process the code faster. It will take some time to learn the feature variables through the training data and then make predictions on the test data.

To turn on the **GPU** in google colab follow the steps below:
1. Click on the **Edit** menu option on the top-left.
2. Click on the **Notebook settings** option from the menu. A pop-up will appear.
3. Click on the drop-down for selecting **Hardware accelerator**.
4. Select **GPU** from the drop-down options.
5. Click on **Save**.

In [None]:
# Create a dictionary containing the different combination of features selected by RFE and their corresponding f1-scores.

# Import the libraries
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
# Create the empty dictionary.
d = {}
# Create a 'for' loop.
for i in range(1,len(list(norm_X_train.columns))):

  # Create the Logistic Regression Model
  log_reg2 = LogisticRegression()

  # Create the RFE model with 'i' number of features
  rfe1 = RFE(log_reg2, n_features_to_select = i)

  # Train the rfe model on the normalised training data using 'fit()'
  rfe1.fit(norm_X_train, y_train)

  # Create a list of important features chosen by RFE.
  rfe_features = norm_X_train.columns[rfe1.support_]

  # Create the normalised training DataFrame with rfe features
  rfe_X_train = norm_X_train[rfe_features]

  # Create the logistic regression
  log_reg3 = LogisticRegression()

  # Train the model normalised training DataFrame with rfe features using 'fit()'
  log_reg3.fit(rfe_X_train, y_train)

  # Predict 'y' values only for the test set as generally, they are predicted quite accurately for the train set.
  y_test_pred = log_reg3.predict(norm_X_test[rfe_features])

  # Calculate the f1-score
  f1_values = f1_score(y_test, y_test_pred, average=None)

  # Add the name of features and f1-scores in the dictionary
  d[i] = {'rfe_features':rfe_features,'f1_score':f1_values}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


**6.** Print the dictionary with features and f1-scores.

In [None]:
# Print the dictionary
print(d)

{1: {'rfe_features': Index(['education_ Prof-school'], dtype='object'), 'f1_score': array([0.86067758, 0.05993151])}, 2: {'rfe_features': Index(['education_ Prof-school', 'relationship_ Wife'], dtype='object'), 'f1_score': array([0.86899973, 0.3859208 ])}, 3: {'rfe_features': Index(['sex', 'education_ Prof-school', 'relationship_ Wife'], dtype='object'), 'f1_score': array([0.87462928, 0.42979767])}, 4: {'rfe_features': Index(['sex', 'education_ Prof-school', 'marital-status_ Never-married',
       'relationship_ Wife'],
      dtype='object'), 'f1_score': array([0.8603515 , 0.05819427])}, 5: {'rfe_features': Index(['sex', 'education_ Masters', 'education_ Prof-school',
       'marital-status_ Never-married', 'relationship_ Wife'],
      dtype='object'), 'f1_score': array([0.88323271, 0.51675185])}, 6: {'rfe_features': Index(['sex', 'education_ Doctorate', 'education_ Masters',
       'education_ Prof-school', 'marital-status_ Never-married',
       'relationship_ Wife'],
      dtype='ob

**7.** Convert the dictionary into a DataFrame by using the`from_dict()` function of the DataFrame.

**Note** Set the `pd.options.display.max_colwidth` to `200`

In [None]:
# Convert the dictionary to the DataFrame
pd.options.display.max_colwidth = 200
df2 = pd.DataFrame.from_dict(d)
df2

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,85,86,87,88,89,90,91,92,93,94
rfe_features,"Index(['education_ Prof-school'], dtype='object')","Index(['education_ Prof-school', 'relationship_ Wife'], dtype='object')","Index(['sex', 'education_ Prof-school', 'relationship_ Wife'], dtype='object')","Index(['sex', 'education_ Prof-school', 'marital-status_ Never-married',  'relationship_ Wife'],  dtype='object')","Index(['sex', 'education_ Masters', 'education_ Prof-school',  'marital-status_ Never-married', 'relationship_ Wife'],  dtype='object')","Index(['sex', 'education_ Doctorate', 'education_ Masters',  'education_ Prof-school', 'marital-status_ Never-married',  'relationship_ Wife'],  dtype='object')","Index(['sex', 'education_ Bachelors', 'education_ Doctorate',  'education_ Masters', 'education_ Prof-school',  'marital-status_ Never-married', 'relationship_ Wife'],  dtype='obj...","Index(['sex', 'education_ Bachelors', 'education_ Doctorate',  'education_ Masters', 'education_ Prof-school',  'marital-status_ Never-married', 'relationship_ Own-child',  'rela...","Index(['sex', 'education_ Bachelors', 'education_ Doctorate',  'education_ Masters', 'education_ Prof-school',  'marital-status_ Never-married', 'relationship_ Own-child',  'rela...","Index(['sex', 'education_ Bachelors', 'education_ Doctorate',  'education_ Masters', 'education_ Prof-school',  'marital-status_ Never-married', 'relationship_ Not-in-family',  '...",...,"Index(['sex', 'capital-gain', 'capital-loss', 'workclass_ Local-gov',  'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workclass_ State-gov',  '...","Index(['age', 'sex', 'capital-gain', 'capital-loss', 'workclass_ Local-gov',  'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workclass_ State-gov',  ...","Index(['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',  'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workcl...","Index(['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',  'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workcl...","Index(['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',  'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workcl...","Index(['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',  'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workcl...","Index(['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',  'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workcl...","Index(['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',  'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc',  'workclass_ Self-emp-not-inc', 'workcl...","Index(['age', 'education-years', 'sex', 'capital-gain', 'capital-loss',  'hours-per-week', 'workclass_ Local-gov', 'workclass_ Private',  'workclass_ Self-emp-inc', 'workclass_ Self-em...","Index(['age', 'education-years', 'sex', 'capital-gain', 'capital-loss',  'hours-per-week', 'workclass_ Local-gov', 'workclass_ Private',  'workclass_ Self-emp-inc', 'workclass_ Self-em..."
f1_score,"[0.8606775789874381, 0.059931506849315065]","[0.8689997318315903, 0.38592080452545574]","[0.874629280129415, 0.4297976701410178]","[0.860351500539306, 0.05819426615318785]","[0.8832327113062569, 0.5167518455423056]","[0.8750609713608808, 0.5214838537496664]","[0.8714509915822515, 0.5583333333333333]","[0.8754673569168824, 0.5866348448687351]","[0.8761477839635602, 0.5985469885165221]","[0.8804922186029677, 0.6145225309362596]",...,"[0.8894664645151974, 0.6395102425241348]","[0.8900750144258511, 0.6400566839867737]","[0.8892576293196739, 0.637715364644796]","[0.8894820372240658, 0.6383380547686497]","[0.8888407680092393, 0.6371347785108388]","[0.8878020713463752, 0.6280400572246067]","[0.8883905967450272, 0.6388953896559794]","[0.8886317616430431, 0.6393442622950819]","[0.8889370619264398, 0.6391171636534397]","[0.8892583489952292, 0.6407129455909943]"


**Q:** How many features are required for the best f1-scores and why?

**A:** 3 because after that there are only small variations

**After this activity, rfe is used to find the ideal features for  logistic regression**

---

#### Activity 9: Model Training and Prediction Using Ideal Features

**1.** Create the logistic regression model again using RFE with the ideal number of features and predict the target variable:


In [None]:
# Logistic Regression with the ideal number of features and predict the target.

# Create the Logistic Regression Model
log_reg4 = LogisticRegression()

# Create the RFE model with ideal number of features
rfe1 = RFE(log_reg4, n_features_to_select = 3)
# Train the rfe model on the normalised training data
rfe1.fit(norm_X_train, y_train)
# Create a list of important features chosen by RFE.
rfe_features = norm_X_train.columns[rfe1.support_]
# Create the normalised training DataFrame with rfe features
X_train_rfe = norm_X_train[rfe_features]
# Create the Regression Model again
log_reg5 = LogisticRegression()
# Train the model with the normalised training features DataFrame with best rfe features and target training DataFrame
log_reg5.fit(X_train_rfe, y_train)
# Predict the target using the normalised test DataFrame with rfe features
y_test_pred = log_reg5.predict(norm_X_test[rfe_features])
# Calculate the final f1-score and print it
fin_f1_score = f1_score(y_test,y_test_pred)
print(fin_f1_score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.4297976701410178


**Hint:** Create the model using the same steps mentioned in the **Features Selection Activity.**  

**Q:** What is the final f1-score?

**A:** 0.429


---

**Write your interpretation of the results here.**

- Interpretation 1:
- Interpretation 2:
- Interpretation 3:

**After this activity, a Logistic Regression model should be ready with ideal number of features to accurately predict the income group of the people that is to predict whether an individual has annual income `less or equal than 50K (label 0)` or `more than 50K (label 1)` based on the features selected.**

---

### Submitting the Project

Follow the steps described below to submit the project.

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>


3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_CapstoneProject17**) of the notebook will get copied

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_CapstoneProject17** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>


---