# Data Information
Source - https://www.kaggle.com/datasets/wenruliu/adult-income-dataset


## Author: Matt S.

## Project Overview:

Description:
Analysis and predicion of an individual's income based on several features, including but not limited to sex, race, martial status, and location.

- The target will be `income`. The remaining features will be used to calculate a person's expected income.

- A row represents a single person and the factors that determine the income.

Column information:

- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
class: >50K, <=50K

# Data Load & Inspection

In [1]:
# imports here
import pandas as pd

In [2]:
# Read csv
url = "/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week06/project2/adult.csv"
df = pd.read_csv(url)

# Check df
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [3]:
# Check shape
df.shape

(48842, 15)

In [5]:
# Get overview of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


- There are 48,842 rows
- 6 int columns
- 9 object columns
- No null values, but that doesn't mean that data isn't set to missing or similar.


In [13]:
# See what type of values are within the target
df['income'].value_counts()

<=50K    37155
>50K     11687
Name: income, dtype: int64

In [14]:
df['fnlwgt'].head()

0    226802
1     89814
2    336951
3    160323
4    103497
Name: fnlwgt, dtype: int64

Because there are only two types of values within income, a classification model will need to be used.

In [7]:
# Confirm for null values
df.isna().sum(axis=1)

0        0
1        0
2        0
3        0
4        0
        ..
48837    0
48838    0
48839    0
48840    0
48841    0
Length: 48842, dtype: int64

All but one of the columns are self-explantory as to what type of dtype the data/column should be, as well as its purpose/definition.

fnlwgt is the one that stands out that may be troublesome, as it is an integer that doesn't show any direct meaning. This may negatively affect the final model as it may not be needed.

Research online shows that fnlwgt refers to 'final weight.':
- "The continuous variable fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents."
Citaion - https://cseweb.ucsd.edu//classes/sp15/cse190-c/reports/sp15/048.pdf