Skip to content

jieyima/US_Census_Income_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

About

This is UC Davis BAX452 Machine Learning Individual Project.

The objective of the project is to predict whether a person makes over 50K a year given their demographic variations. To achieve this, several classification techniques are explored. In the end, random forest model yields to the best prediction result.

How to navigate this notebook

Below is the table of content:

Visualization Excerpts

1. Gini Index for US:

A measure of statistical dispersion intended to represent the income or wealth distribution of a nation's residents, and is the most commonly used measure of inequality.

Source: Gini coefficient image.png

2. Violin Plot

This is a violin plot using matplotlib to show how different occupations yield to salary variations, controlling age variables.
violinplot

3. Bivariate Analysis

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis.[1] It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.

Source: Bivariate Analysis bivariate variables

4. Principal Component Analysis

A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

Source:

PCA

Data dictionary

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Target

  1. Predclass: >50K, <=50K.
  • Categorical, income Level is either higher or lower than $50K

Categorical Attributes

  1. workclass: (categorical) Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • Individual work category
  1. education: (categorical) Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • Individual's highest education degree
  1. marital-status: (categorical) Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • Individual marital status
  1. occupation: (categorical) Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • Individual's occupation
  1. relationship: (categorical) Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • Individual's relation in a family
  1. race: (categorical) White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • Race of Individual
  1. sex: (categorical) Female, Male.
  2. native-country: (categorical) United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • Individual's native country

Continuous Attributes

  1. age: continuous.
  • Age of an individual
  1. education-num: number of education year, continuous.
  • Individual's year of receiving education
  1. fnlwgt: final weight, continuous.
  • The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau.
  1. capital-gain: continuous.
  2. capital-loss: continuous.
  3. hours-per-week: continuous.
  • Individual's working hour per week