# <center>Part I - Exploratory Analysis of Adult Income Dataset</center>
### <center>by</center>

## <center>Chukwudi Collins Ozoede</center>

## Introduction
> The US Adult income dataset is a repository of 48,842 entries extracted from the 1994 US Census database. 
It contains 15 columns before cleaning. Below are the columns and their meaning:

### The Dataset
- **age:** the age of an individual
- **workclass:** a general term to represent the employment status of an individual
- **fnlwgt:** final weight. In other words, this is the number of people the census believesthe entry represents..
- **education:** the highest level of education achieved by an individual.
- **education-num:** the highest level of education achieved in numerical form.
- **marital-status:** marital status of an individual.
- **occupation:** the general type of occupation of an individual
- **relationship:** represents what this individual is relative to others. 
- **race:** Descriptions of an individual’s race
- **sex:** the biological sex of the individual
- **capital-gain:** capital gains for an individual
- **capital-loss:** capital loss for an individual
- **hours-per-week**: the hours an individual has reported to work per week
- **native-country:** country of origin for an individual
- **Salary:** whether or not an individual makes more than $50,000 annually.


In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [2]:
adult_df = pd.read_csv('adult.data.csv')
adult_df.sample(50)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
21815,22,Private,99199,Some-college,10,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,20,United-States,<=50K
18523,26,Private,201481,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,40,United-States,<=50K
28134,32,Private,260868,Bachelors,13,Married-civ-spouse,Sales,Husband,Black,Male,0,0,40,United-States,>50K
25818,45,Local-gov,235431,HS-grad,9,Separated,Other-service,Unmarried,Black,Female,0,0,40,United-States,<=50K
10067,56,Private,146660,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,60,United-States,<=50K
15110,29,Private,53642,Assoc-voc,11,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,<=50K
15112,44,Federal-gov,102238,HS-grad,9,Divorced,Craft-repair,Unmarried,White,Male,0,0,40,United-States,<=50K
22632,35,Private,165930,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
11075,52,Local-gov,40641,10th,6,Married-civ-spouse,Craft-repair,Husband,White,Male,5013,0,40,United-States,<=50K
7760,20,Federal-gov,147352,HS-grad,9,Never-married,Other-service,Not-in-family,White,Female,0,0,40,United-States,<=50K


In [3]:
# this is an overview of the data shape and composition
print(adult_df.shape)
print(adult_df.dtypes)
print(adult_df.info())

(32561, 15)
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
salary            object
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   s

In [4]:
# making a copy of the dataframe before cleaning and analysis
df = adult_df.copy()

### Data cleaning
- Unnecessary columns have to be removed from the dataset
- I discovered that some cells have '?' as their values, they should all be changed to NAN
- I need to create a new column `age_group` to store age groups

#### Issue 1:
Removing unnecessary columns from the dataframe

In [5]:
# subseting for the columns needed for this ananlysis
df = df[['age','workclass','education','marital-status', 'occupation', 'relationship', 
         'race', 'sex', 'hours-per-week', 'native-country','salary']]
df.head(1)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,salary
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K


#### Issue 2: 
changing '?' values to NAN 

In [6]:
#solution:
df = df.replace('?', np.nan)
df.head(5)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,salary
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K


#### Issue 3:
Create new column `age_group` 

In [7]:
#create an age group to avoid noisy and clustered visuals 
ages = df.age
age_group = []
for i, age in ages.items():
    
    if age <= 20:
        age_group.append('17-20')
    elif age>20 and age<31:
        age_group.append('21-30')
    elif age>30 and age<41:
        age_group.append('31-40')
    elif age>40 and age<51:
        age_group.append('41-50')
    elif age>50 and age<61:
        age_group.append('51-60')
    elif age>60 and age<71:
        age_group.append('61-70')
    elif age>70 and age<81:
        age_group.append('71-80')
    else:
        age_group.append('above 80')
df['age_group']= age_group


# this method can also be used
"""
conditions = [
    (df['age']<=20),
    (df['age']>20) & (df['age']<31),
    (df['age']>30) & (df['age']<41),
    (df['age']>40) & (df['age']<51),
    (df['age']>50) & (df['age']<61),
    (df['age']>60) & (df['age']<71),
    (df['age']>70) & (df['age']<81),
    (df['age']>80),
    ]

# create a list of the values we want to assign for each condition
values = ['17-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', 'above 80']

# create a new column and use np.select to assign values to it using our lists as arguments
df['age_group'] = np.select(conditions, values)



# display updated DataFrame
df.head()
"""

"\nconditions = [\n    (df['age']<=20),\n    (df['age']>20) & (df['age']<31),\n    (df['age']>30) & (df['age']<41),\n    (df['age']>40) & (df['age']<51),\n    (df['age']>50) & (df['age']<61),\n    (df['age']>60) & (df['age']<71),\n    (df['age']>70) & (df['age']<81),\n    (df['age']>80),\n    ]\n\n# create a list of the values we want to assign for each condition\nvalues = ['17-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80', 'above 80']\n\n# create a new column and use np.select to assign values to it using our lists as arguments\ndf['age_group'] = np.select(conditions, values)\n\n\n\n# display updated DataFrame\ndf.head()\n"