# Naive Bayes Classifiers

## Introduction

Naive Bayes is a class of simple classifiers based on Bayes' Rule and strong (or naive) independence assumptions between features. In this problem, you will implement a Naive Bayes Classifier for the Census Income Data Set from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/).

## Dataset Description

The dataset consists 32561 instances, each representing an individual. The goal is to predict whether a person makes over 50K a year based on 14 features. The features are:

| column | type | description |
| --- |:---:|:--- |
| age | continuous | trips around the sun to date
| final_weight | continuous | census weight attribute; constructed from the original census data |
| education_num | continuous | numeric education scale -- their maximum educational level as a number |
| capital_gain | continuous | income from investment sources |
| capital_loss | continuous | losses from investment sources |
| hours_per_week | continuous | number of hours worked every week |
| work_class | categorical | `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, `Never-worked` |
| education | categorical | `Bachelors`, `Some-college`, `11th`, `HS-grad`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `9th`, `7th-8th`, `12th`, `Masters`, `1st-4th`, `10th`, `Doctorate`, `5th-6th`, `Preschool` |
| marital_status | categorical | `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, `Married-AF-spouse` |
| occupation | categorical | `Tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial`, `Prof-specialty`, `Handlers-cleaners`, `Machine-op-inspct`, `Adm-clerical`, `Farming-fishing`, `Transport-moving`, `Priv-house-serv`, `Protective-serv`, `Armed-Forces` |
| relationship | categorical | `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, `Unmarried.` |
| race | categorical | `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Other`, `Black` |
| sex | categorical | `Female`, `Male` |
| native_country | categorical | (41 values not shown here) |

In [1]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats

import gzip
from testing.testing import test

## Q1. Data Preparation

First, you need to load in the above data, provided to you as a CSV file. As the data is from UCI repository, it is already quite clean. However, some instances contain missing `occupation`, `native_country` or `work_class` (represented as ? in the CSV file) and these have to be discarded from the training set. Also, replace the `income` column with `label`, which is 1 if `income` is `>50K` and 0 otherwise. Finally, ensure you reset the index so the row numbers are contiguous.

In [2]:
def read_csv(fn):
    with gzip.open(fn, "rt", newline='', encoding="UTF-8") as file:
        return pd.read_csv(file)

def load_data_test(load_data):
    df = load_data()
    
    DF_TYPES = {
        "age"  : "int64",
        "work_class"  : "object",
        "final_weight"  : "int64",
        "education"  : "object",
        "education_num"  : "int64",
        "marital_status"  : "object",
        "occupation"  : "object",
        "relationship"  : "object",
        "race"  : "object",
        "sex"  : "object",
        "capital_gain"  : "int64",
        "capital_loss"  : "int64",
        "hours_per_week"  : "int64",
        "native_country"  : "object",
        "label"  : "int64"
    }

    test.equal(DF_TYPES, { k: str(df[k].dtypes) for k in DF_TYPES })

    # Check for blank entries:
    test.equal(any(df['occupation'].eq("?")), False)
    test.equal(any(df['native_country'].eq("?")), False)
    test.equal(any(df['work_class'].eq("?")), False)
    
    # Make sure there's no income column:
    test.true("income" not in df.columns)

    # Index handling:
    test.equal(repr(df.index), "RangeIndex(start=0, stop=30162, step=1)")
    
@test
def load_data(file_name="census.csv.gz"):
    """ loads and processes data in the manner specified above

    args:
        file_name : str -- path to csv file containing data

    returns: pd.DataFrame -- processed dataframe
    """
    csv_df=read_csv(file_name)
#     with gzip.open("census.csv.gz", 'rt', encoding='utf-8') as f:
#         csv_df = pd.read_csv(f)
    csv_df=csv_df.drop(csv_df[csv_df.work_class == "?"].index)
    csv_df=csv_df.drop(csv_df[csv_df.occupation == "?"].index)
    csv_df=csv_df.drop(csv_df[csv_df.native_country == "?"].index)
    csv_df = csv_df.replace(">50K", 1)
    csv_df = csv_df.replace("<=50K", 0)
    csv_df=csv_df.rename(columns={"income":"label"})
    csv_df = csv_df.reset_index()
    csv_df = csv_df.drop(['index'], axis=1)
    return csv_df
    pass
    pass

### TESTING load_data: PASSED 6/6
###



## Overview of Naive Bayes classifier

Let $X_1, X_2, \ldots, X_k$ be the $k$ features of a dataset, with class label given by the variable $y$. A probabilistic classifier assigns the most probable class to each instance $(x_1,\ldots,x_k)$, as expressed by
$$ \hat{y} = \arg\max_y P(y\ |\ x_1,\ldots,x_k) $$

Using Bayes' theorem, the above *posterior probability* can be rewritten as
$$ P(y\ |\ x_1,\ldots,x_k) = \frac{P(y) P(x_1,\ldots,x_n\ |\ y)}{P(x_1,\ldots,x_k)} $$
where
- $P(y)$ is the prior probability of the class
- $P(x_1,\ldots,x_k\ |\ y)$ is the likelihood of data under a class
- $P(x_1,\ldots,x_k)$ is the evidence for data

Naive Bayes classifiers assume that the feature values are conditionally independent given the class label, that is,
$ P(x_1,\ldots,x_n\ |\ y) = \prod_{i=1}^{k}P(x_i\ |\ y) $. This strong assumption helps simplify the expression for posterior probability to
$$ P(y\ |\ x_1,\ldots,x_k) = \frac{P(y) \prod_{i=1}^{k}P(x_i\ |\ y)}{P(x_1,\ldots,x_k)} $$

For a given input $(x_1,\ldots,x_k)$, $P(x_1,\ldots,x_k)$ is constant. Hence, we can say that:
$$ P(y\ |\ x_1,\ldots,x_k) \propto P(y) \prod_{i=1}^{k}P(x_i\ |\ y) $$

Thus, the class of a new instance can be predicted as:

$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{k}P(x_i\ |\ y)$$

where $P(y)$ is commonly known as the **class prior** and $P(x_i\ |\ y)$ is the **feature predictor**.

Observe that this is the product of $k+1$ probability values, which can result in very small numbers. When working with real-world data, this often leads to an [arithmetic underflow](https://en.wikipedia.org/wiki/Arithmetic_underflow). We will instead be adding the logarithm of the probabilities:

$$\hat{y} = \arg\max_y \underbrace{\log P(y)}_\text{log-prior} + \underbrace{\sum_{i=1}^{k} \log P(x_i\ |\ y)}_\text{log-likelihood}$$

The rest of the assignment deals with how each of these probability distributions -- $P(y), P(x_1\ |\ y), \ldots, P(x_k\ |\ y)$ -- are estimated from data.


### Feature Predictor

Naive Bayes classifiers are popular because we can independently model each feature and mix-and-match model types based on the prior knowledge. For example, we might know (or assume) that $(X_i|y)$ has some distribution, so we can directly use the probability density or mass function of the distribution to model $(X_i|y)$.

In this assignment, you will be using two classes of likelihood models:
- Gaussian models, for continuous real-valued features (parameterized by mean $\mu$ and variance $\sigma$)
- Categorical models, for features in discrete categories (parameterized by $\mathbf{p} = <p_0,p_1\ldots>$, one parameter per category)

You need to implement a generic predictor class for each type of model. Your class should have the following methods:

- `fit()`: Learn parameters for the likelihood model using an appropriate Maximum Likelihood Estimator.
- `partial_log_likelihood()`: Use the previously learnt parameters to compute the probability density or mass of a given feature value, and return the natural logarithm of this value.

## Q2. Gaussian Feature Predictor

The Gaussian distribution is characterized by two parameters - mean $\mu$ and standard deviation $\sigma$:
$$ f_Z(z) = \frac{1}{\sqrt{2\pi}\sigma} \exp{(-\frac{(z-\mu)^2}{2\sigma^2})} $$

Given $n$ samples $z_1, \ldots, z_n$ from the above distribution, the MLE for mean and standard deviation are:
$$ \hat{\mu} = \frac{1}{n} \sum_{j=1}^{n} z_j $$

$$ \hat{\sigma} = \sqrt{\frac{1}{n} \sum_{j=1}^{n} (z_j-\hat{\mu})^2} $$

`scipy.stats.norm` may be helpful, as may `pandas.DataFrame.var`. If you use the latter, remember to correctly set the `ddof`!

In [3]:
def gaussian_pred_test(gaussian_predictor):
    g = gaussian_predictor(2)
    
    np.random.seed(0xDEADBEEF)
    rnd = np.random.normal(loc=0.0, scale=1.0, size=(1000,))

    data = pd.Series(np.concatenate([rnd, 100-rnd]))
    labels = pd.Series(np.array([0]*1000 + [1]*1000))

    g.fit(data, labels)
    
    test.equal(tuple(g.partial_log_likelihood([0., 50., 100.]).shape), (2, 3))
    # If the equality is not exact, you may need to change the test to ensure the absolute difference is no more than 1e-4
    test.equal(g.partial_log_likelihood([0., 50., 100.]).tolist(),
               [[-0.9234573135702573, -1242.233086628376, -4963.217354198167], [-4963.217354198166, -1242.2330866283753, -0.9234573135702564]])

class GaussianPredictor:
    """ Feature predictor for a normally distributed real-valued, continuous feature.

        attr:
            k : int -- number of classes
            mu : np.ndarray[k] -- vector containing per class mean of the feature
            sigma : np.ndarray[k] -- vector containing per class std. deviation of the feature
    """
    
    def __init__(self, k):
        """ constructor

        args : k -- number of classes
        """
        self.k=k
        self.mu=np.zeros(self.k)
        self.sigma=np.zeros(self.k)
        pass

    def fit(self, x, y):
        """update predictor statistics (mu, sigma) for Gaussian distribution

        args:
            x : pd.Series -- feature values
            y : np.Series -- class labels
            
        return : GaussianPredictor -- return self for convenience
        """
        mean=[x[y[y.values == i].index].mean() for i in set(y.values)]
        self.mu=np.array(mean)
        var=np.sqrt([x[y[y.values == i].index].var(axis=0, ddof=0) for i in set(y.values)])
        self.sigma=np.array(var)
        
        return self
            
    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class

        args:
            x : pd.Series -- feature values

        return: np.ndarray[self.k, len(x)] : log likelihood for this feature for each class
        """
        likelih=np.array([scipy.stats.norm(self.mu[i], self.sigma[i]).logpdf(x) for i in range(len(self.mu))])
        print(likelih)
        return likelih
        pass

@test
def gaussian_pred(k):
    return GaussianPredictor(k)

[[-9.23457314e-01 -1.24223309e+03 -4.96321735e+03]
 [-4.96321735e+03 -1.24223309e+03 -9.23457314e-01]]
[[-9.23457314e-01 -1.24223309e+03 -4.96321735e+03]
 [-4.96321735e+03 -1.24223309e+03 -9.23457314e-01]]
### TESTING gaussian_pred: PASSED 1/2
# 1	: Failed: [[-0.9234573135702573, -1242.233086628376, -4963.217354198167], [-4963.217354198178, -1242.2330866283805, -0.9234573135702585]] is not equal to [[-0.9234573135702573, -1242.233086628376, -4963.217354198167], [-4963.217354198166, -1242.2330866283753, -0.9234573135702564]]
###



## Q3. Categorical Feature Predictor

The categorical distribution with $l$ categories $\{0,\ldots,l-1\}$ is characterized by parameters $\mathbf{p} = (p_0,\dots,p_{l-1})$ where $\sum\mathbf p = 1$.

If $C$ is categorically distributed, the probability of observing $z$ is:

$$ \Pr(C=z; \mathbf{p}) = \begin{cases}
    p_0 & \text{ if } z=0
\\  p_1 & \text{ if } z=1
\\  \vdots
\\  p_{l-1} & \text{ if } z=(l-1)
\end{cases}$$

Given $n$ samples $z_1, \ldots, z_n$ from $C$, the smoothed Maximum Likelihood Estimator for $\mathbf p$ is:
$$ \hat{p_t} = \frac{n_t + \alpha}{n + l\alpha} $$

where $n_t = \sum_{j=1}^{n} [z_j=t]$ (i.e., the number of times the label $t$ occurred in the sample) and $n$ is the total number of samples. The smoothing is done to avoid zero-count problem (similar in spirit to $n$-gram model in NLP.)

In this problem, you need to write a predictor that learns a different categorical distribution $C_i$ for each of $k$ possible classes. You should maintain a dictionary from each possible input token (i.e. each value) to an array of length $k$ that contains $(\Pr(C_0=z), \Pr(C_1=z), ..., \Pr(C_{k-1}=z))$.

In [4]:
def categorical_pred_test(categorical_pred):
    # Test One:
    p = categorical_pred(3)
    
    data = pd.Series(["A"]*99 + ["B"]*99 + ["C"]*99)
    labels = pd.Series([0]*99 + [1]*99 + [2]*99)
    p.fit(data, labels)
    
    test.true(np.allclose(p.p['A'], [0.98039216, 0.00980392, 0.00980392], atol=1e-6))

    pll = p.partial_log_likelihood(["A", "B", "C", "A", "B", "C"])
    test.equal(tuple(pll.shape), (3, 6))
    n = np.log(1/102)
    p = np.log(100/102)
    
    test.true(np.allclose(pll, [[p, n, n, p, n, n], [n, p, n, n, p, n], [n, n, p, n, n, p]]))
    
    # Test Two:
    
    p = categorical_pred(2)
    
    data = pd.Series(["A"]*50 + ["B"]*50 + ["C"]*50)
    labels = pd.Series([0]*75 + [1]*75)
    p.fit(data, labels)
    
    test.true(np.allclose(p.p['A'], [0.65384614, 0.01282051], atol=1e-6))
    test.true(np.allclose(p.p['B'], [0.33333334, 0.33333334], atol=1e-6))
    test.true(np.allclose(p.p['C'], [0.01282051, 0.65384614], atol=1e-6))

    pll = p.partial_log_likelihood(["A", "B", "C"])
    test.equal(tuple(pll.shape), (2, 3))
    n = np.log(1/78)
    m = np.log(51/78)
    l = np.log(26/78)

    test.true(np.allclose(pll, [[m, l, n], [n, l, m]], atol=1e-6))
    # If test two fails but test one passes, check the smoothing term in the denominator!

class CategoricalPredictor:
    """ Feature predictor for a categorical feature.

        attr: 
            k : int -- number of classes
            p : Dict[feature_value, np.ndarray[k]] -- dictionary of vectors containing per-class probability of a feature value;
    """
    
    def __init__(self, k):
        """ constructor

        args : k -- number of classes
        """
        self.k=k
        self.p={}
        pass

    def fit(self, x, y, alpha=1.):
        """ initializes the predictor statistics (p) for Categorical distribution
        
        args:
            x : pd.Series -- feature values
            y : pd.Series -- class labels
        
        kwargs:
            alpha : float -- smoothing factor

        return : CategoricalPredictor -- returns self for convenience
        """
        feature=set(x.values)
        labels=set(y.values)
        count_f=len(feature)
        count_l=len(labels)
        
        list_values=[((x[y==l].apply(lambda a:a==f).astype(float).sum()+alpha)/(x[y==l].count()+count_f*alpha)) for f in feature for l in labels]
    
        s=0
        for key in feature:
            self.p[key]=np.array(list_values[s:count_l+s])
            s=s+count_l
        return self

    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class

        args:
            x : pd.Series -- vector of feature values

        return : np.ndarray[self.k, len(x)] -- matrix of log likelihood for this feature
        """
        likelih=np.array([[np.log(self.p[x1][i]) for x1 in x] for i in range(self.k)])
        print(likelih)
        return likelih
        pass

@test
def categorical_pred(k):
    return CategoricalPredictor(k)

[[-0.01980263 -4.62497281 -4.62497281 -0.01980263 -4.62497281 -4.62497281]
 [-4.62497281 -0.01980263 -4.62497281 -4.62497281 -0.01980263 -4.62497281]
 [-4.62497281 -4.62497281 -0.01980263 -4.62497281 -4.62497281 -0.01980263]]
[[-0.42488319 -1.09861229 -4.35670883]
 [-4.35670883 -1.09861229 -0.42488319]]
### TESTING categorical_pred: PASSED 8/8
###



## Q4. Putting things together

It's time to put all the feature predictors together and do something useful! You will implement a class that puts these classifiers to good use:

- `__init__()`: Compute the log prior for each class and initialize the feature predictors (based on feature type). The smoothed prior for class $t$ is given by
$$ \text{prior}(t) = \frac{n_t + \alpha}{n + k\alpha} $$
where $n_t = \sum_{j=1}^{n} [y_j=t]$, (i.e., the number of times the label $t$ occurred in the sample), $n$ is the number fo entries in the sample, and $k$ is the number of label values. 
- `log_likelihood()`: Compute the sum of the log prior and partial log likelihoods for all features. Use it to predict the final class label.
- `predict()`: Use the output of log_likelihood to predict a class label; break ties by predicting the class with lower id.

**Note:** Your implementation should not assume the data will always be the same as the census data. We may pass any dataset to your class. You can assume that:

1. the input will contain a `label` column of type `int64` with values $0,\ldots,k-1$ for some $k$
2. all other columns will be either of type `object` (for categorical data) or `int64` (for integer data)
3. if you encounter a column of an invalid type, throw an exception

In [11]:
def naive_bayes_test(naive_bayes):
    df = load_data()
    cl = naive_bayes(df)
    test.equal(cl.log_prior.tolist(), [-0.28626858222129903, -1.3905468592226538])

    test.true(isinstance(cl.predictor['age'], GaussianPredictor) and
        isinstance(cl.predictor['work_class'], CategoricalPredictor) and
        isinstance(cl.predictor['final_weight'], GaussianPredictor) and
        isinstance(cl.predictor['education'], CategoricalPredictor) and
        isinstance(cl.predictor['education_num'], GaussianPredictor) and
        isinstance(cl.predictor['marital_status'], CategoricalPredictor) and
        isinstance(cl.predictor['occupation'], CategoricalPredictor) and
        isinstance(cl.predictor['relationship'], CategoricalPredictor) and
        isinstance(cl.predictor['race'], CategoricalPredictor) and
        isinstance(cl.predictor['sex'], CategoricalPredictor) and
        isinstance(cl.predictor['capital_gain'], GaussianPredictor) and
        isinstance(cl.predictor['capital_loss'], GaussianPredictor) and
        isinstance(cl.predictor['hours_per_week'], GaussianPredictor) and
        isinstance(cl.predictor['native_country'], CategoricalPredictor))    

    ll = cl.log_likelihood(df.drop("label", axis="columns"))
    test.equal(tuple(ll.shape), (2, 30162))
    test.true(np.allclose(ll[:,:2].tolist(), [[-49.84977999441486, -50.38520793711001], [-53.407383777033196, -51.30832341372758]]))

    lp = cl.predict(df.drop("label", axis="columns"))
    test.equal(tuple(lp.shape), (30162,))
    test.equal(sum(lp), 5407)
    test.equal(lp[:10].tolist(), [0]*8 + [1]*2)

class NaiveBayesClassifier:
    """ Naive Bayes classifier for a mixture of continuous and categorical attributes.
        We use GaussianPredictor for continuous attributes and CategoricalPredictor for categorical ones.
        
        attr:
            predictor : Dict[column_name,model] -- model for each column
            log_prior : np.ndarray -- the (log) prior probability of each class
    """

    def __init__(self, df, alpha=1.):
        """initialize predictors for each feature and compute class prior
        
        args:
            df : pd.DataFrame -- processed dataframe, without any missing values.
        
        kwargs:
            alpha : float -- smoothing factor for prior probability
        """
        self.k=len(set(df[df.columns[-1]]))
        self.predictor={}
        labels=df['label']
        self.log_prior=np.array([np.log((labels[labels.values==i].shape[0]+alpha)/(labels.shape[0]+self.k*alpha)) for i in set(labels.values)])
        df = df.drop(['label'], axis=1)
        for i in range(df.shape[1]):
            if df[df.columns[i]].dtype=='int64':
                self.predictor[df.columns[i]]=GaussianPredictor(self.k)
                self.predictor[df.columns[i]].fit(df[df.columns[i]], labels)
            elif df[df.columns[i]].dtype=='object':
                self.predictor[df.columns[i]]=CategoricalPredictor(self.k)
                self.predictor[df.columns[i]].fit(df[df.columns[i]], labels)

    def log_likelihood(self, x):
        """log_likelihood for input instances from log_prior and partial_log_likelihood of feature predictors

        args:
            x : pd.DataFrame -- processed dataframe without label

        returns : np.ndarray[num_classes, len(x)] -- array of log-likelihood
        """
        if 'label' in x.columns:
            x=x.drop('label', axis=1)
        sum_prior = np.array([[i]*len(x) for i in self.log_prior])
        likelih=sum_prior
        for i in range(len(x.columns)):
            print(x[x.columns[i]])
            likelih=likelih+self.predictor[x.columns[i]].partial_log_likelihood(x[x.columns[i]])
        print(likelih)
        return likelih    
        pass            

    def predict(self, x):
        """predicts label for input instances, breaks ties in favor of the class with lower id.

        args:
            x : pd.DataFrame -- processed dataframe without label.

        returns : np.ndarray[len(x)] -- vector of class labels
        """
        if 'label' in x.columns:
            x=x.drop('label', axis=1)
        likelih=self.log_likelihood(x)
        labels=likelih.argmax(axis=0)
        print(labels)
        return labels
        pass

@test
def naive_bayes(*args, **kwargs):
    return NaiveBayesClassifier(*args, **kwargs)

0        39
1        50
2        38
3        53
4        28
5        37
6        49
7        52
8        31
9        42
10       37
11       30
12       23
13       32
14       34
15       25
16       32
17       38
18       43
19       40
20       54
21       35
22       43
23       59
24       56
25       19
26       39
27       49
28       23
29       20
         ..
30132    32
30133    22
30134    31
30135    29
30136    34
30137    54
30138    37
30139    22
30140    34
30141    30
30142    38
30143    45
30144    45
30145    31
30146    39
30147    37
30148    43
30149    65
30150    43
30151    43
30152    32
30153    43
30154    32
30155    53
30156    22
30157    27
30158    40
30159    58
30160    22
30161    52
Name: age, Length: 30162, dtype: int64
[[-3.53476257 -4.0136205  -3.52432649 ... -4.78110259 -4.1075355
  -4.17239457]
 [-3.36467045 -3.42109258 -3.41643947 ... -4.18283861 -5.53443839
  -3.55463079]]
0               State-gov
1        Self-emp-not-inc
2              

[[-0.89526536 -1.08371998 -1.79542887 ... -3.41099769 -0.89526536
  -1.08371998]
 [-2.76979821 -0.16060303 -2.80876418 ... -4.53020715 -2.76979821
  -0.16060303]]
0             Adm-clerical
1          Exec-managerial
2        Handlers-cleaners
3        Handlers-cleaners
4           Prof-specialty
5          Exec-managerial
6            Other-service
7          Exec-managerial
8           Prof-specialty
9          Exec-managerial
10         Exec-managerial
11          Prof-specialty
12            Adm-clerical
13                   Sales
14        Transport-moving
15         Farming-fishing
16       Machine-op-inspct
17                   Sales
18         Exec-managerial
19          Prof-specialty
20           Other-service
21         Farming-fishing
22        Transport-moving
23            Tech-support
24            Tech-support
25            Craft-repair
26         Exec-managerial
27            Craft-repair
28         Protective-serv
29                   Sales
               ...        


0        39
1        50
2        38
3        53
4        28
5        37
6        49
7        52
8        31
9        42
10       37
11       30
12       23
13       32
14       34
15       25
16       32
17       38
18       43
19       40
20       54
21       35
22       43
23       59
24       56
25       19
26       39
27       49
28       23
29       20
         ..
30132    32
30133    22
30134    31
30135    29
30136    34
30137    54
30138    37
30139    22
30140    34
30141    30
30142    38
30143    45
30144    45
30145    31
30146    39
30147    37
30148    43
30149    65
30150    43
30151    43
30152    32
30153    43
30154    32
30155    53
30156    22
30157    27
30158    40
30159    58
30160    22
30161    52
Name: age, Length: 30162, dtype: int64
[[-3.53476257 -4.0136205  -3.52432649 ... -4.78110259 -4.1075355
  -4.17239457]
 [-3.36467045 -3.42109258 -3.41643947 ... -4.18283861 -5.53443839
  -3.55463079]]
0               State-gov
1        Self-emp-not-inc
2              

[[-1.1885003  -1.20588696 -1.1885003  ... -2.02198897 -1.63831513
  -3.45887511]
 [-2.2103527  -0.27981671 -2.2103527  ... -3.55854721 -4.75013596
  -2.38061138]]
0                     White
1                     White
2                     White
3                     Black
4                     Black
5                     White
6                     Black
7                     White
8                     White
9                     White
10                    Black
11       Asian-Pac-Islander
12                    White
13                    Black
14       Amer-Indian-Eskimo
15                    White
16                    White
17                    White
18                    White
19                    White
20                    Black
21                    Black
22                    White
23                    White
24                    White
25                    White
26                    White
27                    White
28                    White
29                    Bla