# Working with numerical data

* identifying numerical data in a heterogeneous dataset;
* selecting the subset of columns corresponding to numerical data;
* using a scikit-learn helper to separate data into train-test sets;
* training and evaluating a more complex scikit-learn model.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [4]:
adult_census = pd.read_csv("../datasets/adult-census.csv")

# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census.drop(columns="education-num", inplace=True)
adult_census.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [5]:
data, target = adult_census.drop(columns=['class']), adult_census['class']

In [6]:
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [10]:
numerical_columns = ['age', 'capital-gain', 'capital-loss', 'hours-per-week']
data[numerical_columns].head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [12]:
data['age'].describe()

count    48842.000000
mean        38.643585
std         13.710510
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

In [13]:
data_numeric = data[numerical_columns]

In [15]:
data_train, data_test, target_train, target_test = train_test_split(data_numeric, target, random_state=42, test_size=0.25)

In [18]:
print(f"O número de amostras em treino: {data_train.shape[0]} -> {(len(data_train) / data_numeric.shape[0]) * 100:.2f}% \
em relação ao original")

O número de amostras em treino: 36631 -> 75.00% em relação ao original


In [19]:
print(f"O número de amostras em teste: {data_test.shape[0]} -> {(len(data_test) / data_numeric.shape[0]) * 100:.2f}% \
em relação ao original")

O número de amostras em teste: 12211 -> 25.00% em relação ao original


In [21]:
model = LogisticRegression()

In [22]:
model.fit(data_train, target_train)

In [24]:
accuracy = model.score(data_test, target_test)
print(f"A taxa de acerto da Regressão Logística é {accuracy * 100:.3f}%")

A taxa de acerto da Regressão Logística é 80.706%
