In [None]:
import numpy as np
import pandas as pd
import torch
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
import torch.nn as nn
from collections import Counter
torch.__version__

# 5.2 Pytorch handles structured data
## Introduction
Before the introduction, we must first clarify what is structured data. Structured data, as can be seen from the name, is highly organized and neatly formatted data. It is the type of data that can be placed in tables and spreadsheets. For us, structured data can be understood as a two-dimensional table. For example, a csv file is structured data. It is generally called Tabular Data or structured data in English. Let’s take a look at structured data. example of.

The following files are from fastai's own data set:
https://github.com/fastai/fastai/blob/master/examples/tabular.ipynb
fastai example is here


## Data preprocessing
The structured data we get is generally a csv file or a table in a database, so for structured data, we can directly use the pasdas library to process it.

In [None]:
#Read file
df = pd.read_csv('./data/adult.csv')
#salary is the final classification result of this data set
df['salary'].unique()

In [None]:
#View data type
df.head()

In [None]:
#pandas's describe can tell us the general structure of the entire data set, which is very useful
df.describe()

In [None]:
#View how many data are there in total
len(df)

For model training, only numeric data can be processed, so here we first divide the data into three categories
-Training result label: the training result. Through this result, we can clearly know what our training task is, whether it is a classification task or a regression task.
-Categorized data: This type of data is discrete and cannot be directly input into the model for training, so we need to process this part first when preprocessing, which is also one of the main tasks of data preprocessing
-Numerical data: This type of data can be directly input to the model, but this part of the data may still be discrete, so it can be processed if necessary, and the training accuracy will be greatly improved after processing , Not discussed here

In [None]:
#Training results
result_var ='salary'
#Sub-type data
cat_names = ['workclass','education','marital-status','occupation','relationship','race','sex','native-country']
#Numerical data
cont_names = ['age','fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']

After manually confirming the data type, we can look at the quantity and distribution of the classification type data

In [None]:
for col in df.columns:
    if col in cat_names:
        ccol=Counter(df[col])
        print(col,len(ccol),ccol)
        print("\r\n")

The next step is to convert the sub-type data into numeric data. In this part, we also fill in the missing data

In [None]:
for col in df.columns:
    if col in cat_names:
        df[col].fillna('---')
        df[col] = LabelEncoder().fit_transform(df[col].astype(str))
    if col in cont_names:
        df[col]=df[col].fillna(0)

In the above code:

We first used pandas' fillna function to fill the classified data with null values. It is enough to mark it as a value different from other existing values. The three dashes I used here --- as Mark, and then use sklearn’s LabelEncoder function to process the data

Then there is a 0-filling process for our numerical data. For the filling of numerical data, you can also use the average value or fill in other ways. This is not our focus and will not be explained in detail.

In [None]:
df.head()

After the data processing is completed, you can see that all the data is now digital, and can be directly input to the model for training.

In [None]:
#Segmentation of training data and labels
Y = df['salary']
Y_label = LabelEncoder()
Y=Y_label.fit_transform(Y)
Y

In [None]:
X=df.drop(columns=result_var)
X

Above, the basic data preprocessing has been completed. What is shown above is just some necessary processing. If there are many techniques to improve the training accuracy, I won't explain them in detail here.
## Define data set
To use pytorch to process data, you must use Dataset to define a data set. Define a simple data set below

In [None]:
class tabularDataset(Dataset):
    def __init__(self, X, Y):
        self.x = X#.to_numpy().astype(float)
        self.y = Y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return (self.x.values[idx], self.y[idx])

In [None]:
train_ds = tabularDataset(X, Y)

You can directly use the index to access the data in the defined data set

In [None]:
train_ds[0]

## Define the model
The data has been prepared, the next step is to define our model, here we use a simple model with 3 linear layers as processing

In [None]:
class tabularModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin1 = nn.Linear(14, 500)
        self.lin2 = nn.Linear(500, 100)
        self.lin3 = nn.Linear(100, 2)
        self.bn1 = nn.BatchNorm1d(14)
        self.bn2 = nn.BatchNorm1d(500)
        self.bn3 = nn.BatchNorm1d(100)


    def forward(self,x_in):
        #print(x_in.shape)
        x=x_in
        x = self.bn1(x)
        x = F.relu(self.lin1(x))
        #print(x)

        x = self.bn2(x)
        x = F.relu(self.lin2(x))
        #print(x)

        x = self.bn3(x)
        x = self.lin3(x)
        x=torch.sigmoid(x)
        return x

When defining the model, I saw that we added Batch Normalization to normalize the batch:
Please refer to this article for the content of batch normalization: https://mp.weixin.qq.com/s/FFLQBocTZGqnyN79JbSYcQ

Or scan this QR code and view it in WeChat:
![](https://raw.githubusercontent.com/zergtant/pytorch-handbook/master/deephub.jpg)

## Training

In [None]:
#Specify the equipment used before training
DEVICE=torch.device("cpu")
if torch.cuda.is_available():
        DEVICE=torch.device("cuda")
print(DEVICE)

In [None]:
#Loss function
criterion =nn.CrossEntropyLoss().to(DEVICE)

In [None]:
#Instantiate the model
model = tabularModel().to(DEVICE)
print(model)

In [None]:
#Test whether the model is ok
rn=torch.rand(3,14).to(DEVICE)
model(rn)

In [None]:
#Learning rate
LEARNING_RATE=0.01
#BS
batch_size = 2048
#Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)


In [None]:
#DataLoaderLoading data
train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True)

The above basic steps are required for every training process, so I won’t introduce more, let’s start the model training

In [None]:
%%time
model.train()
#Training 10 rounds
TOTAL_EPOCHS=10
#Record loss function
losses = [];
for epoch in range(TOTAL_EPOCHS):
    for i, (x, y) in enumerate(train_dl):
        x = x.float().to(DEVICE) #input must not be float type
        y = y.long().to(DEVICE) #The result label must not be of type long
        #Clear
        optimizer.zero_grad()
        outputs = model(x)
        #Calculate loss function
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
        losses.append(loss.cpu().data.item());
        print ('Epoch: %d/%d, Loss: %.4f'%(epoch+1, TOTAL_EPOCHS, loss.data.item()))

After the training is complete, we can look at the accuracy of the model

In [None]:
model.eval()
correct = 0
total = 0
for i,(x, y) in enumerate(train_dl):
    x = x.float().to(DEVICE)
    y = y.long()
    outputs = model(x).cpu()


    _, predicted = torch.max(outputs.data, 1)
    total += y.size(0)
    correct += (predicted == y).sum()
print('Accuracy: %.4f %%'% (100 * correct / total))

Through the basic training process, although the accuracy rate has reached 86%, the loss has not dropped at 0.4, indicating that the network is at this level to the greatest extent, so what can be done to improve the accuracy? . Later, more advanced data processing methods will be introduced to improve accuracy