### Source:
https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_03_3_feature_encode.ipynb

### Encoding a feature vector for pytorch deep learning
Neural networks can accept many types of data. We will begin with tabular data, where there are well-defined rows and columns. This data is what your typically see in MS Excel. Neural networks require numeric input. This numeric form is called a feature vector. Each input neurons receive one feature (or column) from this vector. Each row of training data typically becomes one vector. THis section will see how to encode the following tabular data into a feature vector. You can see an example of tabular data into a feature vector. You can see an example of tabular data below.

In [1]:
import pandas as pd
from sklearn import preprocessing

In [6]:
df = pd.read_csv('data/jh-simple-dataset.csv', na_values=["NA", "?"])
df.head()

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b
1,2,kd,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a


You can make the following observations from the avobe data:

- The target column is the colum that you seek to preduct. There are several candidates here. However, we will initially use the column "product". This field specifies what product someone bought.
- There is an ID column. You should exclude this column because it contains no information useful for prediction.
- Many of these fields are numeric and might not require further processing.
- The income column does have some missing values.
- There are categorical values: job, area and product.

### Convert categories to dummy variables
To begin with, convert the job into dummy variables (one hot encoding)

In [7]:
dummies = pd.get_dummies(df["job"], prefix="job")
print(dummies.shape)

dummies.info()

(2000, 33)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 33 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   job_11  2000 non-null   bool 
 1   job_al  2000 non-null   bool 
 2   job_am  2000 non-null   bool 
 3   job_ax  2000 non-null   bool 
 4   job_bf  2000 non-null   bool 
 5   job_by  2000 non-null   bool 
 6   job_cv  2000 non-null   bool 
 7   job_de  2000 non-null   bool 
 8   job_dz  2000 non-null   bool 
 9   job_e2  2000 non-null   bool 
 10  job_f8  2000 non-null   bool 
 11  job_gj  2000 non-null   bool 
 12  job_gv  2000 non-null   bool 
 13  job_kd  2000 non-null   bool 
 14  job_ke  2000 non-null   bool 
 15  job_kl  2000 non-null   bool 
 16  job_kp  2000 non-null   bool 
 17  job_ks  2000 non-null   bool 
 18  job_kw  2000 non-null   bool 
 19  job_mm  2000 non-null   bool 
 20  job_nb  2000 non-null   bool 
 21  job_nn  2000 non-null   bool 
 22  job_ob  2000 non-null   bool 
 23  jo

With 33 different job codes, there are 33 dummy variables. To merge these dummies back into the main dataframe, and drop the original "job" field:

In [8]:
df = pd.concat([df, dummies], axis=1)
df.drop("job", axis=1, inplace=True)
df.head()

Unnamed: 0,id,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,...,job_pe,job_po,job_pq,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz
0,1,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,...,False,False,False,False,False,False,False,False,True,False
1,2,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,...,False,False,False,False,False,False,False,False,False,False
2,3,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,...,True,False,False,False,False,False,False,False,False,False
3,4,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,...,False,False,False,False,False,False,False,False,False,False
4,5,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,...,False,False,False,False,False,False,False,False,False,False


In [9]:
df = pd.concat([df, pd.get_dummies(df["area"], prefix="area")], axis=1)
df.drop("area", axis=1, inplace=True)
df.head()

Unnamed: 0,id,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,...,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz,area_a,area_b,area_c,area_d
0,1,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,...,False,False,False,False,True,False,False,False,True,False
1,2,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,...,False,False,False,False,False,False,False,False,True,False
2,3,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,...,False,False,False,False,False,False,False,False,True,False
3,4,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,...,False,False,False,False,False,False,False,False,True,False
4,5,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,...,False,False,False,False,False,False,False,False,False,True


In [10]:
print(df.shape)

(2000, 49)


Fill missing income values

In [11]:
med = df["income"].median()
df["income"] = df["income"].fillna(med)

Create a list of df fields dropping the "id" field and the target (y) field, in this case "product"

In [12]:
x_columns = df.columns.drop("product").drop("id")
print(list(x_columns))

['income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


Generating X and Y for classification Neural Network

In [None]:
x = df[x_columns].values
le = preprocessing.LabelEncoder()
y = le.fit_transform(df["product"])
products = le.classes_