<a href="https://colab.research.google.com/github/mochayusa/Learning/blob/main/Data_Prepocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing

So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocess it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting. This section, while no substitute for a proper pandas tutorial, will give you a crash course on some of the most common routines.

# Reading the Dataset

Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-like) data. In them, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,field of gravitational physics”. To demonstrate how to load CSV files with pandas, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price).

In [1]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [2]:
data_file

'../data/house_tiny.csv'

import pandas and load the dataset with read_csv.

In [3]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


# Data Preparation

In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing (iloc).

You might have noticed that pandas replaced all CSV entries with value NA with a special NaN (not a number) value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are called missing values and they are the “bed bugs” of data science, a persistent menace that you will confront throughout your career. Depending upon the context, missing values might be handled either via imputation or deletion. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.

Here are some common imputation heuristics. For categorical input fields, we can treat NaN as a category. Since the RoofType column takes values Slate and NaN, pandas can convert this column into two columns RoofType_Slate and RoofType_nan. A row whose roof type is Slate will set values of RoofType_Slate and RoofType_nan to 1 and 0, respectively. The converse holds for a row with a missing RoofType value.

In [4]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN               0             1
1       2.0               0             1
2       4.0               1             0
3       NaN               0             1


For missing numerical values, one common heuristic is to replace the NaN entries with the mean value of the corresponding column.

In [5]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0               0             1
1       2.0               0             1
2       4.0               1             0
3       3.0               0             1


# Conversion to the Tensor Format

Now that all the entries in inputs and targets are numerical, we can load them into a tensor.

In [6]:
import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

There are many materials should be explored

In [22]:
import pandas as pd
import urllib

# Get the URL of the Iris dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Download the data file to your Google Colab notebook
urllib.request.urlretrieve(url, "iris.data")

# Read the data file into a Pandas DataFrame
data = pd.read_csv("iris.data")
df = pd.DataFrame(data)
column_names = ["sepal length in cm", "sepal width in cm", "petal length in cm", "petal width in cm", "class"]

df.set_axis(column_names, axis=1)

df


Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


In [9]:
inputs, targets = df.iloc[:, 0:3], df.iloc[:, 4]

In [10]:
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

     5.1  3.5  1.4
0    4.9  3.0  1.4
1    4.7  3.2  1.3
2    4.6  3.1  1.5
3    5.0  3.6  1.4
4    5.4  3.9  1.7
..   ...  ...  ...
144  6.7  3.0  5.2
145  6.3  2.5  5.0
146  6.5  3.0  5.2
147  6.2  3.4  5.4
148  5.9  3.0  5.1

[149 rows x 3 columns]


In [11]:
print(targets)

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
144    Iris-virginica
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
Name: Iris-setosa, Length: 149, dtype: object


In [12]:
targets.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [15]:
# Ganti data bertipe string menjadi float dalam kolom "kolom1"
for i in range(len(targets)):
    if targets[i] == "Iris-setosa":
        targets[i] = 0.0
    elif targets[i] == "Iris-versicolor":
        targets[i] = 1.0
    elif targets[i] == "Iris-virginica":
        targets[i] = 2.0


# Cetak DataFrame
print(df.iloc[:, 4])

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
144    2.0
145    2.0
146    2.0
147    2.0
148    2.0
Name: Iris-setosa, Length: 149, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  targets[i] = 2.0


In [17]:
print(df)

     5.1  3.5  1.4  0.2 Iris-setosa
0    4.9  3.0  1.4  0.2         0.0
1    4.7  3.2  1.3  0.2         0.0
2    4.6  3.1  1.5  0.2         0.0
3    5.0  3.6  1.4  0.2         0.0
4    5.4  3.9  1.7  0.4         0.0
..   ...  ...  ...  ...         ...
144  6.7  3.0  5.2  2.3         2.0
145  6.3  2.5  5.0  1.9         2.0
146  6.5  3.0  5.2  2.0         2.0
147  6.2  3.4  5.4  2.3         2.0
148  5.9  3.0  5.1  1.8         2.0

[149 rows x 5 columns]


In [18]:
Inputs = torch.tensor(inputs.to_numpy(dtype=float))
Label = torch.tensor(targets.to_numpy(dtype=float))
Inputs, Label

(tensor([[4.9000, 3.0000, 1.4000],
         [4.7000, 3.2000, 1.3000],
         [4.6000, 3.1000, 1.5000],
         [5.0000, 3.6000, 1.4000],
         [5.4000, 3.9000, 1.7000],
         [4.6000, 3.4000, 1.4000],
         [5.0000, 3.4000, 1.5000],
         [4.4000, 2.9000, 1.4000],
         [4.9000, 3.1000, 1.5000],
         [5.4000, 3.7000, 1.5000],
         [4.8000, 3.4000, 1.6000],
         [4.8000, 3.0000, 1.4000],
         [4.3000, 3.0000, 1.1000],
         [5.8000, 4.0000, 1.2000],
         [5.7000, 4.4000, 1.5000],
         [5.4000, 3.9000, 1.3000],
         [5.1000, 3.5000, 1.4000],
         [5.7000, 3.8000, 1.7000],
         [5.1000, 3.8000, 1.5000],
         [5.4000, 3.4000, 1.7000],
         [5.1000, 3.7000, 1.5000],
         [4.6000, 3.6000, 1.0000],
         [5.1000, 3.3000, 1.7000],
         [4.8000, 3.4000, 1.9000],
         [5.0000, 3.0000, 1.6000],
         [5.0000, 3.4000, 1.6000],
         [5.2000, 3.5000, 1.5000],
         [5.2000, 3.4000, 1.4000],
         [4.7000, 3.