In this notebook, we overview the usage of Pandas.

# Load the dataset

In [1]:
import pandas as pd

df = pd.read_csv('0_Preliminary/0_Training/Pre_train_D_0.csv')
df

Unnamed: 0,Timestamp,Arbitration_ID,DLC,Data,Class
0,1.597708e+09,260,8,06 25 05 30 FF CF 71 55,Normal
1,1.597708e+09,329,8,4A C5 7E 8C 31 2D 01 10,Normal
2,1.597708e+09,38D,8,00 00 49 00 90 7F FE 01,Normal
3,1.597708e+09,420,8,50 1E 00 C8 FC 4F 6A 00,Normal
4,1.597708e+09,421,8,FE 07 00 FF E3 7F 00 52,Normal
...,...,...,...,...,...
179341,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
179342,1.597708e+09,260,8,06 39 1A 30 FF D1 A1 63,Normal
179343,1.597708e+09,421,8,FE 07 00 FF E3 7F 00 9E,Normal
179344,1.597708e+09,130,8,94 8E F0 81 00 00 0B AC,Normal


# Pandas usage: how to access the data
## Terms (Pandas class)
- [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html): a Pandas class representing a table (2D array)
- [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html): a Pandas class representing an 1D array

### DataFrame
- Case 1. Access a column: **`df[COL_IDX]`**, `df.loc[, COL_IDX]` -> It returns a *Series*.
- Case 2. Access a row: **`df.loc[ROW_IDX]`**, `df.loc[ROW_IDX, ]`  -> It returns a *Series*.
- Case 3: Access an element: `df.loc[ROW_IDX, COL_IDX]` -> It returns a *value*.


### Series
Just consider it a Python list: `series[IDX]`

[Pandas user guide: Indexing and selecting data](https://pandas.pydata.org/docs/user_guide/indexing.html)

In [2]:
df['Class']  # column

0         Normal
1         Normal
2         Normal
3         Normal
4         Normal
           ...  
179341    Normal
179342    Normal
179343    Normal
179344    Normal
179345    Normal
Name: Class, Length: 179346, dtype: object

In [3]:
df.loc[2]  # row

Timestamp                1597707827.05467
Arbitration_ID                        38D
DLC                                     8
Data              00 00 49 00 90 7F FE 01
Class                              Normal
Name: 2, dtype: object

In [4]:
df.loc[2:4]  # rows (slicing)

Unnamed: 0,Timestamp,Arbitration_ID,DLC,Data,Class
2,1597708000.0,38D,8,00 00 49 00 90 7F FE 01,Normal
3,1597708000.0,420,8,50 1E 00 C8 FC 4F 6A 00,Normal
4,1597708000.0,421,8,FE 07 00 FF E3 7F 00 52,Normal


In [5]:
df.loc[[2, 4, 8, 16, 30000]]  # rows (with specific indices)

Unnamed: 0,Timestamp,Arbitration_ID,DLC,Data,Class
2,1597708000.0,38D,8,00 00 49 00 90 7F FE 01,Normal
4,1597708000.0,421,8,FE 07 00 FF E3 7F 00 52,Normal
8,1597708000.0,389,8,00 00 00 20 00 00 C2 00,Normal
16,1597708000.0,366,7,26 82 10 25 21 0B 01,Normal
30000,1597708000.0,164,4,00 08 1A B4,Normal


In [6]:
df.loc[2, 'Data']  # element

'00 00 49 00 90 7F FE 01'

In [7]:
# three ways to access the element at row 2 and col `Data`
print(df['Data'][2])
print(df.loc[2]['Data'])
print(df.loc[2, 'Data'])

00 00 49 00 90 7F FE 01
00 00 49 00 90 7F FE 01
00 00 49 00 90 7F FE 01


In [8]:
df.shape  # the number of rows and columns

(179346, 5)

## Conditional indexing

The code below works but do not evaluate row by row because it is inefficient.
```python
output = list()
for rowidx, row in df.iterrows():
    if row['DLC'] != 8:
        output.append(row)

len(output)
```

In [9]:
df['DLC'] != 8 # returns a series of booleans

0         False
1         False
2         False
3         False
4         False
          ...  
179341    False
179342    False
179343    False
179344    False
179345    False
Name: DLC, Length: 179346, dtype: bool

In [10]:
df2 = df[df['DLC'] != 8] # select rows by booleans
df2

Unnamed: 0,Timestamp,Arbitration_ID,DLC,Data,Class
13,1.597708e+09,2B0,6,67 00 00 07 CA 5B,Normal
14,1.597708e+09,164,4,00 08 1C FA,Normal
16,1.597708e+09,366,7,26 82 10 25 21 0B 01,Normal
22,1.597708e+09,453,5,00 88 90 00 6F,Normal
30,1.597708e+09,4F1,4,40 1B 60 31,Normal
...,...,...,...,...,...
179317,1.597708e+09,485,4,02 00 00 00,Normal
179323,1.597708e+09,164,4,00 08 0C 37,Normal
179324,1.597708e+09,366,7,39 7A 11 39 1F 08 01,Normal
179326,1.597708e+09,2B0,6,34 0D 22 07 F2 4A,Normal


In [11]:
df2 = df.query('DLC != 8') # df.query() -> another convenient way
df2

Unnamed: 0,Timestamp,Arbitration_ID,DLC,Data,Class
13,1.597708e+09,2B0,6,67 00 00 07 CA 5B,Normal
14,1.597708e+09,164,4,00 08 1C FA,Normal
16,1.597708e+09,366,7,26 82 10 25 21 0B 01,Normal
22,1.597708e+09,453,5,00 88 90 00 6F,Normal
30,1.597708e+09,4F1,4,40 1B 60 31,Normal
...,...,...,...,...,...
179317,1.597708e+09,485,4,02 00 00 00,Normal
179323,1.597708e+09,164,4,00 08 0C 37,Normal
179324,1.597708e+09,366,7,39 7A 11 39 1F 08 01,Normal
179326,1.597708e+09,2B0,6,34 0D 22 07 F2 4A,Normal


In [12]:
df2 = df.query('DLC in (4, 6)') # df.query() supports the python expressions.
df2

Unnamed: 0,Timestamp,Arbitration_ID,DLC,Data,Class
13,1.597708e+09,2B0,6,67 00 00 07 CA 5B,Normal
14,1.597708e+09,164,4,00 08 1C FA,Normal
30,1.597708e+09,4F1,4,40 1B 60 31,Normal
38,1.597708e+09,164,4,00 08 1E C0,Normal
39,1.597708e+09,2B0,6,67 00 00 07 DB 8B,Normal
...,...,...,...,...,...
179300,1.597708e+09,164,4,00 08 0A 79,Normal
179302,1.597708e+09,2B0,6,41 0D 23 07 F1 8D,Normal
179317,1.597708e+09,485,4,02 00 00 00,Normal
179323,1.597708e+09,164,4,00 08 0C 37,Normal


### condition chaining

In [13]:
condition1 = (df['Data'] == "00 00 00 00 00 00 08 EB")
condition2 = (df['Timestamp'] >= 50)
df3 = df[condition1 & condition2]
# df3 = df.query('Data == "00 00 00 00 00 00 08 EB" and Timestamp >= 50')
df3

Unnamed: 0,Timestamp,Arbitration_ID,DLC,Data,Class
280,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
1037,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
1809,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
2579,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
3346,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
...,...,...,...,...,...
176274,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
177046,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
177810,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal
178579,1.597708e+09,391,8,00 00 00 00 00 00 08 EB,Normal


# Making sure the integrity of dataset

## Q1. Is there any missing values (also known as NA or NaN)?

In [14]:
df.isna().any()

Timestamp         False
Arbitration_ID    False
DLC               False
Data              False
Class             False
dtype: bool

## Q2. Are the CAN messages properly sorted by timestamp?

In [15]:
# case 1. manual iteration
for i in range(len(df['Timestamp']) - 1):
    if not (df.loc[i + 1, 'Timestamp'] > df.loc[i, 'Timestamp']):
        print('[Case 1] Something went wrong.')
        break
else:
    print('[Case 1] The dataset is sorted by timestamp.')

# case 2. Pandas API
is_sorted = df['Timestamp'].is_monotonic_increasing
if sorted:
    print('[Case 2] The dataset is sorted by timestamp.')
else:
    print('[Case 2] Something went wrong.')

[Case 1] The dataset is sorted by timestamp.
[Case 2] The dataset is sorted by timestamp.


## Q3. Is data pre-processing necessary?

check the data type

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179346 entries, 0 to 179345
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Timestamp       179346 non-null  float64
 1   Arbitration_ID  179346 non-null  object 
 2   DLC             179346 non-null  int64  
 3   Data            179346 non-null  object 
 4   Class           179346 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 6.8+ MB


Two problems
1. `Timestamp` is not straightforward.
2. `Arbitration_ID` was supposed to be represented as an integer. But the data type is string (object).

We will make two new timestamp fields
 - absolute time
 - monotime *starting with 0*

In [17]:
df['abstime'] = pd.to_datetime(df['Timestamp'], unit='s').round('us')
df['monotime'] = df['Timestamp'] - df['Timestamp'].min()
df[['Timestamp', 'abstime', 'monotime']]

Unnamed: 0,Timestamp,abstime,monotime
0,1.597708e+09,2020-08-17 23:43:47.052591,0.000000
1,1.597708e+09,2020-08-17 23:43:47.053980,0.001389
2,1.597708e+09,2020-08-17 23:43:47.054670,0.002079
3,1.597708e+09,2020-08-17 23:43:47.054904,0.002313
4,1.597708e+09,2020-08-17 23:43:47.055140,0.002549
...,...,...,...
179341,1.597708e+09,2020-08-17 23:45:01.706675,74.654084
179342,1.597708e+09,2020-08-17 23:45:01.706905,74.654314
179343,1.597708e+09,2020-08-17 23:45:01.707144,74.654553
179344,1.597708e+09,2020-08-17 23:45:01.707378,74.654787


In [18]:
def func_hexstr_to_int(value):
    return int(value, 16)

df['aid_int'] = df['Arbitration_ID'].map(func_hexstr_to_int)  #
df[['Arbitration_ID', 'aid_int']]

Unnamed: 0,Arbitration_ID,aid_int
0,260,608
1,329,809
2,38D,909
3,420,1056
4,421,1057
...,...,...
179341,391,913
179342,260,608
179343,421,1057
179344,130,304


In [19]:
0x260, 0x329, 0x38d

(608, 809, 909)