## Hands-On Data Preprocessing in Python
Learn how to effectively prepare data for successful data analytics

## Data Cleaning Level Ⅰ - Clean up the table
We will consider a level Ⅰ dataset clean where the dataset has the following characteristics:
- It is in a standard and preferred data structure.
- It has codable and intuitive column titles.
- Each row has a unique identifier.

<img src="https://drive.google.com/uc?id=1Zy4HQLazo8lFI_Z-XbKbcXJ8K-ifHp-u" width="900"/>

In [None]:
import numpy as np
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Example 1.1 – Reindexing (Multi-level Indexing)

In [None]:
air_df = pd.read_csv('Temp Data.csv')
air_df

In [None]:
air2016_df = air_df.drop(columns=['Year'])

In [None]:
air2016_df.set_index(['Month','Day','Time'],inplace=True)

In [None]:
air2016_df

In [None]:
air2016_df.loc[2,24,'00:30:00']

*กรณีใช้ข้อมูลเป็น Index (ไม่ใช่ตัวเลข)*

> **Property** *DataFrame.loc*

(a) Access a group of rows and columns by label(s) or a boolean array.

(b) .loc[] is primarily label based, but may also be used with a boolean array.

Allowed inputs are:
*   A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
*   A list or array of labels, e.g. ['a', 'b', 'c'].
*   A slice object with labels, e.g. 'a':'f'.
*   A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
*   An alignable boolean Series. The index of the key will be aligned before masking.
*   An alignable Index. The Index of the returned selection will be the input.

> **Property** *DataFrame.iloc* (Index of Location) สำหรับการระบุเลขแถว (Row number)

(a) Purely integer-location based indexing for selection by position.

(b) .iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:
*   An integer, e.g. 5.
*   A list or array of integers, e.g. [4, 3, 0].
*   A slice object with ints, e.g. 1:7.
*   A boolean array.
*   A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. (0, 1).

### Example 1.2 – Intuitive but long column titles

In [None]:
response_df = pd.read_csv('OSMI 2019.csv')
response_df.head(1)

In [None]:
response_df['Do you know the options for mental health care available under your employer-provided health coverage?']

In [None]:
keys = ['Q{}'.format(i) for i in range(1,83)]

columns_dic = pd.Series(response_df.columns,index=keys)

1. First, the code creates the *keys* variable, which is the list of shorter substitutes for column titles. This is done using a list comprehension technique.
2. Second, the code creates a Pandas Series called *columns_dic*, whose indices are keys and whose values are *response_df.columns*.

In [None]:
columns_dic['Q4']

In [None]:
response_df.columns = keys

In [None]:
response_df.head(1)