# Data Preparation

## Data Preparation with Pandas

This chapter will show you how to use the pandas package to import and preprocess data. Preprocessing is the process of pre-analyzing data before converting it to a standard and normalized format.
The following are some of the aspects of preprocessing:

* missing values
* data normalization 
* data standardization 
* data binning

We'll simply be dealing with missing values in this session.

#### Importing data 

We will utilize the Iris dataset in this tutorial, which can be downloaded from the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/iris.

In the pattern recognition literature, this is probably the most well-known database. Fisher's paper is considered a classic in the subject and is still cited frequently. (See, for example, Duda & Hart.) The data collection has three classes, each with 50 instances, each referring to a different species of iris plant. The three classes include 
* Iris Setosa
* Iris Versicolour
* Iris Virginica.

To begin, use the pandas library to import data and transform it to a dataframe. 

In [1]:
import pandas as pd

In [2]:
url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/iris.data'

iris = pd.read_csv(url, header=None, 
                   names = ['sepal_length', 'sepal_width', 
                            'petal_length', 'petal_width', 'class'])

type(iris)

pandas.core.frame.DataFrame

Here we specify whether there is a header (`header`) and the variable names (using `names` and a list). 

The resulting object `iris` is a pandas DataFrame. 

We will print the first 10 rows and the last 10 rows of the dataset using the head(10) method to get an idea of its contents.

In [3]:
iris.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [4]:
iris.tail(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


To get the names of the columns (the variable names), you can use `columns` method.

In [5]:
iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')

To extract the class column, you can simply use the following commands:

In [6]:
iris['class']

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: class, Length: 150, dtype: object

The Pandas Series is a one-dimensional labeled array that may hold any type of data (integer, string, float, python objects, etc.)

In [7]:
type(iris['class'])

pandas.core.series.Series

In [8]:
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
class            object
dtype: object

Tab completion for column names (as well as public attributes), `iris.<TAB>` , is enabled by default if you're using Jupyter.

For example, type `iris.` and then follow with the TAB key. Look for the `shape` attribute.

The `shape` attribute of pandas.DataFrame stores the number of rows and columns as a tuple (number of rows, number of columns).

In [9]:
iris.shape

(150, 5)

In [10]:
iris.info

<bound method DataFrame.info of      sepal_length  sepal_width  petal_length  petal_width           class
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]>

In [11]:
import pandas as pd

values = {'dates':  ['20210305','20210316','20210328'],
          'status': ['Opened','Opened','Closed']
          }

demo = pd.DataFrame(values)

In [12]:
demo

Unnamed: 0,dates,status
0,20210305,Opened
1,20210316,Opened
2,20210328,Closed


In [13]:
demo['dates'] = pd.to_datetime(demo['dates'], format='%Y%m%d')

In [14]:
demo

Unnamed: 0,dates,status
0,2021-03-05,Opened
1,2021-03-16,Opened
2,2021-03-28,Closed
