# Pandas - Data Analysis Library

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

## Importing Pandas

In [1]:
import pandas as pd

In [2]:
pd.__version__

'1.1.3'

## Series

In [3]:
pd.Series(data = [1,2,3,4])

0    1
1    2
2    3
3    4
dtype: int64

## DataFrame

In [4]:
pd.DataFrame(data= {"Nama" : ["Selly", "Emir"], "Umur": [12, 13]})

Unnamed: 0,Nama,Umur
0,Selly,12
1,Emir,13


![](pandas/series-and-dataframe.width-1200.png)

## Creating DataFrame from dictionary

In [5]:
df = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}

In [6]:
df0 = pd.DataFrame(df)
df0

Unnamed: 0,col_1,col_2
0,3,a
1,2,b
2,1,c
3,0,d


Specify orient='index' to create the DataFrame using dictionary keys as rows:

In [7]:
df2 = pd.DataFrame.from_dict(df, orient='index')
df2

Unnamed: 0,0,1,2,3
col_1,3,2,1,0
col_2,a,b,c,d


When using the ‘index’ orientation, the column names can be specified manually:

In [8]:
df3 = pd.DataFrame.from_dict(df, orient='index',
                       columns=['A', 'B', 'C', 'D'])
df3

Unnamed: 0,A,B,C,D
col_1,3,2,1,0
col_2,a,b,c,d


We can change the columns' name

In [9]:
df3.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [10]:
df3.columns = ["first", "second", "third", "fourth"]

In [11]:
df3

Unnamed: 0,first,second,third,fourth
col_1,3,2,1,0
col_2,a,b,c,d


In [12]:
df3.columns.values[0]

'first'

In [13]:
df3.columns.values[0] = "zero"

In [14]:
df3.columns.values[1] = 'satu'

In [15]:
df3

Unnamed: 0,zero,satu,third,fourth
col_1,3,2,1,0
col_2,a,b,c,d


### Exercise 1

1. Create the following dataframe
![](pandas/ex00.png)

2. Change "Location" into "City"

# Open CSV file

We will be using data of Uber drive in 2016. The data can be obtained from Kaggle (https://www.kaggle.com/zusmani/uberdrives)

In [16]:
data = pd.read_csv("datasets/My Uber Drives - 2016.csv")
data

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/My Uber Drives - 2016.csv'

### Basic Operation

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.shape

In [None]:
data.dtypes

### Convert data type

It can be seen that the START_DATE* and END_DATE* is object type data. While in fact, it is a date

In [None]:
data1 = pd.DataFrame({"Cost":["5","5","7"],"Amount":[11,12,13],"Date": ["11-10-2020","12-10-2020","13-10-2020"]})
data1

In [None]:
data1.dtypes

In [None]:
data1["Date"] = pd.to_datetime(data1["Date"])

In [None]:
data1["Cost"] = pd.to_numeric(data1["Cost"])

In [None]:
data1

In [None]:
data1.dtypes

#### Apply to our dataframe

In [None]:
# convert data to datetime format
pd.to_datetime(data["START_DATE*"], format='%m/%d/%Y %H:%M')

In [None]:
pd.to_datetime(data["START_DATE*"],format='%m/%d/%Y %H:%M', errors = 'coerce')

In [None]:
data.dtypes

Why the `START_DATA*` is still object? because it is not changed in the data frame

In [None]:
data["START_DATE*"] = pd.to_datetime(data["START_DATE*"],format='%m/%d/%Y %H:%M', errors = 'coerce')

In [None]:
data.dtypes

In [None]:
data["END_DATE*"] = pd.to_datetime(data["END_DATE*"],format='%m/%d/%Y %H:%M', errors = 'coerce')

In [None]:
data.dtypes

In [None]:
data

In [None]:
data2 = pd.read_csv("datasets/My Uber Drives - 2016.csv")
data2

### Dataset summarization

In [None]:
data.describe()

In [None]:
data.describe(include='all')

In [None]:
data.info()

In [None]:
# count of unique start locations
data["START*"].value_counts()

### > Exercise 2

1. Create the following dataframe with “Umur” is object type and convert it into integer
![](pandas/ex1.png)

2. Go to Kaggle, download the Titanic data and do the data basic exploration.\
head, tail, describe, info, size, shape

## Data Manipulation Tasks

There are five common data manipulations tasks:
1. Selecting/Indexing
2. Filtering
3. Sorting
4. Mutating/conditionally adding columns
5. Groupby/summarize

## 1. Selecting/Indexing

### `loc` and `iloc`

![](pandas/loc.png)

In [None]:
data.head()

### Positional indexing

In [None]:
data.iloc[0:3, [1,3]]

In [None]:
data.iloc[:, 3:6]

In [None]:
data.iloc[1:3, 3:6]

### Label indexing

In [None]:
data.loc[0:5, :"START*"]

In [None]:
data.loc[:, ["START_DATE*", "MILES*"]].head()

In [None]:
data[["START_DATE*", "MILES*"]]

In [None]:
a = data.loc[:, "START*"]
a

In [None]:
type(a)

In [None]:
b = data.loc[:, ["START*"]].head()
b

In [None]:
type(b)

##### All function work in df, not in series

### > Exercise 3

1. Select columns: `START_DATE*, START*, STOP*`

2. Extract the first & last 10 rows of the previous columns