# Data Manipulation With Python

Data manipulation here does not mean engineering data, not making data inconsistent with its original value. But Data Manipulation is here to simplify the data when it is analyzed by the machine.

import library-library that will be needed

In [None]:
import pandas as pd
import numpy as np

Pandas has two objects, namely series and data frames

# Object Series

Object Series has one dimension of data.. It doesn't have a column name because it only has one column.. And it has an index..

In [None]:
data = [0.25, 0.50, 0.75, 1]

converting data into series

In [None]:
data = pd.Series(data)

In [None]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

convert from series to array

In [None]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

display index..

The index is a range, where the starting point is inclusive of the range and the stop point is exclusive of the range.

In [None]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
list(range(1,10))

[1, 2, 3, 4, 5, 6, 7, 8, 9]

how to call data

In [None]:
data[2]

0.75

implicit index is the default index..

we can define the index, this is called the explicit index i.e. the defined index..

When defining an index, the number of indexes must be equal to the number of data.

In [None]:
data = pd.Series([0.25, 0.50, 0.75, 1], index=['a','b','c','d'])

In [None]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [None]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [None]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

call the data

In [None]:
#indeks eksplisit

data['a'] 

0.25

this is data selection

explicit, we can still call the implicit index..

In [None]:
#indeks implisit

data[3]

1.0

when the implicit index and the explicit index are the same.. when we call the data, it will only rely on the explicit index.

In [None]:
data_2 = pd.Series([0.25, 0.50, 0.75, 1], index=[2,5,3,7])

In [None]:
data_2[2]

0.25

In [None]:
data_2[0]

KeyError: ignored

we will try to do data slicing

In [None]:
data = pd.Series([0.25, 0.50, 0.75, 1], index=['a','b','c','d'])

In [None]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

for example we will call from data b to data c

In [None]:
data['b':'c'] #indeks eksplisit

b    0.50
c    0.75
dtype: float64

but if we slicing the implicit index, then only the starting point will appear.. because the implicit index is a range..

In [None]:
data[1:2] #indeks implisit

b    0.5
dtype: float64

# loc dan iloc

Example of data that has the same implicit index and explicit index

In [None]:
data_2 = pd.Series([0.25, 0.50, 0.75, 1], index=[2,5,3,7])

In [None]:
data_2

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

When we access an index, what appears is the explicit index.

In [None]:
data_2[2] #indeks eksplisit : selecting

0.25

when we call the explicit index from index 2 to index 3.. the value that appears is precisely from the implicit index..

In [None]:
data_2[2:3] #indeks implisit : slicing

3    0.75
dtype: float64

when the explicit index and implicit index are the same, there will be inconsistencies like the case above.

To overcome this inconsistency, we will use the loc and iloc rules.

loc is to call its explicit index..

iloc is to call its implicit index..

In [None]:
#loc

data_2.loc[3] #selecting indeks eksplisit

0.75

In [None]:
data_2.loc[2:3] #slicing indeks eksplisit

2    0.25
5    0.50
3    0.75
dtype: float64

In [None]:
#iloc

data_2.iloc[3] #selecting indeks implisit

1.0

In [None]:
data_2.iloc[2:3] #slicing indeks implisit

3    0.75
dtype: float64


Data Frame is a collection of series, with at least one series.

In [None]:
dict_populasi = {'Jakarta':750, 
                 'Bogor':490,
                 'Depok':350,
                 'Tanggerang':270,
                 'Bekasi':670}

#ini hanya permisalan, bukan angka populasi yg sesungguhnya

In [None]:
dict_populasi

{'Jakarta': 750, 'Bogor': 490, 'Depok': 350, 'Tanggerang': 270, 'Bekasi': 670}

In [None]:
#transformasi dictionary ke series

populasi = pd.Series(dict_populasi)

In [None]:
populasi

Jakarta       750
Bogor         490
Depok         350
Tanggerang    270
Bekasi        670
dtype: int64

In [None]:
populasi.loc['Depok']

350

In [None]:
populasi.iloc[2]

350

In [None]:
dict_luas = {'Jakarta':737, 
                 'Bogor':325,
                 'Depok':247,
                 'Tanggerang':302,
                 'Bekasi':355}

#This is just an example, not a real area number

In [None]:
luas = pd.Series(dict_luas)

In [None]:
luas

Jakarta       737
Bogor         325
Depok         247
Tanggerang    302
Bekasi        355
dtype: int64

In [None]:
daerah = pd.DataFrame({'pop':populasi, 'luas':luas})

In [None]:
daerah

Unnamed: 0,pop,luas
Jakarta,750,737
Bogor,490,325
Depok,350,247
Tanggerang,270,302
Bekasi,670,355


In [None]:
daerah['luas']['Jakarta']

737

when calling data with regional.pop syntax it will appear as below

because pop is the same as the name of the function in the data frame

In [None]:
daerah.pop

<bound method NDFrame.pop of             pop  luas
Jakarta     750   737
Bogor       490   325
Depok       350   247
Tanggerang  270   302
Bekasi      670   355>

then it is safer to call the data with the syntax area['population']

In [None]:
daerah['pop']

Jakarta       750
Bogor         490
Depok         350
Tanggerang    270
Bekasi        670
Name: pop, dtype: int64

we rename the column pop to population

In [None]:
daerah = pd.DataFrame({'populasi':populasi, 'luas':luas})

In [None]:
daerah

Unnamed: 0,populasi,luas
Jakarta,750,737
Bogor,490,325
Depok,350,247
Tanggerang,270,302
Bekasi,670,355


In [None]:
daerah['populasi']

Jakarta       750
Bogor         490
Depok         350
Tanggerang    270
Bekasi        670
Name: populasi, dtype: int64

In [None]:
daerah['populasi']['Jakarta':'Depok'] #indeks eksplisit

Jakarta    750
Bogor      490
Depok      350
Name: populasi, dtype: int64

In [None]:
daerah['populasi'].iloc[0:3] #indeks implisit

Jakarta    750
Bogor      490
Depok      350
Name: populasi, dtype: int64

In [None]:
#add new column

daerah['pop_per_area']=daerah['populasi']/daerah['luas']

In [None]:
daerah

Unnamed: 0,populasi,luas,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
#add new line

daerah_tambahan=pd.DataFrame({'Bandung':[151, 148,0.18]})

In [None]:
daerah_tambahan

Unnamed: 0,Bandung
0,151.0
1,148.0
2,0.18


In [None]:
daerah_tambahan=daerah_tambahan.T

In [None]:
daerah_tambahan

Unnamed: 0,0,1,2
Bandung,151.0,148.0,0.18


In [None]:
daerah_tambahan.columns=daerah.columns

In [None]:
daerah_tambahan

Unnamed: 0,populasi,luas,pop_per_area
Bandung,151.0,148.0,0.18


In [None]:
#combine regional data and additional_area data with concat

pd.concat([daerah, daerah_tambahan])

Unnamed: 0,populasi,luas,pop_per_area
Jakarta,750.0,737.0,1.017639
Bogor,490.0,325.0,1.507692
Depok,350.0,247.0,1.417004
Tanggerang,270.0,302.0,0.89404
Bekasi,670.0,355.0,1.887324
Bandung,151.0,148.0,0.18


In [None]:
#deleting the column is not permanent, meaning that it is still stored in the data source

daerah.drop('pop_per_area', axis=1)

Unnamed: 0,populasi,luas
Jakarta,750,737
Bogor,490,325
Depok,350,247
Tanggerang,270,302
Bekasi,670,355


In [None]:
daerah

Unnamed: 0,populasi,luas,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
#deleting rows is not permanent, meaning that it is still stored in the data source

daerah.drop('Bekasi', axis=0)

Unnamed: 0,populasi,luas,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404


In [None]:
daerah

Unnamed: 0,populasi,luas,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
#rename column

In [None]:
daerah.columns

Index(['populasi', 'luas', 'pop_per_area'], dtype='object')

In [None]:
#way 1 #permanent

daerah.columns = ['populasi', 'luas_m2', 'pop_per_area']

In [None]:
daerah

Unnamed: 0,populasi,luas_m2,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
#way 2 #not permanent

daerah.rename(columns={'populasi':'populasi_daerah','luas_m2':'luas_daerah'})

Unnamed: 0,populasi_daerah,luas_daerah,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
daerah

Unnamed: 0,populasi,luas_m2,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
#way 2 #permanent 

daerah.rename(columns={'populasi':'populasi_daerah','luas_m2':'luas_daerah'}, inplace=True)

In [None]:
daerah

Unnamed: 0,populasi_daerah,luas_daerah,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
#rename column to all caps

daerah.rename(columns=str.upper)

Unnamed: 0,POPULASI_DAERAH,LUAS_DAERAH,POP_PER_AREA
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
daerah

Unnamed: 0,populasi_daerah,luas_daerah,pop_per_area
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
Bekasi,670,355,1.887324


In [None]:
#for example, we will sort the population_area column from small to large

daerah.sort_values('populasi_daerah')

Unnamed: 0,populasi_daerah,luas_daerah,pop_per_area
Tanggerang,270,302,0.89404
Depok,350,247,1.417004
Bogor,490,325,1.507692
Bekasi,670,355,1.887324
Jakarta,750,737,1.017639


In [None]:
#for example, we will sort the population_area column from large to small

daerah.sort_values('populasi_daerah', ascending=False)

Unnamed: 0,populasi_daerah,luas_daerah,pop_per_area
Jakarta,750,737,1.017639
Bekasi,670,355,1.887324
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tanggerang,270,302,0.89404
