

---

# **Data Manipulation With Python**

---

## Data manipulation here does not mean engineering data, not making data inconsistent with its original value. But Data Manipulation is here to simplify the data when the machine analyzes it.


---

## Import the required libraries

---
### Import the pandas and numpy library


In [85]:
# import the library pandas and numpy

import pandas as pd
import numpy as np



---

## **1. Object Series**

---

### Pandas have two objects, namely series, and data frames. Object Series has one data dimension. It doesn't have a column name because it only has one column and an index.



---

### Assign List as Variable

---
#### Assign a list of 4 elements, namely 0.25, 0.50, 0.75, and 1, to the variable data

In [86]:
# Assign a list of 4 elements, namely 0.25, 0.50, 0.75, and 1, to the variable data.

data = [0.25, 0.50, 0.75, 1]



---

### Converting List to Series

---
Converting the variable data (list) into series


In [87]:
# Convert list to series using pd.Series()

data = pd.Series(data)



---

### Displaying The Data Series Values

---

Displaying the data series into two column, index and its values

In [88]:
# Displaying the values and data type of variable data with default index by tcalling the name of the variable

data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

---

### Convert Series to Array

---
#### Converting the data series to the array 


In [89]:
# Converting series to array using series_name.values

data.values

array([0.25, 0.5 , 0.75, 1.  ])

---

### Displaying Index

---

#### The index is a range where the starting point is inclusive of the range, and the stop point is exclusive to the range. For an inclusive index, the index is initially included.

In [90]:
# Displaying the index using variable_name.index

data.index

RangeIndex(start=0, stop=4, step=1)

In [91]:
# Converting range to list

list(range(1,10))

[1, 2, 3, 4, 5, 6, 7, 8, 9]

---

### How to call data

---

#### We can Call the values of the data using the variable name or the variable nama with implicit or eksplicit index.

In [92]:
# Calling data using variable name

data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [93]:
# Calling data using variable name and implicit index (original index)

data[2]

0.75

* The implicit index is the default index

* We can define the index, called explicit, i.e., the defined index

* When defining an index, the number of indexes must equal the number of data

In [94]:
# Assign series and index to the variable data

data = pd.Series([0.25, 0.50, 0.75, 1], index=['a','b','c','d'])

In [95]:
# Calling the values of the variable data using its name

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [96]:
# Convert series to array using series_name.values

data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [97]:
# Displaying the index and data type of the variable data

data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [98]:
# Calingg the data using explisit index 'a'

data['a']

0.25

* This is Data Selection
* Even if we have created an explicit index, we can still call the implicit index.

In [99]:
# Calling implicit index 3 (original index 3)

data[3]

1.0

* When the implicit index and the explicit index are the same. 
* When we call the data, it will only rely on its explicit index

In [100]:
# Assign series and index to the variable data_2

data_2 = pd.Series([0.25, 0.50, 0.75, 1], index=[2,5,3,7])

In [101]:
# Calling the values, index and data type of the series data_2

data_2

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [102]:
# Calling the implicit index 2 from the data_2 series

data_2[2]

0.25

In [103]:
# Calling the implicit index 0 from the data_2 series

data_2[0]

# It will turn out KeyError because the implicit index 0 doesn't exist

KeyError: ignored


---

### Data Sclicing

---

#### Slice the data using implicit or explicit index

In [104]:
# Call the values, index and data type of the series data

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

* For example we will call from data b to data c

In [105]:
# Slicing the data from eksplicit index b to c

data['b':'c']

b    0.50
c    0.75
dtype: float64

* But if we slice the implicit index, then only the starting point will appear because the implicit index is a range

In [106]:
# Slicing the data from implicit index 1 to 2

data[1:2] 

b    0.5
dtype: float64



---

### loc and iloc

---
* #### loc access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
* #### iloc purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.


---




* Calling and slicing data without loc and iloc

In [107]:
# Assign series and explicit index to the variable data_2 

data_2 = pd.Series([0.25, 0.50, 0.75, 1], index=[2,3,5,7])

In [108]:
# Calling the values, index and data type of the data_2 series

data_2

2    0.25
3    0.50
5    0.75
7    1.00
dtype: float64

* When we access an index, then what appears is the explicit index

In [109]:
# Calling the value of the explicit index 2 of the data_2 series(data selection)

data_2[2] 

0.25

* When we call the explicit index from index 2 to index 3, the value that appears is precisely from the implicit index

In [110]:
# Calling the implicit index from index 2 to index 3 (slicing)

data_2[2:3]

5    0.75
dtype: float64

* When the explicit index and the implicit index are the same, there will be inconsistencies as in the case above
* To overcome this inconsistency, we will use the loc and iloc rules
* loc is to call its explicit index
* iloc is to call its implicit index
---



* loc

In [111]:
# Calling explicit index 3 using loc (data selection)

data_2.loc[3] 

0.5

In [112]:
# Calling implisit index 2 to index 3 using loc (data slicing)

data_2.loc[2:3] 

2    0.25
3    0.50
dtype: float64

* iloc

In [113]:
# Calling implicit index 3 using iloc (data selection)

data_2.iloc[3] 

1.0

In [114]:
# Calling implicit index 2 to 3 using iloc (data slicing)

data_2.iloc[2:3] 

5    0.75
dtype: float64



---

## **2. Data Frame**

---
### Data Frame is a collection of series, with at least one series

In [115]:
# Assign keys and values of dictionary dict_population

dict_population ={'Jakarta':750,
                'Bogor':490,
                'Depok':350,
                'Tangerang':270,
                'Bekasi':670}

# This is just an example, not a real population figure

In [116]:
# Calling the keys and values of dict_population

dict_population

{'Bekasi': 670, 'Bogor': 490, 'Depok': 350, 'Jakarta': 750, 'Tangerang': 270}

In [117]:
# Convert dictionary to series using pd.Series() and assign it to variable population

population = pd.Series(dict_population)

In [118]:
# Calling values of the population series

population

Jakarta      750
Bogor        490
Depok        350
Tangerang    270
Bekasi       670
dtype: int64

In [119]:
# Calling the number of popolution in Depok using explicit index 'Depok'

population.loc['Depok']

350

In [120]:
# Calling the number of population in Depok using implicit index 2 

population.iloc[2]

350

In [121]:
# Assign Keys and Values to the variable dict_area

dict_area = {'Jakarta':737,
            'Bogor':325,
            'Depok':247,
            'Tangerang':302,
            'Bekasi':355}

# This is just an example, not a real area number

In [122]:
# Convert the dictionary dict_area to series using pd.Series() and assign it to the variable area

area = pd.Series(dict_area)

In [123]:
# Calling the values, index, and data types of the area series

area

Jakarta      737
Bogor        325
Depok        247
Tangerang    302
Bekasi       355
dtype: int64

In [124]:
# Convert two series to Data Frame using pd.DataFrame(). It called concantenate two series and give the column name and assign as pop_area

pop_area = pd.DataFrame({'pop':population, 'area':area})

In [125]:
# Calling the values of the pop_area Data Frame 

pop_area

Unnamed: 0,pop,area
Jakarta,750,737
Bogor,490,325
Depok,350,247
Tangerang,270,302
Bekasi,670,355


In [126]:
# Calling the data in the column 'area' and the explicit index 'Jakarta'

pop_area['area']['Jakarta']

737

* When calling data with pop_area.pop syntax it will appear as below because pop is the same as the name of the function in the data frame

In [140]:
# Calling the data of the pop_area using pop_area.pop

pop_area.pop

<bound method DataFrame.pop of            population  area   density
Jakarta           750   737  1.017639
Bogor             490   325  1.507692
Depok             350   247  1.417004
Tangerang         270   302  0.894040
Bekasi            670   355  1.887324>

In [127]:
# Calling the data of the pop_area in column 'pop'

pop_area['pop']

Jakarta      750
Bogor        490
Depok        350
Tangerang    270
Bekasi       670
Name: pop, dtype: int64

* We rename the pop column to population

In [128]:
# Rename the 'pop' column with population

pop_area = pd.DataFrame({'population':population,'area':area})

In [129]:
# Calling the data of the column 'population'

pop_area['population']

Jakarta      750
Bogor        490
Depok        350
Tangerang    270
Bekasi       670
Name: population, dtype: int64

In [130]:
# Calling the data in the column 'population' using the explicit index 'Jakarta' to 'Depok'

pop_area['population']['Jakarta':'Depok']

Jakarta    750
Bogor      490
Depok      350
Name: population, dtype: int64

In [131]:
# Calling the data in the column 'population' using the implicit index 0 to 3

pop_area['population'].iloc[0:3] 

Jakarta    750
Bogor      490
Depok      350
Name: population, dtype: int64

In [132]:
# Add a new column called 'density' whoose contents are the results of division calculations from the previous two columns

pop_area['density']=pop_area['population']/pop_area['area']

In [133]:
# Calling the data of the pop_area data frame

pop_area

Unnamed: 0,population,area,density
Jakarta,750,737,1.017639
Bogor,490,325,1.507692
Depok,350,247,1.417004
Tangerang,270,302,0.89404
Bekasi,670,355,1.887324


In [134]:
# Add new bar called 'Bandung' with the values of each column are 151, 148. 0.18

new_area=pd.DataFrame({'Bandung':[151,148,0.18]})

In [135]:
# Transpose the new_area data frame

new_area=new_area.T

In [136]:
# Calling the data of the new_area data frame

new_area

Unnamed: 0,0,1,2
Bandung,151.0,148.0,0.18


In [137]:
# Change the column name the same as the column names of pop_area data frame

new_area.columns=pop_area.columns

In [138]:
# Calling the data of the new_area series

new_area

Unnamed: 0,population,area,density
Bandung,151.0,148.0,0.18


In [139]:
# Concantenate the data of pop_area and new_area using pd.concat()

pd.concat([pop_area, new_area])

# The 'Bandung' bar now exist in the last bars

Unnamed: 0,population,area,density
Jakarta,750.0,737.0,1.017639
Bogor,490.0,325.0,1.507692
Depok,350.0,247.0,1.417004
Tangerang,270.0,302.0,0.89404
Bekasi,670.0,355.0,1.887324
Bandung,151.0,148.0,0.18
