### DataFrame

Pandas introduces the DataFrame data structure, which is a two-dimensional tabular data structure with labeled axes (rows and columns). This allows for easy manipulation and analysis of data.

A DataFrame is a collection of Series, with at least one Series.

In [4]:
import pandas as pd

#### From previous PANDAS-SERIES

In [5]:
personal_information = {'Name': 'Ageng',
                        'Age': 24,
                        'Occupation':'Lecturer',
                        'City': 'Yogyakarta',
                        'Address': 'Jl. Merdeka No.1'}

In [6]:
personal_information

{'Name': 'Ageng',
 'Age': 24,
 'Occupation': 'Lecturer',
 'City': 'Yogyakarta',
 'Address': 'Jl. Merdeka No.1'}

In [7]:
#Transforming a dictionary into a series.

information = pd.Series(personal_information)

In [8]:
information

Name                     Ageng
Age                         24
Occupation            Lecturer
City                Yogyakarta
Address       Jl. Merdeka No.1
dtype: object

In [9]:
information.loc['Name']

'Ageng'

In [10]:
information.iloc[1]

24

In [11]:
additional_information = {'Name': 'Anna',
                          'Age': 25,
                          'Occupation': 'Designer',
                          'City': 'Jakarta',
                          'Address': 'Jl. Sultan Agung No. 2'}

In [12]:
additional_information = pd.Series(additional_information)

In [13]:
additional_information

Name                            Anna
Age                               25
Occupation                  Designer
City                         Jakarta
Address       Jl. Sultan Agung No. 2
dtype: object

In [14]:
additional_information.loc['City']

'Jakarta'

In [15]:
additional_information.iloc[0]

'Anna'

In [18]:
#DataFrame
information_2 = pd.DataFrame({'data':personal_information, 'data_2':additional_information})

In [19]:
information_2

Unnamed: 0,data,data_2
Name,Ageng,Anna
Age,24,25
Occupation,Lecturer,Designer
City,Yogyakarta,Jakarta
Address,Jl. Merdeka No.1,Jl. Sultan Agung No. 2


In [20]:
information_2['data_2']

Name                            Anna
Age                               25
Occupation                  Designer
City                         Jakarta
Address       Jl. Sultan Agung No. 2
Name: data_2, dtype: object

In [21]:
information_2['data_2']['Name']

'Anna'

- When calling data with the syntax area_pop, it will appear as shown below.
- Because "pop" is the same as the function name in the DataFrame

In [22]:
information_2.data

Name                     Ageng
Age                         24
Occupation            Lecturer
City                Yogyakarta
Address       Jl. Merdeka No.1
Name: data, dtype: object

Therefore, it is safer to call the data using the syntax area['population'].

In [23]:
information_2['data']

Name                     Ageng
Age                         24
Occupation            Lecturer
City                Yogyakarta
Address       Jl. Merdeka No.1
Name: data, dtype: object

We'll change the column name "data" to "employee_1" and "data_2" to "employee_2"

In [26]:
information_2 = pd.DataFrame({'employee_1':personal_information, 'employee_2':additional_information})

In [27]:
information_2

Unnamed: 0,employee_1,employee_2
Name,Ageng,Anna
Age,24,25
Occupation,Lecturer,Designer
City,Yogyakarta,Jakarta
Address,Jl. Merdeka No.1,Jl. Sultan Agung No. 2


information_2['employee_1']

In [31]:
information_2['employee_1'].loc['Name':'City'] #slicing explicit

Name               Ageng
Age                   24
Occupation      Lecturer
City          Yogyakarta
Name: employee_1, dtype: object

In [32]:
information_2['employee_1'].iloc[0:2] #slicing implicit

Name    Ageng
Age        24
Name: employee_1, dtype: object

#### LOAD DATA CSV IN PANDAS
Import CSV files using the read_csv() function from the Pandas library. Set the column index when reading your data into memory, and ensure that the uploaded CSV data matches the folder.

In [33]:
df = pd.read_csv('Kelahiran Bayi_Jakarta_2020.csv')

In [34]:
df.head() #Used to get the first rows. By default, it returns the top five rows.

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
0,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU UTARA,PULAU PANGGANG,Laki-Laki,5,2020
1,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU UTARA,PULAU KELAPA,Laki-Laki,1,2020
2,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU UTARA,PULAU HARAPAN,Laki-Laki,1,2020
3,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU SELATAN,PULAU UNTUNG JAWA,Laki-Laki,2,2020
4,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU SELATAN,PULAU TIDUNG,Laki-Laki,2,2020


In [35]:
# Viewing data info

df.info() #Provides a concise summary of a DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6408 entries, 0 to 6407
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tahun           6408 non-null   int64 
 1   bulan           6408 non-null   int64 
 2   kota_kabupaten  6408 non-null   object
 3   kecamatan       6408 non-null   object
 4   kelurahan       6408 non-null   object
 5   jenis_kelamin   6408 non-null   object
 6   jumlah          6408 non-null   int64 
 7   periode_data    6408 non-null   int64 
dtypes: int64(4), object(4)
memory usage: 400.6+ KB


In [36]:
# Viewing the count of non-null values in the data

df.notnull().sum() #Used to check for missing values (NaN or None) in a DataFrame.

tahun             6408
bulan             6408
kota_kabupaten    6408
kecamatan         6408
kelurahan         6408
jenis_kelamin     6408
jumlah            6408
periode_data      6408
dtype: int64

In [37]:
# Viewing the count of null values in the data

df.isnull().sum() #Used to check for missing values (NaN or None) in a DataFrame.

tahun             0
bulan             0
kota_kabupaten    0
kecamatan         0
kelurahan         0
jenis_kelamin     0
jumlah            0
periode_data      0
dtype: int64

In [40]:
# Viewing the count of data

df.sum()

tahun                                                      12944160
bulan                                                         41652
kota_kabupaten    ADM. KEPULAUAN SERIBUADM. KEPULAUAN SERIBUADM....
kecamatan         KEPULAUAN SERIBU UTARAKEPULAUAN SERIBU UTARAKE...
kelurahan         PULAU PANGGANGPULAU KELAPAPULAU HARAPANPULAU U...
jenis_kelamin     Laki-LakiLaki-LakiLaki-LakiLaki-LakiLaki-LakiL...
jumlah                                                       137161
periode_data                                               12944160
dtype: object

In [41]:
# Viewing from the bottom of the data
df.tail() #Displays the last few rows of the DataFrame. By default, it shows the last 5 rows but can be customized to display a different number of rows.

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
6403,2020,12,JAKARTA TIMUR,CIPAYUNG,MUNJUL,Perempuan,23,2020
6404,2020,12,JAKARTA TIMUR,CIPAYUNG,SETU,Perempuan,20,2020
6405,2020,12,JAKARTA TIMUR,CIPAYUNG,BAMBU APUS,Perempuan,22,2020
6406,2020,12,JAKARTA TIMUR,CIPAYUNG,LUBANG BUAYA,Perempuan,100,2020
6407,2020,12,JAKARTA TIMUR,CIPAYUNG,CEGER,Perempuan,16,2020


In [42]:
# Viewing the number of rows and columns

df.shape #To get the number of rows and columns in the DataFrame.

(6408, 8)

In [43]:
# Viewing columns

df.columns #To view the column names of the DataFrame.

Index(['tahun', 'bulan', 'kota_kabupaten', 'kecamatan', 'kelurahan',
       'jenis_kelamin', 'jumlah', 'periode_data'],
      dtype='object')

In [44]:
# Viewing index

df.index #To view the index labels of the DataFrame.

RangeIndex(start=0, stop=6408, step=1)

In [45]:
# Displaying information from numeric columns

df.describe() #Generates descriptive statistics of numerical columns in the DataFrame

Unnamed: 0,tahun,bulan,jumlah,periode_data
count,6408.0,6408.0,6408.0,6408.0
mean,2020.0,6.5,21.40465,2020.0
std,0.0,3.452322,18.487518,0.0
min,2020.0,1.0,0.0,2020.0
25%,2020.0,3.75,9.0,2020.0
50%,2020.0,6.5,17.0,2020.0
75%,2020.0,9.25,29.0,2020.0
max,2020.0,12.0,178.0,2020.0


In [46]:
# For example, displaying the mean of the 'Total' column

df['jumlah'].mean()

21.40465043695381

In [47]:
# For example, displaying the median of the 'Total' column

df['jumlah'].median() #To calculate the mean of a specific column or columns in the DataFrame.

17.0

In [48]:
# For example, displaying the mode of the 'Total' column

df['jumlah'].mode()[0] #To calculate the median of a specific column or columns in the DataFrame.

10

In [49]:
# For example, checking NaN values in the 'Total' column

df[df['jumlah'].isnull()] #To calculate the mode of a specific column or columns in the DataFrame

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data


In [50]:
# Then we apply masking
[df.jumlah.isnull()] 

[0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 6403    False
 6404    False
 6405    False
 6406    False
 6407    False
 Name: jumlah, Length: 6408, dtype: bool]

In [51]:
# For example, displaying unique data from the 'jenis_kelamin' column
df.jenis_kelamin.unique() #Returns the number of unique elements in each column of the DataFrame.

array(['Laki-Laki', 'Perempuan'], dtype=object)

In [52]:
# For example, displaying nunique data from the 'jenis_kelamin' column
df.jenis_kelamin.nunique() 

2

In [53]:
df['kecamatan'].unique()

array(['KEPULAUAN SERIBU UTARA', 'KEPULAUAN SERIBU SELATAN', 'GAMBIR',
       'SAWAH BESAR', 'KEMAYORAN', 'S E N E N', 'CEMPAKA PUTIH',
       'MENTENG', 'TANAH ABANG', 'JOHAR BARU', 'PENJARINGAN',
       'TANJUNG PRIOK', 'KOJA', 'CILINCING', 'PADEMANGAN',
       'KELAPA GADING', 'CENGKARENG', 'GROGOL PETAMBURAN', 'TAMAN SARI',
       'TAMBORA', 'MAMPANG PRAPATAN', 'PASAR MINGGU', 'KEBAYORAN LAMA',
       'CILANDAK', 'KEBAYORAN BARU', 'PANCORAN', 'JAGAKARSA',
       'PESANGGRAHAN', 'MATRAMAN', 'PULO GADUNG', 'JATINEGARA',
       'KRAMATJATI', 'PASAR REBO', 'CAKUNG', 'DUREN SAWIT', 'MAKASAR',
       'CIRACAS', 'KEBON JERUK', 'KALI DERES', 'PAL MERAH', 'KEMBANGAN',
       'TEBET', 'SETIA BUDI', 'CIPAYUNG'], dtype=object)

In [54]:
df['kecamatan'].nunique()

44

In [55]:
df.jenis_kelamin.value_counts() #Used to count the occurrences of unique values in a DataFrame column. It returns a Series containing counts of unique values.

jenis_kelamin
Laki-Laki    3204
Perempuan    3204
Name: count, dtype: int64

In [56]:
# Call a apsecific column

df[['kecamatan','kelurahan','jenis_kelamin','jumlah']]
#Retrieving certain columns from a DataFrame df. This notation will return a new DataFrame containing only the specified columns.

Unnamed: 0,kecamatan,kelurahan,jenis_kelamin,jumlah
0,KEPULAUAN SERIBU UTARA,PULAU PANGGANG,Laki-Laki,5
1,KEPULAUAN SERIBU UTARA,PULAU KELAPA,Laki-Laki,1
2,KEPULAUAN SERIBU UTARA,PULAU HARAPAN,Laki-Laki,1
3,KEPULAUAN SERIBU SELATAN,PULAU UNTUNG JAWA,Laki-Laki,2
4,KEPULAUAN SERIBU SELATAN,PULAU TIDUNG,Laki-Laki,2
...,...,...,...,...
6403,CIPAYUNG,MUNJUL,Perempuan,23
6404,CIPAYUNG,SETU,Perempuan,20
6405,CIPAYUNG,BAMBU APUS,Perempuan,22
6406,CIPAYUNG,LUBANG BUAYA,Perempuan,100


Calling a Python dataset with terms and conditions means retrieving or
selecting certain data from a dataset based on certain terms or conditions.
This can be done using boolean operators or logical statements, such as ==,!=, <, >, <=, >=, or and, or, not.

In [57]:
df[df['jumlah']==2]

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
3,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU SELATAN,PULAU UNTUNG JAWA,Laki-Laki,2,2020
4,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU SELATAN,PULAU TIDUNG,Laki-Laki,2,2020
12,2020,1,JAKARTA PUSAT,SAWAH BESAR,PASAR BARU,Laki-Laki,2,2020
25,2020,1,JAKARTA PUSAT,S E N E N,SENEN,Laki-Laki,2,2020
37,2020,1,JAKARTA PUSAT,MENTENG,GONDANGDIA,Laki-Laki,2,2020
...,...,...,...,...,...,...,...,...
6146,2020,12,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU SELATAN,PULAU PARI,Perempuan,2,2020
6178,2020,12,JAKARTA PUSAT,MENTENG,GONDANGDIA,Perempuan,2,2020
6285,2020,12,JAKARTA SELATAN,SETIA BUDI,SETIA BUDI,Perempuan,2,2020
6286,2020,12,JAKARTA SELATAN,SETIA BUDI,KARET SEMANGGI,Perempuan,2,2020


In [58]:
#For example, we will call the 'jumlah' column where the value is greater than 20

df[df['jumlah']>20]

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
50,2020,1,JAKARTA UTARA,PENJARINGAN,PENJARINGAN,Laki-Laki,27,2020
56,2020,1,JAKARTA UTARA,TANJUNG PRIOK,SUNTER JAYA,Laki-Laki,29,2020
57,2020,1,JAKARTA UTARA,TANJUNG PRIOK,PAPANGGO,Laki-Laki,21,2020
63,2020,1,JAKARTA UTARA,KOJA,TUGU UTARA,Laki-Laki,23,2020
64,2020,1,JAKARTA UTARA,KOJA,LAGOA,Laki-Laki,34,2020
...,...,...,...,...,...,...,...,...
6400,2020,12,JAKARTA TIMUR,CIPAYUNG,CIPAYUNG,Perempuan,23,2020
6401,2020,12,JAKARTA TIMUR,CIPAYUNG,CILANGKAP,Perempuan,27,2020
6403,2020,12,JAKARTA TIMUR,CIPAYUNG,MUNJUL,Perempuan,23,2020
6405,2020,12,JAKARTA TIMUR,CIPAYUNG,BAMBU APUS,Perempuan,22,2020


In [59]:
#For example, we will call the 'jumlah' column where the value is less than or equal to 20

df[~(df['jumlah']>20)]

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
0,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU UTARA,PULAU PANGGANG,Laki-Laki,5,2020
1,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU UTARA,PULAU KELAPA,Laki-Laki,1,2020
2,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU UTARA,PULAU HARAPAN,Laki-Laki,1,2020
3,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU SELATAN,PULAU UNTUNG JAWA,Laki-Laki,2,2020
4,2020,1,ADM. KEPULAUAN SERIBU,KEPULAUAN SERIBU SELATAN,PULAU TIDUNG,Laki-Laki,2,2020
...,...,...,...,...,...,...,...,...
6391,2020,12,JAKARTA TIMUR,MAKASAR,PINANG RANTI,Perempuan,20,2020
6398,2020,12,JAKARTA TIMUR,CIRACAS,SUSUKAN,Perempuan,19,2020
6402,2020,12,JAKARTA TIMUR,CIPAYUNG,PONDOK RANGGON,Perempuan,14,2020
6404,2020,12,JAKARTA TIMUR,CIPAYUNG,SETU,Perempuan,20,2020


In [60]:
#For example, we will call the 'jumlah' column where the value is greater than 20 and the 'jenis_kelamin' column is Female

df[(df['jumlah']>20)&(df['jenis_kelamin']=='Perempuan')]

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
261,2020,1,JAKARTA TIMUR,CIRACAS,CIRACAS,Perempuan,22,2020
262,2020,1,JAKARTA TIMUR,CIRACAS,CIBUBUR,Perempuan,30,2020
473,2020,1,JAKARTA UTARA,PENJARINGAN,PENJARINGAN,Perempuan,29,2020
479,2020,1,JAKARTA UTARA,TANJUNG PRIOK,SUNTER JAYA,Perempuan,30,2020
483,2020,1,JAKARTA UTARA,TANJUNG PRIOK,SUNTER AGUNG,Perempuan,25,2020
...,...,...,...,...,...,...,...,...
6400,2020,12,JAKARTA TIMUR,CIPAYUNG,CIPAYUNG,Perempuan,23,2020
6401,2020,12,JAKARTA TIMUR,CIPAYUNG,CILANGKAP,Perempuan,27,2020
6403,2020,12,JAKARTA TIMUR,CIPAYUNG,MUNJUL,Perempuan,23,2020
6405,2020,12,JAKARTA TIMUR,CIPAYUNG,BAMBU APUS,Perempuan,22,2020


In [61]:
#For example, we will call the 'jumlah' column where the value is greater than 20 or the 'jenis_kelamin' column is female

df[(df['jumlah']>20)|(df['jenis_kelamin']=='perempuan')]

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
50,2020,1,JAKARTA UTARA,PENJARINGAN,PENJARINGAN,Laki-Laki,27,2020
56,2020,1,JAKARTA UTARA,TANJUNG PRIOK,SUNTER JAYA,Laki-Laki,29,2020
57,2020,1,JAKARTA UTARA,TANJUNG PRIOK,PAPANGGO,Laki-Laki,21,2020
63,2020,1,JAKARTA UTARA,KOJA,TUGU UTARA,Laki-Laki,23,2020
64,2020,1,JAKARTA UTARA,KOJA,LAGOA,Laki-Laki,34,2020
...,...,...,...,...,...,...,...,...
6400,2020,12,JAKARTA TIMUR,CIPAYUNG,CIPAYUNG,Perempuan,23,2020
6401,2020,12,JAKARTA TIMUR,CIPAYUNG,CILANGKAP,Perempuan,27,2020
6403,2020,12,JAKARTA TIMUR,CIPAYUNG,MUNJUL,Perempuan,23,2020
6405,2020,12,JAKARTA TIMUR,CIPAYUNG,BAMBU APUS,Perempuan,22,2020


In [62]:
#For example, we will call the 'jumlah' column where the value is between 20 and 50

df[((df['jumlah']>20)&(df['jumlah']>50))]

Unnamed: 0,tahun,bulan,kota_kabupaten,kecamatan,kelurahan,jenis_kelamin,jumlah,periode_data
359,2020,2,JAKARTA BARAT,CENGKARENG,KAPUK,Laki-Laki,51,2020
864,2020,2,JAKARTA UTARA,KOJA,TUGU UTARA,Perempuan,51,2020
1328,2020,3,JAKARTA BARAT,CENGKARENG,KAPUK,Laki-Laki,74,2020
2001,2020,5,JAKARTA UTARA,CILINCING,KALI BARU,Laki-Laki,57,2020
2373,2020,5,JAKARTA UTARA,CILINCING,KALI BARU,Perempuan,57,2020
...,...,...,...,...,...,...,...,...
6384,2020,12,JAKARTA TIMUR,DUREN SAWIT,PONDOK BAMBU,Perempuan,55,2020
6385,2020,12,JAKARTA TIMUR,DUREN SAWIT,KLENDER,Perempuan,60,2020
6386,2020,12,JAKARTA TIMUR,DUREN SAWIT,PONDOK KELAPA,Perempuan,68,2020
6396,2020,12,JAKARTA TIMUR,CIRACAS,CIBUBUR,Perempuan,56,2020


In [63]:
#For example, we will call the 'kota_kabupaten' and 'kecamatan' columns where the jumlah is greater than 80 or the 'jenis_kelamin' column is female

df[['kota_kabupaten','kecamatan','jumlah','jenis_kelamin']][(df['jumlah']>80)|(df['jenis_kelamin']=='Perempuan')]

Unnamed: 0,kota_kabupaten,kecamatan,jumlah,jenis_kelamin
259,JAKARTA TIMUR,MAKASAR,3,Perempuan
260,JAKARTA TIMUR,MAKASAR,11,Perempuan
261,JAKARTA TIMUR,CIRACAS,22,Perempuan
262,JAKARTA TIMUR,CIRACAS,30,Perempuan
263,JAKARTA TIMUR,CIRACAS,18,Perempuan
...,...,...,...,...
6403,JAKARTA TIMUR,CIPAYUNG,23,Perempuan
6404,JAKARTA TIMUR,CIPAYUNG,20,Perempuan
6405,JAKARTA TIMUR,CIPAYUNG,22,Perempuan
6406,JAKARTA TIMUR,CIPAYUNG,100,Perempuan
