When I collect external data while working on analytics,  
I go through the following process:

1. Look at the provided data specifications and check the following first
   - Whether it can be used without the risk of copyright infringement
   - Whether the raw data has the columns I need
   - Is it possible to convert raw data into meaningful information by cleaning and processing it?
  
2. If it is determined that the data is usable, the data is downloaded.  
   Then, it is checked whether the raw data is configured according to the provided data specifications.  
   In my experience, open public data often do not match specifications.  
   So, in order to determine whether the data is correct and complete, check the following with Python code.  
   - column name
   - data type
   - Existence of missing values
   - Whether a specific column is made up of unique values
   - Examples of data values

In [13]:
import pandas as pd
import numpy as np
import geopandas as gpd

### Read shp file (raw data)
- data name : "건축물연령공간정보" - 부산광역시 중구
- data source : The National Spatial Information Portal (http://openapi.nsdi.go.kr/nsdi/index.do)

In [14]:
df = gpd.read_file('data/국가공간정보포털_건축물연령정보_부산중구/AL_26110_D196_20230111.shp',
                   sep = ",", encoding='cp949')
print(df.shape)

(6536, 32)


### Check and change column name (refer to data specification)

In [15]:
# Read excel file (data specification)

df_col = pd.read_excel('data/datasetDetail.xlsx')
print(df_col.shape)

(31, 2)


In [16]:
# Create dictionary type data with column code as key and column name as value

a = df_col.set_index('col_name_code').to_dict()
a.get('col_name')

{'A0': '도형ID',
 'A1': 'GIS건물통합식별번호',
 'A2': '고유번호',
 'A3': '법정동코드',
 'A4': '법정동명',
 'A5': '특수지구분코드',
 'A6': '특수지구분명',
 'A7': '지번',
 'A8': '건물식별번호',
 'A9': '집합건물구분코드',
 'A10': '집합건물구분',
 'A11': '대장종류코드',
 'A12': '대장종류',
 'A13': '건물명',
 'A14': '건물동명',
 'A15': '건물연면적',
 'A16': '건축물구조코드',
 'A17': '건축물구조명',
 'A18': '주요용도코드',
 'A19': '주요용도명',
 'A20': '건물높이',
 'A21': '지상층수',
 'A22': '지하층수',
 'A23': '허가일자',
 'A24': '사용승인일자',
 'A25': '건물연령',
 'A26': '연령대구분코드',
 'A27': '연령대구분명',
 'A28': '연령대5계급코드',
 'A29': '연령대5계급명',
 'A30': '데이터기준일자'}

In [17]:
# Change raw data column name using this dictionary

df.rename(columns = a.get('col_name'), inplace=True)

### Several ways to check basic information such as the number of missing values and data type.
1. df.info()
2. df.describe()
3. self-made code

#### 1. df.info()

In [19]:
df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 6536 entries, 0 to 6535
Data columns (total 32 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   도형ID         6536 non-null   int64   
 1   GIS건물통합식별번호  6536 non-null   object  
 2   고유번호         6536 non-null   object  
 3   법정동코드        6536 non-null   object  
 4   법정동명         6536 non-null   object  
 5   특수지구분코드      6536 non-null   object  
 6   특수지구분명       6536 non-null   object  
 7   지번           6536 non-null   object  
 8   건물식별번호       6536 non-null   object  
 9   집합건물구분코드     6536 non-null   object  
 10  집합건물구분       6536 non-null   object  
 11  대장종류코드       6536 non-null   object  
 12  대장종류         6536 non-null   object  
 13  건물명          768 non-null    object  
 14  건물동명         350 non-null    object  
 15  건물연면적        6536 non-null   float64 
 16  건축물구조코드      6536 non-null   object  
 17  건축물구조명       6534 non-null   object  
 18  주요용도코드       6535 no

#### 2. df.describe()

In [20]:
df.describe()

Unnamed: 0,도형ID,건물연면적,건물높이,지상층수,지하층수,건물연령
count,6536.0,6536.0,6533.0,6536.0,6518.0,4448.0
mean,61807630.0,589.936258,5.753874,3.35817,0.331697,33.597572
std,2230.548,1902.859245,9.658369,2.290211,0.565065,13.174698
min,61803660.0,0.0,0.0,0.0,0.0,2.0
25%,61805680.0,83.865,0.0,2.0,0.0,26.0
50%,61807680.0,179.275,0.0,3.0,0.0,34.0
75%,61809580.0,453.37,11.35,4.0,1.0,45.0
max,61811400.0,40348.18,115.88,33.0,7.0,88.0


In [21]:
df.describe(include='O')

Unnamed: 0,GIS건물통합식별번호,고유번호,법정동코드,법정동명,특수지구분코드,특수지구분명,지번,건물식별번호,집합건물구분코드,집합건물구분,...,건축물구조명,주요용도코드,주요용도명,허가일자,사용승인일자,연령대구분코드,연령대구분명,연령대5계급코드,연령대5계급명,데이터기준일자
count,6536,6536,6536,6536,6536,6536,6536,6536,6536,6536,...,6534,6535,6535,4015,4448,6536,6536,6536,6536,6536
unique,6535,6293,40,40,2,2,4079,6521,2,2,...,13,26,25,2869,3047,10,10,19,19,1
top,2018203345041779409000000000,2611012000100410331,2611010100,부산광역시 중구 영주동,1,일반,11-1,100171006,1,일반건축물,...,철근콘크리트구조,1000,단독주택,1975-06-27,1975-09-15,ZZ,기타,ZZZ,구분없,2023-01-11
freq,2,17,1339,1339,6524,6524,17,2,5541,5541,...,3462,2928,2928,69,85,2088,2088,2088,2088,6536


#### 3. self-made code

In [22]:
col_list = list(df.columns.values)

info_list = []
for i in col_list:
    if df[i].nunique()>10:
        a = df[i].unique()[0]
    else:
        a = df[i].unique()
    info_list_a = [i, df[i].dtype, df[i].isnull().sum(), df[i].notnull().sum(), df[i].nunique(), a]
    info_list.append(info_list_a)

df_info = pd.DataFrame(data = info_list, 
                       columns = ['col_name','dtype','isnull_sum','notnull_sum','nunique','ex'])

In [23]:
df_info

Unnamed: 0,col_name,dtype,isnull_sum,notnull_sum,nunique,ex
0,도형ID,int64,0,6536,6536,61803665
1,GIS건물통합식별번호,object,0,6536,6535,1988202940501795945100000000
2,고유번호,object,0,6536,6293,2611010100100010092
3,법정동코드,object,0,6536,40,2611010100
4,법정동명,object,0,6536,40,부산광역시 중구 영주동
5,특수지구분코드,object,0,6536,2,"[1, 2]"
6,특수지구분명,object,0,6536,2,"[일반, 산]"
7,지번,object,0,6536,4079,1-92
8,건물식별번호,object,0,6536,6521,2376
9,집합건물구분코드,object,0,6536,2,"[2, 1]"
