### 알파벳 빈도수 기반 언어 식별 모델
* 데이터셋: lang.zip
* 피쳐/속성: 알파벳 26개
* 타겟/라벨: class 변수 1개 4개 (영어, 불어, 터키어, 말레이시아어)
* 학습방법:	지도학습 >> 분류 >> 다중분류 (클래스: 4개)
* 알고리즘: 딥러닝 층: 3개 (입력층, 은닉층:1개, 출력층)

#### Point
* 알파벳(26개) 제외 문자 drop
* 대소문자 통일
* **각 파일마다 전체 알파벳 개수가 다름	=> 각 파일에 대해 전체 알파벳 빈도수 합으로 나누어 빈도율을 통일해야!!**

In [1]:
# 데이터 전처리 & 시각화 관련 모듈
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt


In [5]:
# *.txt 파일 오픈 & 모든 알파벳 갯수 저장 Dict 반환 함수

def read_count(f_name, alphabets, data_path):

    if not os.path.exists(data_path+f_name):
        print(f'{f_name}이 없습니다.')
        
    else:
    
        with open(data_path+f_name, "r") as f:
            print(f,data_path+f_name, os.path.exists(data_path+f_name))

			# 모든 문자 소문자 통일
            data = f.read().lower()

        
			# 알파벳 아닌 문자들 제거
            print("알파벳 전처리 전: ",len(data))
            for ch in data:
                if ord('a')> ord(ch) or ord(ch) > ord('z'):
				# if not ord('a')> ch or ch > ord('z'):	
                    data = data.replace(ch,'')
            print("알파벳 전처리 후: ", len(data))
            
        total_len = len(data)

		## 각 txt 파일 내 모든 알파벳 갯수 저장 dict 생성
		# {'a': ###, ..., 'z': ###}
        # data_count_dict = {}

        # for chrcter in alphabets:
        #     data_count_dict[chrcter] = data.count(chrcter)

		## 알파벳 유니코드 값 기준 counting Ver.-----------------------------------
        cnt_list = [0*len(alphabets)]
        
        a_z = dict(zip(alphabets,cnt_list))
        
        for _ in range( ord('a'), ord('z')+1):
             cnt = data.count(chr(_))
             a_z[chr(_)] = cnt/total_len

		# Counter Ver.------------------------------------------------------------
        # count_dict = Counter(data)			# 알파벳 별 빈도 수 딕셔너리 반환 
		# 									# 알파벳 정렬 X!

		# # 전체 개수로 각 알파벳 빈도수 나누기
        # count_dict_normalized = {key: (lambda val: val/total_len)(val) for key, val in count_dict.items()}
        
		# a-z 알파벳 순 dict 정렬
        # 		-> DF 생성 후 칼럼명 기준 정렬!
        
    return a_z


In [6]:
path = "../language/train/"
file_list = os.listdir(path)

print ("file_list: {}".format(file_list))


alphabets=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

lang_class_list = []
alltxt_data_count = []

for file1 in file_list:
    
    lang_class_list.append(file1[:2])				# en, fr, id, tl 추출
    
    alltxt_data_count.append( read_count(file1, alphabets, data_path=path) )
        
	
# DF 생성
rawDF = pd.DataFrame(alltxt_data_count)
rawDF.head(10)

print(rawDF.isna().sum())

# 언어클래스 칼럼 생성
rawDF['class'] = lang_class_list
rawDF.head(3)

file_list: ['en-1.txt', 'en-2.txt', 'en-3.txt', 'en-4.txt', 'en-5.txt', 'fr-10.txt', 'fr-6.txt', 'fr-7.txt', 'fr-8.txt', 'fr-9.txt', 'id-11.txt', 'id-12.txt', 'id-13.txt', 'id-14.txt', 'id-15.txt', 'tl-16.txt', 'tl-17.txt', 'tl-18.txt', 'tl-19.txt', 'tl-20.txt']


In [10]:
## TEST_DF
path = "../language/test/"
file_list = os.listdir(path)

print ("file_list: {}".format(file_list))


alphabets=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

lang_class_list = []
alltxt_data_count = []

for file1 in file_list:
    
    lang_class_list.append(file1[:2])				# en, fr, id, tl 추출
    
    alltxt_data_count.append( read_count(file1, alphabets, data_path=path) )
        
	
# DF 생성
raw_test_DF = pd.DataFrame(alltxt_data_count)
rawDF.head(10)

print(rawDF.isna().sum())

# 언어클래스 칼럼 생성
raw_test_DF['class'] = lang_class_list
raw_test_DF.head(3)

file_list: ['en-1.txt', 'en-2.txt', 'fr-3.txt', 'fr-4.txt', 'id-5.txt', 'id-6.txt', 'tl-7.txt', 'tl-8.txt']
<_io.TextIOWrapper name='../language/test/en-1.txt' mode='r' encoding='UTF-8'> ../language/test/en-1.txt True
알파벳 전처리 전:  61410
알파벳 전처리 후:  45619
<_io.TextIOWrapper name='../language/test/en-2.txt' mode='r' encoding='UTF-8'> ../language/test/en-2.txt True
알파벳 전처리 전:  141276
알파벳 전처리 후:  101952
<_io.TextIOWrapper name='../language/test/fr-3.txt' mode='r' encoding='UTF-8'> ../language/test/fr-3.txt True
알파벳 전처리 전:  36833
알파벳 전처리 후:  26566
<_io.TextIOWrapper name='../language/test/fr-4.txt' mode='r' encoding='UTF-8'> ../language/test/fr-4.txt True
알파벳 전처리 전:  65045
알파벳 전처리 후:  45301
<_io.TextIOWrapper name='../language/test/id-5.txt' mode='r' encoding='UTF-8'> ../language/test/id-5.txt True
알파벳 전처리 전:  8455
알파벳 전처리 후:  6154
<_io.TextIOWrapper name='../language/test/id-6.txt' mode='r' encoding='UTF-8'> ../language/test/id-6.txt True
알파벳 전처리 전:  33524
알파벳 전처리 후:  25641
<_io.TextIOWrapp

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,...,r,s,t,u,v,w,x,y,z,class
0,0.067823,0.013459,0.034328,0.048817,0.116114,0.020014,0.016002,0.022798,0.07692,0.002411,...,0.070124,0.07955,0.075122,0.02591,0.014775,0.036103,0.005634,0.013087,0.000416,en
1,0.080283,0.016174,0.03535,0.038342,0.129865,0.016704,0.01895,0.042697,0.073986,0.004463,...,0.066227,0.063599,0.07888,0.027631,0.013026,0.01488,0.002119,0.0133,0.001491,en
2,0.056764,0.012008,0.035835,0.049876,0.127155,0.013476,0.00862,0.007303,0.08605,0.002786,...,0.067304,0.090078,0.068433,0.042912,0.013852,0.028909,0.009298,0.005157,0.000414,fr


In [12]:
# 데이터프레임 저장
SAVE_PATH = '../language/'
rawDF.to_csv(SAVE_PATH+'train_feature.csv', index=False)
raw_test_DF.to_csv(SAVE_PATH+'test_feature.csv', index=False)


In [11]:
pwd

'c:\\Users\\KDP-43\\Desktop\\딥러닝\\과제'