### 외국어 판정
: 알파벳을 사용하는 자연언어가 어떤 나라의 언어인지 판명

### 판정방법
- 언어가 다르면 알파벳의 출현빈도가 다르다.(언어학적으로 알려진 사실)
- 언어마다 자주 사용하는 표현과 단어가 다르기 때문에 출현빈도가 달라진다.
- a부터 z까지의 출현빈도를 확인하고 이를 특징으로 활용한다.

#### 샘플데이터 수집
- 위키피디아 글자 모음 사용
- 영어(en), 프랑스어(fr), 인도네시아어(id), 타갈로그어(tl)로 구분
- train data는 20개(영어, 프랑스어, 인도네시아어, 타갈로그어 각 5개) 파일
- test data는 8개(각각 2개씩)의 파일

In [100]:
from sklearn import svm, metrics
import glob # 여러개의 파일명을 하나의 리스트로 합치기
import os # 운영체제 명령어
import re # 정규식으로 파일 이름 찾기
import json # json 파일 읽기

### Train Data 처리

In [101]:
# Train Data 불러오기
file_list = glob.glob('../Data/lang/train/*.txt')
file_list

['../Data/lang/train\\en-1.txt',
 '../Data/lang/train\\en-2.txt',
 '../Data/lang/train\\en-3.txt',
 '../Data/lang/train\\en-4.txt',
 '../Data/lang/train\\en-5.txt',
 '../Data/lang/train\\fr-10.txt',
 '../Data/lang/train\\fr-6.txt',
 '../Data/lang/train\\fr-7.txt',
 '../Data/lang/train\\fr-8.txt',
 '../Data/lang/train\\fr-9.txt',
 '../Data/lang/train\\id-11.txt',
 '../Data/lang/train\\id-12.txt',
 '../Data/lang/train\\id-13.txt',
 '../Data/lang/train\\id-14.txt',
 '../Data/lang/train\\id-15.txt',
 '../Data/lang/train\\tl-16.txt',
 '../Data/lang/train\\tl-17.txt',
 '../Data/lang/train\\tl-18.txt',
 '../Data/lang/train\\tl-19.txt',
 '../Data/lang/train\\tl-20.txt']

In [102]:
freqs = []
labels = []

for fname in file_list:
    name = os.path.basename(fname) # 전체 경로 중 file name만 추출
    lang = re.match(r'^[a-z]{2,}', name).group() # 파일명에서 언어 추출
    with open(fname, 'r', encoding='utf-8') as f:
        text = f.read()
    text = text.lower() # 소문자로 변환
    cnt = [0 for _ in range(0, 26)] # 알파벳 갯수를 0으로 초기화
    # ASCII Code로 변환 후 범위 설정
    code_a = ord('a')
    code_z = ord('z')

    # 알파벳 출현 횟수 구하기
    for ch in text:
        n = ord(ch)
        if code_a <= n <= code_z: # a~z 사이에 있을 경우에마 해당
            cnt[n - code_a] += 1

    # 정규화 하기
    total = sum(cnt)
    freq = [cnt[i] / total for i in range(0, len(cnt))]
    freqs.append(freq)
    labels.append(lang)

    


In [103]:
len(freqs)

20

In [104]:
labels[0]

'en'

In [105]:
data = {
    'freqs': freqs,
    'labels': labels
}

data 


{'freqs': [[0.07595212187159957,
   0.012840043525571273,
   0.04570184983677911,
   0.04613710554951034,
   0.10533188248095757,
   0.015669205658324265,
   0.019151251360174103,
   0.043743199129488576,
   0.07399347116430903,
   0.0017410228509249185,
   0.00544069640914037,
   0.05375408052230685,
   0.026332970620239392,
   0.07747551686615888,
   0.08966267682263329,
   0.016539717083786723,
   0.0,
   0.07769314472252448,
   0.061371055495103376,
   0.08052230685527748,
   0.02589771490750816,
   0.009793253536452665,
   0.014145810663764961,
   0.0006528835690968443,
   0.02002176278563656,
   0.0004352557127312296],
  [0.08417789436031954,
   0.019911768212710148,
   0.030404196971503518,
   0.038869679265529984,
   0.13699773458924527,
   0.017407893167998092,
   0.031238821986407535,
   0.02742339334684631,
   0.07535471563133421,
   0.0026231071896983425,
   0.009777035888875641,
   0.042327411470132345,
   0.024204125432216526,
   0.05353523309884345,
   0.0687969476570883

In [106]:
data['freqs'][0]

[0.07595212187159957,
 0.012840043525571273,
 0.04570184983677911,
 0.04613710554951034,
 0.10533188248095757,
 0.015669205658324265,
 0.019151251360174103,
 0.043743199129488576,
 0.07399347116430903,
 0.0017410228509249185,
 0.00544069640914037,
 0.05375408052230685,
 0.026332970620239392,
 0.07747551686615888,
 0.08966267682263329,
 0.016539717083786723,
 0.0,
 0.07769314472252448,
 0.061371055495103376,
 0.08052230685527748,
 0.02589771490750816,
 0.009793253536452665,
 0.014145810663764961,
 0.0006528835690968443,
 0.02002176278563656,
 0.0004352557127312296]

In [107]:
data['labels'][0]

'en'

---
### Test Data 처리


In [108]:
# Test Data 불러오기
file_list = glob.glob('../Data/lang/test/*.txt')
file_list

['../Data/lang/test\\en-1.txt',
 '../Data/lang/test\\en-2.txt',
 '../Data/lang/test\\fr-3.txt',
 '../Data/lang/test\\fr-4.txt',
 '../Data/lang/test\\id-5.txt',
 '../Data/lang/test\\id-6.txt',
 '../Data/lang/test\\tl-7.txt',
 '../Data/lang/test\\tl-8.txt']

In [109]:
freqs = []
labels = []

for fname in file_list:
    name = os.path.basename(fname) # 전체 경로 중 file name만 추출
    lang = re.match(r'^[a-z]{2,}', name).group() # 파일명에서 언어 추출
    with open(fname, 'r', encoding='utf-8') as f:
        text = f.read()
    text = text.lower() # 소문자로 변환
    cnt = [0 for _ in range(0, 26)] # 알파벳 갯수를 0으로 초기화
    # ASCII Code로 변환 후 범위 설정
    code_a = ord('a')
    code_z = ord('z')

    # 알파벳 출현 횟수 구하기
    for ch in text:
        n = ord(ch)
        if code_a <= n <= code_z: # a~z 사이에 있을 경우에마 해당
            cnt[n - code_a] += 1

    # 정규화 하기
    total = sum(cnt)
    freq = [cnt[i] / total for i in range(0, len(cnt))]
    freqs.append(freq)
    labels.append(lang)    


In [110]:
test = {
    'freqs': freqs,
    'labels': labels
}

test 

{'freqs': [[0.06782261776891207,
   0.013459304237269558,
   0.03432780201231943,
   0.04881737872377737,
   0.11611389991012516,
   0.020013590828382912,
   0.016002104386330256,
   0.022797518577785572,
   0.07691970450908613,
   0.002411276003419628,
   0.005502093425984787,
   0.03827352638155155,
   0.030097108660864992,
   0.07400425261404239,
   0.08373703939148162,
   0.028321532694710536,
   0.000657620728205353,
   0.07012429031763082,
   0.07955018742190754,
   0.0751222078519915,
   0.02591025669129091,
   0.014774545693680264,
   0.036103377978473884,
   0.005633617571625857,
   0.013086652491286526,
   0.00041649312786339027],
  [0.08028287821720025,
   0.016174278091650972,
   0.03534996861268048,
   0.0383415725047081,
   0.12986503452605147,
   0.016703939108600126,
   0.01895009416195857,
   0.04269656308851224,
   0.07398579723791589,
   0.004462884494664155,
   0.00673846516007533,
   0.043324309478970494,
   0.025384494664155682,
   0.07053319209039548,
   0.073926

In [111]:
test['freqs'][0]

[0.06782261776891207,
 0.013459304237269558,
 0.03432780201231943,
 0.04881737872377737,
 0.11611389991012516,
 0.020013590828382912,
 0.016002104386330256,
 0.022797518577785572,
 0.07691970450908613,
 0.002411276003419628,
 0.005502093425984787,
 0.03827352638155155,
 0.030097108660864992,
 0.07400425261404239,
 0.08373703939148162,
 0.028321532694710536,
 0.000657620728205353,
 0.07012429031763082,
 0.07955018742190754,
 0.0751222078519915,
 0.02591025669129091,
 0.014774545693680264,
 0.036103377978473884,
 0.005633617571625857,
 0.013086652491286526,
 0.00041649312786339027]

In [112]:
test['labels'][0]

'en'

### JSON으로 결과 저장하기

In [113]:
with open('../Data/lang/freq2.json', 'w', encoding='utf-8') as fp:
    json.dump([data, test], fp)

In [114]:
# 학습하기
clf = svm.SVC()
clf.fit(data['freqs'], data['labels'])

In [115]:
# 예측하기
pred = clf.predict(test['freqs'])
pred


array(['en', 'en', 'fr', 'fr', 'id', 'id', 'tl', 'tl'], dtype='<U2')

In [116]:
metrics.accuracy_score(test['labels'], pred)

1.0

---
### 외부 문장으로 예측해보기

In [117]:
fname = '../Data/lang/inputTest.txt'

with open(fname, "r", encoding="utf-8") as f:
    text = f.read()
text = text.lower() # 소문자 변환
cnt = [0 for _ in range(0, 26)] # 알파벳 갯수를 0으로 초기화 
# ASCII Code로 변환후 범위 설정
code_a = ord("a")
code_z = ord("z")

# 알파벳 출현 횟수 구하기
for ch in text:
    n = ord(ch)
    if code_a <= n <= code_z: # a~z사이에 있을 경우에만 해당
        cnt[n - code_a] += 1

# 정규화 하기
total = sum(cnt)
freq = [cnt[i]/total for i in range(0, len(cnt))]
freq

[0.1048498845265589,
 0.006004618937644342,
 0.03048498845265589,
 0.04387990762124711,
 0.1117782909930716,
 0.020323325635103928,
 0.011085450346420323,
 0.039722863741339494,
 0.09330254041570439,
 0.005542725173210162,
 0.004157043879907622,
 0.03926096997690531,
 0.020323325635103928,
 0.08129330254041571,
 0.05450346420323326,
 0.022632794457274827,
 0.0,
 0.05912240184757506,
 0.09145496535796767,
 0.08960739030023095,
 0.018013856812933025,
 0.010623556581986143,
 0.020785219399538105,
 0.0018475750577367205,
 0.017551963048498844,
 0.0018475750577367205]

In [118]:
# 예측하기
clf.predict([freq]).tolist()

['en']

---
### 함수로 구성하기

In [119]:
def check_freq(fname):
    name = os.path.basename(fname)
    lang = re.match(r'^[a-z]{2,}', name).group()
    with open(fname, 'r', encoding='utf-8') as f:
        text = f.read()
    text = text.lower()
    cnt = [0 for _ in range(26)]
    code_a = ord('a')
    code_z = ord('z')

    for ch in text:
        n = ord(ch)
        if code_a <= n <= code_z:
            cnt[n-code_a] += 1
    
    total = sum(cnt)
    freq = [cnt[i] / total for i in range(len(cnt))]
    return (freq, lang)

In [120]:
# 각 파일 처리하기
def load_files(path):
    freqs = []
    labels = []  # 정답
    file_list = glob.glob(path)
    for fname in file_list:
        r = check_freq(fname)
        freqs.append(r[0])
        labels.append(r[1])
    return {'freqs': freqs, 'labels': labels}

In [121]:
data = load_files('../Data/lang/train/*.txt')

In [122]:
test = load_files('../Data/lang/test/*.txt')

In [123]:
with open('../Data/lang/freq1.json', 'w', encoding='utf-8') as fp:
    json.dump([data, test], fp)

In [124]:
clf = svm.SVC()
clf.fit(data['freqs'], data['labels'])

In [125]:
pred = clf.predict(test['freqs'])

In [126]:
metrics.accuracy_score(test['labels'], pred)

1.0