### 朴素贝叶斯完成语种检测

类似我们平时用的谷歌、百度翻译的语种检测...

这边有6种语言，拿来看看...

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv('data.csv', header=None)

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9066 entries, 0 to 9065
Data columns (total 2 columns):
0    9066 non-null object
1    9066 non-null object
dtypes: object(2)
memory usage: 141.7+ KB


In [3]:
from sklearn.model_selection import train_test_split

split_train, split_test = train_test_split(data, random_state=0)

In [4]:
X_train = split_train[0]
y_train = split_train[1]
X_test = split_test[0]
y_test = split_test[1]

去掉文本中的噪声数据...

In [5]:
import re
def remove_noise(document):
    noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
    clean_text = re.sub(noise_pattern, "", document)
    return clean_text.strip()

remove_noise("Trump images are now more popular than cat gifs. @trump #trends http://www.trumptrends.html")

'Trump images are now more popular than cat gifs.'

而对于文本中的词特征的抽取，我们依然使用词袋模型，可以选择n_gram，如：

### 词袋模型 （默认为1_gram）

我是中国人，我爱我的祖国 => 我 是 中国 人 ，我 爱 我 的 祖国 => (我:3 是:1 中国:1 人:1 爱:1)

词表：4w个词

以词袋表示该句子可能会出现如下形式：

[0, 0, 0, ..., 0]
[3, 1, 0, 1, ...]

### 语言模型 n-gram（可以表达主宾颠倒所表达的不同意思）

李雷喜欢韩梅梅

[李雷，喜欢，韩梅梅，李雷 喜欢， 喜欢 韩梅梅]

韩梅梅喜欢李雷

[李雷，喜欢，韩梅梅，韩梅梅 喜欢， 喜欢 李雷]

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
            lowercase=True,
            analyzer='char_wb',
            ngram_range=(1, 2),
            max_features=1000,
            preprocessor=remove_noise)
vec.fit(X_train)

CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2),
        preprocessor=<function remove_noise at 0x114181ae8>,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

In [7]:
vec.transform(['10 der welt sind bei'])

<1x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 34 stored elements in Compressed Sparse Row format>

In [8]:
# 词表
# vec.vocabulary_

In [9]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(X_train), y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
classifier.score(vec.transform(X_test), y_test)

0.976621085134539

我们可以加大语料库，使得其准确率更高！！

### 定义为Class

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

class LanguageDetector():
    
    def __init__(self):
        self.classifier = MultinomialNB()
        self.vectorizer = CountVectorizer(ngram_range=(1, 2), max_features=1000,
                                         preprocessor=self._remove_noise)
        
    def _remove_noise(self, document):
        noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
        clean_text = re.sub(noise_pattern, "", document)
        return clean_text.strip()
    
    def features(self, X):
        return self.vectorizer.transform(X)
    
    def fit(self, X, y):
        self.vectorizer.fit(X)
        self.classifier.fit(self.features(X), y)
        
    def predict(self, x):
        return self.classifier.predict(self.features(x))
    
    def score(self, x, y):
        return self.classifier.score(self.features(x), y)

In [12]:
language_detector = LanguageDetector()
language_detector.fit(X_train, y_train)

In [13]:
language_detector.predict(X_test)

array(['nl', 'nl', 'es', ..., 'es', 'es', 'en'], dtype='<U2')

In [14]:
language_detector.predict(['This is an English sentence'])

array(['en'], dtype='<U2')

In [15]:
language_detector.score(X_test, y_test)

0.9801499779444199