# Introduction
#### she changlue
24th April 2017

This project is use simple classification model to handle topic classification problems.This is a classical supervision learning problems which we manually label 2,000 samples to feed the logistic regression model. This edition will ignore the order of words(BOW) and use the TF-IDF to select the key words to as the features. 

this notebook will process as follow:
1. load library and raw corpus data
2. cut the corpus in to a list format(transfer the money term, number term and infrequence term into SPEC norm:MONE,NUMB,IFQT)
3. compute the TF-IDF, and get the most import 100 tokens as the features
4. construct the training and testing data sets
5. training the logistic regression model
6. evaluation the model results
7. save the model 

### 1)   load library and raw corpus data

In [8]:
% matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects
import matplotlib
from random import shuffle

from collections import Counter

from sklearn.linear_model import LogisticRegression#linear classification model
 
import seaborn as sns
sns.set_style('darkgrid')
sns.set_palette('muted')
sns.set_context("notebook", font_scale=1.5,
                rc={"lines.linewidth": 2.5})

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import jieba.posseg as pseg # cut the documents with token and tags
import jieba
 

###### set hyper prameters

In [9]:
HYPA_featureNums = 100

In [10]:
rawdata = pd.read_csv('corpus/催收sample2.csv', header=0,encoding='gbk')

#### corpus briefing

In [11]:
rawdata.head()

Unnamed: 0,opr_rem,Unnamed: 1,cls,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,今天下午5点472,,4,,,,
1,\t告知客户电话15226088432,,3,,,,
2,说是叔侄关系 转告,,3,,,,
3,客户承诺今天下午五点存入184,,4,,,,
4,已提醒征信影响。已告知从今天凌晨开始又会多增加75元的滞纳金。客户敷衍明天去解决,,4,,,,


In [12]:
rawdata.shape

(1996, 7)

In [13]:
rawdata.describe()

  interpolation=interpolation)


Unnamed: 0,Unnamed: 1,cls,Unnamed: 3,Unnamed: 4,Unnamed: 5
count,0.0,1996.0,0.0,0.0,7.0
mean,,4.115731,,,4.0
std,,1.149538,,,2.160247
min,,1.0,,,1.0
25%,,4.0,,,
50%,,4.0,,,
75%,,4.0,,,
max,,7.0,,,7.0


### 2) cut the corpus in to a list format

In [14]:
tokenCorpus  = []#corpus list of cutted tokens
rawSentences = []#raw text 
labels       = []#training labels
#documents    = list(rawdata[rawdata['消息目标']=='机器人']['消息内容'])#text which is send by custormers
documents    = list(rawdata['opr_rem'])#text which is send by custormers
LABELS       = list(rawdata['cls'])

In [15]:
# cstruct the corpus
for idx,sentence in enumerate(documents):
    if len(str(sentence))>4:
        sentence = sentence.replace('\t','')
        tokens = []    
        for pair in pseg.lcut(sentence):
            if pair.flag in ['t','n','ns','vs','nv','v']:
                tokens.append(pair.word)
            elif pair.flag=='m':
                if len(str(pair.word))==11:
                    tokens.append('NUMB')
                else:
                    tokens.append('MONE')   
        if len(tokens)>1:
            tokenCorpus.append(tokens)
            rawSentences.append(sentence)
            labels.append(LABELS[idx])

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\changlue.she\AppData\Local\Temp\jieba.cache
Loading model cost 1.108 seconds.
Prefix dict has been built succesfully.


In [16]:
tokenCorpus[:3],labels[:3]

([['MONE', 'MONE', 'MONE'],
  ['告知', '客户', '电话', 'NUMB'],
  ['说', '是', '叔侄', '关系', '转告']],
 [4, 3, 3])

In [17]:
rawSentences[:5],labels[:5]

(['今天下午5点472',
  '告知客户电话15226088432',
  '说是叔侄关系  转告 ',
  '客户承诺今天下午五点存入184',
  '已提醒征信影响。已告知从今天凌晨开始又会多增加75元的滞纳金。客户敷衍明天去解决'],
 [4, 3, 3, 4, 4])

### 3) compute the TF-IDF

In [18]:
TF_term  = []   #the token frequency relate terms
IDF_term = []   #the inverse document frequency relate terms
ALL_term = []   #the all token frequency relate terms

##### put tokens into the containments

In [19]:
for tokens in tokenCorpus:
    tf = dict(Counter(tokens))
    TF_term.append(tf)
    IDF_term+=tf.keys()
    ALL_term+=tokens

###### compute the tf-idf values

In [20]:
wordFreq  = dict(Counter(ALL_term))
allDocNum = len(TF_term)
IDF       = dict(Counter(IDF_term))
TF_IDF    = dict() 
for tf in TF_term:
    for word in tf.keys():
        TF_IDF.setdefault(word,0)
        TF_IDF[word]+=tf[word]*np.log(allDocNum/IDF[word]) 

###### get 100 most import tokens as the features

In [21]:
idxs = np.argsort(-np.array(list(TF_IDF.values())))[:HYPA_featureNums]
features = [list(TF_IDF.keys())[idx] for idx in idxs]

### 4) construct the training and testing data sets

In [22]:
featMat = np.array([[1  if feature in tokens else 0 for feature in features ] for tokens in tokenCorpus])
labels  = np.array(labels) 
samp_idx= np.arange(len(featMat))

In [23]:
shuffle(samp_idx)
trainNum = int(0.7*len(samp_idx))
trainIdx = samp_idx[:trainNum]
testIdx  = samp_idx[trainNum:]
trainX   = featMat[trainIdx]
trainy   = labels[trainIdx]
testX    = featMat[testIdx]
testy    = labels[testIdx]

### 5）training the logistic regression model

In [24]:
cpara = [1,10,100,1000,0.1,0.01,0.001,0.0001,2,3,4,0.5,0.2,0.3,0.4,0.6,0.7,0.8,0.9]
petype =['l1','l2']

In [25]:
bstScore = 0
for pty in petype:
    for cp in cpara:
        clf = LogisticRegression(C=cp,penalty=pty)
        clf.fit(trainX,trainy)
        teScore = clf.score(testX,testy)
        if teScore>bstScore:
            bstScore = teScore
            print(clf.score(trainX,trainy),teScore,'|para:[c]',cp,'[penalty]',pty)

0.803230543319 0.799657534247 |para:[c] 1 [penalty] l1
