# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [2]:
# import libraries
import nltk
import pickle
nltk.download(['punkt','wordnet','averaged_perceptron_tagger'])
import re
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('messages_pie', engine)
X = df['message']
Y = df.iloc[:, 4:]

### 2. 编写分词函数，开始处理文本

In [4]:
def tokenize(text):
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [5]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [7]:
#train classifier
pipeline.fit(X_train, y_train)

#predict on test data
y_pred = pipeline.predict(X_test)

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [10]:
y_pred_pd = pd.DataFrame(y_pred, columns = y_test.columns)


In [11]:
for column in y_test.columns:
    print('\n')
    print('FEATURE: {}\n'.format(column))
    print(classification_report(y_test[column],y_pred_pd[column]))



FEATURE: related

             precision    recall  f1-score   support

          0       0.59      0.25      0.35      1524
          1       0.80      0.95      0.87      4974
          2       0.27      0.07      0.11        56

avg / total       0.75      0.78      0.74      6554



FEATURE: request

             precision    recall  f1-score   support

          0       0.92      0.96      0.94      5443
          1       0.76      0.57      0.66      1111

avg / total       0.89      0.90      0.89      6554



FEATURE: offer

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6534
          1       0.00      0.00      0.00        20

avg / total       0.99      1.00      0.99      6554



FEATURE: aid_related

             precision    recall  f1-score   support

          0       0.76      0.87      0.81      3850
          1       0.77      0.60      0.67      2704

avg / total       0.76      0.76      0.76      6554



FEA

In [12]:

accuracy = (y_pred == y_test).mean().mean()

print("Accuracy:", accuracy)

Accuracy: 0.948589495813


### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [21]:
parameters = {'vect__max_df': (0.75, 1.0),
            'vect__max_features': (None, 5000),
            'tfidf__use_idf': (True, False)
             }

cv = GridSearchCV(pipeline, param_grid=parameters,cv=5, verbose=4, n_jobs=-1)

In [22]:
cv.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__memory', 'estimator__steps', 'estimator__vect', 'estimator__tfidf', 'estimator__clf', 'estimator__vect__analyzer', 'estimator__vect__binary', 'estimator__vect__decode_error', 'estimator__vect__dtype', 'estimator__vect__encoding', 'estimator__vect__input', 'estimator__vect__lowercase', 'estimator__vect__max_df', 'estimator__vect__max_features', 'estimator__vect__min_df', 'estimator__vect__ngram_range', 'estimator__vect__preprocessor', 'estimator__vect__stop_words', 'estimator__vect__strip_accents', 'estimator__vect__token_pattern', 'estimator__vect__tokenizer', 'estimator__vect__vocabulary', 'estimator__tfidf__norm', 'estimator__tfidf__smooth_idf', 'estimator__tfidf__sublinear_tf', 'estimator__tfidf__use_idf', 'estimator__clf__estimator__algorithm', 'estimator__clf__estimator__base_estimator', 'estimator__clf__estimator__learning_rate', 'estimator__clf__estimator__n_estimators', 'estimator__clf__estimator__random_state', 'estimator__clf__estim

In [23]:
cv.fit(X_train, y_train)
# y_pred = cv_fit.best_estimator_.predict(X_test)
# print(classification_report(y_test, y_pred, target_names=y_test.columns))

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .
[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.23468090516145437, total= 2.3min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.5min remaining:    0.0s


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.23569794050343248, total= 2.3min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:  5.0min remaining:    0.0s


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.23626653102746695, total= 2.3min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  7.6min remaining:    0.0s


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.22100712105798576, total= 2.3min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .
[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.22609359104781282, total= 2.3min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000 .
[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000, score=0.24357996440376303, total= 2.0min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000 .
[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000, score=0.22323925756420035, total= 2.0min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000 .
[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000, score=0.23397761953204477, total= 2.0min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000 .
[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000, score=0.231943

[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 90.6min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'vect__max_df': (0.75, 1.0), 'vect__max_features': (None, 5000), 'tfidf__use_idf': (True, False)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=4)

### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [24]:
y_pred_new = cv.predict(X_test)

In [25]:
y_pred_pd_new = pd.DataFrame(y_pred_new, columns = y_test.columns)

for column in y_test.columns:
    print('\n')
    print('FEATURE: {}\n'.format(column))
    print(classification_report(y_test[column],y_pred_pd_new[column]))



FEATURE: related

             precision    recall  f1-score   support

          0       0.59      0.25      0.35      1524
          1       0.80      0.95      0.87      4974
          2       0.31      0.07      0.12        56

avg / total       0.75      0.78      0.74      6554



FEATURE: request

             precision    recall  f1-score   support

          0       0.92      0.96      0.94      5443
          1       0.76      0.57      0.66      1111

avg / total       0.89      0.90      0.89      6554



FEATURE: offer

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6534
          1       0.00      0.00      0.00        20

avg / total       0.99      1.00      0.99      6554



FEATURE: aid_related

             precision    recall  f1-score   support

          0       0.76      0.87      0.81      3850
          1       0.77      0.60      0.67      2704

avg / total       0.76      0.76      0.76      6554



FEA

In [26]:
accuracy = (y_pred_new == y_test).mean().mean()

print("Accuracy:", accuracy)

Accuracy: 0.948602210694


### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

### 9. 导出模型为 pickle file

In [27]:
with open("classifier_my.pickle", 'wb') as pickle_file:
    pickle.dump(cv, pickle_file)

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。