# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [30]:
# import libraries
import numpy as np
import pandas as pd
import re
from sqlalchemy import create_engine
from collections import Counter

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import classification_report,precision_score, recall_score, f1_score

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('DisasterResponse',engine)

In [4]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df.groupby('genre').count()['message']

genre
direct    10766
news      13054
social     2396
Name: message, dtype: int64

In [None]:
df.groupby('genre').sum()

In [6]:
X = df['message']
y = df.drop(['id', 'message', 'original', 'genre'], axis=1)
category_names = list(y.columns)

In [7]:
X.sample(20).tolist()

['The political forces in the region must be alert to the phenomenon of thousands of young people who recently took to the streets in more than one Arab country, demanding that citizenship be the basis for partnership and refusing either to be represented on a sectarian basis, or that sectarian representation be used to cover up corruption.',
 "could people enter rooms that was made in concrete and whish not crack? cause the wind's blows out. ",
 "I'm at Carrefour Feuille we need food",
 'Weeks of torrential flooding and cyclones have claimed many lives and left hundreds of thousands homeless and without access to the most basic needs.',
 'We all know that these liberated communities are still not fully safe and habitable.',
 'River levels have also returned to normal with flood waters receding quickly.',
 'World oilseed production (soybeans, cottonseed, peanut oil, sunflower seed oil, etc.) is forecast at a record 307 million metric tons (MMt).',
 'In the neighbouring storm-hit state 

### 2. 编写分词函数，开始处理文本

In [8]:
def tokenize(text): 
    url_pat = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
    urls = re.findall(url_pat, text)
    
    for url in urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = [lemmatizer.lemmatize(token).lower().strip() for token in tokens]
    stops = list(set(stopwords.words('english')))
    stops += ["http","&","-",":",",",".","(",")","#"]

    clean_tokens = [token for token in clean_tokens if token not in stops]

    return clean_tokens

### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [9]:
def build_model():
    pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf',TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                ])
    
    return pipeline

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [10]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)
model = build_model()
model.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [11]:
def evaluate_model(model, X_test, y_test, category_names):
    y_pred = model.predict(X_test)
    scores = []
    for j in range(len(y_test.columns)):
        scores.append([precision_score(y_test.iloc[:, j].values, y_pred[:, j], average='weighted'),
                                     recall_score(y_test.iloc[:, j].values, y_pred[:, j], average='weighted'),
                                     f1_score(y_test.iloc[:, j].values, y_pred[:, j], average='weighted'),
                                     Counter(y_test.iloc[:,j]),
                                    Counter(y_pred[:,j]) 
                                    ])
    scores = pd.DataFrame(scores, columns=['precision','recall_score','f1_score','y_test','y_pred'],\
                          index=category_names)
    return scores

In [12]:
scores = evaluate_model(model, X_test, y_test, category_names)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [13]:
round(scores,2)

Unnamed: 0,precision,recall_score,f1_score,y_test,y_pred
related,0.78,0.79,0.78,"{1: 4944, 0: 1563, 2: 47}","{1: 5449, 0: 1095, 2: 10}"
request,0.87,0.88,0.87,"{0: 5443, 1: 1111}","{0: 5984, 1: 570}"
offer,0.99,0.99,0.99,"{0: 6521, 1: 33}",{0: 6554}
aid_related,0.74,0.74,0.74,"{0: 3884, 1: 2670}","{0: 4451, 1: 2103}"
medical_help,0.9,0.92,0.9,"{0: 6019, 1: 535}","{0: 6449, 1: 105}"
medical_products,0.94,0.95,0.93,"{0: 6210, 1: 344}","{0: 6524, 1: 30}"
search_and_rescue,0.97,0.98,0.97,"{0: 6395, 1: 159}","{0: 6538, 1: 16}"
security,0.97,0.98,0.97,"{0: 6438, 1: 116}","{0: 6549, 1: 5}"
military,0.96,0.97,0.96,"{0: 6354, 1: 200}","{0: 6524, 1: 30}"
child_alone,1.0,1.0,1.0,{0: 6554},{0: 6554}


In [34]:
scores.f1_score.median()

0.94391920460076517

In [223]:
for j in np.arange(y_pred.shape[1]):
    print(category_names[j],'====')
    print(" y_test: {0}\n y_pred: {1}".format(Counter(y_test.iloc[:,j]), Counter(y_pred[:,j])))
    print(classification_report(y_test.iloc[:,j], y_pred[:,j])) 

related ====
 y_test: Counter({1: 4944, 0: 1563, 2: 47})
 y_pred: Counter({1: 5435, 0: 1114, 2: 5})
             precision    recall  f1-score   support

          0       0.61      0.43      0.51      1563
          1       0.83      0.91      0.87      4944
          2       0.60      0.06      0.12        47

avg / total       0.78      0.79      0.78      6554

request ====
 y_test: Counter({0: 5443, 1: 1111})
 y_pred: Counter({0: 5984, 1: 570})
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5443
          1       0.79      0.41      0.54      1111

avg / total       0.87      0.88      0.87      6554

offer ====
 y_test: Counter({0: 6521, 1: 33})
 y_pred: Counter({0: 6554})
             precision    recall  f1-score   support

          0       0.99      1.00      1.00      6521
          1       0.00      0.00      0.00        33

avg / total       0.99      0.99      0.99      6554

aid_related ====
 y_test: Counter({0: 3884

  'precision', 'predicted', average, warn_for)


In [15]:
model.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f2eaada8f28>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [26]:
# GridSearchCV太慢了，跑不出来，请审阅老师指教这个应该怎么办
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf',TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

parameters = {
        'clf__estimator__max_features': ['log2','sqrt'],
        'clf__estimator__n_estimators': [100,200]
        }

cv = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=5)

cv.fit(X_train,y_train)

KeyboardInterrupt: 

### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [18]:
scores_cv = evaluate_model(cv, X_test, y_test, category_names)

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

In [32]:
model_improved = Pipeline([
                                ('vect', CountVectorizer(tokenizer=tokenize)),
                                ('tfidf', TfidfTransformer()),
                                ('clf', MultiOutputClassifier(AdaBoostClassifier()))
                            ])
model_improved.fit(X_train, y_train)


scores_model_improved = evaluate_model(model_improved, X_test, y_test, category_names)

In [33]:
scores_model_improved

Unnamed: 0,precision,recall_score,f1_score,y_test,y_pred
related,0.714674,0.758773,0.701779,"{1: 4944, 0: 1563, 2: 47}","{1: 6123, 0: 414, 2: 17}"
request,0.883943,0.891517,0.881626,"{0: 5443, 1: 1111}","{0: 5840, 1: 714}"
offer,0.990557,0.993592,0.991995,"{0: 6521, 1: 33}","{0: 6543, 1: 11}"
aid_related,0.755134,0.756332,0.751321,"{0: 3884, 1: 2670}","{0: 4369, 1: 2185}"
medical_help,0.908696,0.924779,0.910713,"{0: 6019, 1: 535}","{0: 6324, 1: 230}"
medical_products,0.946085,0.954684,0.947384,"{0: 6210, 1: 344}","{0: 6383, 1: 171}"
search_and_rescue,0.968402,0.976045,0.9699,"{0: 6395, 1: 159}","{0: 6504, 1: 50}"
security,0.967797,0.978944,0.972813,"{0: 6438, 1: 116}","{0: 6524, 1: 30}"
military,0.966142,0.97162,0.967829,"{0: 6354, 1: 200}","{0: 6440, 1: 114}"
child_alone,1.0,1.0,1.0,{0: 6554},{0: 6554}


In [35]:
scores_model_improved.f1_score.median()

0.96259612661900251

In [47]:
# 查看两个模型在各个类别预测f1分数的差异
scores_diff = pd.concat([scores.f1_score,scores_model_improved.f1_score],axis=1)
scores_diff.columns = ['f1_rf','f1_ab']
scores_diff['diff'] = scores_diff.f1_ab - scores_diff.f1_rf
scores_diff

Unnamed: 0,f1_rf,f1_ab,diff
related,0.77824,0.701779,-0.076461
request,0.86587,0.881626,0.015756
offer,0.992454,0.991995,-0.000459
aid_related,0.73544,0.751321,0.01588
medical_help,0.897386,0.910713,0.013327
medical_products,0.928704,0.947384,0.01868
search_and_rescue,0.966818,0.9699,0.003081
security,0.973593,0.972813,-0.00078
military,0.959656,0.967829,0.008173
child_alone,1.0,1.0,0.0


### 9. 导出模型为 pickle file

In [230]:
import pickle

with open('model.pickle','wb') as fw:
    pickle.dump(model,fw)

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。