# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [309]:
# import libraries
import re
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('words')
nltk.download('wordnet')

import string
import pandas as pd
from sqlalchemy import create_engine
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords, wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import contractions
from sklearn.model_selection import train_test_split
import enchant

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/spmccar/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/spmccar/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package punkt to /home/spmccar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/spmccar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /home/spmccar/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /home/spmccar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [237]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql('SELECT * FROM Messages', engine)
df.head()
X = df['message'].values
Y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data
https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate/34294022

https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python

https://stackoverflow.com/questions/40144473/do-we-need-to-use-stopwords-filtering-before-pos-tagging

https://stackoverflow.com/questions/27673527/how-should-i-vectorize-the-following-list-of-lists-with-scikit-learn

https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate/34294022

https://en.wiktionary.org/wiki/Category_talk:English_one_letter_words

https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python

https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python

In [321]:
english_stopwords = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

translator = str.maketrans('', '', string.punctuation)

def tokenize(text, 
             translator,
             english_stopwords,
             lemmatizer):

    clean_text = contractions.fix(text)

    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    clean_text = re.sub(url_regex, ' ', clean_text)

    clean_text = re.sub('http[a-zA-Z\s&\.]+\s', ' ', clean_text)
    clean_text = re.sub('http\s+:\s+[/a-zA-Z0-9\.]+', ' ', clean_text)
    clean_text = re.sub('[w]{3}\.([A-Za-z0-9-]+)\.com', ' ', clean_text)
    clean_text = re.sub('RT', ' ', clean_text)
    clean_text = re.sub('yr old', ' ', clean_text)

    clean_text = clean_text.translate(translator)

    tagged_tokens = pos_tag([elem.lower() for elem in word_tokenize(clean_text)])

    tokens = [lemmatize(elem, lemmatizer) for elem in tagged_tokens]

    tokens = [elem for elem in tokens if elem not in english_stopwords]
    
    pattern_obj = re.compile('^[0-9].*$')
    return [elem for elem in tokens if pattern_obj.match(elem) is None]

def lemmatize(tagged_token,
              lemmatizer):

    if tagged_token[1].startswith('J'):
        pos = wordnet.ADJ
    elif tagged_token[1].startswith('V'):
        pos = wordnet.VERB
    elif tagged_token[1].startswith('N'):
        pos = wordnet.NOUN
    elif tagged_token[1].startswith('R'):
        pos = wordnet.ADV
    else:
        pos = None

    if pos is None:
        return lemmatizer.lemmatize(tagged_token[0])
    else:
        return lemmatizer.lemmatize(tagged_token[0], pos)

In [240]:
X_train, X_test, Y_train, Y_Test = train_test_split(X,
                                                    Y,
                                                    random_state=1011768029,
                                                    train_size=0.7)



### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [322]:
for idx in range(1000):
    print('----------------------------------------------------------')
    print(X_train[idx])
    print(tokenize(X_train[idx],
                   translator,
                   english_stopwords,
                   lemmatizer))

----------------------------------------------------------
Relief camps have been set up to house evacuated residents and tourists.
['relief', 'camp', 'set', 'house', 'evacuated', 'resident', 'tourist']
----------------------------------------------------------
Please bring us help today, if you continue down siko ( ? ), in Carrefour, because we're dying of hunger
['please', 'bring', 'u', 'help', 'today', 'continue', 'siko', 'carrefour', 'die', 'hunger']
----------------------------------------------------------
I would like to know your latest news
['would', 'like', 'know', 'late', 'news']
----------------------------------------------------------
Dengue is characterized by a sudden onset of headaches, severe muscle and joint pains and often also a rash.
['dengue', 'characterize', 'sudden', 'onset', 'headache', 'severe', 'muscle', 'joint', 'pain', 'often', 'also', 'rash']
----------------------------------------------------------
The ability to pick dengue from influenza is crucial, a

['addition', 'actively', 'solicit', 'donation', 'support', 'work', 'across', 'country', 'thus', 'create', 'impression', 'nationally', 'concerned', 'flood', 'victim', 'government']
----------------------------------------------------------
- Latrines, washroom kits, purification tablets and hygiene kits to 4,500 households
['latrine', 'washroom', 'kit', 'purification', 'tablet', 'hygiene', 'kit', 'household']
----------------------------------------------------------
It also includes reusing wastewater and greywater to increase green spaces in the city and using drought-resistant plants.
['also', 'include', 'reuse', 'wastewater', 'greywater', 'increase', 'green', 'space', 'city', 'use', 'droughtresistant', 'plant']
----------------------------------------------------------
Please hurry. We are hungry. We are in Fontamara.
['please', 'hurry', 'hungry', 'fontamara']
----------------------------------------------------------
We are asking for help at Delmas 9 you've completely forgotten us

['club', 'others', 'eastern', 'high', 'sierra', 'region', 'california', 'raise', 'bulk', 'tsunami', 'relief', 'fund', 'though', 'onehour', 'telethon', 'broadcast', 'tv', 'radio', 'january']
----------------------------------------------------------
My love I can't live without you, but alas! Tell me why you are angry and you are humiliating me. I can't anymore, you said the words but that's you. 
['love', 'live', 'without', 'alas', 'tell', 'angry', 'humiliate', 'anymore', 'say', 'word']
----------------------------------------------------------
Exploratory drilling in May last year by local gas company PT Lapindo Brantas pierced an underground chamber of hydrogen sulphide, forcing hot mud to the surface.
['exploratory', 'drilling', 'may', 'last', 'year', 'local', 'gas', 'company', 'pt', 'lapindo', 'branta', 'pierce', 'underground', 'chamber', 'hydrogen', 'sulphide', 'force', 'hot', 'mud', 'surface']
----------------------------------------------------------
For those who survived, the 

['nard', 'think', 'turn', 'water', 'part', 'deal', 'sandy']
----------------------------------------------------------
The quake's victims are still visible on the bumpy road -- a boulder-smashed bus, a truck and a jeep stranded next to an abyss where the sliding earth obliterated a kilometer-long stretch.
['quake', 'victim', 'still', 'visible', 'bumpy', 'road', 'bouldersmashed', 'bus', 'truck', 'jeep', 'strand', 'next', 'abyss', 'slide', 'earth', 'obliterate', 'kilometerlong', 'stretch']
----------------------------------------------------------
..30 people in the neighborhood of the Montana, on the road to Bresilienne and arriving at 39 Denard, ask for Institute Forgere Jacques 
['people', 'neighborhood', 'montana', 'road', 'bresilienne', 'arrive', 'denard', 'ask', 'institute', 'forgere', 'jacques']
----------------------------------------------------------
The children require suitable shelter, hygienic living conditions, proper schooling and water for drinking and bathing.
['child'

['unicef', 'already', 'send', 'supply', 'affected', 'area', 'include', 'water', 'purification', 'tablet', 'diesel', 'centrifugal', 'pump', 'basic', 'family', 'water', 'kit', 'collapsible', 'water', 'tank', 'soap', 'lime', 'household', 'chlorine', 'continue', 'support', 'drainage', 'clearing']
----------------------------------------------------------
It therefore urged the ECOWAS Commission to continue on-going discussions with the African Union and the UN on the selection of the command of the force for which the UN is believed to have approached 11 countries to nominate candidates to head the force, including from outside the region.
['therefore', 'urge', 'ecowas', 'commission', 'continue', 'ongoing', 'discussion', 'african', 'union', 'un', 'selection', 'command', 'force', 'un', 'believe', 'approach', 'country', 'nominate', 'candidate', 'head', 'force', 'include', 'outside', 'region']
----------------------------------------------------------
She strongly encouraged Member States to 

['problem', 'registre', 'akaye']
----------------------------------------------------------
"This company can buy animals from our farmers, put them in feedlots in South Africa and ensure we reap the benefits collectively.
['company', 'buy', 'animal', 'farmer', 'put', 'feedlot', 'south', 'africa', 'ensure', 'reap', 'benefit', 'collectively']
----------------------------------------------------------
The International Crisis Group says that pressing ahead with July 28 would risk an election so "technically deficient" and with such low turnout that it would fail to bestow legitimacy on the new president and could feed a new cycle of instability.
['international', 'crisis', 'group', 'say', 'press', 'ahead', 'july', 'would', 'risk', 'election', 'technically', 'deficient', 'low', 'turnout', 'would', 'fail', 'bestow', 'legitimacy', 'new', 'president', 'could', 'fee', 'new', 'cycle', 'instability']
----------------------------------------------------------
I need to call lmy family. I do not 

['cnnbrk', 'late', 'development', 'earthquake', 'haiti', 'follow', 'break', 'news', 'twitter', 'list']
----------------------------------------------------------
Airlift to Turkmenistan - two chartered planes carrying 1,000 rolls of plastic sheeting for emergency shelter arrived in Ashgabat October 18 and were consigned to UNICEF.
['airlift', 'turkmenistan', 'two', 'chartered', 'plane', 'carry', 'roll', 'plastic', 'sheeting', 'emergency', 'shelter', 'arrive', 'ashgabat', 'october', 'consign', 'unicef']
----------------------------------------------------------
Keita, prime minister from 1994 to 2000 and president of the National Assembly for five years from 2002, was one of several of the high-profile presidential hopefuls to hold news conferences or rallies announcing their candidatures in front of thousands of backers, in his case in Bamako.
['keita', 'prime', 'minister', 'president', 'national', 'assembly', 'five', 'year', 'one', 'several', 'highprofile', 'presidential', 'hopeful', 

In [196]:
h_tokenize = lambda elem: tokenize(elem,
                                   english_stopwords,
                                   lemmatizer)

In [213]:
vectorizer = TfidfVectorizer(tokenizer=h_tokenize, lowercase=False)
bag_of_words = vectorizer.fit_transform(X_train)

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.