In [None]:
import sys
import os
import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split

sys.path.append('./data')
sys.path.append('./models')

# from train_classifier import load_data

# Downloads for NLTK tools
nltk.download('words', quiet=True);
nltk.download('wordnet', quiet=True);
nltk.download('punkt_tab', quiet=True);
nltk.download('stopwords', quiet=True);


# Project Overview
## ETL Pipeline: process_data.py
In a Python script, process_data.py, write a data cleaning pipeline that:
- Loads the messages and categories datasets
- Merges the two datasets
- Cleans the data
- Stores it in a SQLite database

## ML pipeline: train_classifier.py
In a Python script, train_classifier.py, write a machine learning pipeline that:
- Loads data from the SQLite database
- Splits the dataset into training and test sets
- Builds a text processing and machine learning pipeline
- Trains and tunes a model using GridSearchCV
- Outputs results on the test set
- Exports the final model as a pickle file

## Flask Web App
We are providing much of the flask web app for you, but feel free to add extra features depending on your knowledge of flask, html, css and javascript. For this part, you'll need to:
- Modify file paths for database and model as needed
- Add data visualizations using Plotly in the web app. One example is provided for you.

## Github and Code Quality
Your project will also be graded based on the following:
- Use of Git and Github
- Strong documentation
- Clean and modular code
- Follow the [RUBRIC](https://learn.udacity.com/nanodegrees/nd025/parts/cd0018/lessons/e692c8ed-b713-464b-95ac-72d93a35b4fc/concepts/e692c8ed-b713-464b-95ac-72d93a35b4fc-project-rubric) when you work on your project to assure you meet all of the necessary criteria for developing the pipelines and web app.

In [None]:
DATA_DIR = './data'
database_filepath = os.path.join(DATA_DIR, 'DisasterResponse.db')
messages_filepath = os.path.join(DATA_DIR, 'disaster_messages.csv')
categories_filepath = os.path.join(DATA_DIR, 'disaster_categories.csv')

# debug train_classifier.py

In [None]:
from train_classifier import load_data
from process_data import load_data as load_data_from_process_data, clean_data as clean_data_from_process_data

# train_classifier.py

In [None]:
df = clean_data_from_process_data(load_data_from_process_data(messages_filepath, categories_filepath))

categories = [k for k, v in df.dtypes.items() if v in [int, np.int64]]

X = df.message
Y = df[categories]


X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, train_size=3/4, test_size=1/4, random_state=42)

In [None]:
DATA_DIR = './data'
database_filepath = os.path.join(DATA_DIR, 'DisasterResponse.db')
X, Y, category_names = load_data(database_filepath)

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, train_size=3/4, test_size=1/4, random_state=42, stratify=None)

In [None]:
categories = [k for k, v in df.dtypes.items() if v in [int, np.int64]]
df[categories].describe().T

In [None]:
df[categories].eq(0).all(axis=1).sum()

In [None]:
genre_counts = df.groupby('genre').count()['message']
genre_names = list(genre_counts.index)


In [None]:
cat_counts = df[categories].sum(axis=1).reset_index(drop=True)
cat_labels = list(cat_counts.index)
cat_values = cat_counts.tolist()

In [None]:
cat_labels

In [None]:
cat_labels

# app.py

# CRISP-DM Flow
- ALL of this should go in the README.md file at the end of the project.
- The rubric states the README should contain a lot of info.

## Business Understanding
- The purpose of this project is to identify an efficient way to model emergency messages to determine which are most informative for making a quick disaster response decision. This is important as resources for disaster response are limited and false positives can prevent those resources from being deployed where they are needed most.

## Data Understanding
- The dataset is raw, containing an identifier along with communication information that can contain an English 'message', a non-English 'original' messge, and a 'genre' (type of communication). This insformation is provided in CSV form.
- The content of each row is inconsistent - for example, some of the lines contain more than 4 commas (which would cleanly correspond to the 4 expected fields), some contain consecutive commas (where one of the messages is missing), and some contain quotation marks separating the English and non-English messages.
- There is also categorical information stored in a separate CSV, which provides some informative identifiers that are pre-associated with each message (and link by an 'id' field in each set of data).
    - This CSV data is further subdivided into key/value pairs separated by a semicolon, e.g. 'id,key1-value1;key2-value2; ...' etc.
    - There are 36 categeries in the file, and the rubric says these are to be used as the ***responses*** for the multi-output classification task.
    - Note that the values in the category values are integers - mostly boolean (0 or 1) except for **related** which is ternery (0/1/2).
        - If using a decision tree style classifier, we do not need to create dummies for the **related** variable.
    - These categories are also NOT mutually exclusive (the cross-sectional sum of them can be greater than 1).

- **load_data.py** loads the raw message and category information, first splits out the 'id' and the 'categories' key/value pairs, then splits the key/value pairs into columns of data with integer data, and joins each dataset (on 'id') and passes that joined dataframe along for cleaning in the next phase.

## Data Preparation
- **clean_data.py** attempts to separate the Enlish and non-English components of the raw nessage, and stores the processed English-only message as a new field in the data, allowing it to be more easily processed in the machine learning pipeline.
- It also converts the 'genre' field into a set of dummies with boolean values (n-1 categories).

## Data Modeling

## Result Evaluation

## Deployment