As part of the Udacity Data Scientist Nanodegree Program, this multioutput classification project aims to analyze and classify messages to improve communication during disasters, using data provided by Appen (formally Figure 8) that contains real messages that were sent during disaster events.
A web application can be hosted locally to classify new messages.
The result should look like this:
To run the app locally, as well as any other step of the project, it is recommended to create a new working environment and install the required libraries. For example, if using Anaconda:
conda create -n <env_name>
thenconda activate <env_name>
conda install -c anaconda pip
pip install -r requirements.txt
Or simply (although it can generate a "PackagesNotFoundError"):
conda create --name <env_name> --file requirements.txt
Configuration setup is handled in the "config" directory through two files: core.py and config.yml. The different path files needed to run this project (messages and categories data, database file, and model pickle file) are specified in this folder, and a validation is done to ensure everything works as intented. Thus users do not need to write additional name files when running the python script, which can prevent errors and increases efficiency.
To re-train the model, e.g. if new data becomes available, the "config.yml" should be updated with the new file names, or the previous data files should be replaced.
It contains the raw data (messages and categories) in csv files, as well as the cleaned data in a database (.db) file.
It also contains the python script needed to apply the entire ETL process, process_data.py, which extracts data from csv files, transforms them and then loads them into a single SQLite database.
To run this script on the command line, from the project folder:
python data/process_data.py
It contains the python script that handles all the machine learning steps needed for this project, train_classifier.py. It also holds the pickle file containing the best model from the GridSearchCV done on the training set.
To run the python script, train_classifier.py, from the command line:
python models/train_classifier.py
It contains the necessary files to run the wep application. This include two python scripts:
- run.py which contains the Flask code needed to render the HTML files as well as the Plotly figures
- functions.py which contains extra functions needed to execute run.py (for a modular and clean code)
In addition, two additional directories: templates and static, contain the necessary HTML and CSS files.
To access the web application on a local computer, run: python app/run.py
and run the given url.