# Natural Language Processing with Disaster Tweets
**By Thanh Son Nghiem & Minh Kien Nguyen**

<a id='0'>Table of Contents (ToC):</a>
* <a href='#1'>1. Frame the Problem and Look at the Big Picture</a>
* <a href='#2'>2. Get the Data</a>
* <a href='#3'>3. Explore the Data</a>
* <a href='#4'>4. Prepare the Data</a>
* <a href='#5'>5. Short-List Promising Models</a>
* <a href='#6'>6. Fine-Tune the System</a>
* <a href='#7'>7. Present the Solution</a>
* <a href='#8'>8. Launch!</a>

# TODO (Stand 31.05): 
* Label test set ```labeled_test.csv``` (Deadline 07.06)

# DONE (Stand 31.05): 
* Sections 1.1, 1.2, 1.3
* Section 2

<a id='1'></a>

## 1. Frame the Problem and Look at the Big Picture

### 1.1 Define the objective in business terms

*The primary goal of this Jupyter Notebook is to build machine learning models that predict which Tweets are about real disasters and which ones are not.*

### 1.2 How should you frame this problem?

It will be categorized as a Supervised, Batch, Model-based Learning Problem.

### 1.3 How should performance be measured?

The F-Beta-Score will be used to evaluate the performance of the models.

### 1.4 What would be the minimum performance needed to reach the business objective?

### 1.5 List all of the assumptions made in the project

<a id='2'></a>
<a href='#0'>Back to ToC</a>

## 2. Get the Data

### 2.1 Find and document where to get the data

The dataset used in this Notebook was created by the company *figure-eight* and was made widely known to the public by *Kaggle*, who uses it as training and test sets for their Getting Started Prediction Competition "Natural Language Processing with Disaster Tweets".

Link to the *Kaggle* training and test datasets as well as their description can be found [here](https://www.kaggle.com/c/nlp-getting-started/data).

Note that the *Kaggle* test set ```test.csv``` was not labeled. The labeling of the test set was done manually by the creators of this Notebook and saved separately in the file ```labeled_test.csv```.

### 2.2 Get and take a quick look at the data

In [None]:
import pandas as pd

# Read the Kaggle training set and the preprocessed test set
train_df = pd.read_csv('train.csv',sep=",", header="infer", index_col = 0)
test_df = pd.read_csv('labeled_test.csv',sep=",", header="infer", index_col = 0)

*Kaggle* dataset columns description:
* **id** - a unique identifier for each tweet
* **text** - the text of the tweet
* **location** - the location the tweet was sent from (may be blank)
* **keyword** - a particular keyword from the tweet (may be blank)
* **target** - this denotes whether a tweet is about a real disaster (1) or not (0)

In [None]:
# Print a concise summary of the training set
train_df.info()

In [None]:
# Print a concise summary of the test set
test_df.info()

In [None]:
# Calculate the amount and the percentage of Disaster Tweets and Non-Disaster Tweets in the training set
print(train_df["target"].value_counts())
print(train_df["target"].value_counts()/len(train_df))

In [None]:
# Calculate the amount and the percentage of Disaster Tweets and Non-Disaster Tweets in the test set
print(test_df["target"].value_counts())
print(test_df["target"].value_counts()/len(test_df))

<a id='3'></a>
<a href='#0'>Back to ToC</a>

## 3. Explore the Data

### 3.1 Create a copy of the data for exploration

In [None]:
scaffolds = strat_train_set.copy()

### 3.2 Study each attribute and its characteristics

In [None]:
scaffolds.info()

Numerical attributes

In [None]:
scaffolds.describe()

Categorical attributes

In [None]:
scaffolds["Origin"].value_counts()

### 3.3 Study the correlations between attributes

In [None]:
ax = sns.heatmap(scaffolds.corr(), vmin=-1, vmax=1, center=0, square=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right');

In [None]:
scaffolds.corr()['Organelle'].sort_values(ascending=False)

Korrelationskoeffizienten von tRNA und rRNAp sind super klein. Kann man die beiden entfernen?

### 3.4 Document what you have learned

Anmerkungen:

### 3.5 Identify promising transformations to apply

* Remove outliers
* Removing the attribute "Origin"
* Standardizing all attributes

<a id='4'></a>
<a href='#0'>Back to ToC</a>

## 4. Prepare the Data

Notes:
* Work on copies of the data
* Write functions for all data transformations that will be applied

In [None]:
scaffolds_predictors = strat_train_set.drop("Organelle", axis=1)
scaffolds_label = strat_train_set["Organelle"].copy()

### 4.1 Data cleaning:
* Fix or remove outliers
* Fill in missing values

### 4.2 Feature selection:
* Drop the attributes that provide no useful information for the task

In [None]:
attr_to_remove = ["Origin"]

def remove_attr(data, which_to_remove=attr_to_remove, remove_attr=True):
    if remove_attr:
        return data.drop(attr_to_remove, axis=1)
    else:
        return data

### 4.3 Feature engineering, where appropriate:
* Discretize continuous features
* Decompose features
* Add promising transformations of features
* Aggregate features into promising new features

### 4.4 Feature scaling:
* Standardize or normalize features

Vorbereiten der Daten für Maschine learnin algorithmen.

In [None]:
full_pipeline = Pipeline([
      ('attr_remover', FunctionTransformer(remove_attr, validate=False)),
      ('std_scaler', StandardScaler())
   ])
scaffolds_predictors_prepared = full_pipeline.fit_transform(scaffolds_predictors)


In [None]:
scaffolds_predictors_prepared

In [None]:
scaffolds_predictors_prepared.shape

<a id='5'></a>
<a href='#0'>Back to ToC</a>

## 5. Short-List Promising Models

### 5.1 Train many quick and dirty models from different categories (e.g. linear, naive Bayer, SVM, Random Forest, neural net, etc.) using different parameters

### 5.2 Measure and compare their performance

For each model, use N-fold cross validation and compute the mean and standard deviation of the performance measured on the N folds

### 5.3 Analyze the most significant variables for each algorithm

### 5.4 Analyze the type of errors the models make

### 5.5 Have a quick round of feature selection and engineering

### 5.6 Have one or two more quick iterations of the five previous steps

### 5.7 Short-list the top three to five most promising models, preferring models that make differrent types of errorss

<a id='6'></a>
<a href='#0'>Back to ToC</a>

## 6. Fine-Tune the System

### 6.1 Fine-tune the hyperparameters using cross-validation

* Treat your data transformation choices as hyperparameters, especially when you are not sure about them
* Unless there are very few hyperparameter values to explore, prefer random search over grid search

### 6.2 Try Ensemble methods. 

Combining your best models will often perform better than running them individually

### 6.3 Once you are confident about your final model, measure its performance on the test set to estimate the generalization error

Do NOT tweak your model after measuring the generalization error due to risk of overfitting the test set

<a id='7'></a>
<a href='#0'>Back to ToC</a>

## 7. Present the Solution

### 7.1 Document what you have done

### 7.2 Create a nice presentation

Highlight the big picture first

### 7.3 Explain why your solution achieves the business objective

### 7.4 Present interesting points along the ways

* Describe what worked and what did not
* List your assumptions and your system's limitations

### 7.5 Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements

<a id='8'></a>
<a href='#0'>Back to ToC</a>

## 8. Launch!