# COMP 562 - Machine Learning Final Project
## Plan

Our goal is to distinguish between tweets which are about real disasters and those which are about fake, metaphorical, or otherwise not real ones.
### Turning tweets into features

- Start with trigrams, can tune later
- Can consider bigrams, bag of words, or other n-grams
- Ignore location information, at least for now
- Almost all tweets have keywords, use as another feature
- Make sure to process "keyword" values, removing special characters

### Criteria for disaster
- Meant to track if tweets are referring to ongoing disasters
- Also includes historical events


### Training
- Train and validate our model on `train.csv` 
- Test by sending results to Kaggle

### Random forest
- Use Gini criterion for efficiency

### Neural networks
- Use multi-layer perceptron classifier
- Tweak alpha values

## Disaster tweet classification
### Important modules

In [None]:
import numpy as np
import pandas as pd
import string, re

### Importing data

In [None]:
train_df = pd.read_csv("disaster-tweets/data/train.csv")
test_df = pd.read_csv("disaster-tweets/data/test.csv")

### Finding all characters in dataset

In [None]:
def standardize_string(s):
    s = s.lower()
    s = re.sub("http://t\.co/\S+", "", s)
    return s

In [None]:
all_characters = set()

for tweet in train_df['text']:
    all_characters = all_characters.union(set(standardize_string(tweet)))

char_list = list(all_characters)
char_list.sort()
print(char_list)

### Narrowing down characters

We decided that from these characters, we would only keep letters, numbers, and a few accented characters. We also kept '#' and '@' due to their importance on Twitter.

In [None]:
included_chars = list(string.ascii_lowercase + string.digits) + ['#', '@', 'â', 'ã', 'å', 'ç', 'è', 'ê', 'ì', 'ï', 'ñ', 'ò', 'ó', 'ü', ' ']
print(included_chars)

### Removing invalid characters
- Try both with and without removing special characters
- Consider skipping data points with bad characters

In [None]:
def remove_special_characters(s):
    for c in char_list:
        if c not in included_chars:
            s = s.replace(c, "")
    return s

def format_tweet(t):
    # Makes lowercase
    formatted_tweet = t.lower()
    # Removed links
    formatted_tweet = re.sub(" http(s|)://t\.co/\S+", "", formatted_tweet)
    formatted_tweet = re.sub("http(s|)://t\.co/\S+", "", formatted_tweet)
    # Removes any special characters, other than a-z, numbers, spaces, hashtags, and @
    formatted_tweet = remove_special_characters(formatted_tweet)
    final_tweet_array = []
    
    # Removes multiple consecutive spaces
    for i, char in enumerate(formatted_tweet):
        if i == 0:
            if char != ' ':
                final_tweet_array.append(char)
                continue
        prev_char = formatted_tweet[i-1]
        if char == ' ' and prev_char == ' ':
            continue
        final_tweet_array.append(char)
    final_tweet = "".join(final_tweet_array)
    return final_tweet

In [None]:
formatted_train_tweets = []
for i, tweet in enumerate(train_df["text"]):
    formatted_train_tweets.append(format_tweet(tweet))

formatted_test_tweets = []
for tweet in test_df["text"]:
    formatted_test_tweets.append(format_tweet(tweet))
    
test_ids = test_df['id']

### Splitting tweets into bigrams

Tweets were processed into bigram representations, which includes information about two consecutive words at a time.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
bigram_train = bigram_vectorizer.fit_transform(formatted_train_tweets)
bigram_test = bigram_vectorizer.transform(formatted_test_tweets)

In [None]:
# 1-gram no string formatting
# array([0.55543823, 0.50891089, 0.54221388, 0.51913133, 0.68794326])
# 1 and 2-gram, no string formatting
# array([0.46118721, 0.45027322, 0.43412527, 0.44141069, 0.61523626])
# 1-gram basic string formatting
# array([0.57556936, 0.48219736, 0.5530303 , 0.51859504, 0.68586387])
# 1 & 2-gram, basic string formatting
# array([0.5039019 , 0.41150442, 0.41241685, 0.45823928, 0.62327416])
# bigram only, basic string formatting
# array([0.24096386, 0.25725095, 0.1682243 , 0.17475728, 0.31060606])

### Random forest classifier
#### Creating the model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(n_jobs=10, max_depth=None, class_weight="balanced")

#### Cross-validation

In [None]:
rf_parameters = {
    'min_samples_split': range(2, 5),
    'min_samples_leaf': range(1, 4),
    'n_estimators': [50, 100, 500]
}

rf_cv = GridSearchCV(rf, rf_parameters, verbose=3, n_jobs=10)
rf_cv.fit(bigram_train, train_df['target'])
print(rf_cv.best_params_)

# {'min_samples_leaf': 2, 'min_samples_split': 4, 'n_estimators': 500}
# {'class_weight': 'balanced', 'max_depth': None, 'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 1}

#### Predicting test data

As the disaster tweets dataset is from a Kaggle competition, the creators chose to not make the test labels public. As such, we have the random forest model make predictions on the test data. We then submitted this data to Kaggle to get an accuracy.

In [None]:
rf_predicted_classes = rf_cv.predict(bigram_test)
print(rf_predicted_classes)
rf_out_array = []
for i, pred_class in enumerate(rf_predicted_classes):
    rf_out_array.append([int(test_ids[i]), pred_class])

np.savetxt("disaster-tweets/rf-results.csv", rf_out_array, delimiter=',', fmt='%i')

### Multilayer perceptron classifier
#### Creating the model

In [None]:
from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier(verbose=True, tol=.001)

#### Cross-validation

In [None]:
mlpc_parameters = {
    "alpha": [.0001, .001, .01, .1]
}
mlpc_cv = GridSearchCV(mlpc, mlpc_parameters, verbose=3, n_jobs=-1)
mlpc_cv.fit(bigram_train, train_df['target'])
print(mlpc_cv.best_params_)

#### Predicting test data

In [None]:
mlpc_predicted_classes = mlpc_cv.predict(bigram_test)
mlpc_out_array = []
for i, pred_class in enumerate(mlpc_predicted_classes):
    mlpc_out_array.append([int(test_ids[i]), pred_class])
    
np.savetxt("disaster-tweets/mlpc-results.csv", mlpc_out_array, delimiter=',', fmt='%i')

#### Retrying with tri-grams

Due to undesirable low accuracy, we tried to train a multilayer perceptron classifier again, this time with tweet data represented as tri-grams rather than bi-grams.

In [None]:
trigram_vectorizer = CountVectorizer(ngram_range=(3,3))
trigram_train = trigram_vectorizer.fit_transform(formatted_train_tweets)
trigram_test = trigram_vectorizer.transform(formatted_test_tweets)

mlpc_cv.fit(trigram_train, train_df['target'])
print(mlpc_cv.best_params_)

In [None]:
mlpc_trigram_predicted_classes = mlpc_cv.predict(trigram_test)
mlpc_trigram_out_array = []
for i, pred_class in enumerate(mlpc_trigram_predicted_classes):
    mlpc_trigram_out_array.append([int(test_ids[i]), pred_class])
    
np.savetxt("disaster-tweets/mlpc-trigam-results.csv", mlpc_trigram_out_array, delimiter=',', fmt='%i')

## Humor detection
### Important modules

In [None]:
import numpy as np
import pandas as pd

### Importing data

We decided to train on only 10 percent of the data, as there are 200,000 data points. Training on, for example, 60 percent of the data would take prohibitively long (more than 10 minutes for one fit). However, evaluation time is more reasonable than training time, so we can still test on the remaining 90 percent of the data.

In [None]:
from sklearn.model_selection import train_test_split

all_text = pd.read_csv("humor-dataset.csv")["text"]
all_humor = pd.read_csv("humor-dataset.csv")["humor"]

text_train, text_test, humor_train, humor_test = train_test_split(all_text, all_humor, train_size=.1)

### Making bi-grams

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
bigram_train = bigram_vectorizer.fit_transform(text_train)
bigram_test = bigram_vectorizer.transform(text_test)

### Random forest classifier
#### Making the model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(max_depth=None, class_weight="balanced", n_jobs=8, verbose=3)

#### Cross-validation

In [None]:
rf_params = {
    'n_estimators': [50, 100, 500]
}

rf_cv = GridSearchCV(rf, rf_params, n_jobs=8, verbose=3)
rf_cv.fit(bigram_train, humor_train)
print(rf_cv.best_params_)
# {'n_estimators': 100}: 0.8332

#### Evaluation

The humor dataset, unlike the disaster tweets dataset, included labels for every piece of text in the dataset. As such, we are able to directly compute accuracy/F1 scores and also to plot confusion matrices.

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sb
import matplotlib.pyplot as plt

def get_cm(preds):
    return confusion_matrix(humor_test, preds, labels=[False, True], normalize='all')

def plot_confusion_matrix(cm, title):
    fx = sb.heatmap(cm, annot=True, cmap='turbo')

    # labels the title and x, y axis of plot
    fx.set_title(title + '\n\n');
    fx.set_xlabel('Predicted Values')
    fx.set_ylabel('Actual Values ');

    # labels the boxes
    fx.xaxis.set_ticklabels(['False','True'])
    fx.yaxis.set_ticklabels(['False','True'])

    plt.show()

#### Confusion matrix and accuracy

In [None]:
humor_preds = rf.predict(bigram_test)
rf_cm = get_cm(humor_preds)
plot_confusion_matrix(rf_cm, "Random forest with bigrams")
rf_cv.score(bigram_test, humor_test)

### Multilayer perceptron classifier
#### Making the model

In [None]:
from sklearn.neural_network import MLPClassifier
mlpc = MLPClassifier(tol=.001, verbose=True)

#### Cross-validation

In [None]:
mlpc_params = {
    "alpha": [.0001, .001, .01, .1]
}
print("Starting CV...")
mlpc_cv = GridSearchCV(mlpc, mlpc_params, verbose=3, n_jobs=8)
mlpc_cv.fit(bigram_train, humor_train)

#### Confusion matrix and accuracy

In [None]:
humor_preds = mlpc.predict(bigram_test)
mlpc_cm = get_cm(humor_preds)
plot_confusion_matrix(mlpc_cm, "Multilayer perceptron with bigrams")
mlpc_cv.score(bigram_test, humor_test)