## 1A Download Data

(1) how many training and test data points are there?

**There are 7613 test data points, and 3263 test data points.**

(2) what percentage of the training tweets are of real disasters, and what percentage is not? Note that the meaning of each column is explained in the data description on Kaggle.

**42.966% of tweets are of real disasters, while 57.034% of tweets are not from real disasters.**

In [1]:
import pandas as pd

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sub = pd.read_csv('sample_submission.csv')

In [2]:
train.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
train.shape[0]

7613

In [4]:
round(100*(train['target'].sum())/train.shape[0], 4)

42.966

In [5]:
100-round(100*(train['target'].sum())/train.shape[0], 4)

57.034

In [6]:
test.head(5)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
test.shape[0]

3263

### 1B Split training data

In [8]:
training_set = train.sample(frac = .7)
dev_set = df = train.merge(training_set, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

### 1C Preprocess the data

In [9]:
training_set.head(5)

Unnamed: 0,id,keyword,location,text,target
1527,2209,chemical%20emergency,,Emergency Response and Hazardous Chemical Mana...,0
543,790,avalanche,Canada,What a feat! Watch the #BTS of @kallemattson's...,0
4819,6860,mass%20murder,Los Angeles,RT owenrbroadhurst RT JuanMThompson: At this h...,1
6446,9221,suicide%20bombing,GCC,Alleged driver in #Kuwait attack 'joined Daesh...,1
2870,4124,drought,"Las Cruces, NM",Pretty neat website to get the latest drought ...,1


I first drop the ID column because the label of the tweet will likely not have anything to do with whether or not the tweet is related to a real disaster.

In [10]:
training_set.drop(['id'], inplace=True, axis=1)
dev_set.drop(['id', '_merge'], inplace=True, axis=1)

In [11]:
training_set.head(5)

Unnamed: 0,keyword,location,text,target
1527,chemical%20emergency,,Emergency Response and Hazardous Chemical Mana...,0
543,avalanche,Canada,What a feat! Watch the #BTS of @kallemattson's...,0
4819,mass%20murder,Los Angeles,RT owenrbroadhurst RT JuanMThompson: At this h...,1
6446,suicide%20bombing,GCC,Alleged driver in #Kuwait attack 'joined Daesh...,1
2870,drought,"Las Cruces, NM",Pretty neat website to get the latest drought ...,1


I make all text lowercase and strip punctuation in order for real disaster tweet predictions to rely only on the content and word sequences in my dataset, as opposed to formatting differences (upper vs. lower case). My data will all be uniform in order to eliminate the possiblity that text formatting influences my classification of tweets.

I also get rid of all Twitter usernames, hyperlinks, and random other non-normal English alphabet or Phonecian integers (for example, there's a weird U character that didn't fit into any weird. I remove it below). These text attributes are additional information from the true content of the tweet and therefore I don't anticipate that they will have a major influence on the accurate classification of tweets as real or not real disaster.

In [12]:
def clean_df(df):
    for col in ['keyword', 'location', 'text']:
        df[col] = df[col].str.lower()
        #get rid of twitter usernames
        df[col] = df[col].str.replace('@([\w]+)','')
        #get rid of hyperlinks
        df[col] = df[col].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','')
        #get rid of all punctuation
        df[col] = df[col].str.replace('[^\w\s]','')
        #weird letter type U in some responses; get rid of it
        df[col] = df[col].str.replace('û_', '')
    return df

In [13]:
training_set = clean_df(training_set)
dev_set = clean_df(dev_set)

I also get rid of stop words and lemmatize all of the words in order to extract only the unique words that have important meaning in each tweet. I anticipate that these unique words or range of specific words will contribute to accurate classification of tweets.

In [1]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [15]:
def stopwords(df):
    
    pat = r'\b(?:{})\b'.format('|'.join(stop))
    df['text'] = df['text'].str.replace(pat, '')
    df['text'] = df['text'].str.replace(r'\s+', ' ')
    
    return df

In [16]:
training_set = stopwords(training_set)
dev_set = stopwords(dev_set)

In [17]:
training_set.head(5)

Unnamed: 0,keyword,location,text,target
1527,chemical20emergency,,emergency response hazardous chemical manageme...,0
543,avalanche,canada,feat watch bts incredible music video avalanche,0
4819,mass20murder,los angeles,rt owenrbroadhurst rt juanmthompson hour 70 yr...,1
6446,suicide20bombing,gcc,alleged driver kuwait attack joined daesh day ...,1
2870,drought,las cruces nm,pretty neat website get latest drought conditi...,1


I lemmatize all of the words in order to keep tweets consistent in order for comparison soon (to try to achieve an accurate model).

In [18]:
from nltk.stem import WordNetLemmatizer
import nltk

def lemmatize(df):
    wnl = WordNetLemmatizer()

    new = []
    for x in df['text']:
        y = nltk.word_tokenize(x)
        store = []
        for word in y:
            store.append(wnl.lemmatize(word))
        new.append(' '.join(store))

    df['text'] = new
    
    return df

In [19]:
training_set = lemmatize(training_set)
dev_set = lemmatize(dev_set)

In [20]:
training_set.head(5)

Unnamed: 0,keyword,location,text,target
1527,chemical20emergency,,emergency response hazardous chemical manageme...,0
543,avalanche,canada,feat watch bts incredible music video avalanche,0
4819,mass20murder,los angeles,rt owenrbroadhurst rt juanmthompson hour 70 yr...,1
6446,suicide20bombing,gcc,alleged driver kuwait attack joined daesh day ...,1
2870,drought,las cruces nm,pretty neat website get latest drought conditi...,1


https://medium.com/biaslyai/beginners-guide-to-text-preprocessing-in-python-2cbeafbf5f44

### 1D Bag of Words

I chose 10 as the minimum number of different tweets words have to appear in order to be included in my model. After calculated all F1 scores when using min_dfs equal in the range of 10 to the entire length of the training set, 10 resulted in the highest F1 score for both training and development sets.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorize the training set
count_vect = CountVectorizer(binary=True, min_df=10)
X_train = count_vect.fit_transform(training_set['text'])
X_test = count_vect.transform(dev_set['text'])

### 1E Logistic Regression: No Regularization

I use the same solver parameter for all 3 classifiers below to try to maintain consistent manipulation in order to accurately compare each of their performance. 'Saga' solver was the only solver that works with both no regularization, L1 reg, and L2 reg in sklearn.

In [22]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(verbose=True, penalty='none', solver='saga')
model_noreg = logreg.fit(X_train, training_set['target'])

max_iter reached after 0 seconds


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


In [23]:
predicted_train = model_noreg.predict(X_train)
predicted_dev = model_noreg.predict(X_test)

In [24]:
from sklearn.metrics import f1_score

f1_train_noreg = f1_score(training_set['target'], predicted_train)
f1_dev_noreg = f1_score(dev_set['target'], predicted_dev)

print('F1 score of training:', f1_train_noreg)
print('F1 score of development:', f1_dev_noreg)

F1 score of training: 0.8501362397820164
F1 score of development: 0.7194994786235662


With current Bag of Words model, I slightly overfit my data as seen in the higher F1 score of my predictions using my training set, as opposed to my predictions using my development set.

### 1E Logistic Regression: L1 Regularization

In [25]:
logreg = LogisticRegression(verbose=True, penalty='l1', solver='saga')
model_l1 = logreg.fit(X_train, training_set['target'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 2 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.5s finished


In [26]:
predicted_train_l1 = model_l1.predict(X_train)
predicted_dev_l1 = model_l1.predict(X_test)

In [27]:
from sklearn.metrics import f1_score

f1_train_l1 = f1_score(training_set['target'], predicted_train_l1)
f1_dev_l1 = f1_score(dev_set['target'], predicted_dev_l1)

print('F1 score of training:', f1_train_l1)
print('F1 score of development:', f1_dev_l1)

F1 score of training: 0.8171992481203008
F1 score of development: 0.7241003271537622


### 1E Logistic Regression: L2 Regularization

In [28]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(verbose=True, penalty='l2', solver='saga')
model_l2 = logreg.fit(X_train, training_set['target'])

convergence after 79 epochs took 0 seconds


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


In [29]:
predicted_train_l2 = model_l2.predict(X_train)
predicted_dev_l2 = model_l2.predict(X_test)

In [30]:
from sklearn.metrics import f1_score

f1_train_l2 = f1_score(training_set['target'], predicted_train_l2)
f1_dev_l2 = f1_score(dev_set['target'], predicted_dev_l2)
print('F1 score of training:', f1_train_l2)
print('F1 score of development:', f1_dev_l2)

F1 score of training: 0.8287910552061495
F1 score of development: 0.7305002689618074


## Summary of Logistic Regression Models

In [31]:
print('F1 Score of training set using:')
print('\tNo Regularization:', f1_train_noreg)
print('\tL1 Regularization:', f1_train_l1)
print('\tL2 Regularization:', f1_train_l2)

print('\nF1 Score of development set using:')
print('\tNo Regularization:', f1_dev_noreg)
print('\tL1 Regularization:', f1_dev_l1)
print('\tL2 Regularization:', f1_dev_l2)

F1 Score of training set using:
	No Regularization: 0.8501362397820164
	L1 Regularization: 0.8171992481203008
	L2 Regularization: 0.8287910552061495

F1 Score of development set using:
	No Regularization: 0.7194994786235662
	L1 Regularization: 0.7241003271537622
	L2 Regularization: 0.7305002689618074


**Evaluation:** Logistic Regression with L2 Regularization showed the best performance on the training and development set. Compared to model performances of the models with no regularization and L1 regularization, the model with L2 regularization decreases overfitting on the training set, and maximized performance on the development set, respectively. Even though the classifier with L1 regularization also increased performance of the testing set, it didn't increase model performance of the testing set as much as the model with L2 regularization, and on top of this, the classifier with L1 regularziation showed decreased performance on the development set. This shows that L2 regularization is the optimal regularization for this specific problem.

In [32]:
import numpy as np

words_coeffs = {}

coeffs = model_l1.coef_.tolist()[0]
words = [key for key in count_vect.vocabulary_.keys()]
for i, val in enumerate(coeffs):
    words_coeffs[words[i]] = val
    
sorted_dict = {}
sorted_keys = sorted(words_coeffs, key=words_coeffs.get)  # [1, 3, 2]

for w in sorted_keys:
    sorted_dict[w] = words_coeffs[w]

In [33]:
top_words_neg = list(sorted_dict.keys())[:25]
top_words_pos = list(sorted_dict.keys())[::-1][:25]

In [34]:
print('The most important words for deciding whether a tweet IS about a real disaster:')
print('\n', top_words_pos)

The most important words for deciding whether a tweet IS about a real disaster:

 ['second', 'bioterror', 'wreck', 'outrage', 'costlier', 'men', 'gon', 'london', 'ship', 'give', 'hailstorm', 'severe', 'tsunami', 'tried', 'game', 'shoulder', 'link', 'life', 'may', 'night', 'destruction', 'wont', 'detonate', 'pakistan', 'russia']


In [35]:
print('The most important words for deciding whether a tweet IS NOT about a real disaster:')
print('\n', top_words_neg)

The most important words for deciding whether a tweet IS NOT about a real disaster:

 ['reddit', 'season', 'nigerian', 'wound', 'train', 'like', 'fatality', 'teen', 'bear', 'let', 'horrible', 'hey', 'annihilation', 'course', 'enough', 'lead', 'geller', 'cliff', 'think', 'chile', 'listen', 'move', 'owner', '50', 'well']


### Bernoulli Naive Bayes

In [36]:
X_train = X_train.toarray()
n = X_train.shape[0] # size of the dataset
d = X_train.shape[1] # number of features in our dataset
K = 2 # number of clases

# these are the shapes of the parameters
psis = np.zeros([K,d])
phis = np.zeros([K])

# we now compute the parameters
for k in range(K):
    X_k = X_train[training_set['target'] == k]
    #Laplace smoothing
    psis[k] = (X_k.shape[0] + 1)/(float(n)+2)
    phis[k] = X_k.shape[0] / float(n)

# print out the class proportions
print(psis, phis)

[[0.57137498 0.57137498 0.57137498 ... 0.57137498 0.57137498 0.57137498]
 [0.42862502 0.42862502 0.42862502 ... 0.42862502 0.42862502 0.42862502]] [0.57140176 0.42859824]


In [37]:
#  we can implement this in numpy
def nb_predictions(x, psis, phis):
    """This returns class assignments and scores under the NB model.
    
    We compute \arg\max_y p(y|x) as \arg\max_y p(x|y)p(y)
    """
    # adjust shapes
    n, d = x.shape
    x = np.reshape(x, (1, n, d))
    psis = np.reshape(psis, (K, 1, d))
    
    # clip probabilities to avoid log(0)
    psis = psis.clip(1e-14, 1-1e-14)
    
    # compute log-probabilities
    logpy = np.log(phis).reshape([K,1])
    logpxy = x * np.log(psis) + (1-x) * np.log(1-psis)
    logpyx = logpxy.sum(axis=2) + logpy

    return logpyx.argmax(axis=0).flatten(), logpyx.reshape([K,n])

idx, logpyx = nb_predictions(X_train, psis, phis)

In [38]:
predictions, logpyx = nb_predictions(X_test.toarray(), psis, phis)

In [39]:
#accuracy of BNB on development set
print('F1 score on development set using Naive Bayes:', f1_score(dev_set['target'], predictions))

F1 score on development set using Naive Bayes: 0.603485172730052


**Model Comparison:**

- The Logistic Regression with L2 regularization performed the best in predicting whether a tweet is of a real disaster or not (looking only at performance on development set): it's F1 score was 0.7396 while Naive Bayes predicted the same outcome at 0.6153 (on development set only). Generative models like Naive Bayes have high explanatory power, can detect outliers, and  aren't as computationally expensive as discriminative models. However they can only be used in supervised machine learning because they learn classifications by learning about all of the details of the datatypes/classes it's given, and often result in poorer perfromance than discriminative models. Discriminative models like logistic regression are good because they require less data to make accurate predictions, and produce generally accurate predictions across different types of problems. However, discriminative models are harder to interpret because they seem like black-boxes, and are usually more computationally expensive than generative models.

- Naive Bayes assumes that the position of the words in the tweet doesn't affect whether or not the tweet is of a real disaster or not (Bag of Words methods), that each word in a tweet is independent of the other words in the tweet (which makes Naive Bayes for classification purposes robust to outliers), and that their can be any amount of classes to be predicted. These assumptions are different than the ones made by Logistic Regression. Binary logistic Regression assumes that there only two possible predicted classes, that each word in a tweet might affect whether or not the tweet is of a real disaster or not, and that each word in a tweet is linearly related to the log odds. Naive Bayes is a valid model to use for text classification because it doesn't restrict what types of features are inputted into the model - it can predict classes using specific word features or long strings of words. However, it works the best for text classification if it's condition independence assumption about words in tweets holds, and only if the programmer handles the cold start problem - what to do with new, unseen words that it doesn't have prior knowledge for. Overall Naive Bayes is an okay model to use for text classification because of its robustness, efficient computability, and generalizability, but there are more accurate classifiers out there that make less unrealistic assumptions about the data: like logistic regression.

http://primo.ai/index.php?title=Discriminative_vs._Generative

https://web.stanford.edu/~jurafsky/slp3/slides/7_NB.pdf

### N-Gram Model

After using the code below to determine the best M for the CountVectorizer, I found 10 to be the best min_df that optimized model performance, just like as in the logistic regression classifiers above. 

In [40]:
# #choose M

# def choose_M(val):
    
    
#     count_vect = CountVectorizer(binary=True, min_df=val, ngram_range=(1,2))
#     X_train = count_vect.fit_transform(training_set['text'])
#     X_test = count_vect.transform(dev_set['text'])

#     logreg = LogisticRegression(verbose=True, penalty='l1', solver='saga')
#     model_l1_ngram = logreg.fit(X_train, training_set['target'])

#     predicted_train_l1_ngram = model_l1_ngram.predict(X_train)
#     predicted_dev_l1_ngram = model_l1_ngram.predict(X_test)



#     f1_train_l1_ngram = f1_score(training_set['target'], predicted_train_l1_ngram)
#     f1_dev_l1_ngram = f1_score(dev_set['target'], predicted_dev_l1_ngram)

# #     print('F1 score of training - with Ngrams:', f1_train_l1_ngram)
# #     print('F1 score of development - with Ngrams:', f1_dev_l1_ngram)
    
#     return f1_train_l1_ngram, f1_dev_l1_ngram
    
# final_min_df = 0
# dev = 0
# training = 0

# for num in range(10, training_set.shape[0], 10):
#     new_training, new_dev = choose_M(num)
#     if new_training > training and new_dev > dev:
#         final_min_df = num
#     else:
#         pass

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

# vectorize the training set
count_vect = CountVectorizer(binary=True, min_df=10, ngram_range=(1,2))
X_train = count_vect.fit_transform(training_set['text'])
X_test = count_vect.transform(dev_set['text'])

In [42]:
print('Total number of 1-grams and 2-grams:', len(count_vect.vocabulary_))

twograms = [grams for grams in count_vect.vocabulary_.keys() if len(grams.split(' ')) == 2]

print('10 2-grams in CountVectorizer Vocab:', twograms[:10])

Total number of 1-grams and 2-grams: 1260
10 2-grams in CountVectorizer Vocab: ['mass murder', 'suicide bombing', 'take quiz', 'cross body', 'dont know', 'christian attacked', 'attacked muslim', 'muslim temple', 'temple mount', 'mount waving']


In [43]:
#Log Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

logreg = LogisticRegression(verbose=True, penalty='l2', solver='saga')
model_l2_ngram = logreg.fit(X_train, training_set['target'])

predicted_train_l2_ngram = model_l2_ngram.predict(X_train)
predicted_dev_l2_ngram = model_l2_ngram.predict(X_test)



f1_train_l2_ngram = f1_score(training_set['target'], predicted_train_l2_ngram)
f1_dev_l2_ngram = f1_score(dev_set['target'], predicted_dev_l2_ngram)

print('F1 score of training - with Ngrams:', f1_train_l2_ngram)
print('F1 score of development - with Ngrams:', f1_dev_l2_ngram)

max_iter reached after 0 seconds
F1 score of training - with Ngrams: 0.831114225648213
F1 score of development - with Ngrams: 0.730124391563007


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


I use L2 regularization in this logistic regression classifier because of the high L2 regularized classifier performance in section E above. Compared to the performance on the L2 regularized classifier above using Bag of Words, this L2 regularized classifier using Ngrams performs slightly better: the results of the classifier using Bag of Words was 0.8186 for the training set and 0.7396 for the development set. Now, using the Ngrams method, the classifier overfits the training data slightly more (by .03), and performs slightly better on the development set. Overall the results are comprable.

In [44]:
X_train = X_train.toarray()
n = X_train.shape[0] # size of the dataset
d = X_train.shape[1] # number of features in our dataset
K = 2 # number of clases

# these are the shapes of the parameters
psis = np.zeros([K,d])
phis = np.zeros([K])

# we now compute the parameters
for k in range(K):
    X_k = X_train[training_set['target'] == k]
    #Laplace smoothing
    psis[k] = (X_k.shape[0] + 1)/(float(n)+2)
    phis[k] = X_k.shape[0] / float(n)

# print out the class proportions
print(psis, phis)

[[0.57137498 0.57137498 0.57137498 ... 0.57137498 0.57137498 0.57137498]
 [0.42862502 0.42862502 0.42862502 ... 0.42862502 0.42862502 0.42862502]] [0.57140176 0.42859824]


In [45]:
#  we can implement this in numpy
def nb_predictions_ngrams(x, psis, phis):
    """This returns class assignments and scores under the NB model.
    
    We compute \arg\max_y p(y|x) as \arg\max_y p(x|y)p(y)
    """
    # adjust shapes
    n, d = x.shape
    x = np.reshape(x, (1, n, d))
    psis = np.reshape(psis, (K, 1, d))
    
    # clip probabilities to avoid log(0)
    psis = psis.clip(1e-14, 1-1e-14)
    
    # compute log-probabilities
    logpy = np.log(phis).reshape([K,1])
    logpxy = x * np.log(psis) + (1-x) * np.log(1-psis)
    logpyx = logpxy.sum(axis=2) + logpy

    return logpyx.argmax(axis=0).flatten(), logpyx.reshape([K,n])

In [46]:
predictions_ngrams, logpyx = nb_predictions_ngrams(X_test.toarray(), psis, phis)

In [47]:
#accuracy of BNB on development set
print('F1 score on development set using Naive Bayes and Ngrams:', f1_score(dev_set['target'], predictions_ngrams))

F1 score on development set using Naive Bayes and Ngrams: 0.603485172730052


Compared to the Naive Bayes classifier performance using Bag of Words (0.6153), the Naive Bayes classifier using Ngrams performs the same (printed above). 

Both Logistic Regression classifiers and Naive Bayes classifiers perform basically the same, if not better, when using the Ngrams technique instead of the Bag of Words technique to vectorize the training set. This implies that the Ngrams created in the vectorization of the training set have the same or slightly higher predictive power than the vecotrs created using the Bag of Words technique, or that the combination of 2 words slightly more impact on the classification of whether a tweet is of a real disaster or not than using single words (no combinations of words, or 2-grams). 

### Final Model: N-grams using Logistic Regression

In [48]:
train

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [49]:
ids = test['id']

In [50]:
train.drop(columns=['id'], inplace=True)
test.drop(columns=['id'], inplace=True)

In [51]:
train = clean_df(train)
test = clean_df(test)

In [52]:
train = stopwords(train)
test = stopwords(test)

In [53]:
train = lemmatize(train)
test = lemmatize(test)

In [54]:
train

Unnamed: 0,keyword,location,text,target
0,,,deed reason earthquake may allah forgive u,1
1,,,forest fire near la ronge sask canada,1
2,,,resident asked shelter place notified officer ...,1
3,,,13000 people receive wildfire evacuation order...,1
4,,,got sent photo ruby alaska smoke wildfire pour...,1
...,...,...,...,...
7608,,,two giant crane holding bridge collapse nearby...,1
7609,,,control wild fire california even northern par...,1
7610,,,m194 0104 utc5km volcano hawaii,1
7611,,,police investigating ebike collided car little...,1


In [55]:
test

Unnamed: 0,keyword,location,text
0,,,happened terrible car crash
1,,,heard earthquake different city stay safe ever...
2,,,forest fire spot pond goose fleeing across str...
3,,,apocalypse lighting spokane wildfire
4,,,typhoon soudelor kill 28 china taiwan
...,...,...,...
3258,,,earthquake safety los angeles ûò safety fasten...
3259,,,storm ri worse last hurricane cityamp3others h...
3260,,,green line derailment chicago
3261,,,meg issue hazardous weather outlook hwo


In [59]:
#N grams
from sklearn.feature_extraction.text import CountVectorizer

# vectorize the training set
count_vect = CountVectorizer(binary=True, min_df=10, ngram_range=(1,2))
X_train = count_vect.fit_transform(train['text'])
X_test = count_vect.transform(test['text'])

In [63]:
#Log Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

logreg = LogisticRegression(verbose=True, penalty='l2', solver='saga')
model_l2_ngram = logreg.fit(X_train, train['target'])

predicted_train_l2_ngram = model_l2_ngram.predict(X_train)
predicted_test_l2_ngram = model_l2_ngram.predict(X_test)


f1_train_l2_ngram = f1_score(train['target'], predicted_train_l2_ngram)

print('F1 score of training - with Ngrams:', f1_train_l2_ngram)

convergence after 99 epochs took 1 seconds
F1 score of training - with Ngrams: 0.8297042573935652


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s finished


In [64]:
df_submit = pd.DataFrame(data={'id': ids, 'target': predicted_test_l2_ngram})

In [65]:
df_submit.to_csv('sub.csv', index=False)

In [66]:
print('Kaggle F1 score=', 0.78516)

Kaggle F1 score= 0.78516


This F1 score was about what I expected, except performed better than anticipated. I assumed that my model might have been overfitting my training data, therefore it would have performed badly when exposed to the unseen test data. However, the combination of n-grams and logistic regression resulted in a high performance score. 

Despite my initial prediction that my model would perform badly on unseen data, it performed better than all of my other models in the past sections. This is probably because the assumption in logistic regression that features are dependent holds in the case of our test data: the presence of each word in a tweet is related to the presence of other words in the tweets, OR it may be the case that our dataset is small enough that if any independencies between the presence of words in tweets were to exist, our sample would be too small for those independencies to severly impact model performance.