# Kaggle comp - Twitter Disaster Tweets

##### This notebook uses:

##### Model - Logistic Regression from SKlearn. In this notebook, only the 'text' feature is used. 
##### Feature Extraction -  CountVectorizer from SKlearn. I am attempting to improve the predictions by fitting the vectoriser on both train and test tweets. (Previously, the vectorizer is fitted on the train data only.)
##### Metrics - where model evaluation is is used, metric is F1 from SKlearn

#### 1. Import libraries

In [1]:
# getting started

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [2]:
from sklearn import metrics
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
import time

In [3]:
from sklearn.preprocessing import OneHotEncoder

#### 2. Import data into Pandas data frames

In [4]:
train = pd.read_csv('train.csv') # creates pandas data-frame objects from the train & test data
test = pd.read_csv('test.csv') 

#### 3. Look at the Data

In [None]:
train.head()

In [None]:
#remove NaNs 

train = train.fillna('')

In [None]:
train.head()

In [None]:
train.describe()

From the mean we can see that 43% of the tweets describe real disasters.

In [None]:
test.head()

In [None]:
test.describe()

##### Try a model using just text...

#### 3. Set up X and Y arrays for Test and Train

#### Use this section if trying to evaluate a model

In [None]:
X_train = train['text']

In [None]:
X_train.head()

In [None]:
X_train.shape

In [None]:
#Vectorize the train comments sample and convert to document term matrix
#vect = TfidfVectorizer()
#X_train_dtm = vect.fit_transform(X_train)

In [None]:
vectorizer = CountVectorizer()

In [None]:
vectorizer.fit(X_train)

In [None]:
X_train_vec = vectorizer.transform(X_train)

In [None]:
X_train_vec.shape

In [None]:
X_train_vec

#### 4. Import, Initialise and Fit Model

In [None]:
#target columns
target_y = train['target']

In [None]:
target_y.shape

In [None]:
model = LogisticRegression()

##### Split into train and validation set to fit model and evaluate performance

##### Again, this section uses train/test split to trial and evaluate a model. The section below is used to train the data on the whole of the 'train' set and then to use it to make predictions on the test set. These predictions are then used to produce a submission to the Kaggle competition.

In [None]:
start_time = time.time()
X_train, X_valid, y_train, y_valid = train_test_split(X_train_vec, target_y, test_size=0.33, random_state=2018)

train_f1 = []
valid_f1 = []

In [None]:
#preds_train = np.zeros((X_train.shape[0], 1))
#preds_valid = np.zeros((X_valid.shape[0], 1))

In [None]:
model.fit(X_train,y_train)

In [None]:
#predict method predicts class labels rather than probability of each class label

preds_train = model.predict(X_train)
preds_valid = model.predict(X_valid)

In [None]:
#no need to round up if using model.predict rather than model.predict_proba

#preds_train_int = np.rint(preds_train)
#preds_valid_int = np.rint(preds_valid)

In [None]:
train_f1_class = f1_score(y_train,preds_train)
valid_f1_class = f1_score(y_valid,preds_valid)

In [None]:
y_train

In [None]:
preds_train

In [None]:
train_f1.append(train_f1_class)
valid_f1.append(valid_f1_class)
print('mean column-wise log loss:Train dataset', np.mean(train_f1))
print('mean column-wise log loss:Validation dataset', np.mean(valid_f1))

In [None]:
print('Class:= Real or Not')
print('Train f1:', train_f1_class)
print('Valid f1:', valid_f1_class)

In [None]:
end_time=time.time()
print("total time for model",end_time-start_time)

#### 5. Using the test set - making predictions

In [5]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [6]:
test.shape

(3263, 4)

In [7]:
X_test = test['text']

In [8]:
X_test.shape

(3263,)

In [9]:
X_train = train['text']

In [12]:
X_train.shape

(7613,)

In [10]:
train_y = train['target']

In [16]:
vectorizer = CountVectorizer()

In [13]:
# combine X_train and X_test for vectorizer

X_all = X_train.append(X_test)

In [14]:
X_all.shape

(10876,)

In [17]:
vectorizer.fit(X_all)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [18]:
train_vec = vectorizer.transform(X_train)

In [19]:
train_vec

<7613x27922 sparse matrix of type '<class 'numpy.int64'>'
	with 111497 stored elements in Compressed Sparse Row format>

In [20]:
test_vec = vectorizer.transform(X_test)

In [21]:
test_vec

<3263x27922 sparse matrix of type '<class 'numpy.int64'>'
	with 48133 stored elements in Compressed Sparse Row format>

In [22]:
model = LogisticRegression()

In [23]:
model.fit(train_vec, train_y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [24]:
preds_test = model.predict(test_vec)

In [25]:
preds_test.shape

(3263,)

In [26]:
preds_test

array([1, 1, 1, ..., 1, 1, 0], dtype=int64)

In [27]:
df = pd.DataFrame(preds_test)

In [28]:
df.columns = ['target']

In [29]:
df.head()

Unnamed: 0,target
0,1
1,1
2,1
3,0
4,1


In [36]:
df_sub = pd.read_csv('sample_submission.csv') 

In [37]:
df_sub.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [38]:
df_sub.loc[ : , 'target'] = df['target']

In [39]:
df_sub.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,0
4,11,1


In [40]:
df_sub.to_csv('my_sub.csv', encoding='utf-8', index=False)