# Real or Not

* **OBJECTIVE** (Quoted from Kaggle): 
    - Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
    - But, it’s not always clear whether a person’s words are actually announcing a disaster.
    - In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. 
    

* **DATA** Columns: 
    - **id** - a unique identifier for each tweet
    - **text** - the text of the tweet
    - **location** - the location the tweet was sent from (may be blank)
    - **keyword** - a particular keyword from the tweet (may be blank)
    - **target** - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)
    
    
    
* My Approach: 
    - I used a Target Encoder to convert the 'keywords' into a probability distribution. 
    - I dropped 'location', as it was a messy column. But I may consider including it down the line. 
    - While counts of hashtags, mentions, text length etc showed low correlation with target in the training dataset, some additional features have been added to the dataset. 
    - The 'text' feature has been passed through a TweetTokenizer and lemmatized using the nltk package. 
    - Hashtags have been extracted into an additional feature column. 
    - Finally TFIDF has been used on 'hashtags' and cleaned 'text' separately. Along with 'keyword_target' and some additional numerical features - the dataset has been run through hyperopt to identify the best parameter & model - that gives highes f1 score. 


* **Result**: 
    - An 85% f1 score was obtained on the training data. 
    - A 78% fi score was ontained on the test data after submission to Kaggle. 
    - Further analysis is underway. 

# Imports & read in data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import string
import re
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize, TweetTokenizer
from sklearn.metrics import classification_report, recall_score, f1_score, accuracy_score
from sklearn.metrics import confusion_matrix, precision_recall_curve
from sklearn.feature_extraction.text import CountVectorizer
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

In [2]:
np.random.seed(500)
%cd ~

C:\Users\Orpita Das


In [3]:
train = pd.read_csv("../input/realornot/train.csv")
test = pd.read_csv("../input/realornot/test.csv")
sub = pd.read_csv("../input/realornot/sample_submission.csv")

# Explore Data

## Getting a sense of data

In [None]:
train.info()

In [None]:
test.info()

In [None]:
train.shape, test.shape, sub.shape

In [None]:
sum(train.duplicated()), sum(test.duplicated())

## Target class distribution

In [None]:
sns.countplot(train.target)

## Focussing on individual columns

### keywords

In [None]:
# Find blank rows 
print(train.isnull().any()) 
print(test.isnull().any())

In [None]:
train.keyword.unique()

In [None]:
sum(train.keyword.unique()!=test.keyword.unique())

In [None]:
kw_d = train[train.target==1].keyword.value_counts().head(10)
kw_nd = train[train.target==0].keyword.value_counts().head(10)

plt.figure(figsize=(13,5))
plt.subplot(121)
sns.barplot(kw_d, kw_d.index, color='c')
plt.title('Top keywords for disaster tweets')
plt.subplot(122)
sns.barplot(kw_nd, kw_nd.index, color='y')
plt.title('Top keywords for non-disaster tweets')
plt.show()

In [4]:
# Target encoding
encoder = ce.TargetEncoder(cols=['keyword'])
encoder.fit(train['keyword'],train['target'])

train = train.join(encoder.transform(train['keyword']).add_suffix('_target'))
test = test.join(encoder.transform(test['keyword']).add_suffix('_target'))

### location

In [None]:
len(train.location.unique()), train.location.unique()

In [None]:
sum(train.keyword.unique()!=test.keyword.unique())

### text : elements & correlations

In [5]:
def clean_text(text):
    text = re.sub(r'https?://\S+', '', text) # Remove link
    text = re.sub(r'\n',' ', text) # Remove line breaks
    text = re.sub('\s+', ' ', text).strip() # Remove leading, trailing, and extra spaces
    return text

def find_hashtags(tweet):
    return " ".join([match.group(0)[1:] for match in re.finditer(r"#\w+", tweet)]) or 'no'

def find_mentions(tweet):
    return " ".join([match.group(0)[1:] for match in re.finditer(r"@\w+", tweet)]) or 'no'

def find_links(tweet):
    return " ".join([match.group(0)[:] for match in re.finditer(r"https?://\S+", tweet)]) or 'no'

def find_emojis(tweet):
    emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']" 
    return " ".join([match.group(0)[1:] for match in re.finditer(emoji, tweet)]) or 'no'

def process_text(df):
    df['text_clean'] = df['text'].apply(lambda x: clean_text(x))
    df['hashtags'] = df['text'].apply(lambda x: find_hashtags(x))
    df['mentions'] = df['text'].apply(lambda x: find_mentions(x))
    df['links'] = df['text'].apply(lambda x: find_links(x))
    df['emojis'] = df['text'].apply(lambda x: find_emojis(x))
    # df['hashtags'].fillna(value='no', inplace=True)
    # df['mentions'].fillna(value='no', inplace=True)
    return df

In [6]:
train = process_text(train)
test = process_text(test)

In [7]:
def create_stat(df):
    # Tweet length
    df['text_len'] = df['text_clean'].apply(len)
    # Word count
    df['word_count'] = df["text_clean"].apply(lambda x: len(str(x).split()))
    # Stopword count
    df['stop_word_count'] = df['text_clean'].apply(lambda x: len([w for w in str(x).lower().split() if w in stopwords.words('english')]))
    # Punctuation count
    df['punctuation_count'] = df['text_clean'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
    # Count of hashtags (#)
    df['hashtag_count'] = df['hashtags'].apply(lambda x: len(str(x).split()))
    # Count of mentions (@)
    df['mention_count'] = df['mentions'].apply(lambda x: len(str(x).split()))
    # Count of links
    df['link_count'] = df['links'].apply(lambda x: len(str(x).split()))
    # Count of uppercase letters
    df['caps_count'] = df['text_clean'].apply(lambda x: sum(1 for c in str(x) if c.isupper()))
    # Ratio of uppercase letters
    df['caps_ratio'] = df['caps_count'] / df['text_len']
    return df

train = create_stat(train)
test = create_stat(test)

In [None]:
train.corr()['target'].drop('target').sort_values()

* **Little correlation, so these above characteristics of text can be dropped.**
* **The information in the *mentions* will likely be irrelevant. But the information in the *hashtags* may be useful.**

In [8]:
# Any lowercase
#print(sum(train.keyword.str.islower()), sum(train.text.str.islower()))

# Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
train['text_lower']=[doc.lower() for doc in train.text]
test['text_lower']=[doc.lower() for doc in test.text]

In [10]:
# Step - c : Tokenization : In this each entry in the corpus will be broken into set of words
tknzr=TweetTokenizer(strip_handles=True, reduce_len=True)
train['text_token']=[tknzr.tokenize(doc) for doc in train.text_lower]
test['text_token']=[tknzr.tokenize(doc) for doc in test.text_lower]

In [11]:
# Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. 
# By default it is set to Noun. 

lemmatizer = WordNetLemmatizer()
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wn.ADJ
    elif nltk_tag.startswith('V'):
        return wn.VERB
    elif nltk_tag.startswith('N'):
        return wn.NOUN
    elif nltk_tag.startswith('R'):
        return wn.ADV
    else:          
        return wn.NOUN
    
train['text_final']=np.zeros(len(train))
for i in range(len(train.text_token)):
    train.text_final[i]=[t for t in train.text_token[i] if t.isalpha() and t not in stopwords.words('english')]
    train.text_final[i]=[lemmatizer.lemmatize(word, nltk_tag_to_wordnet_tag(tag)) for word,tag in pos_tag(train.text_final[i])]
    train.text_final[i]=str(train.text_final[i])
    
test['text_final']=np.zeros(len(test))
for i in range(len(test.text_token)):
    test.text_final[i]=[t for t in test.text_token[i] if t.isalpha() and t not in stopwords.words('english')]
    test.text_final[i]=[lemmatizer.lemmatize(word, nltk_tag_to_wordnet_tag(tag)) for word,tag in pos_tag(test.text_final[i])]
    test.text_final[i]=str(test.text_final[i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/panda

In [12]:
# Vectorize columns
Tfidf_vect = TfidfVectorizer(max_features=5000, min_df = 10, ngram_range = (1,2))

In [13]:
text_vec = Tfidf_vect.fit_transform(train['text_final'])
text_vec_test = Tfidf_vect.transform(test['text_final'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=Tfidf_vect.get_feature_names())
X_test_text = pd.DataFrame(text_vec_test.toarray(), columns=Tfidf_vect.get_feature_names())

train = train.join(X_train_text, rsuffix='_text')
test = test.join(X_test_text, rsuffix='_text')
print(Tfidf_vect.vocabulary_)






In [None]:
train.head()

### Hashtags

In [14]:
vec_hash = CountVectorizer(min_df = 5)
hash_vec = vec_hash.fit_transform(train['hashtags'])
hash_vec_test = vec_hash.transform(test['hashtags'])
X_train_hash = pd.DataFrame(hash_vec.toarray(), columns=vec_hash.get_feature_names())
X_test_hash = pd.DataFrame(hash_vec_test.toarray(), columns=vec_hash.get_feature_names())
print (X_train_hash.shape)

train = train.join(X_train_hash, rsuffix='_hashtag')
test = test.join(X_test_hash, rsuffix='_hashtag')

(7613, 107)


In [None]:
# Extract hashtags into a new column
temp=[]
for i in range(len(train.text_lower)):
    temp.append(re.findall(r'#(\w+)', train.text_lower[i]))
train['text_hash']=temp

temp=[]
for i in range(len(test.text_lower)):
    temp.append(re.findall(r'#(\w+)', test.text_lower[i]))
test['text_hash']=temp
"""

# Make string
for i in range(len(train.text_hash)):
    train.text_hash[i]=str(train.text_hash[i])
    
# Replace empty list is text_hash column
for i in range(len(test.text_hash)):
    test.text_hash[i]=str(test.text_hash[i])
"""

In [None]:
# Re-append processed hashtags
for i in range(len(train.text_final)):
    train.text_hash[i]=[t for t in train.text_hash[i] if t.isalpha() and t not in stopwords.words('english')]
    train.text_hash[i]=[lemmatizer.lemmatize(word, nltk_tag_to_wordnet_tag(tag)) for word,tag in pos_tag(train.text_hash[i])]
    train.text_final[i].extend(train.text_hash[i])
    train.text_final[i]=str(train.text_final[i])
    
for i in range(len(test.text_final)):
    test.text_hash[i]=[t for t in test.text_hash[i] if t.isalpha() and t not in stopwords.words('english')]
    test.text_hash[i]=[lemmatizer.lemmatize(word, nltk_tag_to_wordnet_tag(tag)) for word,tag in pos_tag(test.text_hash[i])]
    test.text_final[i].extend(test.text_hash[i])
    test.text_final[i]=str(test.text_final[i])

# Trying out other classification models

* Using hyperopt

In [20]:
features_to_drop = ['id', 'keyword','location','text', 'target','text_clean', 'hashtags', 
                    'mentions','links', 'emojis', 'text_lower', 'text_token', 'text_final']
X_train = train.drop(columns=features_to_drop+['target','target_text'])
X_test = test.drop(columns=features_to_drop)
y_train = train.target

In [21]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

In [None]:
X_train.head()

In [None]:
models = {'logistic_regression' : LogisticRegression, 
   'rf' : RandomForestClassifier, 
   'naive_bayes' : BernoulliNB, 'svc' : SVC
}

In [None]:
def search_space(model):  
    model = model.lower()
    space = {}
    if model == 'naive_bayes':
        space = {'alpha': hp.choice('alpha', [0,1])
                }
    elif model == 'svc':
        space = {'C': hp.lognormal('C', 0, 1.0),
                 'kernel': hp.choice('kernel', ['linear', 'sigmoid', 'poly', 'rbf']),
                 'gamma': hp.uniform('gamma', 0, 20)
                }
    elif model == 'logistic_regression':
        space = {'warm_start' : hp.choice('warm_start', [True, False]), 
                 'fit_intercept' : hp.choice('fit_intercept', [True, False]),
                 'tol' : hp.uniform('tol', 0.00001, 0.0001),
                 'C' : hp.uniform('C', 0.05, 3),
                 'solver' : hp.choice('solver', ['newton-cg', 'lbfgs', 'liblinear']),
                 'max_iter' : hp.choice('max_iter', range(100,1000)),
                 'class_weight' : 'balanced',
                 'n_jobs' : -1
                }
    elif model == 'rf':
        space = {'max_depth': hp.choice('max_depth', range(1,20)),
                 'n_estimators': hp.choice('n_estimators', range(50,300)),
                 #'n_estimators': 150,
                 #'criterion': hp.choice('criterion', ["gini", "entropy"]),
                 'criterion' : 'gini',
                 'min_samples_split': hp.choice("min_samples_split", range(2,40)),
                 'n_jobs' : -1
                }
    space['model'] = model
    return space

In [None]:
def get_acc_status(clf,X,y):
    acc = cross_val_score(clf, X, y, cv=3, scoring='f1').mean()
    #y_pred = clf.fit(X,y).predict(X_test)
    #print(confusion_matrix(y_hold, y_pred))
    #print(classification_report(y_hold, y_pred))
    return {'loss': -acc, 'status': STATUS_OK}

In [None]:
def obj_fnc(params) :
    model = params.get('model').lower()
    del params['model']
    clf = models[model](**params)
    return(get_acc_status(clf,X_train,y_train))

## Random Forest 

In [None]:
model= 'rf'
best_params = fmin(obj_fnc, 
                   search_space(model), 
                   algo=tpe.suggest, 
                   max_evals=100)
 
print(best_params)
# with bigrams
#{'max_depth': 18, 'min_samples_split': 6, 'n_estimators': 182}

In [None]:
rf=RandomForestClassifier(criterion='gini', max_depth= 18, min_samples_split = 0.5070744524836673, n_estimators= 157)
y_pred = rf.fit(X_train, y_train).predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

## Logistic Regression

In [None]:
model= 'logistic_regression'
best_params = fmin(obj_fnc, 
                   search_space(model), 
                   algo=tpe.suggest, 
                   max_evals=100)
 
print(best_params)
#first
#{'C': 0.8625782737995314, 'fit_intercept': True, 'max_iter': 318, 'solver': 'lbfgs', 
# 'tol': 3.3702355408712684e-05, 'warm_start': False}
# with bigrams
#{'C': 0.4191297475916065, 'fit_intercept': True, 'max_iter': 39, 'solver': 'liblinear', 
# 'tol': 6.003262415227273e-05, 'warm_start': False}

In [22]:
logistic_regresion=LogisticRegression(C=0.4191297475916065, fit_intercept=True, max_iter = 39, solver='liblinear', 
                                      tol = 6.003262415227273e-05, warm_start=False)
y_pred = logistic_regresion.fit(X_train, y_train).predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

[[3987  355]
 [ 597 2674]]
              precision    recall  f1-score   support

           0       0.87      0.92      0.89      4342
           1       0.88      0.82      0.85      3271

    accuracy                           0.87      7613
   macro avg       0.88      0.87      0.87      7613
weighted avg       0.88      0.87      0.87      7613



## Naive Bayes

In [None]:
model= 'naive_bayes'
best_params = fmin(obj_fnc, 
                   search_space(model), 
                   algo=tpe.suggest, 
                   max_evals=100)
 
print(best_params)

In [None]:
naive_bayes=BernoulliNB()
y_pred = naive_bayes.fit(X_train, y_train).predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

## SVC

In [None]:
model= 'svc'
best_params = fmin(obj_fnc, 
                   search_space(model), 
                   algo=tpe.suggest, 
                   max_evals=100)
 
print(best_params)
#first
#{'C': 1.5426972107125763, 'gamma': 0.9018573366743587, 'kernel': 'rbf'}
# with bigrams
#

In [None]:
svc=SVC(C = 1.5426972107125763, gamma = 0.9018573366743587, kernel = 'rbf')
y_pred = svc.fit(X_train, y_train).predict(X_train)
print(confusion_matrix(y_train, y_pred))
print(classification_report(y_train, y_pred))

# Submission

In [23]:
#scaler = StandardScaler()
X_test = scaler.transform(X_test)

In [None]:
X_train.target_text

In [None]:
for val in X_train.columns:
    if val not in X_test.columns:
        print(val)

In [24]:
logistic_regresion=LogisticRegression(C=0.4191297475916065, fit_intercept=True, max_iter = 39, solver='liblinear', 
                                      tol = 6.003262415227273e-05, warm_start=False)
y_pred = logistic_regresion.fit(X_train, y_train).predict(X_test)

In [26]:
sub['target'] = y_pred
sub.head(10)

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
5,12,1
6,21,0
7,22,0
8,27,0
9,29,0


In [27]:
sub.to_csv("submission.csv", index=False, header=True)

# Leak 

In [39]:
test.drop('target', axis=1, inplace=True)

In [40]:
leak = pd.read_csv("../input/realornot/socialmedia-disaster-tweets-DFE.csv", encoding='latin_1')
leak['target'] = (leak['choose_one']=='Relevant').astype(int)
leak['id'] = leak.index
leak = leak[['id', 'target','text']]
merged_df = pd.merge(test, leak, on='id')
sub1 = merged_df[['id', 'target']]
sub1.to_csv('submit_1.csv', index=False)

In [38]:
test.target

0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
3258    0.0
3259    0.0
3260    0.0
3261    0.0
3262    0.0
Name: target, Length: 3263, dtype: float64

In [37]:
sum(merged_df.target_x!=merged_df.target_y)

1408