# Novelty Detection

Steps:

1. preprocess tweets for feature extraction 

2. build a model from historical tweets

3. use the model to predict the nearest neighbor for tweets in an anomalous period; rank tweets and keep the novel one(s)

### 1. Preprocessing

Preprocessing steps for feature extraction from tweets: tokenization, part-of-speech tagging, lemmatization, normalization. 

In [64]:
from IPython.display import display

In [16]:
from nltk.tokenize import TweetTokenizer
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
import re

In [17]:
PUNCTUATION = [':', '.', ')', '(']
PUNCTUATION_AND_URL = PUNCTUATION + ['URL']
LEMMATIZER_POS = ['n', 'v', 'a', 'r']
TWITTER_SYMBOLS = ['#', '@']
URL_PATTERN = re.compile("(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?")

In [18]:
tweet_tokenizer = TweetTokenizer()
wordnet_lemmatizer = WordNetLemmatizer()
vectorizer = TfidfVectorizer()
lsh_forest = LSHForest()

In [19]:
def pos_to_lemmatizer_pos(pos_initial):        
    if pos_initial != 'J':
        return pos_initial.lower()
    else:
        return 'a'

In [20]:
def normalize_tagged_token(tagged_token):
    token_str = tagged_token[0]
    token_pos = tagged_token[1]
    if token_str[0] in TWITTER_SYMBOLS:
        return (token_str[1:], token_pos)
    elif URL_PATTERN.match(token_str):
        return (token_str, 'URL')
    else:
        return tagged_token

In [21]:
def lemmatize_tagged_token(tagged_token):
    token_str = tagged_token[0]
    token_pos = tagged_token[1]
    pos_initial = pos_to_lemmatizer_pos(token_pos[0])
    if pos_initial in LEMMATIZER_POS:
        return(wordnet_lemmatizer.lemmatize(token_str, pos=pos_initial), token_pos)
    else:
        return tagged_token

In [22]:
def tweet_to_features(tweet):
    # tokenization
    tokens = tweet_tokenizer.tokenize(tweet)
   
    # pos tagging
    pos_tagged_tokens = pos_tag(tokens)
    
    # lemmatize
    lemmatized_tagged_tokens = [lemmatize_tagged_token(tagged_tok) for tagged_tok in pos_tagged_tokens]
    
    # normalized tokens 
    normalized_tagged_tokens = [normalize_tagged_token(tagged_tok) for tagged_tok in lemmatized_tagged_tokens]
    
    # keep valid tokens: no punctuation, URLs
    valid_tweet_str =  ' '.join([ntoken[0] 
                                 for ntoken in normalized_tagged_tokens 
                                 if ntoken[1] not in PUNCTUATION_AND_URL])
    
    return valid_tweet_str

Example preprocessing result

In [24]:
tweet_to_features('Next @Yahoo victim? Hearing aid group Sonova buys AudioNova in $953 million deal: S) is buying D... https://t.co/h1x5b05afu @ArchiveTeam')

u'Next Yahoo victim Hearing aid group Sonova buy AudioNova in $ 953 million deal S be buy D ArchiveTeam'

### 2. Dataset and model building

Given a dataset of tweets spanning a certain time period, use tweets up to timestamp T to build a model used to predict novel tweets. 

In [71]:
import pandas as pd
from datetime import datetime, timedelta

In [126]:
df_input = pd.read_csv('pydata_vw_tweets.csv')
display(df_input.head())
display(df_input.tail())
display(df_input.describe())

Unnamed: 0,created_at,text
0,2016-05-04T14:59:43.000Z,RT @vwpress_en: The #VWGolfGTI Clubsport S bre...
1,2016-05-04T14:57:28.000Z,RT @vwpress_en: The #VWGolfGTI Clubsport S bre...
2,2016-05-04T14:55:37.000Z,The Volkswagen Parts Sale continues. \nDon't m...
3,2016-05-04T14:55:23.000Z,RT @vwpress_en: The #VWGolfGTI Clubsport S bre...
4,2016-05-04T14:54:40.000Z,RT @vwpress_en: The #VWGolfGTI Clubsport S bre...


Unnamed: 0,created_at,text
195020,2014-06-01T06:37:09.000Z,A few improvements prior to #bristolvolksfest ...
195021,2014-06-01T06:31:47.000Z,VW to accelerate US model rollout - Filed unde...
195022,2014-06-01T03:19:06.000Z,#Volkswagen Group to invest 100 million euros ...
195023,2014-06-01T01:27:42.000Z,VW to accelerate US model rollout: Filed under...
195024,2014-06-01T01:08:41.000Z,The end. Once more time thank you so much @veu...


Unnamed: 0,created_at,text
count,195025,195025
unique,164939,137802
top,2015-11-14T19:44:03.000Z,RT @business: BREAKING: #Volkswagen's shares f...
freq,83,993


Assume the training dataset includes all tweets between 1-6-2014 until 1-9-2015

In [127]:
df_input['created_at'] = pd.to_datetime(df_input['created_at'])
df_train = df_input.loc[df_input['created_at'] < datetime(2015, 9, 1)].copy()
display(df_train.head())
display(df_train.tail())
display(df_train.describe())

Unnamed: 0,created_at,text
168025,2015-08-31 23:40:22,RT @India_Business: #india #business : Suzuki ...
168026,2015-08-31 22:05:37,"Suzuki, to the appropriation to the policy rep..."
168027,2015-08-31 21:59:33,RT @iMiaSanMia: Volkswagen boss Martin Winterk...
168028,2015-08-31 21:54:30,RT @iMiaSanMia: Volkswagen boss Martin Winterk...
168029,2015-08-31 21:20:40,IndustryWeek: Suzuki set to buy back shares as...


Unnamed: 0,created_at,text
195020,2014-06-01 06:37:09,A few improvements prior to #bristolvolksfest ...
195021,2014-06-01 06:31:47,VW to accelerate US model rollout - Filed unde...
195022,2014-06-01 03:19:06,#Volkswagen Group to invest 100 million euros ...
195023,2014-06-01 01:27:42,VW to accelerate US model rollout: Filed under...
195024,2014-06-01 01:08:41,The end. Once more time thank you so much @veu...


Unnamed: 0,created_at,text
count,27000,27000
unique,24025,22328
top,2015-04-16 21:46:36,"RT @ubercool: Check out that cool, full-screen..."
freq,63,252
first,2014-06-01 01:08:41,
last,2015-08-31 23:40:22,


Extract features from tweets which are used to build the model.

In [128]:
df_train.loc[:, 'features'] = df_train.apply(lambda row: tweet_to_features(row['text']), axis=1)
display(df_train.head())

Unnamed: 0,created_at,text,features
168025,2015-08-31 23:40:22,RT @India_Business: #india #business : Suzuki ...,RT India_Business india business Suzuki start ...
168026,2015-08-31 22:05:37,"Suzuki, to the appropriation to the policy rep...","Suzuki , to the appropriation to the policy re..."
168027,2015-08-31 21:59:33,RT @iMiaSanMia: Volkswagen boss Martin Winterk...,RT iMiaSanMia Volkswagen boss Martin Winterkor...
168028,2015-08-31 21:54:30,RT @iMiaSanMia: Volkswagen boss Martin Winterk...,RT iMiaSanMia Volkswagen boss Martin Winterkor...
168029,2015-08-31 21:20:40,IndustryWeek: Suzuki set to buy back shares as...,IndustryWeek Suzuki set to buy back share as V...


Drop exact duplicates.

In [130]:
df_train = df_train.drop_duplicates(subset='features')
display(df_train.describe())

Unnamed: 0,created_at,text,features
count,13764,13764,13764
unique,13580,13764,13764
top,2014-11-20 21:39:15,VW: Huge funding in the future in China: Germa...,Volkswagen bet big on emerge market like India...
freq,4,1,1
first,2014-06-01 01:08:41,,
last,2015-08-31 23:40:22,,


Build the LSHForest model

In [123]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import LSHForest

In [131]:
feature_vectors = vectorizer.fit_transform(df_train['features'])
lsh_forest.fit(feature_vectors)

LSHForest(min_hash_match=4, n_candidates=50, n_estimators=10, n_neighbors=5,
     radius=1.0, radius_cutoff_ratio=0.9, random_state=None)

### 3. Make predictions

Read anomalous periods provided by the anomaly detection algorithm. The anomalous period starts at _period start_ and ends one hour later.

In [135]:
import numpy as np

In [99]:
df_anomalies = pd.read_csv('anomalies_vw.csv')
df_anomalies = df_anomalies.drop('Unnamed: 0', 1)
df_anomalies['period_start'] = pd.to_datetime(df_anomalies['period_start'])
display(df_anomalies.head())

Unnamed: 0,period_start
0,2015-09-02 08:00:00
1,2015-09-02 11:00:00
2,2015-09-02 12:00:00
3,2015-09-03 12:00:00
4,2015-09-03 16:00:00


In [105]:
def tweets_in_period(period_start):
    mask = (df_input['created_at'] >= period_start) & (df_input['created_at'] <= period_start + timedelta(hours=1))
    tweets = df_input.loc[mask]['text'].tolist()
    return {'tweets': tweets}

In [162]:
tw = tweets_in_period(datetime(2015, 9, 18, 16, 0, 0))
tw

{'tweets': ['#News #Stocks U.S. accuses Volkswagen of clean air violations  http://t.co/s51tPet1Ec',
  '@volkskrant  @vwgroup_en  DONT HAVE A CAR VOLKSWIFI   DEVICE OR CAR  VEHICLE WATERVAPORIZES RECYCLES MYBE BANSELLINGBOTTLEDWATERS',
  'EPA: Volkswagen cheated on emissions standards http://t.co/oSfYIb0Xjc #Stocks',
  '#SA #markets Volkswagen takes the fall as EPA flexes clean air muscles:  http://t.co/ISJSi9E22t http://t.co/j3ZcgctIzR',
  '\xe2\x80\x9cBy 2020, we will have transformed all of our new cars into smartphones on wheels\xe2\x80\x9d -Martin Winterkorn, @Volkswagen http://t.co/9XtQMYDbDG',
  'Our wonderful Volkswagen sales staff shares their favorite things about #VW. What do you love about your #VeeDub?... http://t.co/zootQBuxBR',
  "RT @AliceMartin8: VW Does Some Party Planning With Adam Scott, Michael Pena, and McLovin: The carmaker's App Connect feature mak... http://\xe2\x80\xa6",
  'Volkswagen Is Ordered to Recall Nearly 500000 Vehicles Over Emissions Software - New Yo

In [109]:
df_anomalies.loc[:, 'tweets'] = df_anomalies.apply(lambda row: tweets_in_period(row['period_start']), axis=1)
df_anomalies.loc[:, 'len_tweets'] = df_anomalies.apply(lambda row: len(row['tweets']), axis=1)
display(df_anomalies.head())

Unnamed: 0,period_start,tweets,len_tweets
0,2015-09-02 08:00:00,[UPDATE 1-VW committee proposes extending CEO ...,115
1,2015-09-02 11:00:00,[VW's CEO Martin Winterkorn set to stay on for...,72
2,2015-09-02 12:00:00,[GERMANY: VW extending Winterkorn's contract: ...,124
3,2015-09-03 12:00:00,[#News #Stocks Volkswagen CFO Poetsch to becom...,23
4,2015-09-03 16:00:00,[Love that signature shape. #ClassicCars #VWBe...,21


Identify novel tweets.

In [218]:
def novel_tweets(tweets):
    
    features = [tweet_to_features(tweet) for tweet in tweets]
    len_tweets = np.array([len(t) for t in tweets])
    
    vectors = vectorizer.transform(features)
    distances, indices = lsh_forest.kneighbors(vectors, n_neighbors=1)
    
    distances = distances.reshape(1, len(tweets))[0]
    distances = distances/len_tweets
    
    novelty_result = np.argsort(distances)    
    best_novelty_result = novelty_result[-3:].tolist()
    
    result_tweets = [(tweets[r], distances[r]) for r in best_novelty_result]
    
    return {'novel_tweets': result_tweets}

In [219]:
novel_tweets(tw.get('tweets'))

{'novel_tweets': [('The EPA accuses Volkswagen of Clean Air Act violations http://t.co/5MoTlWHxOH $VLKAY',
   0.010224082412261139),
  ('VW AG Shares Down 5.4%, News Doesn\xe2\x80\x99t Get Better For The German Auto Company',
   0.010506235383665526),
  ("Read this old Ward's 10 Best Engine piece on VW 2.0L diesel http://t.co/4J7cubLDjU",
   0.010940330250019606)]}

In [220]:
df_anomalies.loc[:, 'novel_tweets'] = df_anomalies.apply(lambda row: novel_tweets(row['tweets']), axis=1)
display(df_anomalies['novel_tweets'].head())

0    [(Volkswagen will likely extend CEO Winterkorn...
1    [(UPDATE 2-VW's CEO Martin Winterkorn set to s...
2    [(Martin Winterkorn’s mandate at VW extended t...
3    [(#News #Stocks VW CFO Poetsch to be proposed ...
4    [(FreedomWonInc. Volkswagen finance chief Poet...
Name: novel_tweets, dtype: object

In [221]:
display(df_anomalies.head(15))

Unnamed: 0,period_start,tweets,len_tweets,novel_tweets
0,2015-09-02 08:00:00,[UPDATE 1-VW committee proposes extending CEO ...,115,[(Volkswagen will likely extend CEO Winterkorn...
1,2015-09-02 11:00:00,[VW's CEO Martin Winterkorn set to stay on for...,72,[(UPDATE 2-VW's CEO Martin Winterkorn set to s...
2,2015-09-02 12:00:00,[GERMANY: VW extending Winterkorn's contract: ...,124,[(Martin Winterkorn’s mandate at VW extended t...
3,2015-09-03 12:00:00,[#News #Stocks Volkswagen CFO Poetsch to becom...,23,[(#News #Stocks VW CFO Poetsch to be proposed ...
4,2015-09-03 16:00:00,[Love that signature shape. #ClassicCars #VWBe...,21,[(FreedomWonInc. Volkswagen finance chief Poet...
5,2015-09-04 07:00:00,[RT @vwgroup_en: In 2014 #Porsche was able to ...,74,[(Martin Winterkorn Stays At The Helm As VW En...
6,2015-09-09 06:00:00,[RT @lizcastro: Volkswagen is not afraid of Ca...,19,[(Volkswagen is not afraid of Catalan independ...
7,2015-09-14 17:00:00,[General Motors: Buy For The 4.9% Dividend Alo...,39,[(The @vwgroup_en #Frankfurt press conference ...
8,2015-09-14 18:00:00,[Surprise! @Porsche drops all electric 4-door ...,32,[(@E6Principal @E6General @Bugatti @vwgroup_de...
9,2015-09-14 19:00:00,[ICYMI: #VWGroup Night at the #IAA. Watch rebr...,48,[(It's like a unicorn. @Bugatti @vwgroup_en #I...
