# Data Agnostic Topic Modelling (DATM)
### This is an exploratory notebook to explore the topicmodelling tool that I have created. See the submitted paper for details.

Please follow the steps below. Most of them are optional since the data has already been cleaned and transformed. 

# 1. Import Statements

In [54]:
# import this if you cannot access the stop words
import nltk
nltk.download('stopwords')
# !pip install --yes --prefix {sys.prefix} nltk

In [73]:
#These are the generic import statements.
#-------------------------------------------------------------------------
#Classic Imports
import sys
from sklearn.preprocessing import MinMaxScaler
from  sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import nltk
import pandas as pd
from nltk.stem import WordNetLemmatizer
import re
import ast
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pathlib

from biterm.btm import oBTM
from biterm.cbtm import oBTM as c_oBTM
from biterm.utility import vec_to_biterms, topic_summuary
import pickle
#--------------------------------------------------------------
#Local Imports
from src.data import DataPreprocessing as dp # General data processing functions
from src.sompy import sompy as sompy #SOM module
from src.data import PreprocessAirlineData as PAD #Specific data processing for the airline data
from src.models import Modelling as mdl #Contains functions for the data trasformation and topic models
from src.evaluations import evaluation as ev #evaluation modules
import imp
# imp.reload(ev)
from src.generalFunctions import functions as fs # general functions 


# 2. Data preprocessing

Please note that the Data has already been preproccessed and cleaned.
The following cells lets us replicate all the data preprocessing steps that we have already developed

In [12]:
#This cell specifies where all data must be stored and retreived from
# dataFolder = str(pathlib.Path().absolute().parents[1]) +"/data" #points to the data folder
dataFolder = "./data" #points to the data folder
airlineRawData = dataFolder+"/airlineDataset/raw" #points to the raw data folder
airlineProcessedData = dataFolder+"/airlineDataset/processed" #points to the processed data folder
airlineResults = dataFolder+"/airlineDataset/results" #points to the results folder

In [13]:
#-1- Read csv file and/or convert to dataframe - Airline Dataset
airline_DF_Clean = PAD.preprocessAirline(convertcsv = True, save = False, clean = True, 
                                         dataSource = airlineRawData, 
                                         dataStorage = airlineProcessedData)
#airline_DF_Clean.count()

### The topics captured in the dataset

In [74]:
#-2- Get only topic labelled tweets - This is for exploration only
tweets_Labelled = dp.getTopicLabelledTweets(dataSource=airlineProcessedData+'/airline_DF_Clean.pkl', 
                                              exclude = ["Can't Tell"], save = False)
#Following lets us see what topics exists
topicList = set(list(tweets_Labelled.negativeReason))
topicList
#Following gives me a count on the number of tweets we are dealing with
#tweets_Labelled.count()

## The following allows you to access the stored preprocessed data

In [76]:
airline_DF_Clean = dp.getData(airlineProcessedData+'/airline_DF_Clean.pkl')
airline_DF_Clean.head()

Unnamed: 0,tweetID,text,author,publicationDate,negativeReason
0,5.70306e+17,say,cairdin,2/24/15 11:35,
1,5.70301e+17,plus youve add commercial experience tacky,jnardino,2/24/15 11:15,
2,5.70301e+17,didnt today must mean need take another trip,yvonnalynn,2/24/15 11:15,
3,5.70301e+17,really aggressive blast obnoxious entertainmen...,jnardino,2/24/15 11:15,Bad Flight
4,5.70301e+17,really big bad thing,jnardino,2/24/15 11:14,Can't Tell


#  3. Data Tranformation

### We extract the corpus from our dataframe

In [15]:
tweets = airline_DF_Clean['text'].tolist()

### We now demonstrate our transformation algorithm.
Please note that we only transform a fraction of the corpus for the sake of time. The fully transformed document set have already been completed and is available (see next step)

In [28]:
# the corpus as a slice of the tweets - feel free to change this 
corpus = tweets[:20]
# the noOfBuckets is the number of documents that will be returned. 
# Here we want the number of documents returned to be the same as the 
# original number of documents 
noOfBuckets = len(corpus)

In [29]:
# the actual transformation
transformed_docs = mdl.malg(corpus, noOfBuckets)

  5%|▍         | 50/1018 [00:00<00:01, 499.55it/s]

The total initial cost is 1037.9989337433067
The number of folds to perform is  1018
The number documents is  1038


100%|██████████| 1018/1018 [00:00<00:00, 2487.78it/s]

The new total cost after round x is  19.999762106972675 and new corpus size is  20





In [34]:
# transformed_docs

### The fully tranformed documents
Get access to the saved fully transformed documents

In [64]:
fully_transfromed_docs = './data/airlineDataset/interim/transformedCorpus_Noisy.pkl'

with open(fully_transfromed_docs, 'rb') as f:
    transformed_data = pickle.load(f)
    
# transformed_data

# 3. Modelling Approaches
### (used for both transformed and untransformed data)

## a) LDA Topics 

In [77]:
foundTopics = mdl.getLDATopics(transformed_docs, n_topics= 9)
print(foundTopics)

Number of features too low, do not use 'n_features = auto' option,        specify a number
[['reserve', 'away', 'seat', 'flight'], ['seat', 'reserve', 'flight', 'away'], ['seat', 'flight', 'reserve', 'away'], ['seat', 'flight', 'reserve', 'away'], ['flight', 'seat', 'reserve', 'away'], ['seat', 'flight', 'reserve', 'away'], ['seat', 'away', 'flight', 'reserve'], ['seat', 'flight', 'reserve', 'away'], ['seat', 'flight', 'reserve', 'away']]


## b) BTM Topics

### use the function below

In [78]:
import pickle
import time
from sklearn.feature_extraction.text import CountVectorizer


def basic_evaluation(data):
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(data).toarray()
    vocab = np.array(vectorizer.get_feature_names())

    biterms = vec_to_biterms(X)
    # create btm
    btm = oBTM(num_topics=9, V=vocab)

    print("\n\n Train BTM ..")
    topics = btm.fit_transform(biterms, iterations=100)
    return topic_summuary(btm.phi_wz.T, X, vocab, 10)

In [43]:
# btm_airline_res = basic_evaluation(transformed_docs)

## c) PTM Topics
The three functions below are required to
* generate corpus for the PTM model
* run the PTM model
* get the topics from the model

In [79]:
import tomotopy as tp

def getCorpus(input_data):
    corpus = tp.utils.Corpus(tokenizer=tp.utils.SimpleTokenizer(), stopwords=['.'])
    corpus.process(input_data)
    return corpus

In [80]:
def PTM_model (input_corpus, num_topics, a=None, b=None):
    if a == None:
        a = 1/num_topics
    if b == None: 
        b = 1/num_topics
    ptm_model = tp.PTModel(k=num_topics, alpha=a, eta=b, corpus=input_corpus)
    for i in range(0, 1000, 10):
        ptm_model.train(10)
    return ptm_model

def get_mdl_topics(mdl, k):
    k_topics = []
    for k in range(k):
        word_set = mdl.get_topic_words(k, top_n=10)
        k_topics.append([w[0]for w in word_set])
    return k_topics

In [82]:
ptm_corpus = getCorpus(transformed_data)
ptm_mdl = PTM_model(ptm_corpus, 9)

In [66]:
ptm_topics = get_mdl_topics(ptm_mdl, 9)

## Comparing discovered topics with true topics

In [84]:
f = open('data/airlineDataset/results/gsrTopics.txt', 'r')
d = {}
for line in f:
    (key, val) = line.split(':')
    d[(key)] = list(val.strip().split(', '))
f.close()
ActualTopics = d

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# BTM results must be transformed to lists
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# topics = []
# for e in btm_results['topic_summary']['top_words']:
#     btm_topics.append(list(e))
# btm_topics

ev.evaluate_correctness(ptm_topics, ActualTopics, 2)

{'RecallActual': 0.8888888888888888,
 'Precision': 0.8888888888888888,
 'Purity': 0.35555555555555557}