<a href="https://colab.research.google.com/github/hvarS/NLPRefer/blob/main/ContextualisedTextModellingBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Imports

In [1]:
import pandas as pd
import numpy as np

In [8]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list, TopicModelDataPreparation
from contextualized_topic_models.datasets.dataset import CTMDataset
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import common_texts
from gensim.models import ldamodel 
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import os
import pickle

##API SETUP and GET Dataset

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')


def setup_dataset(text):
  import shutil
  import os
  %cd /content/gdrive/My Drive/Kaggle/
  strings = text.split(' ')[-1]
  folder = strings.split('/')[1]
  os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle/"+folder
  print(strings)
  print(folder)
  !mkdir $folder
  shutil.copy2("kaggle.json","./"+folder+"/kaggle.json")
  %cd $folder
  !kaggle datasets download -d $strings
  
setup_dataset("kaggle datasets download -d onurserbetci/data-science-job-market-in-uk")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Kaggle
onurserbetci/data-science-job-market-in-uk
data-science-job-market-in-uk
mkdir: cannot create directory ‘data-science-job-market-in-uk’: File exists
/content/gdrive/My Drive/Kaggle/data-science-job-market-in-uk
data-science-job-market-in-uk.zip: Skipping, found more recently modified local copy (use --force to force download)


In [6]:
%ls

data-science-job-market-in-uk.zip  indeed-uk.csv  kaggle.json  ml.csv
data_scientist.csv                 junior.csv     lead.csv     not_lead.csv


In [3]:
!unzip data-science-job-market-in-uk.zip

Archive:  data-science-job-market-in-uk.zip
  inflating: data_scientist.csv      
  inflating: indeed-uk.csv           
  inflating: junior.csv              
  inflating: lead.csv                
  inflating: ml.csv                  
  inflating: not_lead.csv            


In [3]:
df1 = pd.read_csv("data_scientist.csv")
df2 = pd.read_csv("indeed-uk.csv")
df3 = pd.read_csv("junior.csv")
df4 = pd.read_csv("lead.csv")
df5 = pd.read_csv("ml.csv")
df6 = pd.read_csv("not_lead.csv")

In [4]:
df = pd.DataFrame(columns = ["Description"])

In [5]:
df.Description = pd.concat([df.Description,df1.Description,df2.Description,df3.Description,df4.Description,df5.Description,df6.Description])

In [6]:
df.shape

(10455, 1)

*We have a total of 10455 Job Descriptions*

##Installing Relevant Packages

In [7]:
!pip install contextualized-topic-models==1.8.1
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


##Data Preprocessing

In [9]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
descriptions = []
for i in range(len(list(df.Description))):
  query = list(df.Description)[i]
  stripped = query.strip()
  descriptions.append(stripped)

In [17]:
print(descriptions[9])

SoulTeks client, a fast growing fin tech startup in London are looking for an Insights Data Scientist to join the team. This role is focussing primarily on taking large data sets and gaining insights for 3rd party clients. The data is truly massive and the possibilities it presents are just as big.
For this role the key skills you will need are:
Commercial experience working as a Data Scientist.
Data Visualisation experience
Python or R coding experience
Experience working independently and the ability to pick up new technologies quickly.
Ideally you will have worked with financial data before.
If this job sounds interesting to you please apply below.

Job Overview
Expiration date:
31st July 2020
Location:
London
Job Title:
Insights Data Scientist
Salary:
£45,000 - £60,000


In [18]:
sp = WhiteSpacePreprocessing(descriptions,"english")

In [19]:
preprocessed_documents_for_bow, unpreprocessed_corpus_for_contextual, vocab = sp.preprocess()

In [21]:
preprocessed_documents_for_bow[9]

'client fast growing tech startup london looking insights data scientist join team role primarily taking large data sets insights party clients data truly big role key skills need commercial experience working data scientist data visualisation experience python coding experience experience working independently ability new technologies quickly ideally worked financial data job interesting please apply job overview date july location london job title insights data scientist salary'

In [22]:
df["preprocessed_documents_for_bow"] = preprocessed_documents_for_bow
df["unpreprocessed_corpus_for_contextual"] = unpreprocessed_corpus_for_contextual

##Training Dataset Creation

In [24]:
qt = TopicModelDataPreparation("bert-base-nli-mean-tokens")

training_dataset = qt.create_training_set(unpreprocessed_corpus_for_contextual, preprocessed_documents_for_bow)

100%|██████████| 405M/405M [00:16<00:00, 23.9MB/s]


Batches:   0%|          | 0/53 [00:00<?, ?it/s]

In [27]:
pickle_out = open("training_data.pickle","wb")
pickle.dump(training_dataset, pickle_out)

##Training

In [28]:
ctm = CombinedTM(input_size=len(qt.vocab), bert_input_size=768, n_components=6, num_epochs=15)

ctm.fit(training_dataset) 

Epoch: [15/15]	 Seen Samples: [156825/156825]	Train Loss: 2263.4745460455524	Time: 0:00:11.129987: : 15it [02:47, 11.16s/it]


##Topics Identified

In [32]:
ctm.get_topics(k = 1)

defaultdict(list,
            {0: ['architectural'],
             1: ['management'],
             2: ['azure'],
             3: ['architectural'],
             4: ['even'],
             5: ['languages']})

##Analysing the search document topics

In [33]:
topics_predictions = ctm.get_doc_topic_distribution(training_dataset, n_samples=20)

Sampling: [20/20]: : 20it [01:31,  4.55s/it]


In [35]:
preprocessed_documents_for_bow[0]

'company description years ostc become one leading proprietary trading companies world success based hiring developing talented people helping perform maximum fulfil potential trading company however unlike many trading companies continually innovating new products support business zishi comprehensive platform knowledge training ostc trading business across world partnering world leading artificial intelligence specialists university college london upon extensive ostc data bring valuable performance enhancing insights tools individual collective trading experience job description role working closely co head zishi chief technology officer senior stakeholders within business external expert partners primary focus help create embed ai powered technology key areas business main responsibilities include handling vast amount data various data sources integrate clean verify integrity translating business opportunities challenges data data actionable business insights effective decision makin

In [36]:
import numpy as np
topic_number = np.argmax(topics_predictions[0])

In [37]:
ctm.get_topic_lists(5)[topic_number]

['languages', 'data', 'investigation', 'pipeline', 'apple']