**<h1><center>Conservational Analytics</center></h1>**

Conversational analytics is a technology-driven process that involves analysing data from human conversations to extract meaningful insights. These conversations can occur across various channels such as phone calls, chatbots, social media, customer service interactions, and more. The primary goal of conversational analytics is to understand the content, context, and sentiment of conversations to improve customer experiences, streamline operations, and inform strategic decision-making.

## **What Is Our Approach?**

For our analysis of the Cornell Movie-Dialogs Corpus, we employ natural language processing (NLP) techniques, including topic modelling, to uncover hidden thematic structures within the dialogues. Additionally, we create a conversation network to organise and summarise the extensive textual information, allowing us to understand the connections between entities in the dataset.

In [4]:
#   Importing the necessary libraries/modules.

from convokit import Corpus, download
import pandas as pd
from tqdm import tqdm
import json
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [5]:
#   Downloading the necessary packages.

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/huzaifa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/huzaifa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/huzaifa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/huzaifa/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [6]:
corpus=Corpus(filename=download("movie-corpus"))    #   Loading the corpus.

Downloading movie-corpus to /Users/huzaifa/.convokit/downloads/movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done


In [7]:
corpus.print_summary_stats()

Number of Speakers: 9035
Number of Utterances: 304713
Number of Conversations: 83097


In [8]:
speaker_dataframe=corpus.get_speakers_dataframe()   #   Retrieving the speakers pandas.DataFrame.
speaker_dataframe

Unnamed: 0_level_0,vectors,meta.character_name,meta.movie_idx,meta.movie_name,meta.gender,meta.credit_pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
u0,[],BIANCA,m0,10 things i hate about you,f,4
u2,[],CAMERON,m0,10 things i hate about you,m,3
u3,[],CHASTITY,m0,10 things i hate about you,?,?
u4,[],JOEY,m0,10 things i hate about you,m,6
u5,[],KAT,m0,10 things i hate about you,f,2
...,...,...,...,...,...,...
u9029,[],CREALOCK,m616,zulu dawn,?,?
u9033,[],STUART SMITH,m616,zulu dawn,?,?
u9028,[],COGHILL,m616,zulu dawn,?,?
u9031,[],MELVILL,m616,zulu dawn,?,?


In [9]:
conversation_dataframe=corpus.get_conversations_dataframe() #   Retrieving the conversations pandas.DataFrame.
conversation_dataframe

Unnamed: 0_level_0,vectors,meta.movie_idx,meta.movie_name,meta.release_year,meta.rating,meta.votes,meta.genre
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
L1044,[],m0,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']"
L984,[],m0,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']"
L924,[],m0,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']"
L870,[],m0,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']"
L866,[],m0,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']"
...,...,...,...,...,...,...,...
L666324,[],m616,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w..."
L666262,[],m616,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w..."
L666520,[],m616,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w..."
L666369,[],m616,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w..."


In [10]:
utterance_dataframe=corpus.get_utterances_dataframe()   #   Retrieving the utterances pandas.DataFrame.
utterance_dataframe

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.movie_id,meta.parsed,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
L1045,,They do not!,u0,L1044,L1044,m0,"[{'rt': 1, 'toks': [{'tok': 'They', 'tag': 'PR...",[]
L1044,,They do to!,u2,,L1044,m0,"[{'rt': 1, 'toks': [{'tok': 'They', 'tag': 'PR...",[]
L985,,I hope so.,u0,L984,L984,m0,"[{'rt': 1, 'toks': [{'tok': 'I', 'tag': 'PRP',...",[]
L984,,She okay?,u2,,L984,m0,"[{'rt': 1, 'toks': [{'tok': 'She', 'tag': 'PRP...",[]
L925,,Let's go.,u0,L924,L924,m0,"[{'rt': 0, 'toks': [{'tok': 'Let', 'tag': 'VB'...",[]
...,...,...,...,...,...,...,...,...
L666371,,Lord Chelmsford seems to want me to stay back ...,u9030,L666370,L666369,m616,"[{'rt': 2, 'toks': [{'tok': 'Lord', 'tag': 'NN...",[]
L666370,,I'm to take the Sikali with the main column to...,u9034,L666369,L666369,m616,"[{'rt': 1, 'toks': [{'tok': 'I', 'tag': 'PRP',...",[]
L666369,,"Your orders, Mr Vereker?",u9030,,L666369,m616,"[{'rt': 1, 'toks': [{'tok': 'Your', 'tag': 'PR...",[]
L666257,,"Good ones, yes, Mr Vereker. Gentlemen who can ...",u9030,L666256,L666256,m616,"[{'rt': 1, 'toks': [{'tok': 'Good', 'tag': 'JJ...",[]


In [11]:
speaker_utterance_dataframe=pd.merge(speaker_dataframe, utterance_dataframe, left_on="id", right_on="speaker")  #   Joining the speakers and utterances pandas.DataFrames.
speaker_utterance_dataframe.head()

Unnamed: 0,vectors_x,meta.character_name,meta.movie_idx,meta.movie_name,meta.gender,meta.credit_pos,timestamp,text,speaker,reply_to,conversation_id,meta.movie_id,meta.parsed,vectors_y
0,[],BIANCA,m0,10 things i hate about you,f,4,,They do not!,u0,L1044,L1044,m0,"[{'rt': 1, 'toks': [{'tok': 'They', 'tag': 'PR...",[]
1,[],BIANCA,m0,10 things i hate about you,f,4,,I hope so.,u0,L984,L984,m0,"[{'rt': 1, 'toks': [{'tok': 'I', 'tag': 'PRP',...",[]
2,[],BIANCA,m0,10 things i hate about you,f,4,,Let's go.,u0,L924,L924,m0,"[{'rt': 0, 'toks': [{'tok': 'Let', 'tag': 'VB'...",[]
3,[],BIANCA,m0,10 things i hate about you,f,4,,Okay -- you're gonna need to learn how to lie.,u0,L871,L870,m0,"[{'rt': 4, 'toks': [{'tok': 'Okay', 'tag': 'UH...",[]
4,[],BIANCA,m0,10 things i hate about you,f,4,,I'm kidding. You know how sometimes you just ...,u0,,L870,m0,"[{'rt': 2, 'toks': [{'tok': 'I', 'tag': 'PRP',...",[]


In [12]:
corpus_dataframe=pd.merge(conversation_dataframe, speaker_utterance_dataframe, left_on="id", right_on="conversation_id")    #   Joining the conversations and speakers and utterances pandas.DataFrames.
corpus_dataframe.head()

Unnamed: 0,vectors,meta.movie_idx_x,meta.movie_name_x,meta.release_year,meta.rating,meta.votes,meta.genre,vectors_x,meta.character_name,meta.movie_idx_y,...,meta.gender,meta.credit_pos,timestamp,text,speaker,reply_to,conversation_id,meta.movie_id,meta.parsed,vectors_y
0,[],m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",[],BIANCA,m0,...,f,4,,They do not!,u0,L1044,L1044,m0,"[{'rt': 1, 'toks': [{'tok': 'They', 'tag': 'PR...",[]
1,[],m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",[],CAMERON,m0,...,m,3,,They do to!,u2,,L1044,m0,"[{'rt': 1, 'toks': [{'tok': 'They', 'tag': 'PR...",[]
2,[],m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",[],BIANCA,m0,...,f,4,,I hope so.,u0,L984,L984,m0,"[{'rt': 1, 'toks': [{'tok': 'I', 'tag': 'PRP',...",[]
3,[],m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",[],CAMERON,m0,...,m,3,,She okay?,u2,,L984,m0,"[{'rt': 1, 'toks': [{'tok': 'She', 'tag': 'PRP...",[]
4,[],m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",[],BIANCA,m0,...,f,4,,Let's go.,u0,L924,L924,m0,"[{'rt': 0, 'toks': [{'tok': 'Let', 'tag': 'VB'...",[]


In [13]:
print(list(corpus_dataframe.columns))

['vectors', 'meta.movie_idx_x', 'meta.movie_name_x', 'meta.release_year', 'meta.rating', 'meta.votes', 'meta.genre', 'vectors_x', 'meta.character_name', 'meta.movie_idx_y', 'meta.movie_name_y', 'meta.gender', 'meta.credit_pos', 'timestamp', 'text', 'speaker', 'reply_to', 'conversation_id', 'meta.movie_id', 'meta.parsed', 'vectors_y']


In [14]:
corpus_dataframe=corpus_dataframe.drop(columns=["vectors", "meta.movie_idx_x", "vectors_x", "meta.movie_idx_y", "meta.movie_name_y", "meta.gender", "meta.credit_pos", "timestamp", "conversation_id", "meta.movie_id", "meta.parsed", "vectors_y"])
corpus_dataframe

Unnamed: 0,meta.movie_name_x,meta.release_year,meta.rating,meta.votes,meta.genre,meta.character_name,text,speaker,reply_to
0,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']",BIANCA,They do not!,u0,L1044
1,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']",CAMERON,They do to!,u2,
2,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']",BIANCA,I hope so.,u0,L984
3,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']",CAMERON,She okay?,u2,
4,10 things i hate about you,1999,6.90,62847,"['comedy', 'romance']",BIANCA,Let's go.,u0,L924
...,...,...,...,...,...,...,...,...,...
304708,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w...",DURNFORD,"Your orders, Mr Vereker?",u9030,
304709,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w...",VEREKER,I think Chelmsford wants a good man on the bor...,u9034,L666371
304710,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w...",VEREKER,I'm to take the Sikali with the main column to...,u9034,L666369
304711,zulu dawn,1979,6.40,1911,"['action', 'adventure', 'drama', 'history', 'w...",DURNFORD,"Good ones, yes, Mr Vereker. Gentlemen who can ...",u9030,L666256


### **• Conversation Network:**

A conversation network is a graphical representation of interactions between entities within a dataset, particularly focusing on dialogues and conversations. In the context of the Cornell Movie-Dialogs Corpus, a conversation network can help visualise and analyse the relationships and communication patterns between characters.

- **Nodes:** Each node represents an entity in the dataset; in this case, the nodes represent movie characters.
- **Edges:** Each edge represents a conversation or interaction between two nodes (characters).

Facilitates various types of analysis, such as:

- **Centrality Analysis:** Identifying key characters who are central to the story or have the most interactions.
- **Community Detection:** Finding groups of characters who interact more frequently with each other than with others.
- **Interaction Patterns:** Understanding how conversations flow, who initiates them, and how they propagate through the network.

In [15]:
edges=list(zip(corpus_dataframe["meta.character_name"], corpus_dataframe["reply_to"], corpus_dataframe["text"]))    #   Creating a list of tuples containing the source, target, and text of each edge.
edges=[(source, target, text) for source, target, text in edges if target is not None]  #   Removing the edges with a target of None.
edges

[('BIANCA', 'L1044', 'They do not!'),
 ('BIANCA', 'L984', 'I hope so.'),
 ('BIANCA', 'L924', "Let's go."),
 ('BIANCA', 'L871', "Okay -- you're gonna need to learn how to lie."),
 ('CAMERON', 'L870', 'No'),
 ('BIANCA', 'L868', 'Like my fear of wearing pastels?'),
 ('BIANCA', 'L866', 'What good stuff?'),
 ('CAMERON', 'L867', 'The "real you".'),
 ('BIANCA',
  'L863',
  "Me.  This endless ...blonde babble. I'm like, boring myself."),
 ('CAMERON',
  'L864',
  'Thank God!  If I had to hear one more story about your coiffure...'),
 ('CAMERON', 'L862', 'What crap?'),
 ('CAMERON', 'L860', 'No...'),
 ('BIANCA', 'L697', 'But'),
 ('CAMERON', 'L698', 'You always been this selfish?'),
 ('CAMERON', 'L696', "Then that's all you had to say."),
 ('BIANCA', 'L693', 'I was?'),
 ('CAMERON', 'L694', "You never wanted to go out with 'me, did you?"),
 ('BIANCA', 'L662', 'Tons'),
 ('CAMERON', 'L577', 'I believe we share an art instructor'),
 ('CAMERON', 'L575', 'Looks like things worked out tonight, huh?'),
 (

In [16]:
json_edges=[]

#   Creating a list of dictionaries containing the source, target, and text of each edge.

for source, target, text in edges:
    json_edges.append({"source": source, "target": target, "text": text})
    
json_edges

[{'source': 'BIANCA', 'target': 'L1044', 'text': 'They do not!'},
 {'source': 'BIANCA', 'target': 'L984', 'text': 'I hope so.'},
 {'source': 'BIANCA', 'target': 'L924', 'text': "Let's go."},
 {'source': 'BIANCA',
  'target': 'L871',
  'text': "Okay -- you're gonna need to learn how to lie."},
 {'source': 'CAMERON', 'target': 'L870', 'text': 'No'},
 {'source': 'BIANCA',
  'target': 'L868',
  'text': 'Like my fear of wearing pastels?'},
 {'source': 'BIANCA', 'target': 'L866', 'text': 'What good stuff?'},
 {'source': 'CAMERON', 'target': 'L867', 'text': 'The "real you".'},
 {'source': 'BIANCA',
  'target': 'L863',
  'text': "Me.  This endless ...blonde babble. I'm like, boring myself."},
 {'source': 'CAMERON',
  'target': 'L864',
  'text': 'Thank God!  If I had to hear one more story about your coiffure...'},
 {'source': 'CAMERON', 'target': 'L862', 'text': 'What crap?'},
 {'source': 'CAMERON', 'target': 'L860', 'text': 'No...'},
 {'source': 'BIANCA', 'target': 'L697', 'text': 'But'},
 {'

In [17]:
#   Saving the list of dictionaries as a JSON file.

with open(r"../data/conversation_network.json", "w") as f:
    json.dump(json_edges, f)

### **• Topic Modelling:**

Topic modelling can be effectively applied to the Cornell Movie-Dialogs Corpus to reveal hidden thematic structures within the dialogues.

#### **What is Topic Modelling?**

Topic modelling is a type of statistical modeling used to identify abstract topics that occur in a collection of documents. It helps in organising and summarising large datasets of textual information. The most common techniques for topic modelling are Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorisation (NMF).

In [18]:
tqdm.pandas()   #   Creating a progress bar.

In [19]:
stop_words=stopwords.words("english")   #   Creating a list of stopwords.
lemmatiser=WordNetLemmatizer()  #   Creating an instance of the WordNetLemmatizer class.

In [20]:
#   Function to preprocess the text data.

def preprocess_text(text):
    text=text.lower()   #   Converting the text to lowercase.
    text=re.sub(r"\d+", "", text)   #   Removing digits.
    text=re.sub(r"[^\w\s]", "", text)   #   Removing punctuation.
    text=re.sub(r"\s+", " ", text)  #   Removing extra whitespace.
    text=text.strip()   #   Removing leading and trailing whitespace.
    text=word_tokenize(text)    #   Tokenising the text.
    text=[i for i in text if not i in stop_words]   #   Removing stopwords.
    text=[lemmatiser.lemmatize(word=w, pos="v") for w in text]  #   Lemmatising the text.
    text=" ".join(text) #   Joining the text back together.
    return text

In [21]:
topic_modelling_dataframe=corpus_dataframe.copy()   #   Creating a copy of the corpus pandas.DataFrame.

In [22]:
topic_modelling_dataframe["text"]=topic_modelling_dataframe["text"].progress_apply(preprocess_text) #   Preprocessing the text data.
topic_modelling_dataframe.head()

100%|██████████| 304713/304713 [00:17<00:00, 17080.90it/s]


Unnamed: 0,meta.movie_name_x,meta.release_year,meta.rating,meta.votes,meta.genre,meta.character_name,text,speaker,reply_to
0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",BIANCA,,u0,L1044
1,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",CAMERON,,u2,
2,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",BIANCA,hope,u0,L984
3,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",CAMERON,okay,u2,
4,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",BIANCA,let go,u0,L924


In [23]:
topic_modelling_dataframe=topic_modelling_dataframe[topic_modelling_dataframe["text"]!=""]  #   Removing rows with empty text.
topic_modelling_dataframe.head()

Unnamed: 0,meta.movie_name_x,meta.release_year,meta.rating,meta.votes,meta.genre,meta.character_name,text,speaker,reply_to
2,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",BIANCA,hope,u0,L984
3,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",CAMERON,okay,u2,
4,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",BIANCA,let go,u0,L924
5,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",CAMERON,wow,u2,
6,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']",BIANCA,okay youre gon na need learn lie,u0,L871


In [24]:
len(topic_modelling_dataframe)

291834

In [25]:
count_vectorizer=CountVectorizer(stop_words="english")  #   Creating an instance of the CountVectorizer class.
count_data=count_vectorizer.fit_transform(topic_modelling_dataframe["text"])    #   Creating a document-term matrix.
count_data

<291834x56303 sparse matrix of type '<class 'numpy.int64'>'
	with 1297581 stored elements in Compressed Sparse Row format>

#### **Latent Dirichlet Allocation (LDA):**

Latent Dirichlet Allocation (LDA) is a generative probabilistic model commonly used for topic modeling in natural language processing (NLP). It helps to uncover the underlying thematic structure in a collection of documents by identifying topics and the distribution of these topics within the documents.

##### **Documents and Corpus:**

- **Document:** A single piece of text, such as an article, book chapter, or in the case of the Cornell Movie-Dialogs Corpus, a movie dialogue.
- **Corpus:** A collection of documents.

##### **Topics:**

- **Topic:** A distribution over a fixed vocabulary of words. Each topic is characterised by a set of words that frequently appear together.
- **Word Distribution:** For each topic, Latent Dirichlet Allocation (LDA) assumes a probability distribution over all the words in the corpus vocabulary.

##### **Assumptions:**

- Each document is a mixture of a small number of topics.
- Each word in a document can be attributed to one of the document's topics.

In [26]:
lda=LatentDirichletAllocation(n_components=10, learning_decay=0.7, max_iter=100, random_state=0, n_jobs=-1, verbose=1)  #   Creating an instance of the LatentDirichletAllocation class.
lda.fit(count_data) #   Fitting the model to the document-term matrix.

iteration: 1 of max_iter: 100
iteration: 2 of max_iter: 100
iteration: 3 of max_iter: 100
iteration: 4 of max_iter: 100
iteration: 5 of max_iter: 100
iteration: 6 of max_iter: 100
iteration: 7 of max_iter: 100
iteration: 8 of max_iter: 100
iteration: 9 of max_iter: 100
iteration: 10 of max_iter: 100
iteration: 11 of max_iter: 100
iteration: 12 of max_iter: 100
iteration: 13 of max_iter: 100
iteration: 14 of max_iter: 100
iteration: 15 of max_iter: 100
iteration: 16 of max_iter: 100
iteration: 17 of max_iter: 100
iteration: 18 of max_iter: 100
iteration: 19 of max_iter: 100
iteration: 20 of max_iter: 100
iteration: 21 of max_iter: 100
iteration: 22 of max_iter: 100
iteration: 23 of max_iter: 100
iteration: 24 of max_iter: 100
iteration: 25 of max_iter: 100
iteration: 26 of max_iter: 100
iteration: 27 of max_iter: 100
iteration: 28 of max_iter: 100
iteration: 29 of max_iter: 100
iteration: 30 of max_iter: 100
iteration: 31 of max_iter: 100
iteration: 32 of max_iter: 100
iteration: 33 of 

In [27]:
#   Function to retrieve the top words for each topic.

def get_topics(model, feature_names, n_top_words):
    topics=[]

    #   Iterating through each topic.

    for _, topic in enumerate(model.components_):
        topics.append(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))  #   Retrieving the top words for each topic.
        
    return topics

In [28]:
n_top_words=3
topic_list=get_topics(lda, count_vectorizer.get_feature_names_out(), n_top_words)   #   Retrieving the top words for each topic.
topic_list

['tell want good',
 'know dont like',
 'make fuck money',
 'say talk mean',
 'ill hell mr',
 'im think oh',
 'youre right thats',
 'hes man let',
 'whats need thank',
 'yes kill sure']

In [29]:
topic_distribution=lda.transform(count_data)    #   Retrieving the topic distribution for each document.
topic_distribution

array([[0.05000905, 0.05      , 0.05      , ..., 0.05      , 0.54998198,
        0.05000897],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.54998747, 0.05      ,
        0.05      ],
       ...,
       [0.34999203, 0.01666763, 0.01666951, ..., 0.01666667, 0.01666667,
        0.01667485],
       [0.12222243, 0.01111111, 0.01111155, ..., 0.13408382, 0.34443742,
        0.332584  ],
       [0.01250337, 0.0125    , 0.15685496, ..., 0.0125    , 0.31044864,
        0.33956012]])

In [30]:
topic_dataframe=pd.DataFrame(topic_distribution, columns=topic_list)    #   Creating a pandas.DataFrame from the topic distribution.
topic_dataframe

Unnamed: 0,tell want good,know dont like,make fuck money,say talk mean,ill hell mr,im think oh,youre right thats,hes man let,whats need thank,yes kill sure
0,0.050009,0.050000,0.050000,0.050000,0.050000,0.050000,0.050000,0.050000,0.549982,0.050009
1,0.050000,0.050000,0.050000,0.050000,0.050000,0.050000,0.550000,0.050000,0.050000,0.050000
2,0.050000,0.050000,0.050000,0.050000,0.050005,0.050000,0.050007,0.549987,0.050000,0.050000
3,0.050000,0.549995,0.050000,0.050000,0.050000,0.050000,0.050005,0.050000,0.050000,0.050000
4,0.137499,0.012500,0.012501,0.012500,0.012500,0.012500,0.637500,0.012500,0.137500,0.012500
...,...,...,...,...,...,...,...,...,...,...
291829,0.025000,0.025000,0.025006,0.025000,0.025010,0.025000,0.025000,0.025000,0.774982,0.025002
291830,0.128865,0.148061,0.259302,0.007144,0.007143,0.007146,0.007143,0.278052,0.150000,0.007144
291831,0.349992,0.016668,0.016670,0.016668,0.349993,0.183335,0.016667,0.016667,0.016667,0.016675
291832,0.122222,0.011111,0.011112,0.011111,0.011116,0.011111,0.011111,0.134084,0.344437,0.332584


In [31]:
data=topic_dataframe.to_dict(orient="records")  #   Creating a list of dictionaries from the pandas.DataFrame.
data

[{'tell want good': 0.05000904963595088,
  'know dont like': 0.050000000000807523,
  'make fuck money': 0.05000000000100708,
  'say talk mean': 0.05000000000132095,
  'ill hell mr': 0.050000000001151616,
  'im think oh': 0.050000000001203526,
  'youre right thats': 0.05000000000119459,
  'hes man let': 0.050000000000937836,
  'whats need thank': 0.5499819807157692,
  'yes kill sure': 0.05000896964065685},
 {'tell want good': 0.05000000000012814,
  'know dont like': 0.05000000000010083,
  'make fuck money': 0.050000000000125826,
  'say talk mean': 0.05000000000016499,
  'ill hell mr': 0.05000000000014385,
  'im think oh': 0.05000000000015033,
  'youre right thats': 0.5499999999987552,
  'hes man let': 0.050000000000117166,
  'whats need thank': 0.050000000000147864,
  'yes kill sure': 0.050000000000165794},
 {'tell want good': 0.05000000000016822,
  'know dont like': 0.05000000000013236,
  'make fuck money': 0.050000000000165176,
  'say talk mean': 0.050000000000216614,
  'ill hell mr':

In [32]:
columns=topic_dataframe.columns.tolist()    #   Creating a list of the columns.

#   Creating a list of dictionaries containing the topic distribution for each document.

json_data=[
    {column: row[column] for column in columns} 
    for row in data
]

json_data

[{'tell want good': 0.05000904963595088,
  'know dont like': 0.050000000000807523,
  'make fuck money': 0.05000000000100708,
  'say talk mean': 0.05000000000132095,
  'ill hell mr': 0.050000000001151616,
  'im think oh': 0.050000000001203526,
  'youre right thats': 0.05000000000119459,
  'hes man let': 0.050000000000937836,
  'whats need thank': 0.5499819807157692,
  'yes kill sure': 0.05000896964065685},
 {'tell want good': 0.05000000000012814,
  'know dont like': 0.05000000000010083,
  'make fuck money': 0.050000000000125826,
  'say talk mean': 0.05000000000016499,
  'ill hell mr': 0.05000000000014385,
  'im think oh': 0.05000000000015033,
  'youre right thats': 0.5499999999987552,
  'hes man let': 0.050000000000117166,
  'whats need thank': 0.050000000000147864,
  'yes kill sure': 0.050000000000165794},
 {'tell want good': 0.05000000000016822,
  'know dont like': 0.05000000000013236,
  'make fuck money': 0.050000000000165176,
  'say talk mean': 0.050000000000216614,
  'ill hell mr':

In [33]:
#   Saving the list of dictionaries as a JSON file.

with open(r"../data/topic_modelling.json", "w") as f:
    json.dump(json_data, f)