## Results
We found that the captions are very influential on a user's engagement with a post. They carry more weight than the topics depicted within the image. Zara will be able to increase user response to posts by focusing on creating engaging captions.

The images yielding the greatest engagement featured models walking outside, wearing coats and dresses. Images that featured close ups of faces, full body shots of models sitting down wearing nice shoes, and models showcasing their bags were also significant. 

## Our process:
1. Scraping Instagram: Extracting image URLs, captions, number of likes, and number of comments from most recent 700 posts on the Zara instagram account
2. Obtaining Image Labels From Google Vision Cloud: Accessing the Google Vision API to detect and classify photo labels for each image in our dataset
3. Measuring Engagement: Created a metric for engagement using a weighted sum of the number of likes and number of comments per post. We used min-max scaling to normalize these variables and assigned a weight of 0.4 to the number of likes and 0.6 to the number of comments. Comments were assigned a greater weight as they indicate the user engaged more actively with the post. 

  A binary engagement score was assigned to each post with a value of 1 if the engagement score is above the median engagement score and a value of 0 if below. This allows our metric for engagement to be relative to the posts we scraped. 

4. Predicting Engagement: Used three models to predict engagement using TF-IDF scores. TF-IDF scores calculate the frequency with which a word was used within one body of text relative to all the text data one has. It is a helpful tool to find key, relevant words within text documents. Using TF-IDF in this context helps us identify which key labels or words within a caption clearly distinguish the post and contribute to the user's engagement with the post. 

  ##### 1. Our first model predicted engagement using just image labels. 
  ##### 2. Our second model predicted engagement using just captions. 
  ##### 3. Our third model predicted engagement using both image labels and captions.

5. Topic modeling (LDA) on the image labels:
  
  LDA topic modeling identifies the hidden semantic structure in our images. It's a probabilistic, unsupervised approach that clusters similar documents dependent upon the topics they share. These topics are identified recursively through finding which topics yield the highest probability for being generated from the labels within a given image.  



## Scraping Instagram

### Extracting image URLs, captions, number of likes, and number of comments from most recent 700 posts on the Zara instagram account

In [None]:
# Imports and Installs
#!pip install instaloader
import instaloader
import pandas as pd 
import time
import os

In [None]:
# Accessing Zara Profile 
L = instaloader.Instaloader()
user_name = 'zara'
profile = instaloader.Profile.from_username(L.context, user_name)

# Converting to DF
posts_df = pd.DataFrame(columns=['num_comments', 'num_likes', 'caption', 'image_url', 'is_video'])

# URL of Zara's Instagram
url = 'https://www.instagram.com/p/{}/'

# Grabbing 700 posts 
posts = profile.get_posts()
posts_scrape = 700
number = 0 

In [None]:
# Scraping # of comments, # of likes, caption, image_url, is_video
for post in posts:
    if number < posts_scrape:
        number += 1
        posts_df = posts_df.append({'num_comments': post.comments,
                                    'num_likes': post.likes,
                                    'caption': post.caption,
                                    'image_url': post.url, 
                                    'is_video': post.is_video}, ignore_index=True)
        posts_df.is_video.astype('bool')
        if posts_df[posts_df.is_video == False].shape[0] == posts_scrape:
            number = posts_scrape      
    else:
        break

In [None]:
# Only grabbing posts by dropping any values where is_video column is true
posts_df = posts_df[posts_df['is_video'] == False]
# Dropping is_video column as it is no longer  needed
posts_df = posts_df.drop('is_video', 1)
# Exporting as CSV
posts_df.to_csv('insta.csv', index=False)

In [None]:
posts_df = posts_df.reset_index(drop=True)
posts_df.head(5)

Unnamed: 0,num_comments,num_likes,caption,image_url
0,421,73609,FW20 Campaign. Kids Collection\nCreative Direc...,https://scontent-dfw5-2.cdninstagram.com/v/t51...
1,126,38690,FW20 Campaign. Kids Collection\nCreative Direc...,https://scontent-dfw5-2.cdninstagram.com/v/t51...
2,196,56404,FW20 Campaign. Kids Collection\nCreative Direc...,https://scontent-dfw5-2.cdninstagram.com/v/t51...
3,106,34923,FW20 Campaign. Man Collection\nCreative Direct...,https://scontent-dfw5-2.cdninstagram.com/v/t51...
4,189,34424,FW20 Campaign. Man Collection\nCreative Direct...,https://scontent-dfw5-2.cdninstagram.com/v/t51...


## Obtaining Image Labels From Google Vision Cloud

### Accessing the Google Vision API to detect and classify photo labels for each image in our dataset



In [None]:
#!pip install google-cloud-vision

In [None]:
from google.cloud import vision
import os

"""Initialize Environment Variables to enable authentication with google Vision API"""
#credential_path = r"C:\Users\india\Documents\text-assignment3-94cea931a4be.json" 
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credential_path

client = vision.ImageAnnotatorClient()
image = vision.Image() 

"""storing urls for each image and file names"""
file_name=[]
labels=[]
image_paths = []
counter=0

for each_img in posts_df.image_url:
  image_paths.append(each_img)
  file_name.append(str(posts_df.image_url.index[counter]+1) + ".jpg")
  counter+=1

In [None]:
def getlabelsforRemoteImage(uri):
    """Detects labels in the file located in Google Cloud Storage or on the
    Web."""
    labels_list=[]
    image.source.image_uri = uri

    response = client.label_detection(image=image)
    labels = response.label_annotations
    
    for label in labels:
        labels_list.append(label.description)
    print(uri)
    print(labels_list)
    return(labels_list)

"""grabbing photo labels for each image url in our dataset"""
for i in image_paths:
    labels.append(getlabelsforRemoteImage(i))

In [None]:
posts_df['labels']=pd.Series(labels) #storing labels in our dataframe


## Measuring Engagement

### We created a metric for engagement using a weighted sum of the number of likes and number of comments per post. We used min-max scaling to normalize these variables and assigned a weight of 0.4 to the number of likes and 0.6 to the number of comments. Comments were assigned a greater weight as they indicate the user engaged more actively with the post. 

### A binary engagement score was assigned to each post with a value of 1 if the engagement score is above the median engagement score and a value of 0 if below. This allows our metric for engagement to be relative to the posts we scraped. 


In [4]:
# Scaling Likes and Comments
posts_df['scaled_likes'] = posts_df['num_likes'] / posts_df['num_likes'].max()
posts_df['scaled_comms'] = posts_df['num_comments'] / posts_df['num_comments'].max()

# Creating Engagement Score
posts_df['engagement_score'] = .4 * posts_df['scaled_likes'] + .6 * posts_df['scaled_comms']

# Qualitative 'High' or 'Low'
def engagement_qual(eng_score):
    if eng_score > posts_df['engagement_score'].median():
        return 'High'
    else:
        return 'Low'

# Binary 1 or 0 to perform other tasks
def engagement_bin(eng_score):
    if eng_score > posts_df['engagement_score'].median():
        return 1
    else:
        return 0

In [5]:
#Creating Engagement Columns
posts_df['engagement'] = posts_df['engagement_score'].apply(engagement_bin)
posts_df['engagement_qual'] = posts_df['engagement_score'].apply(engagement_qual)

In [8]:
posts_df.head()

Unnamed: 0,num_comments,num_likes,caption,image_url,labels,scaled_likes,scaled_comms,engagement_score,engagement,engagement_qual
0,405,73222,FW20 Campaign. Kids Collection\nCreative Direc...,https://scontent-iad3-1.cdninstagram.com/v/t51...,"['Clothing', 'Fashion', 'Outerwear', 'Fur', 'S...",0.249823,0.124769,0.174791,1,High
1,124,38537,FW20 Campaign. Kids Collection\nCreative Direc...,https://scontent-iad3-1.cdninstagram.com/v/t51...,"['Sky', 'Darkness', 'Room', 'Adventure game', ...",0.131483,0.038201,0.075514,0,Low
2,194,56141,FW20 Campaign. Kids Collection\nCreative Direc...,https://scontent-iad3-1.cdninstagram.com/v/t51...,"['Cool', 'Fashion', 'Jeans', 'Sitting', 'Denim...",0.191545,0.059766,0.112478,0,Low
3,104,34764,FW20 Campaign. Man Collection\nCreative Direct...,https://scontent-iad3-1.cdninstagram.com/v/t51...,"['Hair', 'Face', 'Hairstyle', 'Eyebrow', 'Fore...",0.11861,0.032039,0.066668,0,Low
4,187,34312,FW20 Campaign. Man Collection\nCreative Direct...,https://scontent-iad3-1.cdninstagram.com/v/t51...,"['Snapshot', 'Standing', 'Hand', 'Arm', 'Human...",0.117068,0.057609,0.081393,0,Low


In [None]:
posts_df.to_csv('insta_labels.csv', index=False) #saving to csv

## Predicting Engagement

### We used three models to predict engagement using TF-IDF scores. TF-IDF scores calculate the frequency with which a word was used within one body of text relative to all the text data one has. It is a helpful tool to find key, relevant words within text documents. Using TF-IDF in this context helps us identify which key labels or words within a caption clearly distinguish the post and contribute to the user's engagement with the post. 

### 1. Our first model predicted engagement using just image labels. 
### 2. Our second model predicted engagement using just captions. 
### 3. Our third model predicted engagement using both image labels and captions.

### The confusion matrix and classifcation matrix for each model is outputted below. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


def predicting_eng(X,y):
  """ logistic regression model using 80% train, 20% test split"""
  X_train, X_test, y_train,y_test = train_test_split(X,y,test_size= 0.2, random_state=1)

  """initialize TFIDFVectorizer"""
  tfidf_vectorizer = TfidfVectorizer(stop_words='english',use_idf=True)
  fitted_vectorizer=tfidf_vectorizer.fit(X_train)
  X_train_tfidf=fitted_vectorizer.transform(X_train)
  X_test_tfidf = tfidf_vectorizer.transform(X_test)

  """fitting logistic regression model to data """

  model = LogisticRegression()
  model.fit(X_train_tfidf,y_train)

  """ Predicting on test data""" 
  y_fitted = model.predict(X_test_tfidf)

  """confusion matrix"""
  print(confusion_matrix(y_test,y_fitted))
  
  """classification report"""
  print(classification_report(y_test,y_fitted))


### Predicting engagement using only image labels

In [None]:
X = posts_df['labels']
y = posts_df['engagement']
print("Confusion Matrix and Classification Report: ")
predicting_eng(X,y)

Confusion Matrix and Classification Report: 
[[32 25]
 [17 40]]
              precision    recall  f1-score   support

           0       0.65      0.56      0.60        57
           1       0.62      0.70      0.66        57

    accuracy                           0.63       114
   macro avg       0.63      0.63      0.63       114
weighted avg       0.63      0.63      0.63       114



### Predicting engagement using only captions

In [None]:
X = posts_df['caption']
y = posts_df['engagement']
print("Confusion Matrix and Classification Report: ")
predicting_eng(X,y)

Confusion Matrix and Classification Report: 
[[38 19]
 [16 41]]
              precision    recall  f1-score   support

           0       0.70      0.67      0.68        57
           1       0.68      0.72      0.70        57

    accuracy                           0.69       114
   macro avg       0.69      0.69      0.69       114
weighted avg       0.69      0.69      0.69       114



### Predicting engagement using a combination of captions and image labels

In [None]:
posts_df['caption_labels'] = posts_df['caption'] + ' ' + posts_df['labels']
X = posts_df['caption_labels']
y = posts_df['engagement']
print("Confusion Matrix and Classification Report: ")
predicting_eng(X,y)

Confusion Matrix and Classification Report: 
[[35 22]
 [14 43]]
              precision    recall  f1-score   support

           0       0.71      0.61      0.66        57
           1       0.66      0.75      0.70        57

    accuracy                           0.68       114
   macro avg       0.69      0.68      0.68       114
weighted avg       0.69      0.68      0.68       114



### Interpreting Our Models

The engagement score was calculated by taking a weighted percentage of the number of likes and comments on a post. We set the engagement score to be equal to 1 if the users level of engagement with a post was greater than the median level for all posts. It was set to 0 if it was less than this value. This allowed our dataset to be equally balanced between the two engagement classes.

When predicting engagement using only image labels, our model correctly classified 65% of posts that had an engagement of 0 and 62% that had an engagement of 1. When using only captions, we correctly classified 70% of posts with an engagement score of 0 and 68% with an engagement of 1. When combining the two, we correctly classified 71% of posts with low engagement and 66% of posts with high engagement.

From these results, we can infer that using captions to predict engagement greatly increased our model's accuracy. Using both captions and labels helped improve the model's accuracy for posts with low engagement but only slightly. It decreased the model's predictive ability for posts with high engagement. This suggests that captions are very influential on a user's engagement with a post and carry more weight than the topics depicted within the image. Zara will be able to increase user response to posts by focusing on creating engaging captions.

## Topic modeling (LDA) on the image labels.

#### LDA topic modeling identifies the hidden semantic structure in our images. It's a probabilistic, unsupervised approach that clusters similar documents dependent upon the topics they share. These topics are identified recursively through finding which topics yield the highest probability for being generated from the labels within a given image.  


In [None]:
!pip install pyLDAvis 
!pip install gensim

In [7]:
import ast
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer 
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')
import pyLDAvis
import pyLDAvis.gensim

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Pre-Processing Image Labels

### We cleaned our image labels by removing stopwords and lemmatizing the words. Lemmatizing is the process of reducing the word to its root or base word. For example, it will convert 'is' and 'are' to 'be' or 'walks' and 'walking' to 'walk'. This will help us easily compare topics when we apply LDA

In [8]:
def lemmatize(text):
  ''' returns lemmatized root word of all words'''
  return WordNetLemmatizer().lemmatize(text, pos='v')
    
def preprocess(text):
  ''' iterates through each word in the text, removes all stopwords, and returns a list of all lemmatized words'''
  result = []
  for token in gensim.utils.simple_preprocess(text):
      if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2:
          result.append(lemmatize(token))
  return result

In [9]:
clean_labels = posts_df['labels'].map(preprocess)


### Building a Bag of Words

### We will run our topic modeling using a bag of words approach. This creates a list of words and their corresponding frequencies. We will soon feed this to our LDA model so that the model can identify which photo labels are most important. 


In [10]:
dictionary = gensim.corpora.Dictionary(clean_labels) #creating a dictionary with our topics and their frequencies
dictionary.filter_extremes(no_below=7, no_above=0.5, keep_n=100000) #removing any potential outlier topics to prevent them from over-influencing our model
bow_corpus = [dictionary.doc2bow(doc) for doc in clean_labels] #creating the bag of words corpus = list of tuples references index of topic in our dictionary and frequency count

### Running LDA On Image Labels To Identify Significant Topics

In [None]:
np.random.seed(2020)

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=8, id2word=dictionary, passes=8, workers=2) #setting up LDA model using our bag of words and dictionary for topics 

In [13]:
def clean_topics(string): 
  string = re.sub('[^A-Za-z ]+','', string)
  ''' returns cleaned topics by removing any punctuation and splitting on spaces to format into a list'''
  words = list(string.split("  ")) 
  return words 

topic_words={} #dictionary to store words in each topic
for topic, word in lda_model.show_topics():
    topic_words[topic]=clean_topics(word) #filling in dictionary with cleaned topic for each topic number
    print('Topic Number:',topic,'\nWords:',topic_words[topic])

Topic Number: 0 
Words: ['hair', 'beauty', 'hairstyle', 'skin', 'long', 'child', 'face', 'model', 'chin', 'photography']
Topic Number: 1 
Words: ['photography', 'black', 'white', 'monochrome', 'fashion', 'photograph', 'stand', 'blue', 'denim', 'jeans']
Topic Number: 2 
Words: ['leg', 'human', 'sit', 'photography', 'footwear', 'joint', 'fashion', 'shoulder', 'shoe', 'arm']
Topic Number: 3 
Words: ['design', 'eyewear', 'uniform', 'glass', 'cool', 'sunglasses', 'room', 'shoe', 'product', 'sandal']
Topic Number: 4 
Words: ['clothe', 'neck', 'shoulder', 'shirt', 'sleeve', 'blue', 'fashion', 'outerwear', 'jeans', 'white']
Topic Number: 5 
Words: ['fashion', 'clothe', 'model', 'coat', 'outerwear', 'beauty', 'shoot', 'photo', 'shoulder', 'bag']
Topic Number: 6 
Words: ['fashion', 'clothe', 'outerwear', 'dress', 'shoulder', 'model', 'sleeve', 'design', 'white', 'neck']
Topic Number: 7 
Words: ['fashion', 'outerwear', 'wear', 'formal', 'suit', 'photography', 'blazer', 'clothe', 'vehicle', 'stand

### Interpreting Topics

The labels that were given the most weight are featured in the beginning of each list. This helps us identify what labels each topic found most significant.  

Topic 0 is related to images that focus on the head, emphasizing the face, hair, and skin. 

Topic 1 captures black and white photography featuring jeans. 

Topic 2 appears to be capturing full body images of someone sitting down, highlighting their footwear. 

Topic 3 is focused on accessory products; glasses and shoes. 

Topic 4 is seems related to images that feature white tops and blue jeans. 

Topic 5 captures models dressed up for walking outside; wearing coats while holding bags. 

Topic 6 is very similar to topic 5, but has more of a focus on images that contain dresses as opposed to bags. 

Topic 7 captures images of models in work outfits; wearing suits and blazers. 

### Looking into the weight given to each topic for each image

In [None]:
data_labels = posts_df[['labels']]
data_labels['index'] = data_labels.index

#grabbing topic name and associated weight for each image
for i in range(0,len(bow_corpus)):
    for index, score in sorted(lda_model[bow_corpus[i]]):
        arr = "Topic "+ str(index)
        data_labels.loc[i,arr]= score


In [46]:
#Displaying topic weights for each images label
data_labels[1:10]

Unnamed: 0,labels,index,Topic 2,Topic 5,Topic 6,Topic 0,Topic 1,Topic 3,Topic 4,Topic 7
1,"['Sky', 'Darkness', 'Room', 'Adventure game', ...",1,0.031338,0.031256,0.031272,0.031292,0.031441,0.780802,0.03125,0.03135
2,"['Cool', 'Fashion', 'Jeans', 'Sitting', 'Denim...",2,0.015666,0.015646,0.015644,0.015633,0.71856,0.187564,0.015646,0.01564
3,"['Hair', 'Face', 'Hairstyle', 'Eyebrow', 'Fore...",3,0.011374,0.011366,0.011364,0.920424,0.011365,0.011365,0.011375,0.011366
4,"['Snapshot', 'Standing', 'Hand', 'Arm', 'Human...",4,0.466011,0.012508,0.012504,0.012526,0.458934,0.012502,0.012501,0.012514
5,"['Clothing', 'Fashion', 'Outerwear', 'Beige', ...",5,0.013891,0.01391,0.702047,0.013894,0.013894,0.013899,0.013897,0.214567
6,"['People in nature', 'Photograph', 'Black-and-...",6,,,,,0.932644,,,
7,"['Photograph', 'Standing', 'People', 'Suit', '...",7,,,,,0.471041,,,0.478937
8,"['Cool', 'Human', 'Font', 'Fur', 'Outerwear', ...",8,0.010457,0.010435,0.398883,0.010471,0.01045,0.538442,0.010432,0.01043
9,"['Photograph', 'Clothing', 'Formal wear', 'Sta...",9,,,,,0.21049,,,0.731772


### Identifying the most dominant topic for each image

In [31]:

def top_topics(lda_model=None, corpus=bow_corpus, texts=clean_labels):
  ''' returns best topic match for each document and the corresponding score'''

  topics = pd.DataFrame() #creating empty dataframe to store topics

  # Get main topic in each document
  for i, row_list in enumerate(lda_model[bow_corpus])
      row = row_list[0] if lda_model.per_word_topics else row_list            
      row = sorted(row, key=lambda x: (x[1]), reverse=True)
      # Get the Dominant topic, Perc Contribution and Keywords for each document
      for j, (topic_num, prop_topic) in enumerate(row):
          if j == 0:  # => dominant topic
              wp = lda_model.show_topic(topic_num)
              topic_keywords = ", ".join([word for word, prop in wp])
              topics = topics.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
          else:
              break

  topics.columns = ['Dominant Topic', 'Percent Contribution', 'Keywords']

  # Add original text to the end of the output
  contents = pd.Series(texts)
  topics = pd.concat([topics, contents], axis=1)
  return topics 

In [32]:
dominant_topics = top_topics(lda_model, corpus=bow_corpus, texts=clean_labels)

In [33]:
dominant_topics

Unnamed: 0,Dominant Topic,Percent Contribution,Keywords,labels
0,6.0,0.6099,"fashion, clothe, outerwear, dress, shoulder, m...","[clothe, fashion, outerwear, fur, street, fash..."
1,3.0,0.7808,"design, eyewear, uniform, glass, cool, sunglas...","[sky, darkness, room, adventure, game, music, ..."
2,1.0,0.7185,"photography, black, white, monochrome, fashion...","[cool, fashion, jeans, sit, denim, shoe, style]"
3,0.0,0.9204,"hair, beauty, hairstyle, skin, long, child, fa...","[hair, face, hairstyle, eyebrow, forehead, chi..."
4,2.0,0.4662,"leg, human, sit, photography, footwear, joint,...","[snapshot, stand, hand, arm, human, photograph..."
...,...,...,...,...
565,6.0,0.8138,"fashion, clothe, outerwear, dress, shoulder, m...","[clothe, fashion, fashion, model, footwear, fa..."
566,0.0,0.8120,"hair, beauty, hairstyle, skin, long, child, fa...","[hair, face, hairstyle, beauty, lip, skin, fas..."
567,7.0,0.4669,"fashion, outerwear, wear, formal, suit, photog...","[leg, footwear, street, fashion, photography, ..."
568,5.0,0.3376,"fashion, clothe, model, coat, outerwear, beaut...","[pink, fashion, street, fashion, outerwear, gl..."


### Engagement scores vs Topic Weights: Splitting the engagement scores into quartiles to identify the highest and lowest score quartiles. We then analyzed the proportion of each topic within each quartile.

In [47]:
data_labels['image_URL'] = posts_df['image_url']
cols = ['image_URL','labels','index','Topic 0','Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7']
data_labels[cols].to_csv('topics_and_weights.csv')
data_labels['engagement_score'] = posts_df['engagement_score']
data_labels.engagement_score.quantile([0.25,0.5,0.75])
#identifying engagement scores for each quartile

0.25    0.095768
0.50    0.132158
0.75    0.198502
Name: engagement_score, dtype: float64

In [40]:
#Splitting the data into two quartiles for the lowest and highest engageent scores
quart_25 = data_labels[data_labels['engagement_score']<0.095768] #lowest
quart_75 = data_labels[data_labels['engagement_score']>0.198502] #highest

#Average of topic weights in each of these quartiles
avg_low = pd.DataFrame(quart_25[['Topic 0','Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7']].mean())
avg_low = avg_low.transpose()

avg_high = pd.DataFrame(quart_75[['Topic 0','Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7']].mean())
avg_high = avg_high.transpose()

In [42]:
eng_by_topic= avg_low.append(avg_high,ignore_index=True)
eng_by_topic['Engagement_levels'] = ['Low Engagement','High Engagement']
eng_by_topic.set_index('Engagement_levels',inplace=True)
eng_by_topic

Unnamed: 0_level_0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7
Engagement_levels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Low Engagement,0.159203,0.185159,0.12691,0.105928,0.08269,0.093169,0.180964,0.170946
High Engagement,0.177665,0.097137,0.172274,0.048284,0.139917,0.156746,0.274314,0.100644


### LDA Visualisation: a fun interactive visual to explore the topics and identify the differences between them

In [43]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
vis