# Topic modeling on grphclus elyzée communities
The data concerns Twitter profiles active during the campaign (from November 2016 to May 2017), and their corresponding tweets and retweets, plus the retweet and mention networks related to these profiles. Each community is made up of a set of twitter users. Each community is a part of a level of a community tree. A community tree exists for a given timestamp. 

Throughout this notebook, we
 - Retrieve and preprocess the tweets so that they are ready to be fed into the topic modeling algorithm.
 - Retrieve the communities and extract the tweets from them.
 - Build topic models for a given level of a timestamp
 - Build keyword summarization models for a given community
 - Study topics of a topic model
 - Study community topic distributions
 - Display results as wordclouds
 
NOTE: cells that are in comments (# or """...""") are examples, feel free to uncomment then to try out a model


***

  ## <b>1.</b> <a href="#1">Retrieve necessary data</a>
  
  
 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>1.1.</b>  <a href="#1-1">Retrieve tweets</a>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>1.2.</b>  <a href="#1-2">Retrieve tweets for each user</a>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>1.3.</b>  <a href="#1-3">Read pickle file containing the communities for the elyzée dataset</a>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>1.4.</b>  <a href="#1-4">Retrieve all users connections</a>


 
 ## <b>2.</b> <a href="#2"> Extract tweets from communities</a>
 
 ## <b>3.</b> <a href="#3">Build NLP Models</a>
 
 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>3.1.</b>  <a href="#3-1">LDA Topic Modeling</a>

 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>3.2.</b>  <a href="#3-2">Keyword Summarization</a>

 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>3.3.</b>  <a href="#3-3">Check parameters and choose model</a>

 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>3.4.</b>  <a href="#3-4">Build a model</a>
 
 
 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>3.5.</b>  <a href="#3-5">Update a model</a>


 ## <b>4.</b> <a href="#4">Study purely the topics</a>
 
 
 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>4.1.</b>  <a href="#4-1">Show topics</a>
  
 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>4.2.</b>  <a href="#4-2">Get unique words for each topic</a>

 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>4.3.</b>  <a href="#4-3">Intertopic distance map</a>


 
 ## <b>5.</b> <a href="#5">Study the topic distributions of communities</a>
 
 ## <b>6.</b> <a href="#6">Visualization</a>


 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>6.1.</b>  <a href="#6-1">Produce community topic distribution Word Clouds</a>


 ### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>6.2.</b>  <a href="#6-2">Visualize keyword summarization (word cloud)</a>

 

In [1]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Necessary libraries

In [2]:
import json
import pickle
import re

# removing stopwords from tweets
from gensim.parsing.preprocessing import preprocess_string, STOPWORDS
from stop_words import get_stop_words

# for comparing dates
from datetime import datetime

# gensim
from gensim import corpora, models, similarities
from gensim import models
import pyLDAvis
import pyLDAvis.gensim
from gensim.summarization import keywords

# parameter tuning
from gensim.models import CoherenceModel
from collections import defaultdict 
import numpy as np
import matplotlib.pyplot as plt

# word clouds
from wordcloud import WordCloud

# 1. Retrieve necessary data
<div id='1'></div>

## 1.1. Retrieve tweets
<div id='1-1'></div>
(See preProcessTweets.ipynb)

In [3]:
with open('jtweets.json', 'r') as fp:
    jtweets = json.load(fp)

In [4]:
allPreProcessedTweets = [jtweet['txt_pp'] for jtweet in jtweets]

## 1.2. Retrieve tweets for each user
<div id='1-2'></div>
(See getAllUsersTweets.ipynb)

In [5]:
with open("allUsersTweets.json","r") as fp:
    allUsersTweets = json.load(fp)

## 1.3. Read pickle file containing the communities for the elyzée dataset
<div id='1-3'></div>
It consists of a graph containing the twitter users witht the tweets, and dictionnary where each key is a date and each value is a community tree 

The community tree is another graph, where communities are organised by hierarchichal level. In the tree, each node is a twitter user (level 0) or a community (level>0).

The children of a node (which is thus a community id) are the sub-communities belonging to that node 

In [6]:
import pickle, sys
pickle_file = "/home/pgay/grphclus_stuff/elyzee_communities.pck" 
#"/home/pgay/grphclus4py/data/elyzee_communities.pck"
#"/home/pgay/go/src/grphclus/elyzee_communities.pck"
data = pickle.load(open(pickle_file, 'rb')) #/home/pgay/twitter/elyzee/storage.pck','rb'))


def aggregate_childrens(community_tree, level):
    """
    create a dictionnary given the community tree and a level
    all the keys are the community ids of the level "level" 
    and the values are the twitter users belonging to this community
    """
    communities = {}
    if level == 0:
        return {}
    com_id_this_level = [ n for n in community_tree.nodes if community_tree.nodes[n]['level'] == level ]
    for n in com_id_this_level:
        communities[n] = get_all_leaves(community_tree, n)
    return communities

def get_childrens(G, node_id):
    """
    get the children : the nodes which are directly below in the graph
    """
    return [ n for n in G.neighbors(node_id) if G.nodes[n]['level'] < G.nodes[node_id]['level'] ]

def get_all_leaves(community_tree, n):
    """
    get the leaves (i.e. the twitter user) which are below a given community id
    """
    leaves = []
    childrens = get_childrens(community_tree, n)
    if len(childrens) == 0:
        leaves.append(n)
    for child in get_childrens(community_tree, n):
        leaves += get_all_leaves(community_tree, child)
    return leaves

## 1.4. Retrieve all users connections
<div id='1-4'></div>
(See getAllUsersConnections.ipynb)

In [7]:
with open('allUsersConnections.json', 'r') as fp:
    allUsersConnections = json.load(fp)

# 2. Extract tweets from communities
<div id='2'></div>
The next set of functions allow us to extract the tweets from communities. As we know, each community represents a set of users. For each user we shall retrieve their tweets in order to construct a set of tweets for a community. 

As we know, the data is structured as follows: (community $\subset$ level $\subset$ community tree / timestamp). We are interested in constructing topic models for a given level. So more precisely, we contruct a set of tweets for a given level. Also, it is possible to retrieve tweets for all users OR all users and their connections.

In [8]:
def getLevelsFromTimeStamp(timeStamp):
    """
    get the levels of a community tree given a timestamp
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    
    :type timeStamp: str
    
    :return: a set of levels (set of int's)
    :rtype: set
    
    """
    communityTree = data['communities'][timeStamp]['community_tree']
    
    return set([communityTree.nodes[n]['level'] for n in communityTree.nodes ])

In [9]:
def getCommunitiesFromLevel(timeStamp,level):
    """
    get the communities of a level given the timestamp
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    
    :type timeStamp: str
    :type level: int
    
    :return: a dictionary where keys are community IDs and values are lists of userIDs 
    :rtype: dict
    
    """
    return aggregate_childrens(data['communities'][timeStamp]['community_tree'], level)

In [10]:
def getUserIDsFromCommunity(timeStamp, level, community, communitiesDict):
    """
    get all the user IDs from community 'community' given a timestamp and level
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: is the community id (ex: b'-10000875')
    :param communitiesDict: getCommunitiesFromLevel(timeStamp,level)

    
    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}
    :type communitiesDict: dict
    
    :return: list of userIDs  
    :rtype: list
    
    """
    # NO return list(dict.fromkeys(["".join([str(int(s)) for s in userID.split() if s.isdigit()]) for userID in  getCommunitiesFromLevel(timeStamp,level)[community]]))  
    return list(dict.fromkeys(["".join([str(int(s)) for s in userID.split() if s.isdigit()]) for userID in  communitiesDict[community]]))

In [11]:
def getAllUsersAndUsersConnectionsFromCommunity(timeStamp, level, community, communitiesDict):
    """
    get all user id's and user connections id's from a community given a timestamp, level and community
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: is the community id (ex: b'-10000875')
    :param communitiesDict: getCommunitiesFromLevel(timeStamp,level)

    
    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}
    :type communitiesDict: dict
    
    :return: list of allIDs
    :rtype: list
    """
    
    userIDs = getUserIDsFromCommunity(timeStamp,level,community,communitiesDict)
    for userID in userIDs:
        if userID == ' ': # FIX THIS BUG
            userIDs.append(allUsersConnections[userID])
    return list(dict.fromkeys(userIDs))
    

In [12]:
def getTweetsFromAllUsersFromCommunity(timeStamp, level, community, dateMargin, communitiesDict):
    """
    get all tweets from all users (not connections) from community given a level and timestamp
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: is the community id (ex: b'-10000875')
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param communitiesDict: getCommunitiesFromLevel(timeStamp,level)

    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}
    :type dateMargin: int
    :type communitiesDict: dict
    
    :return: - list of lists. Each sublist is preprocessed tweet tokens. ex: [["word1","word2"],["ok","help","nice"],...]
             - string of all raw tweets (each tweet is separated by a full stop (".")).

    :rtype: (list,str)
    """
    
    allUserIDs = getUserIDsFromCommunity(timeStamp, level, community, communitiesDict)
    ppTweets = []
    rawTweets = []
    
    for userID in allUserIDs:
        
        # get preprocessed tweets and raw tweets
        #userTweets = [(jtweets[n]['txt_pp'],jtweets[n]['txt_ori']) for n in range(len(jtweets[:nbTweets])) if jtweets[n]['source'] == userID and (datetime.strptime(timeStamp,"%a %b %d %H:%M:%S %z %Y")-datetime.strptime(jtweets[n]['created_at'],"%a %b %d %H:%M:%S %z %Y")).days < dateMargin]
        userTweets = [(tweet['txt_pp'],tweet['txt_ori']) for tweet in allUsersTweets[userID] if (datetime.strptime(timeStamp,"%a %b %d %H:%M:%S %z %Y")-datetime.strptime(tweet['created_at'],"%a %b %d %H:%M:%S %z %Y")).days < dateMargin]
        
        # separate preprocessed tweets for raw tweets into two separate lists
        ppTweets.append([tweet[0] for tweet in userTweets])
        rawTweets.append([tweet[1] for tweet in userTweets])
        
    return [item for sublist in ppTweets for item in sublist], ". ".join(list(dict.fromkeys([item for sublist in rawTweets for item in sublist])))

In [13]:
def getTweetsFromAllUsersAndUsersConnectionsFromCommunity(timeStamp, level, community, dateMargin, communitiesDict):
    """
    get all tweets from all users and users connections from community given a timestamp and level
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: is the community id (ex: b'-10000875')
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param communitiesDict: getCommunitiesFromLevel(timeStamp,level)

    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}
    :type dateMargin: int
    :type communitiesDict: dict
    
    :return: - list of lists. Each sublist is preprocessed tweet tokens. ex: [["word1","word2"],["ok","help","nice"],...]
             - string of all raw tweets (each tweet is separated by a full stop (".")).

    :rtype: (list,str)
    """
    
    allUserIDs = getAllUsersAndUsersConnectionsFromCommunity(timeStamp, level, community, communitiesDict)
    ppTweets = []
    rawTweets = []

    for userID in allUserIDs:
        
        # get preprocessed tweets and raw tweets
        #userTweets = [(jtweets[n]['txt_pp'],jtweets[n]['txt_ori']) for n in range(len(jtweets[:nbTweets])) if jtweets[n]['source'] == userID and (datetime.strptime(timeStamp,"%a %b %d %H:%M:%S %z %Y")-datetime.strptime(jtweets[n]['created_at'],"%a %b %d %H:%M:%S %z %Y")).days < dateMargin]
        userTweets = [(tweet['txt_pp'],tweet['txt_ori']) for tweet in allUsersTweets[userID] if (datetime.strptime(timeStamp,"%a %b %d %H:%M:%S %z %Y")-datetime.strptime(tweet['created_at'],"%a %b %d %H:%M:%S %z %Y")).days < dateMargin]

        # separate preprocessed tweets for raw tweets into two separate lists
        ppTweets.append([tweet[0] for tweet in userTweets])
        rawTweets.append([tweet[1] for tweet in userTweets])
        
    return [item for sublist in ppTweets for item in sublist], ". ".join(list(dict.fromkeys([item for sublist in rawTweets for item in sublist])))

In [14]:
def getTweetsFromAllUsersFromAllCommunitiesFromLevel(timeStamp, level, dateMargin):
    """
    get all tweets from all users from all communities given a level and timestamp
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    
    :type timeStamp: str
    :type level: int
    :type dateMargin: int
    
    :return: dictionary where keys are communities and values are list of tweets
             example: {b'-100000765': ([['word','ok'],['hello','good'],...]," raw tweets"),
                       b'-100006795': ([['word','ok'],['hello','good'],...]," raw tweets"),
                       ...}
    :rtype: dict
    """
    
    comsTweets = {}
    communitiesDict = getCommunitiesFromLevel(timeStamp,level)
    
    for community in communitiesDict:
        comsTweets[community] = getTweetsFromAllUsersFromCommunity(timeStamp, level, community, dateMargin, communitiesDict) 
        
    return comsTweets    

In [15]:
def getTweetsFromAllUsersAndUsersConnectionsFromAllCommunitiesFromLevel(timeStamp, level, dateMargin):
    """
    get all tweets from all users (and their connections) from all communities given a level and timestamp
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    
    :type timeStamp: str
    :type level: int
    :type dateMargin: int
    
    :return: dictionary where keys are communities and values are list of tweets
             example: {b'-100000765': ([['word','ok'],['hello','good'],...]," raw tweets"),
                       b'-100006795': ([['word','ok'],['hello','good'],...]," raw tweets"),
                       ...}
    :rtype: dict
    """
    
    comsTweets = {}
    communitiesDict = getCommunitiesFromLevel(timeStamp,level)

    for community in communitiesDict:
        comsTweets[community] = getTweetsFromAllUsersAndUsersConnectionsFromCommunity(timeStamp, level, community, dateMargin, communitiesDict) 
    
    return comsTweets    

# 3. Build NLP models
<div id='3'></div>
Up to now, we have done all the preparation needed in order to be able to build some NLP models, more precisely text summarization models. In this notebook, we're primarily focusing on Topic Modeling. 

To give a brief recap, <b>Topic Modeling</b> consists of building a set of topics to describe a set of documents. In `LDAModelForLevel()`, we consider a document as a community and the set of documents as a level. This is for creayting topic models on the level of a level. In `LDAModelForAllTweets()`, we consider a document as a tweet and the set of documents as all the tweets in the dataset. This is for creating a general topic model covering all aspects of the clustering. We'll focus more on this function.  Each topic is composed of a set of keywords. It is up to the user to decide what the topic is about from the keywords. Then, we assign each community a topic distribution. For example, community x could be made up of topic 1 at 80% and topic 5 at 20%. So topic 1 is the most prevalent topic in this community.

The other model presented in this notebook is <b>Keyword Summarization</b> which consists in extracting the most prevalent words in a document. So Text Summarization isn't necessarily interesting as Topic Modeling as it only gives a general view of a document.

## 3.1. LDA Topic Modeling
<div id='3-1'></div>
To build our topic model, we will use the LDA (Latent Dirichlet Allocation) approach. LDA is a form of unsupervised learning that views documents as bags of words meaning order doesn't matter. We're not going to go into the details here, see https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd for more details.

In our case, we will procede with the <b>gensim</b> library (https://radimrehurek.com/gensim/) to construct our LDA topic model. gensim comes with a built in function for building LDA models, called `LdaMultiCore`. 'multicore' to speed up processing. This function allows for a lot of hyper-parameter tuning. The important ones are as following: 
 - <b> num_topics </b> (int) (default = 100): number of topics of the model.
 - <b>passes</b> (int) (default = 1): controls how often we train the model on the entire corpus (set to 10)
 - <b>iterations</b> (int) (default = 50). This parameter is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. iterations is the aximum number of iterations through the corpus when inferring the topic distribution of a corpus. It is important to set the number of “passes” and “iterations” high enough.
 - <b>alpha</b> ({float,str}) (default='symmetric'): Document-Topic Density. with a higher alpha, documents are assumed to be made up of more topics and result in more specific topic distribution per document.
 - <b>eta</b> ({float,str}) (default=None): Topic-Word Density. with high beta, topics are assumed to made of up most of the words and result in a more specific word distribution per topic.

In [16]:
def LDAModelForLevel(timeStamp, level, dateMargin, connectionsToo, numTopics, passes, ppComsDict, rawComsDict, alpha, eta, eval_every, iterations):
    """
    construct LDA topic model for a level given a timestamp
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param connectionsToo: true if you want users connections tweets too, false if not
    :param numTopics: number of topic for topic model
    :param passes: controls how often we train the lda model on the entire corpus (set to 1)
    :param ppComsDict: preprocessed coms as a dict. For running tests on existing models
    :param rawComsDict: raw coms as a dict. For running tests on existing models
    :param alpha: Document-Topic Density.
    :param eta: Topic-Wird Density.
    :param eval_every: log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x (set to 10)
    :param iterations: Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

    :type timeStamp: str
    :type level: int
    :type dateMargin: int
    :type connectionsToo: bool
    :type numTopics: int
    :type passes: int
    :type ppComsDict: {dict,NoneType}
    :type rawComsDict: {dict,NoneType}
    :type alpha: {float,str}
    :type eta: {float,str}
    :type eval_every: int
    :type iterations: int
    
    :return: LDA Model, dictionary, corpus, corpusTransformedSorted, ppComsDict, rawComsDict

    """

    # check if we have already collected the tweet data. If not then we need to do it again.
    # This process of getting the tweets can take a while so for testing it's good if we don't have to do it twice.
    if (ppComsDict == None) and (rawComsDict == None):
        
        # get tweets from level
        if connectionsToo:
            coms = getTweetsFromAllUsersAndUsersConnectionsFromAllCommunitiesFromLevel(timeStamp, level, dateMargin)
        else:
            coms = getTweetsFromAllUsersFromAllCommunitiesFromLevel(timeStamp, level, dateMargin)

        # separate the raw tweets from the pre processed ones
        rawComsDict = {k:v[1] for (k,v) in coms.items()}
        ppComsDict = {k:v[0] for (k,v) in coms.items()}

    # concatenate all the tweets of a community into one list.
    # So we get a list where each elt is a list of words. ex [["word","hello",...],["well","ok",...],...]
    #                                                              com1                com2
    preProcessedComs = [[item for sublist in tweets for item in sublist] for tweets in list(ppComsDict.values())]

    dictionaryComs = corpora.Dictionary(preProcessedComs)
    corpusComs = [dictionaryComs.doc2bow(community) for community in preProcessedComs]

    # create model
    model = models.ldamulticore.LdaMulticore(corpus=corpusComs, 
                                     num_topics=numTopics, 
                                     id2word=dictionaryComs, 
                                     passes=passes,
                                     alpha=alpha,
                                     eta=eta,
                                     eval_every=eval_every, 
                                     iterations=iterations)

    # calculate the document-topic distributions (will need adjusting later)
    docTopicDist = model[corpusComs]
    
    # sort each doc-topic distribution by most prevalent topics first 
    docTopicDistSorted = [sorted(c,key= lambda a: a[1], reverse=True) for c in docTopicDist]
    
    """
    !IMPORTANT!
    Un-comment the next section of code if you want to save the model as a file.
    """
    #modelName = re.sub("[ :]","",str(timeStamp))+";"+str(level)+".gensim"
    #model.save(modelName)
    
    #corpusName = re.sub("[ :]","",str(timeStamp))+";"+str(level)+".pkl"
    #pickle.dump(corpusComs, open(corpusName, 'wb'))
    
    return model, dictionaryComs, corpusComs, docTopicDistSorted, ppComsDict, rawComsDict
        

In [17]:
def LDAModelForAllTweets(numTopics, passes, alpha, eta, eval_every, iterations):
    """
    Construct LDA model where in the document-term matrix we have: document = tweet.
    We consider all tweets in the dataset
    
    :param numTopics: number of topic for topic model
    :param passes: controls how often we train the lda model on the entire corpus (set to 1)
    :param alpha: Document-Topic Density.
    :param eta: Topic-Wird Density.
    :param eval_every: log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x (set to 10)
    :param iterations: Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

    :type numTopics: int
    :type passes: int
    :type alpha: {float,str}
    :type eta: {float,str}
    :type eval_every: int
    :type iterations: int
    
    :return: model, dictionary, corpus, allPreProcessedTweets

    """

    # create dictionary and corpus
    dictionary = corpora.Dictionary(allPreProcessedTweets)
    corpus = [dictionary.doc2bow(tweet) for tweet in allPreProcessedTweets]
    
    # create model
    model = models.ldamulticore.LdaMulticore(
                                     corpus=corpus, 
                                     num_topics=numTopics, 
                                     id2word=dictionary, 
                                     passes=passes,
                                     alpha=alpha,
                                     eta=eta,
                                     eval_every=eval_every, 
                                     iterations=iterations)
    
    """
    !IMPORTANT!
    Un-comment the next section of code if you want to save the model as a file.
    """
    #modelName = re.sub("[ :]","",str(timeStamp))+";"+str(level)+".gensim"
    #model.save(modelName)
    
    #corpusName = re.sub("[ :]","",str(timeStamp))+";"+str(level)+".pkl"
    #pickle.dump(corpus, open(corpusName, 'wb'))
    
    return model, dictionary, corpus, allPreProcessedTweets

## 3.2. Keyword Summarization
<div id='3-2'></div>
First and foremost, this summarization only applies to a specific community, not a level (for now). For the keyword summarization we will use <b>gensim</b> again. gensim has a keyword summarization function `keywords`. This function isn't great so we do a little bet extra processing. We filter out irrelevent words and limit the number of keywords to describe a community.

In [18]:
def getInitialKeyWords(timeStamp, level, community, dateMargin, connectionsToo):
    """
    perform a first key word summarization (nothing fancy)
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: community id
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param connectionsToo: true if you want users connections tweets too, false if not
    
    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}
    :type dateMargin: int
    :type connectionsToo: bool
    
    returns lists of keywords for the associated com 
    """
    
    communitiesDict = getCommunitiesFromLevel(timeStamp,level)
    if connectionsToo:
        allUsersIDs = getAllUsersAndUsersConnectionsFromCommunity(timeStamp, level, community, communitiesDict)
    else:
        allUsersIDs = getUserIDsFromCommunity(timeStamp, level, community, communitiesDict)
        
    tweets = []
    for userID in allUsersIDs:
        
        # get semi-preprocssed tweet data (it's good enough for now)
        rawUserTweets = [tweet['txt']for tweet in allUsersTweets[userID] if tweet['source'] == userID and (datetime.strptime(timeStamp,"%a %b %d %H:%M:%S %z %Y")-datetime.strptime(tweet['created_at'],"%a %b %d %H:%M:%S %z %Y")).days < dateMargin]
        tweets.append(rawUserTweets)
        
    tweets = list(set([item for sublist in tweets for item in sublist]))
    
    return keywords('. '.join(tweets), scores=True, lemmatize=True)

In [19]:
def simplifyKeyWords(nbKeyWords,minWeight,communityKeyWordsU):
    """
    reduce number of keywords, filter out irrelevent words and sort
    
    :param nbKeyWords: number of keywords to describe community desired
    :param minWeight: minimal weight for keywords
    :param communityKeyWordsU: list of initial keywords (from getInitialKeyWords())
    
    :type nbKeyWords: int
    :type minWeight: float
    :type communityKeyWordsU: list
    
    :return: list of keywords
    :rtype: list
    """
    return  sorted([keyword for keyword in communityKeyWordsU if keyword[1] > minWeight and keyword[0].count(' ') <= 2],key=lambda x:x[1],reverse=True)[:nbKeyWords]

In [20]:
def KeyWordModelForCommunity(timeStamp, level, community, dateMargin, connectionsToo, nbKeyWords, minWeight):
    """
    summing up a community of a level by the keywords extracted from the concatenation of its tweets
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: community id
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param connectionsToo: true if you want users connections tweets too, false if not
    :param nbKeyWords: number of keywords to describe community desired
    :param minWeight: minimal weight for keywords
    
    
    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}
    :type dateMargin: int
    :type connectionsToo: bool
    :type nbKeyWords: int
    :type minWeight: float
    
    :return: list of keywords
    :rtype: list
    """
    return simplifyKeyWords(nbKeyWords, minWeight, getInitialKeyWords(timeStamp, level, community, dateMargin, connectionsToo))

## 3.3. Check parameters and choose model
<div id='3-3'></div>
In the previous section of this notebook, we developed some functions to make Topic Models and perform Keyword Summarization. The following function checks that all the parameters are valid and allows us to choose between the two models.

In [21]:
def BuildModelForLevel(timeStamp, 
                       level, 
                       dateMargin, 
                       modelType, 
                       connectionsToo, 
                       nbKeyWords=30, 
                       minWeight=0.05, 
                       numTopics=10, 
                       community=None, 
                       passes=1, 
                       ppComsDict=None,
                       rawComsDict=None,
                       alpha='symmetric',
                       eta=None,
                       eval_every=10,
                       iterations=50):
    """
    construct NLP Model for a level given a timestamp.
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param modelType: is the desired model type (0 for topic model, 1 for key word summarization)
    :param connectionsToo: true if you want users connections tweets too, false if not
    :param nbKeyWords: (optional) number of desired keywords for Key Word Summarization
    :param minWeight: (optional) minimal weight of key word for key word summarization
    :param numTopics: (optional) number of topic for topic model
    :param community: (optional) desired community for key word summarization
    :param passes: (optional) controls how often we train the lda model on the entire corpus (set to 1)
    :param ppComsDict: (optional) preprocessed coms as a dict. For running tests on existing models
    :param rawComsDict: (optional) raw coms as a dict. For running tests on existing models
    :param alpha: (optional) Document-Topic Density.
    :param eta: (optional) Topic-Word Density.
    :param eval_every: (optional) log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x (set to 10)
    :param iterations: (optional) Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

    :type timeStamp: str
    :type level: int
    :type dateMargin: int
    :type modelType: int
    :type connectionsToo: bool
    :type nbKeyWords: int
    :type minWeight: float
    :type numTopics: int
    :type community: {bytes,str}
    :type passes: int
    :type ppComsDict: {dict,NoneType}
    :type rawComsDict: {dict,NoneType}
    :type alpha: {float,str}
    :type eta: {float,str}
    :type eval_every: int
    :type iterations: int
    
    :return: model
    
    """
    
    # check all parameters   
    
    if not timeStamp in list(data['communities'].keys()):
        raise Exception('time stamp non existant')
        
    if not level in getLevelsFromTimeStamp(timeStamp):
        raise Exception("Level non existant")
        
    if dateMargin < 0:
        raise Exception('margin invalid')

    if not modelType in [0,1]:
        raise Exception("model type not valid ,choose from 0 or 1")
        
    if not connectionsToo in [True, False]:
        raise Exception("boolean value for connectionsToo")
        
    if nbKeyWords < 1:
        raise Exception('not enough key words')
        
    if minWeight < 0 or minWeight > 1:
        raise Exception('min weight needs to be between 0 and 1')
        
    if numTopics <= 1:
         raise Exception('nb of topics has to be at least 1')
            
    if community != None: 
        if not community in getCommunitiesFromLevel(timeStamp,level):
            raise Exception("community non existant")
            
    if passes  < 1:
        raise Exception("passes needs to be at least 1")
        
    if type(alpha) == float:
        if alpha < 0 or alpha > 1:
            raise Exception('alpha needs to between 0 and 1 in the case that it"s a float')
            
    if type(alpha) == str:
        if not alpha in ["auto","symmetric","asymmetric"]:
            raise Exception('alpha needs to in ["auto","symmetric","asymmetric"] in the case that it"s a string')
            
    if eta != None:
        if type(eta) == float:
            if eta < 0 or eta > 1:
                raise Exception('eta needs to between 0 and 1 in the case that it"s a float')
        if type(eta) == str:
            if eta != 'symmetric':
                raise Exception('eta needs to be equal to "symmetric" in the case that it"s a string')
                
    if eval_every < 0:
        raise Exception("eval_every needs to be at least one")
        
    if iterations < 50:
        raise Exception("iterations needs to be at least 50")

    # returns model depending on decision by user
    if modelType == 0:
        # build topic model
        return LDAModelForLevel(timeStamp, level, dateMargin, connectionsToo, numTopics, passes, ppComsDict, rawComsDict, alpha, eta, eval_every, iterations)
    else:
        # build keyword model
        return KeyWordModelForCommunity(timeStamp, level, community, dateMargin, connectionsToo, nbKeyWords, minWeight)

## 3.4. Build a model
<div id='3-4'></div>
Let's build an example topic model. Using the results from our tests we can build a topic model for level 3 of the community tree at Sun Apr 30 10:30:11 +0000 2017.

In [22]:
"""
myModel, myDictionary, myCorpus, myDocTopicDistSorted, myppComs, myrawComs = BuildModelForLevel(
                          timeStamp='Sun Apr 30 10:30:11 +0000 2017',
                          level=3,
                          dateMargin=500,
                          modelType=0,
                          connectionsToo=True,
                          numTopics=15,
                          passes=10,
                          eval_every=10,
                          iterations=200,
                          alpha=0.31,
                          eta=0.61)
                          #ppComsDict=myppComs,
                          #rawComsDict=myrawComs)
"""

"\nmyModel, myDictionary, myCorpus, myDocTopicDistSorted, myppComs, myrawComs = BuildModelForLevel(\n                          timeStamp='Sun Apr 30 10:30:11 +0000 2017',\n                          level=3,\n                          dateMargin=500,\n                          modelType=0,\n                          connectionsToo=True,\n                          numTopics=15,\n                          passes=10,\n                          eval_every=10,\n                          iterations=200,\n                          alpha=0.31,\n                          eta=0.61)\n                          #ppComsDict=myppComs,\n                          #rawComsDict=myrawComs)\n"

Let's build a keyword summarization model too:

In [23]:
"""
myKWModel = BuildModelForLevel(
    dateMargin=50,
    modelType=1,
    timeStamp='Sun Apr 30 10:30:11 +0000 2017',
    level=3,
    community= b'-30000094722',
    connectionsToo=False,
    nbKeyWords=50,
    minWeight=0.05)
"""

"\nmyKWModel = BuildModelForLevel(\n    dateMargin=50,\n    modelType=1,\n    timeStamp='Sun Apr 30 10:30:11 +0000 2017',\n    level=3,\n    community= b'-30000094722',\n    connectionsToo=False,\n    nbKeyWords=50,\n    minWeight=0.05)\n"

## 3.5. Update a model
<div id='3-5'></div>
The following function allows us to update a pre existing model with a new community.

In [24]:
def updateLDAModelWithNewCom(model, dictionary, ppComs, rawComs, timeStamp, level, community, connectionsToo, dateMargin):
    
    """ 
    update an LDA model with a new community
  
    :param model: (gensim.models.ldamulticore): is the old model
    :param dictionary: (gensim.corpora): is the old dictionary
    :param ppComsDict: preprocessed coms as a dict. For running tests on existing models
    :param rawComsDict: raw coms as a dict. For running tests on existing models
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: new community id
    :param connectionsToo: true if you want users connections tweets too, false if not
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 

    :type model: gensim.models.ldamulticore
    :type dictionary: gensim.corpora
    :type ppComsDict: {dict,NoneType}
    :type rawComsDict: {dict,NoneType}
    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}    
    :type connectionsToo: bool
    :type dateMargin: int

    :return: model, dictionary, newDocTopicDistSorted, ppComs, rawComs
    
    """
    
    # get tweets from new com
    if connectionsToo:
        comTweets = getTweetsFromAllUsersAndUsersConnectionsFromCommunity(timeStamp,level,community, dateMargin)
    else:
        comTweets = getTweetsFromAllUsersFromCommunity(timeStamp, level, community, dateMargin)
    
    # add new com to the rest of other coms
    ppComs[community] = comTweets[0]
    rawComs[community] = comTweets[1]
    
    preProcessedComs = [[item for sublist in tweets for item in sublist] for tweets in ppComs.values()] 
    
    # create dict
    preProcessedCom = [[item for sublist in comTweets[0] for item in sublist]] 
    comDict = corpora.Dictionary(preProcessedCom)
    comCorpus = [comDict.doc2bow(community) for community in preProcessedCom]
    
    #dictionary.add_documents(preProcessedCom)
    newCorpus = [dictionary.doc2bow(community) for community in preProcessedComs]
    
    model.update(newCorpus)
    newDocTopicDist = model[newCorpus]
    newDocTopicDistSorted = [sorted(c,key= lambda a: a[1], reverse=True) for c in newDocTopicDist]
    
    return model, dictionary, newDocTopicDistSorted, ppComs, rawComs # remove ppComs, rawComs maybe
 

# 4. Study purely the topics
<div id='4'></div>
As we know, our topic model has a set of topics. This next part of the notebook is purely for just for studying the topics as they are, not topic distributions over communtities. Three possibilites for this:
 - Show the topics, how many you want and how many words per topic to show.
 - Get the unique words from each topic (the words that are present in only one topic)
 - Show the intertopic distance map (see below for further details)

## 4.1. Show topics
<div id='4-1'></div>

In [25]:
def getTopics(model,nbTopics,nbWords):
    """
    get topics for LDA Model
    
    :param model: the model
    :param nbTopics: is the number of topics to be displayed
    :param nbWords: is the number of words in topics 
    
    :type model: gensim.models.ldamulticore
    :type nbTopics: int
    :type nbWords: int
    
    :return: list of topics in the format word1*"weight1"+word2*"weight2"+...
    :rtype: list
    """
    return list(dict.fromkeys([topic for topic in model.print_topics(num_topics=nbTopics, num_words=nbWords)]))

In [26]:
#getTopics(myModel,30,7)

## 4.2. Get unique words for each topic
<div id='4-2'></div>

In [27]:
def getTopicKeyWords(model,topicNb,nbTopics,nbWords):
    """
    get just the keywords from a topic.
    
    :param model: the model
    :param topicNb: the topic from which we want the keywords
    :param nbTopics: is the number of topics to be displayed (see prev func)
    :param nbWords: is the number of words in topics 
    
    :type model: gensim.models.ldamulticore
    :type topicNb: int
    :type nbTopics: int
    :type nbWords: int
    
    
    :return: a list of keywords for topic 'topicNb'
    :rtype: list
    """
    return [re.sub('"',"",elt.split("*")[1]).replace(" ","") for elt in [topic for topic in getTopics(model,nbTopics,nbWords) if topic[0] == topicNb][0][1].split("+")]

In [28]:
def getTopicKeyWordsForAllTopics(model,nbTopics,nbWords):
    """
    return a list of lists. each sublist is a topic's keywords
    
    :param model: the model
    :param nbTopics: is the number of topics to be displayed (see prev func)
    :param nbWords: is the number of words in topics 
    
    :type model: gensim.models.ldamulticore
    :type nbTopics: int
    :type nbWords: int
    
    :return: list of topic keyword lists
    :rtype: list
    """
    return [getTopicKeyWords(model,i,nbTopics,nbWords) for i in range(len(getTopics(model,nbTopics,nbWords)))]
    #return [i for i in range(len(getTopics(model,nbTopics,nbWords)))]

In [29]:
def getUniqueWordsForeachTopic(model,nbTopics,nbWords):
    """
    return dictionary where keys are topic nbs and values are lists of unique words to the topic
    
    :param model: the model
    :param nbTopics: is the number of topics to be displayed (see prev func)
    :param nbWords: is the number of words in topics 
    
    :type model: gensim.models.ldamulticore
    :type nbTopics: int
    :type nbWords: int
    
    :return: dictionary where keys are topic nbs and values are lists of unique words to the topic
    :rtype: dict
    
    """
    reducedTopics = {}
    i = 0
    for topicWords in getTopicKeyWordsForAllTopics(model,nbTopics,nbWords):
        for otherTopicWords in [words for words in getTopicKeyWordsForAllTopics(model,nbTopics,nbWords) if words != topicWords]:
            for word in otherTopicWords:
                if word in topicWords:
                    topicWords.remove(word)
        reducedTopics[i] = topicWords
        i += 1
    return reducedTopics

In [30]:
#getUniqueWordsForeachTopic(myModel,30,7)

## 4.3. Intertopic distance map
<div id='4-3'></div>
The intertopic distance map allows us to display the model topics in a bubble graph. Each bubble corresponds to a topic. The bigger the bubble, the more prevalent that topic is in the corpus. A good topic model will have farely large bubbles not clustered in one place. Overlapping bubbles means that the topics are similar.

In [31]:
def intertopicDistanceMap(model, corpus, dictionary):
    """
    intertopic Distance Map (bubbles graph) for LDA model 
    """
    
    modelVis = pyLDAvis.gensim.prepare(model, corpus, dictionary)
    pyLDAvis.enable_notebook()
    
    return pyLDAvis.gensim.prepare(model, corpus, dictionary)

In [32]:
#intertopicDistanceMap(myModel,myCorpus,myDictionary)

We can see in this case that are model isn't great (30 topics)

# 5. Study the topic distributions of communities
<div id='5'></div>
Mentioned previously, each community can be represented as a distribution of topics. This next part of the notebook allows us to display these distributions.

We can show the distributions for all the communities of the level at once, or see only the topic distribution of one community. 

For the one community case, it can can be internal to the level or external, meaning being able to study the topic distribution of a community that isn't apart of the original LDA topic model.

In [33]:
def getComTopicDist(community, model, minWeight, timeStamp, level, dateMargin, connectionsToo):
    """
    get community topic distribution
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param community: new community id
    :param connectionsToo: true if you want users connections tweets too, false if not
    :param model: (gensim.models.ldamulticore): is the old model
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param minWeight: minimal weight for a topic in community


    :type model: gensim.models.ldamulticore
    :type timeStamp: str
    :type level: int
    :type community: {bytes,str}    
    :type connectionsToo: bool
    :type dateMargin: int
    :type minWeight: float
 
    :return: dict where key is community and value is list of tuples (ex: [(1, 0.8), (2, 0.2)])
    :rtype: dict
    """
    
    levelComs = getCommunitiesFromLevel(timeStamp,level).keys()
    
    if connectionsToo:
        com = getTweetsFromAllUsersFromCommunity(timeStamp, level, community, dateMargin) 
    else:
        com = getTweetsFromAllUsersAndUsersConnectionsFromCommunity(timeStamp, level, community, dateMargin)
            
    preProcessedCom = [item for sublist in com[0] for item in sublist]
    bow = model.id2word.doc2bow(preProcessedCom) # convert to bag of words format first
    doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True,minimum_probability=-1.0)
    
    return [couple for couple in sorted(doc_topics,key= lambda a: a[1], reverse=True) if couple[1] > minWeight]



In [34]:
def getCompactTopicDistForAllComs(model, timeStamp, level, minWeight, dateMargin, connectionsToo):
    """
    returns list of topic distributions. Each line corresponds to a community.
    No fancy display.
    
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param connectionsToo: true if you want users connections tweets too, false if not
    :param model: (gensim.models.ldamulticore): is the old model
    :param dateMargin: is the maximal time (in days) between the date of publication of a tweet and the timeStamp (ex: 10) 
    :param minWeight: minimal weight for a topic in community


    :type model: gensim.models.ldamulticore
    :type timeStamp: str
    :type level: int
    :type connectionsToo: bool
    :type dateMargin: int
    :type minWeight: float
    
    :return: list of topic dists
    :rtype: list
 
    """
    
    comsTopicDists = {}
    levelComs = getCommunitiesFromLevel(timeStamp,level).keys()
    for community in levelComs:
        comsTopicDists[community] = getComTopicDist(community=community,
                                              model=model,
                                              minWeight=minWeight,
                                              timeStamp=timeStamp,
                                              level=level,
                                              dateMargin=dateMargin,
                                              connectionsToo=connectionsToo)
    
    return comsTopicDists
        

In [35]:
def getModelTopicDistribution(timeStamp, level, model, docTopicDistSorted):
    """
    get for each community, the topical distribution
        (ex: com "x": topic 1 80%, topic 2 20%) 
        
    :param timeStamp: is the date (ex: 'Fri Apr 14 09:58:53 +0000 2017')
    :param level: is the level for the date (ex: 7)
    :param model: (gensim.models.ldamulticore): is the old model
    :param docTopicDistSorted: community topic distributions


    :type model: gensim.models.ldamulticore
    :type timeStamp: str
    :type level: int
    :type docTopicsDistSorted: list

    :return: list of strings
    :rtype: list
    """
    
    communityTopics = list(zip([", ".join([str(couple[0])+" -> "+str(float("{:.2f}".format(couple[1]))*100)+"%" for couple in lst]) for lst in docTopicDistSorted], getCommunitiesFromLevel(timeStamp,level).keys()))
    
    return [str("community n° "+str(community[1])+" topics: "+community[0]) for community in communityTopics]

In [36]:
#getModelTopicDistribution('Sun Apr 30 10:30:11 +0000 2017',3,myModel,myDocTopicDistSorted)

# 6. Visualization
<div id='6'></div>

## 6.1. Produce community topic distribution Word Clouds 
<div id='6-1'></div>
For each community we have it's corresponding topic distribution. This part of the notebook allows us to display a set of word clouds for a given community. For example, if we are interested in community x (topic 0 80%, topic 5 20%), then we'll display a word cloud for topic 0 at 100% size, and a word cloud for topic 5 at 25% size. 

In [37]:
def getWordCloudsFromCommunity(model, comTopicDist):
    """
    get word clouds from community
    
    :param model: the model
    :param comTopicDist: the community topic distribution dict
    
    :type model: gensim.models.ldamulticore
    :type comTopicDist: dict
    
    :return: -list of word clouds
             -list of word clouds size dimensions (smaller for less prevalent topics)
             
    :rtype: (list,list)
    """    
    #comTopicDistAsList = list(comTopicDist.values())[0]

    wordclouds = [WordCloud(width=1000,height=1000,background_color="white",min_font_size = 20).fit_words(dict(model.show_topic(t[0],200))) for t in comTopicDist]
    figSizeDimensions = [10.0*(t[1]/comTopicDist[0][1]) for t in comTopicDist]

    return wordclouds, figSizeDimensions

In [38]:
def plotWordCloudsFromCommunity(model, comTopicDist):
    """
    plot topic model word clouds for a community
    
    :param model: the model
    :param comTopicDist: the community topic distribution dict
    
    :type model: gensim.models.ldamulticore
    :type comTopicDist: dict
    """
    
    wordClouds, figDimensions = getWordCloudsFromCommunity(model, comTopicDist)
    
    i = 0
    for wordcloud in wordClouds:
        plt.figure(figsize = (figDimensions[i], figDimensions[i]), facecolor = None) 
        plt.imshow(wordcloud)
        plt.axis('off')
        plt.tight_layout(pad=0)
        plt.show()
        i += 1
        

In [39]:
#comTopicDist = getComTopicDist(community=b'-30000094722',timeStamp='Sun Apr 30 10:30:11 +0000 2017',level=3, model=myModel,ppComsDict=myppComs,minWeight=0.05)
#plotWordCloudsFromCommunity(myModel,comTopicDist)

## 6.2. Visualize keyword summarization (word cloud)
<div id='6-2'></div>
Just like previously, we'll use word clouds to illustrate the results of a keyword summarization. Except unlike for the community topic distribution word clouds, here there will only be one word cloud containing the keywords.

In [40]:
def showKeyWordsWordCloud(model):
    """
    show community keywords as word cloud
    
    :param model: the model
    
    :type model: gensim.models.ldamulticore
    """
    
    # plot the WordCloud image                        
    plt.figure(figsize = (10, 10), facecolor = None) 
    plt.imshow(WordCloud( width=1000,height=1000,
               background_color ='white', 
               min_font_size = 20).fit_words(dict(model)))
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.show() 

In [41]:
#showKeyWordsWordCloud(myKWModel)