## Demo for the Calculation of the Semantic Brand Score - Basic Version

In [1]:
# Read text documents from an example CSV file
import csv
readfile = csv.reader(open("AliceWonderland.csv", 'rt',  encoding="utf8"), delimiter = "|", quoting=csv.QUOTE_NONE)
texts = [line[0] for line in readfile]
#4 Chapters of Alice in Wonderland
print(len(texts))
print(texts[0][:200])

4
Down the Rabbit-Hole. Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it


**I imported the random text file in Python as a list of text documents (texts), which are processed to remove punctuation, stop-words and special characters. Words are lowercased and split into tokens, thus obtaining a new texts variable, which is a list of lists. More complex operations of text preprocessing are always possible (such as the removal of html tags or ‘#’), for which I recommend reading one of many tutorials on Natural Language Processing in Python. The stopwords list is taken from the NLTK package. Lastly, word affixes are remove through Snowball Stemming.**

In [2]:
##Import re, string and nltk, and download stop-words
import re
import nltk
import string
from nltk.stem.snowball import SnowballStemmer

#Define stopwords
#nltk.download("stopwords")
stopw = nltk.corpus.stopwords.words('english')

#Define brands (lowercase)
brands = ['alice', 'rabbit']

# texts is a list of strings, one for each document analyzed.

#Convert to lowercase
texts = [t.lower() for t in texts]
#Remove words that start with HTTP
texts = [re.sub(r"http\S+", " ", t) for t in texts]
#Remove words that start with WWW
texts = [re.sub(r"www\S+", " ", t) for t in texts]
#Remove punctuation
regex = re.compile('[%s]' % re.escape(string.punctuation))
texts = [regex.sub(' ', t) for t in texts]
#Remove words made of single letters
texts = [re.sub(r'\b\w{1}\b', ' ', t) for t in texts]
#Remove stopwords
pattern = re.compile(r'\b(' + r'|'.join(stopw) + r')\b\s*')
texts = [pattern.sub(' ', t) for t in texts]
#Remove additional whitespaces
texts = [re.sub(' +',' ',t) for t in texts]

#Tokenize text documents (becomes a list of lists)
texts = [t.split() for t in texts]

# Snowball Stemming
stemmer = SnowballStemmer("english")
texts = [[stemmer.stem(w) if w not in brands else w for w in t] for t in texts]
texts[0][:6]

['rabbit', 'hole', 'alice', 'begin', 'get', 'tire']

**During text preprocessing we should pay attention not to lose useful information. Smileys :-), made of punctuation, can be very important if we calculate sentiment.
We can now proceed with the calculation of prevalence, which counts the frequency of occurrence of each brand name — subsequently standardized considering the scores of all the words in the texts. My choice of standardization here is to subtract the mean and divide by the standard deviation. Other approaches are also possible. This step is important to compare measures carried out considering different time frames or sets of documents (e.g. brand importance on Twitter in April and May). Normalization of absolute scores is necessary before summing prevalence, diversity and connectivity to obtain the Semantic Brand Score.**

In [3]:
#PREVALENCE
#Import Counter and Numpy
from collections import Counter
import numpy as np

#Create a dictionary with frequency counts for each word
countPR = Counter()
for t in texts:
    countPR.update(Counter(t))

#Calculate average score and standard deviation
avgPR = np.mean(list(countPR.values()))
stdPR = np.std(list(countPR.values()))

#Calculate standardized Prevalence for each brand
PREVALENCE = {}
for brand in brands:
    PR_brand = (countPR[brand] - avgPR) / stdPR
    PREVALENCE[brand] = PR_brand
    print("Prevalence", brand, PR_brand)

Prevalence alice 6.716253083416179
Prevalence rabbit 0.23502089447171487


**Next and most important step is to transform texts (list of lists of tokens) into a social network where nodes are words and links are weighted according to the number of co-occurrences between each pair of words. In this step we have to define a co-occurrence range, i.e. a maximum distance between co-occurring words (here is set to 3). In addition, we might want to remove links which represent negligible co-occurrences, for example those of weight = 1. Sometimes it can also be useful to remove isolates, if these are not brands.**

In [4]:
#Import Networkx
import networkx as nx

#Choose a co-occurrence range
co_range = 3

#Create an undirected Network Graph
G = nx.Graph()

#Each word is a network node
nodes = set([item for sublist in texts for item in sublist])
G.add_nodes_from(nodes)

#Add links based on co-occurrences
for doc in texts:
    w_list = []
    length= len(doc)
    for k, w in enumerate(doc):
        #Define range, based on document length
        if (k+co_range) >= length:
            superior = length
        else:
            superior = k+co_range+1
        #Create the list of co-occurring words
        if k < length-1:
            for i in range(k+1,superior):
                linked_word = doc[i].split()
                w_list = w_list + linked_word
        #If the list is not empty, create the network links
        if w_list:    
            for p in w_list:
                if G.has_edge(w,p):
                    G[w][p]['weight'] += 1
                else:
                    G.add_edge(w, p, weight=1)
        w_list = []

#Remove negligible co-occurrences based on a filter
link_filter = 2
#Create a new Graph which has only links above
#the minimum co-occurrence threshold
G_filtered = nx.Graph() 
G_filtered.add_nodes_from(G)
for u,v,data in G.edges(data=True):
    if data['weight'] >= link_filter:
        G_filtered.add_edge(u, v, weight=data['weight'])

#Optional removal of isolates
isolates = set(nx.isolates(G_filtered))
isolates -= set(brands)
G_filtered.remove_nodes_from(isolates)

#Check the resulting graph (for small test graphs)
#G_filtered.nodes()
#G_filtered.edges(data = True)
print("Filtered Network\nNo. of Nodes:", G_filtered.number_of_nodes(), "No. of Edges:", G_filtered.number_of_edges())

Filtered Network
No. of Nodes: 519 No. of Edges: 1514


**Having determined the co-occurrence network, we can now calculate diversity and connectivity, which are degree centrality and betweenness centrality of a brand node. We standardize these values as we did with prevalence.**

In [5]:
#DIVERSITY
DIVERSITY_sequence=dict(nx.degree(G_filtered))
#Calculate average score and standard deviation
avgDI = np.mean(list(DIVERSITY_sequence.values()))
stdDI = np.std(list(DIVERSITY_sequence.values()))
#Calculate standardized Diversity for each brand
DIVERSITY = {}
for brand in brands:
    DI_brand = (DIVERSITY_sequence[brand] - avgDI) / stdDI
    DIVERSITY[brand] = DI_brand
    print("Diversity", brand, DI_brand)

Diversity alice 5.594199479718734
Diversity rabbit -0.13888252606943657


**If we calculate connectivity as weighted betweenness centraliy, we first have to define inverse weights, as weights are treated by Networkx as distances (which is the opposite of our case).**

In [6]:
#Define inverse weights 
for u,v,data in G_filtered.edges(data=True):
    if 'weight' in data and data['weight'] != 0:
        data['inverse'] = 1/data['weight']
    else:
        data['inverse'] = 1   

#CONNECTIVITY
CONNECTIVITY_sequence=nx.betweenness_centrality(G_filtered, normalized=False, weight ='inverse')
#Calculate average score and standard deviation
avgCO = np.mean(list(CONNECTIVITY_sequence.values()))
stdCO = np.std(list(CONNECTIVITY_sequence.values()))
#Calculate standardized Prevalence for each brand
CONNECTIVITY = {}
for brand in brands:
    CO_brand = (CONNECTIVITY_sequence[brand] - avgCO) / stdCO
    CONNECTIVITY[brand] = CO_brand
    print("Connectivity", brand, CO_brand)

Connectivity alice 1.270318490240101
Connectivity rabbit 0.1026273469656663


**The Semantic Brand Score of each brand is finally obtained by summing the standardized values of prevalence, diversity and connectivity. Different approaches are also possible, such as taking the geometric mean of unstandardized coefficients.**

In [7]:
#Obtain the Semantic Brand Score of each brand
SBS = {}
for brand in brands:
    SBS[brand] = PREVALENCE[brand] + DIVERSITY[brand] + CONNECTIVITY[brand]
    print("SBS", brand, SBS[brand])
print("SBS: ",SBS)

SBS alice 13.580771053375013
SBS rabbit 0.1987657153679446
SBS:  {'alice': 13.580771053375013, 'rabbit': 0.1987657153679446}


In [8]:
#Generate a final pandas data frame with all results
import pandas as pd

PREVALENCE = pd.DataFrame.from_dict(PREVALENCE, orient="index", columns = ["PREVALENCE"])
DIVERSITY = pd.DataFrame.from_dict(DIVERSITY, orient="index", columns = ["DIVERSITY"])
CONNECTIVITY = pd.DataFrame.from_dict(CONNECTIVITY, orient="index", columns = ["CONNECTIVITY"])
SBS = pd.DataFrame.from_dict(SBS, orient="index", columns = ["SBS"])

SBS = pd.concat([PREVALENCE, DIVERSITY, CONNECTIVITY, SBS], axis=1, sort=False)
SBS

Unnamed: 0,PREVALENCE,DIVERSITY,CONNECTIVITY,SBS
alice,6.716253,5.594199,1.270318,13.580771
rabbit,0.235021,-0.138883,0.102627,0.198766
