# JS and KL divergence

Purpose: this notebook calculates word probability distributions for two sets of data (ad and non-ad video metadata) and then takes the two lists of probability distributions and inputs them into the JS divergence calculation function.  
Author: Lillie Godinez

**Outline**  
1. [Helper functions](#helpers)  
2. [Word probability function](#probs)
3. [JS and KL calculation functions](#calc)  
4. [Main function](#main)  
5. [Calculate for all data files](#df)

Load in the relevant libraries

In [1]:
import pandas as pd
from math import log2
from math import sqrt
from numpy import asarray
from collections import Counter
import os
import re

<a name="helpers"></a>
## Helper functions

In [4]:
def removeNumber(input_string):
    """
    Given a string, remove any numbers.
    """
    return re.sub(r'\d+', '', input_string)

In [5]:
def flatten_list(lst):
    """Flatten a list of lists."""
    wordlist = [item for sublist in lst for item in sublist]
    return [item for w in wordlist for item in w.split()]

In [6]:
# making the datalists
def makeDFs(df):
    """
    Given a dataframe of pyktok results, create a list of all suggested words for
    ad and non-ad videos separately. Returns a tuple of lists
    """
    #remove NaN suggested word rows
    df = df[df['suggested_words'].isna()==False]
    
    #clean the suggested words by removing the numbers
    df['suggested_words_cleaned'] = df['suggested_words'].apply(removeNumber)
    
    #create separate dfs for ads and non-ads
    ad = df[df['video_is_ad']==True]
    nonad = df[df['video_is_ad']==False]
    
    #turn the suggested words column into a list
    suggested_ad = list(ad['suggested_words_cleaned'])
    suggested_nonad = list(nonad['suggested_words_cleaned'])
    
    #split the words and lowercase them for each video
    #this gives us a list of lists (one for each video)
    suggested_ad = [string.lower().split(', ') for string in suggested_ad]
    suggested_nonad = [string.lower().split(', ') for string in suggested_nonad]

    # final list, flattened by our helper function
    suggested_ad = flatten_list(suggested_ad)
    suggested_nonad = flatten_list(suggested_nonad)

    return suggested_ad, suggested_nonad

In [7]:
def findUnion(lst1, lst2):
    """
    Given two list of string phrases, finds the union.
    Returns a tuple of the union as a list and the size 
    (total length of the two sets) as an int
    """
    set1 = set(lst1)
    set2 = set(lst2)
    
    size = len(set1) + len(set2)
    union = list(set1.union(set2))

    #print(f"Set of words in union: {len(union)}")
    #print(f"Set of total words: {size}")

    return union, size

<a name="probs"></a>
## Word probabilities

In [8]:
def calculate_word_probabilities(word_list):
    """
    Given a word list, find the number of occurrences for each unique 
    word and then divides by the length of the list. Returns a dictionary
    with unique words as keys and probability of occurrence within
    the dataset as the item
    """
    # Step 1: Count occurrences of each word
    word_counts = Counter(word_list)

    # Step 2: Calculate total number of words
    total_words = len(word_list)

    # Step 3: Calculate probability distribution
    word_probabilities = {}
    for word, count in word_counts.items():  
        word_probabilities[word] = (count / total_words)

    return word_probabilities

<a name="calc"></a>
## JS and KL calculation functions

In [9]:
def kl_divergence(p, q):
    """
    calculate the kl divergence
    """
    s=[]
    for i in range(len(p)):
        if q[i]!=0 and p[i]!=0:
            s.append(p[i] * log2(p[i]/q[i]))
    return sum(s)

def js_divergence(p, q):
    """
    calculate the js divergence
    """
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)

<a name="main"></a>
## Main function

In [12]:
def main(df):
    """
    Given a df of pyktok results for one user, 
        1. create lists of all words for ads and non-ads.  
        2. find the union of these two lists. 
        3. calculate the probability for each unique word
        4. create lists of word probabilties for each word in 
        the union separately for ads and non-ads
        5. plug in lists from step 4 into the js divergence 
        equation to find divergence and distance
        6. gathers all relevant info into a dictionary 
        and returns the dict
    """
    #find word lists
    suggested_ad, suggested_nonad = makeDFs(df)
    
    #find union and size
    union, size = findUnion(suggested_ad, suggested_nonad)
    
    #calc word probabilities
    word_probabilities_ad = calculate_word_probabilities(suggested_ad)
    word_probabilities_nonad = calculate_word_probabilities(suggested_nonad)
    
    #create the lists of probability values for each word in union
    #separately for ads and non-ads
    word_probabilities_ad_union = []
    word_probabilities_nonad_union = []
    for i in union:
        if i in word_probabilities_ad:
            word_probabilities_ad_union.append(word_probabilities_ad[i])
        else:
            word_probabilities_ad_union.append(0)
        if i in word_probabilities_nonad:
            word_probabilities_nonad_union.append(word_probabilities_nonad[i])
        else:
            word_probabilities_nonad_union.append(0)

    p = asarray(word_probabilities_ad_union)
    q = asarray(word_probabilities_nonad_union)
    
    # calculate JS(P || Q)
    js_pq = js_divergence(p, q)
    
    # calculate JS(Q || P)
    js_qp = js_divergence(q, p)
    
    #gather info into a dict
    oneRow = {'num words in union': len(union), 'num total words': size,
            'pq divergence': js_pq, 'pq distance': sqrt(js_pq), 
            'qp divergence': js_qp, 'qp distance': sqrt(js_qp)}
    
    return oneRow

<a name="df"></a>
## Calculate the divergence and distance for each file in the dataset

Create a list of dictionaries with information for all pyktok data files. 

In [48]:
allData = []

for f in os.listdir('data'):
    #dataframe of pyktok data
    df = pd.read_csv("data"+"/"+f)
    
    #generate dict with divergence and distance scores
    oneRow = main(df)
    
    #finds user code to add to df
    code = re.search(r'\d+', f)[0]
    oneRow['code'] = code
    
    #append dict to list
    allData.append(oneRow)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['suggested_words_cleaned'] = df['suggested_words'].apply(removeNumber)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['suggested_words_cleaned'] = df['suggested_words'].apply(removeNumber)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['suggested_words_cleaned'] = df['suggested_words'].app

Input the list from above into a new dataframe

In [49]:
divergence_df = pd.DataFrame(allData)
divergence_df = divergence_df.set_index('code')

Find the statistics of the dataframe 

In [50]:
divergence_df.describe()

Unnamed: 0,num words in union,num total words,pq divergence,pq distance,qp divergence,qp distance
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,15167.6,16983.6,0.523338,0.72106,0.523338,0.72106
std,2995.366806,3623.468476,0.096211,0.065297,0.096211,0.065297
min,12052.0,12470.0,0.410198,0.640467,0.410198,0.640467
25%,12210.0,14039.0,0.500175,0.70723,0.500175,0.70723
50%,15382.0,17659.0,0.504739,0.71045,0.504739,0.71045
75%,17761.0,20287.0,0.525473,0.724895,0.525473,0.724895
max,18433.0,20463.0,0.676108,0.822258,0.676108,0.822258


Add mean and standard deviation as rows to the dataframe

In [51]:
toAdd = pd.DataFrame(divergence_df.describe().loc[['mean','std']])
divergence_df = pd.concat([divergence_df,toAdd])

Final results

In [53]:
divergence_df

Unnamed: 0,num words in union,num total words,pq divergence,pq distance,qp divergence,qp distance
10824,12052.0,12470.0,0.676108,0.822258,0.676108,0.822258
12345,18433.0,20463.0,0.504739,0.71045,0.504739,0.71045
26301,15382.0,17659.0,0.525473,0.724895,0.525473,0.724895
33534,12210.0,14039.0,0.500175,0.70723,0.500175,0.70723
50405,17761.0,20287.0,0.410198,0.640467,0.410198,0.640467
mean,15167.6,16983.6,0.523338,0.72106,0.523338,0.72106
std,2995.366806,3623.468476,0.096211,0.065297,0.096211,0.065297
