# CS 315 - Jaccard Index Analysis

Author: Audrey Yip (with help from Eni Mustafaraj)

Date: Feb 14 2024

**Table of Contents**
1. [Importing Data](#sec1)
2. [Exploratory Analysis](#sec2)
3. [Jaccard Analysis](#sec3)
4. [Visualizations](#sec4)

In [1]:
import os, csv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


<a id="sec1"></a>
### 1. Importing Data

In [36]:
files = os.listdir('data') # function lists the content of a directory
files

['02-14-16-57_save_data_all_videos_AY.csv',
 '02-14-16-57_data_saved_videos_AY.csv',
 '.DS_Store',
 '02-14-19-04_save_data_all_videos_AY.csv',
 '02-14-16-57_control_data_all_videos_AY.csv',
 '02-14-19-03_control_data_all_videos_AY.csv',
 '02-14-19-04_data_saved_videos_AY.csv']

In [39]:
## Helper functions
def createPostID(row):
    """helper function: create a new value using music and author"""
    return f"{row['music']}_{row['author']}"

# splitter = lambda x: [s.strip() for s in x.split(',') if not s.type() == float]

In [40]:
# create 3 dfs: control data, save data, all data
onlyAllData = [f for f in files if 'data_all' in f] # filtering with list comprehension

control_dfs = []
save_dfs = []
all_dfs = []



for fN in onlyAllData: # folder with files that have all posts
    path = os.path.join('data', fN) # create file path
    df = pd.read_csv(path) # create dataframe

    # add column to indicate date/time collected
    df['collectionTime'] = fN[:11]

    # add column with unique post id
    df['postID'] = df.apply(createPostID, axis=1) # use axis=1 to process one row at a time

    # split hashtags into list
    #df['hashtag'] = df['hashtag'].apply(splitter)

    if 'control'in fN:
        control_dfs.append(df)
    else:
        save_dfs.append(df)

    all_dfs.append(df)


df_control = pd.concat(control_dfs, ignore_index=True)
df_save = pd.concat(save_dfs, ignore_index=True)
df_all = pd.concat(all_dfs, ignore_index=True)

print("control shape:", df_control.shape)
print("save shape:", df_save.shape)
print("all data shape:", df_all.shape)


AttributeError: 'str' object has no attribute 'type'

In [16]:
control_df.head()

Unnamed: 0,batch,index,music,hashtag,author,likes,comments,shares,saves,collectionTime,postID
0,1,0,original sound - ®️ Louieveegang💙,@louieveemoonteenmoonfam,louieveedee,297300,1571,20300,721,02-14-16-57,original sound - ®️ Louieveegang💙_louieveedee
1,1,1,original sound - dayz915,"atodamadre%F0%9F%8F%A7, %F0%9F%A4%99%F0%9F%8F%...",dayz.915,522299,3713,36500,142900,02-14-16-57,original sound - dayz915_dayz.915
2,1,2,original sound - DermDoctor | Dr. Shah,"@poblanopepp, dermatographia",dermdoctor,430100,4758,24900,6591,02-14-16-57,original sound - DermDoctor | Dr. Shah_dermdoctor
3,1,3,original sound - 𝐝𝐢𝐧𝐨 🎀,"@lashedchars, fy, fyp, fyp%E3%82%B7%E3%82%9Avi...",marschvarl,251,29,31,0,02-14-16-57,original sound - 𝐝𝐢𝐧𝐨 🎀_marschvarl
4,1,4,original sound - AfroSamuraiT,"lifehack, lifehacks, challenge, ksi, prime",afrosamuraitt,65200,3487,11700,189,02-14-16-57,original sound - AfroSamuraiT_afrosamuraitt


In [29]:
type(control_df['hashtag'][0])

str

<a id="sec2"></a>
### 2. Exploratory Analysis

In [4]:
ourpath = "data" # make sure this is the same path of your data files

df_test = pd.read_csv(os.path.join(ourpath, onlyAllData[0])) # get the first file
df1.head()

Unnamed: 0,batch,index,music,hashtag,author,likes,comments,shares,saves
0,1,0,original sound - ESPN,"@wendy.bri, snow, cold",espn,2000000,19900,170800,108600
1,1,1,I Have No Enemies - ★DGK132105,"@asianjeffontop, toastyog, streamer, fortnite,...",toasty1k,51200,441,3322,59
2,1,2,Asn - thsituan._.,,ccabots,203000,4712,124700,3817
3,1,3,Like That Sped Up - Laila!,"@bareandneutral, ingrown",dermdoctor,357800,561,11300,2068
4,1,4,original sound - Funny,,baby_ok7,3100000,34500,378200,81300


In [18]:
# check for null values
df_all[['music','author', 'hashtag']].isna().sum() 

music        0
author       0
hashtag    475
dtype: int64

In [21]:
df_all['hashtag'][0].split(',')

['@wendy.bri', ' snow', ' cold']

<a id="sec3"></a>
### 3. Jaccard Analysis 

In [3]:
## Define function for calculating jaccard index

def jaccard_index(list_1, list_2):
    intersection = len(list_1.intersection(list_2))
    union = len(list_1.union(list_2))
    return intersection / union if union != 0 else 0

<a id="sec4"></a>
### 4. Visualizations