This notebook has as the main goal to create CSV's ready to be imported at Gephi.

I think that GB (Great-Britain), US (United States) and CA (Canada) may have the most relatable popular videos, so it is possible to create CSV's with the GB dataset only, GB+US or GB+US+CA.

Change the 'country' global variable to 'GB', 'GBUS' or 'GBUSCA' to chose the dataset. Then click '>>' to run the noteboook. 

Attention: some algorithms may take some time

In [2]:
import pandas as pd
#options : GB, GBUS, GBUSCA
country = "GBUS";

### First : original videos

Upload raw data from the original CSV's.

In [2]:
#upload data from csv
videosGB = pd.read_csv('../OriginalCSVs/GBvideos.csv', sep=',', decimal='.', header=0)
videosUS = pd.read_csv('../OriginalCSVs/USvideos.csv', sep=',', decimal='.', header=0)
videosCA = pd.read_csv('../OriginalCSVs/CAvideos.csv', sep=',', decimal='.', header=0)

#concat videos
if country == 'GB':
    videos = videosGB
elif country == 'GBUS':
    videos = pd.concat([videosGB, videosUS])
elif country == 'GBUSCA':
    videos = pd.concat([videosGB, videosUS, videosCA])
    
#process the dates
videos['trending_date'] = pd.to_datetime(videos['trending_date'], format='%y.%d.%m')
videos['publish_time'] = pd.to_datetime(videos['publish_time'], format='%Y-%m-%dT%H:%M:%S.%fZ')
    
#GB: 34297 videos
#GB+US: 69646 videos
#GB+US+CA: 104929 videos
videos.shape

(34297, 16)

### Second : no duplicated videos

Now, we want to create a graph with all videos, but some videos are popular multiple consecutive days. So the idea is to remove duplicates keeping the last popular day so that we can keep the most accurate statistics of the videos.

In [3]:
#elimina videos duplicados
videos = videos.drop_duplicates('video_id','last',False)
#elimina sem categorias (tags)
videosNone = videos[videos.tags != '[none]']
#elimina estatisticas que nao nos interessa
nodes = videosNone.loc[:, ['video_id','title','channel_title','category_id','trending_date','publish_time','views','likes','dislikes','comment_count',]]

#cria csv
newheader = ['id','Label','channel_title','category_id','trending_date','publish_time','views','likes','dislikes','comment_count']
nodes.to_csv('../CreatedCSVs/nodes'+country+'.csv',header=newheader,index=False)

#GB: 2847 nos
#GB+US: 7630 nos
#GB+US+CA: 24827 nos
print(nodes.shape)

nodes.head()

(2847, 10)


Unnamed: 0,video_id,title,channel_title,category_id,trending_date,publish_time,views,likes,dislikes,comment_count
5,AumaWl0TNBo,How My Relationship Started!,PointlessBlogVlogs,24,2017-11-14,2017-11-11 17:00:00,1182775,52708,1431,2333
7,-N5eucPMTTc,CHRISTMAS HAS GONE TO MY HEAD,MoreZoella,22,2017-11-14,2017-11-10 19:19:43,1164201,57309,749,624
22,fiusxyygqGk,Marshmello - You & Me (Official Music Video),marshmello,10,2017-11-14,2017-11-10 15:00:03,3407008,207262,3167,13279
90,sLJdBmAeB_U,COME SHOPPING WITH ME AND TRY ON NEW CLOTHING ...,Inthefrow,26,2017-11-14,2017-11-07 19:00:50,87772,2617,86,192
92,mCx26FLXWuI,Seth Rollins & Dean Ambrose vs. Cesaro & Sheam...,WWE,17,2017-11-14,2017-11-07 04:52:25,1689382,24186,3330,3414


### Third : tags

Transform the string of tags ' tag1 "|" tag2 "|" ... ' in an array of tags, for each video (dataset row)

In [4]:
#get tags
nodesTemp = videosNone.loc[:, ['video_id','tags','channel_title']]

#criar vetor de tags
for index, tags in nodesTemp.iterrows():
    tags['tags'] = tags['tags'].split('"|"')  
    #compor a ultima tag
    size = len(tags['tags'])
    last = tags['tags'][size-1]
    tags['tags'][size-1] = last.split('"')[0]
    #compor a primeira tag
    tag1 = tags['tags'][0].split('|"')
    tags['tags'][0] = tag1[0]
    if len(tag1) == 2:
        tags['tags'].append(tag1[1])
    
#save this tags so that I can use this file at other files
#GB: tagsGB
#GB+US: tagsGBUS
#GB+US+CA: tagsGBUSCA
nodesTemp.to_csv('../CreatedCSVs/tags'+country+'.csv',header=['id','tags','channel'],index=False)

nodesTemp.head()

Unnamed: 0,video_id,tags,channel_title
5,AumaWl0TNBo,"[pointlessblog, pointlessblogvlogs, games, gam...",PointlessBlogVlogs
7,-N5eucPMTTc,"[zoe sugg, zoe, vlog, vlogging, vlogs, daily, ...",MoreZoella
22,fiusxyygqGk,"[selena gomez wolves, marshmello - alone, mars...",marshmello
90,sLJdBmAeB_U,"[Inthefrow, COME SHOPPING WITH ME, LUXURY BEAU...",Inthefrow
92,mCx26FLXWuI,"[wwe, wrestling, wrestler, wrestle, superstars...",WWE


### Forth : edges

Now we'll create the edges CSV. First, we need to know which nodes have common tags. Then, we can create the dataset and save it.

In [5]:
weights = []
sources = []
targets = []
tags = []

def comparelists( id1, list1, id2, list2 ):
    intersect = set(list1).intersection(list2)
    w = len(intersect)
    if w > 0:
        tags.append(list(intersect))
        weights.append(w)
        sources.append(id1)
        targets.append(id2)
            
#for each node we'll search all nodes that have at least one similar tag    
for i1, tags1 in nodesTemp.iterrows(): 
    for i2, tags2 in nodesTemp.iterrows(): 
        #compara com os seguintes se existir tags
        if i2 > i1: 
            #verificar se existe alguma tag igual
            comparelists(tags1['video_id'],tags1['tags'],tags2['video_id'],tags2['tags'])

print("done")

done


In [6]:
#create dataframe            
df = {'Source' : sources, 'Target' : targets}   
edges = pd.DataFrame(df)
edges['Type'] = 'Undirected'
edges.to_csv('../CreatedCSVs/edges'+country+'.csv',index=False)

#GB: 194178 edges
#GB+US:
#GB+US+CA: 
print(edges.shape)

edges.head()

(194178, 3)


Unnamed: 0,Source,Target,Type
0,AumaWl0TNBo,-N5eucPMTTc,Undirected
1,AumaWl0TNBo,sLJdBmAeB_U,Undirected
2,AumaWl0TNBo,iDXlWTRyxgE,Undirected
3,AumaWl0TNBo,FmYMq9EHhdE,Undirected
4,AumaWl0TNBo,tSjvx_c4meE,Undirected


### Fifth : weightned edges

we used the sourcesAux and TargetAux so that we can now count the number of tags that each 2 videos have in common.

In [7]:
edgesWeight = edges
edgesWeight['Weight'] = weights
edgesWeight['Tags'] = tags
edgesWeight.to_csv('../CreatedCSVs/edgesWeightned'+country+'.csv',index=False)

edgesWeight.head()

Unnamed: 0,Source,Target,Type,Weight,Tags
0,AumaWl0TNBo,-N5eucPMTTc,Undirected,8,"[zoella, british, puppy, zoe, daily, nala, poi..."
1,AumaWl0TNBo,sLJdBmAeB_U,Undirected,1,[vlog]
2,AumaWl0TNBo,iDXlWTRyxgE,Undirected,1,[funny]
3,AumaWl0TNBo,FmYMq9EHhdE,Undirected,2,"[first, vlog]"
4,AumaWl0TNBo,tSjvx_c4meE,Undirected,1,[vlog]


In [3]:
edgesWeight = pd.read_csv('../CreatedCSVs/edgesWeightned'+country+'.csv', sep=',', decimal='.', header=0)
edgesWeight.Weight.describe()

count    1.254112e+06
mean     1.854477e+00
std      2.907944e+00
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      2.000000e+00
max      5.700000e+01
Name: Weight, dtype: float64