# **Problem Statement:**
Explore the data set shared SNAP (Stanford Large Network Dataset Collection) product co-purchasing networks) [link](https://snap.stanford.edu/data/#amazon). The key task is the creation of an undirected graph and basis the standard functions deriving some key insights. The assessment expectation has been shared in the early context of this document. The clearly mentioned tools and defined tools and data sets need to be used. Observe the independent and original approach on building this capability.


## **Tools & Data Set Details(IMPORTANT)**
**Notebook Google collab:**https://colab.research.google.com/
####**Languages & framework:**Pyspark | Spark 2.4+
---
#### **Data Set Details:**(SNAP)https://snap.stanford.edu/data/#amazon
#### **Meta Data:**https://snap.stanford.edu/data/amazon-meta.html 
#### **Data Set:**https://snap.stanford.edu/data/amazon0601.html
#### **SNAP Libraries:**https://pypi.org/project/snap-stanford/
---





# **Approach towards deriving insights**
We have read,extracted,processed and analyzed information with the required pipeline.Following which we present our insight to increase the sale by :
 
1.   Targeting popular products
2.   Target the products which create demand and lead to co purchase
3.   Increasing co purchase pair



# **Index**

<a href='#1'>Step 1: Load the data set (only using Spark based) </a> <br>
<a href='#2'>Step 2: Data extraction (both transaction and meta data)</a><br>
<a href='#3'>Step 3: Understanding of the provided data set (both transaction and meta data)</a></br>
<a href='#4'>Step 4: Linking transactional data and meta data</a></br>
<a href='#5'>Step 5: Analytics based on metadata linked transaction information </a></br>


1.   <a href='#5.1'>How is copurchasing a product dependent on  Average rating when we first buy</a></br>
2.   <a href='#5.2'>How is copurchasing a product dependent on  Average rating</a></br>
3.   <a href='#5.3'>Analzying co purchase within same group</a></br>
4.   <a href='#5.4'>Analzying co purchase across the diverse group</a></br>

<a href='#6'>Step 6: Graph Representation</a></br>
1. <a href='#6.1'>Creation of graph using transaction data </a></br>
2. <a href='#6.2'>Connected component </a></br>
3. <a href='#6.3'>Get indegree and outdegree for every node</a></br>
4. <a href='#6.4'>Importance/Popularity Score for every node</a></br>

<a href='#7'>Step 7: Using Graph based analysis to improve the sale and drive the focus area </a></br>
1. <a href='#7.1'>Popular products are the ones with high page_rank/centrality score </a></br>
2. <a href='#7.2'>Products which might rose to popularity </a></br>
3. <a href='#7.3'>Products which should be targeted inorder to trigger the sale of others</a></br>
4. <a href='#7.4'>Product sale within the same and diverse product group </a></br>

<a href='#8'>Step 8: Improving product recommendation </a></br>
1.  <a href='#8.1'>Using Open traid </a></br>
2. <a href='#8.2'>Using Node2vec </a></br>

<a href='#9'>Step 9: Proposed extension </a></br>
1.  <a href='#9.1'>Title based similarty score </a></br>
2. <a href='#9.2'>Using category navigation  </a></br>



<a href='#9'> Proposed extensions</a>



##**Step 1:** Load the data set (only using Spark based) <a name='1'>

In [None]:
'''Preparing the environment'''
import os
import sys
if 'COLAB_GPU' in os.environ:
    !pip install pyspark
    !pip install snap-stanford
    !pip install node2vec
    !git clone https://github.com/snap-stanford/snap-python.git
    !git clone https://github.com/snap-stanford/snap.git
    amazon0302_link='https://snap.stanford.edu/data/amazon0302.txt.gz'
    amazon0312_link='https://snap.stanford.edu/data/amazon0312.txt.gz'
    amazonmeta_link='https://snap.stanford.edu/data/bigdata/amazon/amazon-meta.txt.gz'
    amazon0601_link='https://snap.stanford.edu/data/amazon0601.txt.gz'
    downloads=[amazon0302_link,amazon0312_link,amazonmeta_link,amazon0601_link]
    for link in downloads:
        !wget {link}
        !gunzip {link.split('/')[-1]}


fatal: destination path 'snap-python' already exists and is not an empty directory.
fatal: destination path 'snap' already exists and is not an empty directory.
--2021-05-27 02:24:46--  https://snap.stanford.edu/data/amazon0302.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4664334 (4.4M) [application/x-gzip]
Saving to: ‘amazon0302.txt.gz’


2021-05-27 02:24:47 (5.46 MB/s) - ‘amazon0302.txt.gz’ saved [4664334/4664334]

gzip: amazon0302.txt already exists; do you wish to overwrite (y or n)? ^C
--2021-05-27 02:25:08--  https://snap.stanford.edu/data/amazon0312.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11304294 (11M) [application/x-gzip]
Saving to: ‘amazon0312.txt

In [None]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.functions import *
import re

In [None]:
spark = SparkSession.builder.getOrCreate()
sc= SparkContext.getOrCreate();

In [None]:
trans_file='amazon0302.txt'
meta_file='amazon-meta.txt'

In [None]:
contentRDD=sc.textFile(trans_file)
filterDD = contentRDD.filter(lambda l: not l.startswith('#'))
filterDD.take(10)

['0\t1',
 '0\t2',
 '0\t3',
 '0\t4',
 '0\t5',
 '1\t0',
 '1\t2',
 '1\t4',
 '1\t5',
 '1\t15']

In [None]:
'''Extracting edge record set'''
def get_edge_record(x):
    return [*x.split('\t')]
trans_data=filterDD.map(lambda x:get_edge_record(x))
trans_data.take(2)

[['0', '1'], ['0', '2']]

In [None]:
text_rdd=spark.read.text(meta_file,lineSep='\r\n\r\n')
records=text_rdd.rdd.filter(lambda l: (not (l.value.startswith('#')) and (not l.value.__contains__('  discontinued product'))))
records=records.map(lambda x : x.value.split('\r\n'))
records.take(3)

[['Id:   1',
  'ASIN: 0827229534',
  '  title: Patterns of Preaching: A Sermon Sampler',
  '  group: Book',
  '  salesrank: 396585',
  '  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]',
  '  reviews: total: 2  downloaded: 2  avg rating: 5',
  '    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9',
  '    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5'],
 ['Id:   2',
  'ASIN: 0738700797',
  '  title: Candlemas: Feast of Flames',
  '  group: Book',
  '  salesrank: 168596',
  '  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940',
  '  categories: 2',
  '   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca

# **Step 2:** Data extraction (both transaction and meta data) <a name='2'></a>

In [None]:
# Extracting product meta information 
def extract_meta_info(x):

    x_dict=dict()
    for ele in x:
        split=ele.split(':')
        x_dict[split[0].strip()]=split[1:]
    ID=int(x_dict['Id'][0])
    asin=x_dict['ASIN'][0].strip()
    title=x_dict.get('title',["None"])[0].strip()
    group=x_dict.get('group',["None"])[0].strip()
    sales_rank=int(x_dict.get('salesrank',["-1"])[0].strip())
    nproduct_similar=int(x_dict.get('similar',['0'])[0].split('  ')[0])
    
    total_rating=0
    avg_rating=0
    if "reviews" in x_dict.keys():
        review=''.join(x_dict['reviews'])
        total_rating = int(re.findall("total (\d+)",review)[0])
        avg_rating=int(re.findall("avg\ rating\ (\d+)",review)[0])

    
    return [ID,asin,group,title,sales_rank,nproduct_similar,total_rating,avg_rating]


meta_table=spark.createDataFrame(records.map(lambda x: extract_meta_info(x)),schema=['ID','ASIN','group','title','sales_rank','np_similar','total_rating','avg_rating'])

meta_table.show()


+---+----------+-----+--------------------+----------+----------+------------+----------+
| ID|      ASIN|group|               title|sales_rank|np_similar|total_rating|avg_rating|
+---+----------+-----+--------------------+----------+----------+------------+----------+
|  1|0827229534| Book|Patterns of Preac...|    396585|         5|           2|         5|
|  2|0738700797| Book|           Candlemas|    168596|         5|          12|         4|
|  3|0486287785| Book|World War II Alli...|   1270652|         0|           1|         5|
|  4|0842328327| Book|Life Application ...|    631289|         5|           1|         4|
|  5|1577943082| Book|Prayers That Avai...|    455160|         5|           0|         0|
|  6|0486220125| Book|How the Other Hal...|    188784|         5|          17|         4|
|  7|B00000AU3R|Music|               Batik|      5392|         5|           3|         4|
|  8|0231118597| Book| Losing Matt Shepard|    277409|         5|          15|         4|
|  9|18596

In [None]:
# Reading transaction data to spark dataframe
trans_sdf=spark.createDataFrame(trans_data,schema=['src','dest'])

## **Step 3:** : Understanding of the provided data set (both transaction and meta data) <a name='3'> </a>

In [None]:
print("Product group count present in meta data")
group_items=meta_table.groupby('group').count()
group_items.show()

Product group count present in meta data
+------------+------+
|       group| count|
+------------+------+
|       Video| 26131|
|         Toy|     8|
|         DVD| 19828|
|      Sports|     1|
|Baby Product|     1|
| Video Games|     1|
|        Book|393561|
|       Music|103144|
|    Software|     5|
|          CE|     4|
+------------+------+



In [None]:
#Get the birectional data
birec=trans_sdf.alias('a').join(trans_sdf.alias('b'),(col('a.src')==col('b.dest')) & (col('a.dest')==col('b.src'))).select(trans_sdf['*'])
birec.show(3)

+------+------+
|   src|  dest|
+------+------+
|100071| 76343|
|100084| 78003|
|100187|156222|
+------+------+
only showing top 3 rows



In [None]:
bidir_pairs=birec.count()/2
bidir_pairs

335085.0

In [None]:
total_pairs=trans_sdf.groupby('src','dest').count().count()
total_pairs

1234877

In [None]:
print("{0:0.2f} % of the purchase pair are two way".format(100*bidir_pairs/total_pairs))

27.14 % of the purchase pair are two way


In [None]:
trans_sdf.columns

['src', 'dest']

## **Step 4:** Linking transactional data and meta data  <a name='4'></a>


In [None]:
node_meta=meta_table.join(trans_sdf,meta_table.ID==trans_sdf.src)

In [None]:
node_meta.columns

['ID',
 'ASIN',
 'group',
 'title',
 'sales_rank',
 'np_similar',
 'total_rating',
 'avg_rating',
 'src',
 'dest']

In [None]:
meta_table=meta_table.select('ID', 'ASIN', 'group','sales_rank','avg_rating')
meta_table.columns

['ID', 'ASIN', 'group', 'sales_rank', 'avg_rating']

In [None]:
meta_table=meta_table.withColumnRenamed('ID','ID2').withColumnRenamed('ASIN','ASIN2').withColumnRenamed('group','group2').withColumnRenamed('category','category2').withColumnRenamed('avg_rating','avg_rating2').withColumnRenamed('sales_rank','sales_rank2')

In [None]:
meta_table.columns

['ID2', 'ASIN2', 'group2', 'sales_rank2', 'avg_rating2']

In [None]:
node_meta_src_des=node_meta.join(meta_table,node_meta.dest==meta_table.ID2)

In [None]:
node_meta_src_des.printSchema()

root
 |-- ID: long (nullable = true)
 |-- ASIN: string (nullable = true)
 |-- group: string (nullable = true)
 |-- title: string (nullable = true)
 |-- sales_rank: long (nullable = true)
 |-- np_similar: long (nullable = true)
 |-- total_rating: long (nullable = true)
 |-- avg_rating: long (nullable = true)
 |-- src: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- ID2: long (nullable = true)
 |-- ASIN2: string (nullable = true)
 |-- group2: string (nullable = true)
 |-- sales_rank2: long (nullable = true)
 |-- avg_rating2: long (nullable = true)



In [None]:
node_meta_src_des.columns

['ID',
 'ASIN',
 'group',
 'title',
 'sales_rank',
 'np_similar',
 'total_rating',
 'avg_rating',
 'src',
 'dest',
 'ID2',
 'ASIN2',
 'group2',
 'sales_rank2',
 'avg_rating2']

## **Step 5:** Analytics based on metadata linked transaction information <a name='5'></a>

#### 5.1 . How the purchasing a product dependent on  Average rating <a name='5.1'></a>

### For analyzing co purchase we need to focus on count of src attribute

In [None]:
print('Grouping by product which first buy')
src_grouped=node_meta_src_des.groupby('src').count()
src_grouped.columns

Grouping by product which first buy


['src', 'count']

In [None]:
src_grouped_avg=src_grouped.join(node_meta_src_des,src_grouped['src']==node_meta_src_des['src'])

In [None]:
src_grouped_avg=src_grouped_avg.select(src_grouped['src'],src_grouped['count'],node_meta_src_des['sales_rank'],node_meta_src_des['group'],node_meta_src_des['avg_rating'])
src_grouped_avg.show()

+------+-----+----------+-----+----------+
|   src|count|sales_rank|group|avg_rating|
+------+-----+----------+-----+----------+
|100010|    5|   1105443| Book|         3|
|100010|    5|   1105443| Book|         3|
|100010|    5|   1105443| Book|         3|
|100010|    5|   1105443| Book|         3|
|100010|    5|   1105443| Book|         3|
|100140|    3|   1556441| Book|         5|
|100140|    3|   1556441| Book|         5|
|100140|    3|   1556441| Book|         5|
|100227|    5|    753595| Book|         3|
|100227|    5|    753595| Book|         3|
|100227|    5|    753595| Book|         3|
|100227|    5|    753595| Book|         3|
|100227|    5|    753595| Book|         3|
|100263|    4|    269450|Music|         0|
|100263|    4|    269450|Music|         0|
|100263|    4|    269450|Music|         0|
|100263|    4|    269450|Music|         0|
|100320|    5|     45323| Book|         4|
|100320|    5|     45323| Book|         4|
|100320|    5|     45323| Book|         4|
+------+---

In [None]:
src_grouped_avg.corr('count','avg_rating')

0.0012499627347239994

In [None]:
src_groupbased_corr=src_grouped_avg.groupby('group').agg(corr('count','avg_rating'))

In [None]:
src_groupbased_corr.show()

+------------+-----------------------+
|       group|corr(count, avg_rating)|
+------------+-----------------------+
|       Video|   0.022525243463348173|
|         Toy|   -0.28867513459481303|
|         DVD|   0.003657015314166712|
|Baby Product|                   null|
| Video Games|                   null|
|        Book|    -8.8567924985705E-5|
|       Music|   1.698285532912330...|
|    Software|                   null|
|          CE|     0.3779644730092271|
+------------+-----------------------+



#### 5.2 . How is copurchasing a product dependent on  Average rating <a name='5.2'></a>

### For analyzing co purchase we need to focus on destination attribute as it identifies the co purchased product

In [None]:
print('Grouping by co purchased product')
dest_grouped=node_meta_src_des.groupby('dest').count()
dest_grouped.columns

Grouping by co purchased product


['dest', 'count']

In [None]:
dest_grouped_avg=dest_grouped.join(node_meta_src_des,dest_grouped['dest']==node_meta_src_des['dest'])

In [None]:
dest_grouped_avg=dest_grouped_avg.select(dest_grouped['dest'],dest_grouped['count'],node_meta_src_des['sales_rank'],node_meta_src_des['group2'],node_meta_src_des['avg_rating2'])
dest_grouped_avg.show()

+------+-----+----------+------+-----------+
|  dest|count|sales_rank|group2|avg_rating2|
+------+-----+----------+------+-----------+
|100010|    1|      8076|  Book|          3|
|100140|    1|    386221|  Book|          5|
|100227|    5|   1115219|  Book|          3|
|100227|    5|    750026|  Book|          3|
|100227|    5|   1549622|  Book|          3|
|100227|    5|     39872|  Book|          3|
|100227|    5|     18399|  Book|          3|
|100263|    3|       464| Music|          0|
|100263|    3|    312566| Music|          0|
|100263|    3|    419694| Music|          0|
|100320|    5|    304322|  Book|          4|
|100320|    5|    277731|  Book|          4|
|100320|    5|   1804743|  Book|          4|
|100320|    5|    561260|  Book|          4|
|100320|    5|     29543|  Book|          4|
|100553|    3|    880630|  Book|          0|
|100553|    3|     54489|  Book|          0|
|100553|    3|    265462|  Book|          0|
|100704|    5|     61804|  Book|          3|
|100704|  

In [None]:
dest_grouped_avg.corr('count','avg_rating2')

0.021753365799190683

In [None]:
groupbaed_corr=dest_grouped_avg.groupby('group2').agg(corr('count','avg_rating2'))

In [None]:
groupbaed_corr.show()

+------------+------------------------+
|      group2|corr(count, avg_rating2)|
+------------+------------------------+
|       Video|    0.038297240052678136|
|         Toy|     -0.9626342561216908|
|         DVD|     -0.0370511073291901|
|Baby Product|                    null|
| Video Games|                    null|
|        Book|    0.018959076357230306|
|       Music|     0.04148909959045391|
|    Software|      0.9999999999999998|
|          CE|                     1.0|
+------------+------------------------+



In [None]:
print("From above , we don't get see correlation with rating of both purchase and copurchased product")

From above , we don't get see correlation with rating of both purchase and copurchased product


#### 5.3 . Analzying co purchase within same group <a name='5.3'></a>






In [None]:
co_purchase_group=node_meta_src_des.groupby(['group','group2']).count().sort('count',ascending=False)

In [None]:
same_group=co_purchase_group[co_purchase_group.group==co_purchase_group.group2]
same_group=same_group.withColumnRenamed('count','purchase_pairs')
same_group.show()

+-----+------+--------------+
|group|group2|purchase_pairs|
+-----+------+--------------+
| Book|  Book|        637896|
|Music| Music|         45806|
|Video| Video|          3372|
|  DVD|   DVD|          3043|
+-----+------+--------------+



In [None]:
group_items_df=group_items.toPandas()

In [None]:
group_items_df

Unnamed: 0,group,count
0,Video,26131
1,Toy,8
2,DVD,19828
3,Sports,1
4,Baby Product,1
5,Video Games,1
6,Book,393561
7,Music,103144
8,Software,5
9,CE,4


In [None]:
same_group=same_group.alias('a').join(group_items.alias('b'),col('a.group')==col('b.group')).select(same_group['*'],group_items['count'])

In [None]:
same_group=same_group.withColumnRenamed('count','total_products').toPandas()
same_group

Unnamed: 0,group,group2,purchase_pairs,total_products
0,Video,Video,3372,26131
1,DVD,DVD,3043,19828
2,Book,Book,637896,393561
3,Music,Music,45806,103144


In [None]:
def co_pur_ratio(x):
    possible_pairs=(x.total_products*(x.total_products-1))/2
    ratio = 100*x.loc['purchase_pairs']/possible_pairs
    return ratio

In [None]:
same_group['co_purchase_ratio']=same_group.apply(lambda x:co_pur_ratio(x),axis=1)
same_group

Unnamed: 0,group,group2,purchase_pairs,total_products,co_purchase_ratio
0,Video,Video,3372,26131,0.000988
1,DVD,DVD,3043,19828,0.001548
2,Book,Book,637896,393561,0.000824
3,Music,Music,45806,103144,0.000861


In [None]:
highest_copurchase_group=same_group[same_group.co_purchase_ratio==same_group.co_purchase_ratio.max()].group.values[0]
print("Highest co purchase ratio is '{}'".format(highest_copurchase_group))

Highest co purchase ratio is 'DVD'


#### 5.4 . Analzying co purchase across the diverse group <a name='5.4'></a>


In [None]:
diverse_group=co_purchase_group[co_purchase_group.group!=co_purchase_group.group2].toPandas()


In [None]:
diverse_group

Unnamed: 0,group,group2,count
0,Book,Music,164434
1,Music,Book,163471
2,Video,Book,42329
3,Book,Video,42241
4,DVD,Book,31270
5,Book,DVD,31008
6,Video,Music,11007
7,Music,Video,10905
8,DVD,Music,8078
9,Music,DVD,8063


## **Step 6:** Graph Representation<a name='6'><a>


In [None]:
import snap

In [None]:
edge_dataframe=trans_sdf.toPandas()

In [None]:
edge_dataframe.shape

(1234877, 2)

In [None]:
edge_dataframe['src']=edge_dataframe['src'].astype('int64')
edge_dataframe['dest']=edge_dataframe['dest'].astype('int64')
edge_dataframe.dtypes

src     int64
dest    int64
dtype: object

In [None]:
print(edge_dataframe.src.nunique())
print(edge_dataframe.dest.nunique())

257570
262111


In [None]:
len(set.intersection(set(edge_dataframe.src),set(edge_dataframe.dest)))

257570

In [None]:
nodes=list(set.union(set(edge_dataframe.src),set(edge_dataframe.dest)))
len(nodes)

262111

In [None]:
np.max(nodes)

262110

#### 1 . Creation of graph using transaction data <a name='6.1'></a>


In [None]:
'''Creating a directed graph '''
graph=snap.TNGraph.New()

In [None]:
node_dict={}
for i,node in enumerate(nodes):
    node_id=i
    node_dict[node]=node_id
    graph.AddNode(node_id)

In [None]:
len(node_dict)

262111

In [None]:
for i, row in edge_dataframe.iterrows():
    graph.AddEdge(node_dict[row['src']], node_dict[row['dest']])

In [None]:
FOut = snap.TFOut("co_purchase_directed_all.graph")
graph.Save(FOut)

In [None]:
# shared_neighbours=[]
# for i,row in edge_dataframe.iterrows():
#     shared_neighbours.append(graph.GetCmnNbrs(int(row['src']),int(row['dest'])))

#### 2 . Connected component <a name='6.2'></a>

In [None]:
Components = graph.GetSccs()

In [None]:
len(Components)

6594

In [None]:
print(graph.IsConnected())

True


In [None]:
print(graph.IsWeaklyConn())

True


#### 3 . Get indegree and outdegree for every node <a name='6.3'></a>

In [None]:
InDegV = graph.GetNodeInDegV()

In [None]:
OutDegV = graph.GetNodeOutDegV()

In [None]:
degree_attrs=[]
for item1,item2 in zip(InDegV,OutDegV):
    node_id=item1.GetVal1()
    in_degree=item1.GetVal2()
    out_degree=item2.GetVal2()
    degree_attrs.append([node_id,in_degree,out_degree])

In [None]:
node_attributes=pd.DataFrame(degree_attrs,columns=['Node_id','in_degree','out_degree'])

In [None]:
node_attributes.head()

Unnamed: 0,Node_id,in_degree,out_degree
0,0,2,5
1,1,1,5
2,2,2,5
3,3,1,5
4,4,25,5


In [None]:
node_attributes[['in_degree','out_degree']].describe()

Unnamed: 0,in_degree,out_degree
count,262111.0,262111.0
mean,4.711275,4.711275
std,5.707922,0.95154
min,1.0,0.0
25%,2.0,5.0
50%,3.0,5.0
75%,6.0,5.0
max,420.0,5.0


#### 4 . Importance/Popularity Score for every node <a name='6.4'></a>


In [None]:
from fastprogress import progress_bar
from tqdm import tqdm
tqdm.pandas()

  from pandas import Panel


In [None]:
PRankH = graph.GetPageRank()

In [None]:
node_attributes['page_rank']=node_attributes.apply(lambda x :PRankH(int(x.Node_id)),axis=1)

In [None]:
NIdHubH, NIdAuthH = graph.GetHits()

In [None]:
node_attributes['hub_score']=node_attributes.apply(lambda x :NIdHubH(x.Node_id),axis=1)

In [None]:
node_attributes['auth_score']=node_attributes.apply(lambda x :NIdAuthH(x.Node_id),axis=1)

In [None]:
node_attributes

Unnamed: 0,Node_id,in_degree,out_degree,page_rank,hub_score,auth_score
0,0,2,5,8.886472e-07,4.763850e-05,5.674574e-06
1,1,1,5,7.595271e-07,6.395747e-05,2.337667e-06
2,2,2,5,8.886472e-07,5.127175e-05,5.496262e-06
3,3,1,5,7.595271e-07,5.395929e-05,2.337667e-06
4,4,25,5,1.076645e-05,4.904757e-04,2.237277e-05
...,...,...,...,...,...,...
262106,262106,2,5,1.895723e-06,4.449405e-12,5.125588e-12
262107,262107,3,5,2.441936e-06,3.485290e-12,4.254202e-13
262108,262108,1,5,1.800167e-06,2.991218e-12,4.995048e-12
262109,262109,1,5,9.144734e-07,1.060986e-10,1.415069e-13


In [None]:
node_attributes['n_triads']=node_attributes.apply(lambda x :graph.GetNodeTriads(int(x.Node_id)),axis=1)

In [None]:
node_attributes.head()

Unnamed: 0,Node_id,in_degree,out_degree,page_rank,hub_score,auth_score,n_triads
0,0,2,5,8.886472e-07,4.8e-05,6e-06,3
1,1,1,5,7.595271e-07,6.4e-05,2e-06,3
2,2,2,5,8.886472e-07,5.1e-05,5e-06,2
3,3,1,5,7.595271e-07,5.4e-05,2e-06,6
4,4,25,5,1.076645e-05,0.00049,2.2e-05,41


In [None]:
node_attributes['Clustr_cf']=node_attributes.apply(lambda x :graph.GetNodeClustCf(int(x.Node_id)),axis=1)

In [None]:
node_attributes

Unnamed: 0,Node_id,in_degree,out_degree,page_rank,hub_score,auth_score,n_triads,Clustr_cf
0,0,2,5,8.886472e-07,4.763850e-05,5.674574e-06,3,0.300000
1,1,1,5,7.595271e-07,6.395747e-05,2.337667e-06,3,0.300000
2,2,2,5,8.886472e-07,5.127175e-05,5.496262e-06,2,0.133333
3,3,1,5,7.595271e-07,5.395929e-05,2.337667e-06,6,0.400000
4,4,25,5,1.076645e-05,4.904757e-04,2.237277e-05,41,0.108466
...,...,...,...,...,...,...,...,...
262106,262106,2,5,1.895723e-06,4.449405e-12,5.125588e-12,12,0.800000
262107,262107,3,5,2.441936e-06,3.485290e-12,4.254202e-13,11,0.733333
262108,262108,1,5,1.800167e-06,2.991218e-12,4.995048e-12,9,0.900000
262109,262109,1,5,9.144734e-07,1.060986e-10,1.415069e-13,13,0.866667


## **Step 7:** Using Graph based analysis to improve the sale and drive the focus area <a name="7"></a>


#### 1 . Popular products are the ones with high page_rank/centrality score <a name='7.1'></a>

In [None]:
ex_high_pgrnk=node_attributes.page_rank.mean()+node_attributes.page_rank.std()
ex_high_pgrnk

1.3096018928027278e-05

In [None]:
 node_attributes[node_attributes.page_rank>ex_high_pgrnk]

Unnamed: 0,Node_id,in_degree,out_degree,page_rank,hub_score,auth_score,n_triads,Clustr_cf
5,5,54,5,0.000131,3.311300e-03,9.762999e-04,79,0.051299
6,6,98,5,0.000185,2.937866e-03,1.020202e-02,66,0.013069
7,7,34,5,0.000201,2.864395e-03,1.878698e-03,38,0.057057
8,8,293,5,0.000906,2.508434e-04,5.663351e-02,166,0.003802
9,9,20,0,0.000168,0.000000e+00,9.653628e-04,14,0.073684
...,...,...,...,...,...,...,...,...
261213,261213,17,5,0.000016,2.822956e-06,2.092580e-08,28,0.183007
261572,261572,23,5,0.000023,3.264110e-12,1.332903e-12,50,0.197628
261581,261581,10,5,0.000013,3.330485e-12,3.349291e-11,20,0.444444
261582,261582,17,5,0.000015,8.532305e-14,5.961414e-13,35,0.257353


#### 2 . Products which might rose to popularity <a name='7.2'></a>

In [None]:
''' Products with high hub score and low Clustr_cf must be increased in sale since they might trigger the sale of products with high authority score'''

' Products with high hub score and low Clustr_cf must be increased in sale since they might trigger the sale of products with high authority score'

In [None]:
ex_high_hub=node_attributes.hub_score.mean()+node_attributes.hub_score.std()
ex_high_auth=node_attributes.auth_score.mean()+node_attributes.auth_score.std()
ex_low_clcf=node_attributes.Clustr_cf.mean()-node_attributes.Clustr_cf.std()

In [None]:
products_for_target_sell = node_attributes[(node_attributes.page_rank>ex_high_pgrnk) & (node_attributes.Clustr_cf<ex_low_clcf) ]
products_for_target_sell

Unnamed: 0,Node_id,in_degree,out_degree,page_rank,hub_score,auth_score,n_triads,Clustr_cf
5,5,54,5,0.000131,3.311300e-03,9.762999e-04,79,0.051299
6,6,98,5,0.000185,2.937866e-03,1.020202e-02,66,0.013069
7,7,34,5,0.000201,2.864395e-03,1.878698e-03,38,0.057057
8,8,293,5,0.000906,2.508434e-04,5.663351e-02,166,0.003802
9,9,20,0,0.000168,0.000000e+00,9.653628e-04,14,0.073684
...,...,...,...,...,...,...,...,...
205477,205477,32,5,0.000035,3.131274e-10,1.095330e-08,53,0.106855
213261,213261,23,5,0.000017,1.987423e-08,6.344874e-08,34,0.134387
218156,218156,23,5,0.000018,2.513467e-09,2.596640e-08,34,0.134387
228899,228899,16,5,0.000015,1.235220e-05,1.310521e-04,17,0.141667


#### 3 . Products which should be targeted inorder to trigger the sale of others <a name='7.3'></a>

In [None]:
hub_products_to_sell = node_attributes[(node_attributes.hub_score>ex_high_hub) & (node_attributes.Clustr_cf<ex_low_clcf) ]
hub_products_to_sell

Unnamed: 0,Node_id,in_degree,out_degree,page_rank,hub_score,auth_score,n_triads,Clustr_cf
5,5,54,5,1.314575e-04,0.003311,9.762999e-04,79,0.051299
6,6,98,5,1.851205e-04,0.002938,1.020202e-02,66,0.013069
7,7,34,5,2.010097e-04,0.002864,1.878698e-03,38,0.057057
17,17,21,5,1.508724e-05,0.003309,8.334128e-05,27,0.090000
18,18,172,5,3.026123e-04,0.006516,8.366780e-03,141,0.009477
...,...,...,...,...,...,...,...,...
261947,261947,3,5,1.376405e-06,0.005635,1.538398e-06,1,0.100000
261985,261985,2,5,2.177681e-06,0.002613,1.191276e-06,1,0.100000
261986,261986,1,3,1.536207e-06,0.003997,3.620334e-07,0,0.000000
261990,261990,1,2,9.034880e-07,0.006145,8.716964e-07,0,0.000000


#### 4 . Product sale within the same and diverse product group <a name='7.4'></a>

In [None]:
def get_avg_clstr_coef(x):
    src_nodes=x['src_product_list']
    dest_nodes=x['dest_product_list']
    nodes=set([*src_nodes,*dest_nodes])
    nodes=[int(i) for i in nodes]
    clus_coefs=[]
    for node in nodes:
        clus_coefs.append(graph.GetNodeClustCf(node))
    avg_clus_coef=np.mean(clus_coefs)
    return avg_clus_coef

In [None]:
same_group_ID=node_meta_src_des.groupBy(['group','group2']).agg(F.collect_list('src').alias('src_product_list'),F.collect_list('dest').alias('dest_product_list')).toPandas()

In [None]:
same_group_ID['Clustr_coef']=same_group_ID.apply(lambda x: get_avg_clstr_coef(x),axis=1)

In [None]:
same_group_ID.sort_values('Clustr_coef')

Unnamed: 0,group,group2,src_product_list,dest_product_list,Clustr_coef
27,Book,Software,"[175183, 131901, 105223, 73545, 105221]","[96696, 96696, 96696, 96696, 96696]",0.128307
5,Baby Product,Book,"[197564, 197564, 197564, 197564]","[197563, 90079, 61309, 209480]",0.168571
31,Video,Toy,"[1014, 72791]","[922, 11660]",0.185577
17,Book,Baby Product,"[197563, 209480]","[197564, 197564]",0.211111
8,Music,Toy,"[773, 11661, 55157, 4830]","[922, 11660, 11660, 11660]",0.219163
1,Toy,Book,"[11660, 922, 922, 257106, 922, 11660, 922, 116...","[11662, 621, 1012, 255366, 1013, 19731, 1011, ...",0.234828
32,Toy,Video,[922],[1014],0.239286
15,DVD,Baby Product,[151435],[197564],0.25
7,Baby Product,DVD,[197564],[151435],0.25
35,Book,Toy,"[255366, 621, 9026, 1011, 7334, 11662, 172212,...","[257106, 922, 922, 922, 922, 11660, 11660, 116...",0.270901


## **Step 8:** Improving product recommendation <a name='8'></a>


 #### 1.  Using Open traid <a name='8.1'></a>

In [None]:
graph.CntUniqBiDirEdges()

335085

In [None]:
nodes=list(node_dict.keys())
len(nodes)

262111

{1, 2, 5}

In [None]:
## A<-->B-->C
## A-->B-->C
## A-->B>-->C

In [None]:
node_neighbours={}
for node in progress_bar(nodes):
  node_neighbours[node]=list(edge_dataframe[edge_dataframe.src==node]['dest'])

In [None]:
def triadic_closure(x):
  src=x['src']
  dest=x['dest']
  src_neighbours=set(node_neighbours[src])
  dest_neighbours=set(node_neighbours[dest])
  new_co_pairs=set.union(src_neighbours,dest_neighbours)-set.intersection(src_neighbours,dest_neighbours)
  new_nbrs=len(new_co_pairs)
  old_nbrs=len(src_neighbours)
  dseries=pd.Series([src,dest,new_co_pairs,new_nbrs,old_nbrs])
  return dseries

In [None]:
closed_triads=edge_dataframe.apply(lambda x :triadic_closure(x),axis=1)

In [None]:
colname={0:'src',1:'dest',2:'new_co_pairs',3:'new_nbrs',4:'old_nbrs'}

In [None]:
closed_triads=closed_triads.rename(columns=colname)
closed_triads

Unnamed: 0,src,dest,new_co_pairs,new_nbrs,old_nbrs
0,0,1,"{0, 1, 3, 15}",4,5
1,0,2,"{0, 1, 2, 3, 4, 5, 11, 12, 13, 14}",10,5
2,0,3,"{64, 1, 2, 3, 4, 5, 65, 66, 67, 63}",10,5
3,0,4,"{1, 2, 3, 4, 5, 7, 16, 17, 18, 19}",10,5
4,0,5,"{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}",10,5
5,1,0,"{0, 1, 3, 15}",4,5
6,1,2,"{2, 4, 5, 11, 12, 13, 14, 15}",8,5
7,1,4,"{0, 2, 4, 5, 7, 15, 16, 17, 18, 19}",10,5
8,1,5,"{0, 2, 4, 5, 6, 7, 8, 9, 10, 15}",10,5
9,1,15,"{0, 2, 4, 5, 68, 69, 70, 71, 72, 15}",10,5


In [None]:
closed_triads_df.new_nbrs.sum()

Unnamed: 0,src,dest,new_co_pairs,new_nbrs,old_nbrs
0,,,,,


#### 2. Using Node2vec <a name='8.2'></a>

This is an attempt to represent the products in vector representation. We are using node2vec to create shallow embedding.

In [None]:
import networkx as nx
from node2vec import Node2Vec

In [None]:
G = nx.from_pandas_edgelist(edge_dataframe, "src", "dest", create_using=nx.Graph())

In [None]:
node2vec = Node2Vec(G, dimensions=100, walk_length=5, num_walks=5)


HBox(children=(FloatProgress(value=0.0, description='Computing transition probabilities', max=262111.0, style=…

Generating walks (CPU: 1):   0%|          | 0/5 [00:00<?, ?it/s]




Generating walks (CPU: 1): 100%|██████████| 5/5 [03:55<00:00, 47.20s/it]


In [None]:
n2w_model = node2vec.fit(window=5, min_count=1)


In [None]:
def get_similar_nodes(node):
  similar_nodes=[int(x[0]) for x in n2w_model.most_similar(str(node))]
  return similar_nodes

In [None]:
'''Sourcing 'dest' column above the table. We can recommend the products which are copurchased with the similar type of product'''


"Sourcing 'dest' column above the table. We can recommend the products which are copurchased with the similar type of product"

In [None]:
recommended_product=node_meta_src_des[node_meta_src_des.src.isin(get_similar_nodes(1))].toPandas()

  


In [None]:
recommended_product

Unnamed: 0,ID,ASIN,group,title,sales_rank,np_similar,total_rating,avg_rating,src,dest,ID2,ASIN2,group2,sales_rank2,avg_rating2
0,119081,0451181468,Book,Praying for Sleep,397061,5,0,0,119081,119082,119082,0736905189,Book,306956,5
1,128758,B00008DDX0,DVD,Pete's a Pizza... and More William Steig Stori...,19428,5,0,0,128758,138381,138381,188301039X,Book,23239,4
2,243397,B00004TXTU,Music,Home,465394,0,0,0,243397,109292,109292,0873513193,Book,77751,4
3,16189,0275955230,Book,Staffing the Contemporary Organization,662020,5,2,4,16189,190,190,6303454488,Video,23333,4
4,243397,B00004TXTU,Music,Home,465394,0,0,0,243397,153522,153522,1566563607,Book,514509,3
5,13086,0312180845,Book,The Devil's Hunt,509335,5,3,5,13086,3405,3405,B0000057DN,Music,196579,4
6,119081,0451181468,Book,Praying for Sleep,397061,5,0,0,119081,84994,84994,1587200376,Book,435438,3
7,128758,B00008DDX0,DVD,Pete's a Pizza... and More William Steig Stori...,19428,5,0,0,128758,177943,177943,0375410538,Book,352627,3
8,74019,6305047480,DVD,New Jack City,5112,5,41,4,74019,57024,57024,0312867956,Book,652660,5
9,128758,B00008DDX0,DVD,Pete's a Pizza... and More William Steig Stori...,19428,5,0,0,128758,47579,47579,1931657017,Book,1281997,0



## Proposed extensions <a name='9'></a>
#### 1.  Using Node attributes
#### 2. Community detection 
#### 3. Location of category can be used as a ontology
#### 4. Similarity score on title,category-name,Authors
#### 5. Collaborative filtering using customer information