# Data Driven Dealings Development


*   EDA on Sales Data
*   RFM Clustering
*   Predicting Sales
*   Market Basket Analysis
*   Recommending Items per Customer







# Reading in the Data

In [3]:
'''# To be able to use your data stored in your Google Drive you first need to mount your Google Drive so you can load and save files to it. 
from google.colab import drive
drive.mount('/content/gdrive')
#You'll need to put in a token which Google will generate for you as soon as you click on the link'''

"# To be able to use your data stored in your Google Drive you first need to mount your Google Drive so you can load and save files to it. \nfrom google.colab import drive\ndrive.mount('/content/gdrive')\n#You'll need to put in a token which Google will generate for you as soon as you click on the link"

In [4]:
import pandas as pd
data = pd.read_excel('DDDD.xlsx')
data.head()
data=data[0:300]

# Sparsity

In [5]:
DataPrep = data[['SalesItem', 'SalesAmount', 'Customer']] #we will only use SalesItem, SalesAmount and Customer for our recommending purpose
DataPrep.head()

Unnamed: 0,SalesItem,SalesAmount,Customer
0,0,10,0
1,0,10,0
2,0,30,0
3,1,10,0
4,2,2,0


In [6]:
#DataPrep.rename(columns={'SalesItem':'movieId','Customer':'userId'},inplace=True)

In [7]:
DataPrep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   SalesItem    300 non-null    int64
 1   SalesAmount  300 non-null    int64
 2   Customer     300 non-null    int64
dtypes: int64(3)
memory usage: 7.2 KB


In [8]:
DataGrouped = DataPrep.groupby(['Customer', 'SalesItem']).sum().reset_index() # Group together
DataGrouped.head()



Unnamed: 0,Customer,SalesItem,SalesAmount
0,0,0,154
1,0,1,25
2,0,2,5
3,1,3,1
4,1,4,1


In [9]:
#make sure that no values <=0 exist
DataGroupedZero = DataGrouped.query('SalesAmount <= 0')
DataGroupedZero.head()

Unnamed: 0,Customer,SalesItem,SalesAmount


In [10]:
#in our above check we have made sure that no datarows <=0 exists. That is fine!
#only use this in case your data includes values <=0
# DataGrouped.SalesAmount.loc[DataGrouped.SalesAmount == 0] = 1 # Replace a sum of zero purchases with a one to
# DataGrouped.head()

#another interesting way to achieve the same is to use query function
#DataGrouped = DataGrouped.query('SalesAmount > 0') # Only get customers where purchase totals were positive
#DataGrouped.head()


In [11]:
import numpy as np
customers = list(np.sort(DataGrouped.Customer.unique())) # why 36 unique customers in a list and not 35? Index starts at 0!
products = list(DataGrouped.SalesItem.unique()) # Get our unique 3725 unique products that were purchased
quantity = list(DataGrouped.SalesAmount) # All of our purchases
#list function is a list of values. So customers now stores 36 unique customers.

In [12]:
from pandas import DataFrame
DfCustomerUnique = DataFrame(customers,columns=['Customer'])
DfCustomerUnique.head()

Unnamed: 0,Customer
0,0
1,1
2,3
3,4
4,5


In [13]:
len(DfCustomerUnique),len(products),len(quantity)

(6, 177, 186)

In [14]:
from scipy import sparse
from pandas.api.types import CategoricalDtype

rows = DataGrouped.Customer.astype(CategoricalDtype(categories=customers)).cat.codes # We have got 36 unique customers, which make up 13837 data rows (index)

# Get the associated row indices
cols = DataGrouped.SalesItem.astype(CategoricalDtype(categories= products)).cat.codes # We have got unique 3725 SalesItems, making up 13837 data rows (index)

# Get the associated column indices
#Compressed Sparse Row matrix
PurchaseSparse = sparse.csr_matrix((quantity, (rows, cols)), shape=(len(customers), len(products))) #len of customers=35, len of products=3725
#csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
#where data, row_ind and col_ind satisfy the relationship a[row_ind[k], col_ind[k]] = data[k]. , see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html

PurchaseSparse
#a sparse matrix is not a pandas dataframe, but sparse matrices are efficient for row slicing and fast matrix vector products


<6x177 sparse matrix of type '<class 'numpy.intc'>'
	with 186 stored elements in Compressed Sparse Row format>

In [15]:
#We have 35 customers with 3725 items. For these user/item interactions, 13837 of these items had a purchase. 
#In terms of sparsity of the matrix, that makes:
MatrixSize = PurchaseSparse.shape[0]*PurchaseSparse.shape[1] # 130375 possible interactions in the matrix (35 unique customers * 3725 unique SalesItems=130375)
PurchaseAmount = len(PurchaseSparse.nonzero()[0]) # 13837 SalesItems interacted with; 
sparsity = 100*(1 - (PurchaseAmount/MatrixSize))
sparsity


82.48587570621469

Since we will use Matrix Factorization for our collaborative filtering it should not be a problem that 89.3% of the interaction matrix is sparse. In plain English, 89,3% in our case means that only 10,7% of our customer-item interactions are already filled, meaning that most items have not been purchased by customers. It is said that collaborative filtering can even work well with even more sparse data. We can prove that it works when checking our decent recommendings in the end. Cosine Similarity is a good measure for sparse data, so we will stick to Cosine (instead of Pearson, Euclidean or Manhattan).

# Recommending

We have already talked about sparsity. However, we will start with a simple recommender first, before we come to more advanced techniques also using optimization for sparse matrices. However, we can normalize items by purchase frequency across all users, which is done in section 3.3. below.

In [16]:
#for every dataset we will add a 1 as purchased. That means, that this customer has purchased this item, no matter how many. We use this binary data for our recommending. Another approach would be to use the SalesAmount and 
#normalize it, in case you want to treat the Amount of SalesItems purchased as a kind of taste factor, meaning that someone who bought SalesItem x 100 times, while another Customer bought that same SalesItem x only 5 times does 
#not like it as much. I believe, that very often in Sales a binary approach makes more sense, but of course that depends on the data.
def create_DataBinary(DataGrouped):
  # DataPrep must be DataGrouped?!
    DataBinary = DataPrep.copy()
    DataBinary['PurchasedYes'] = 1 
    return DataBinary

DataBinary = create_DataBinary(DataGrouped)
DataBinary.head()



Unnamed: 0,SalesItem,SalesAmount,Customer,PurchasedYes
0,0,10,0,1
1,0,10,0,1
2,0,30,0,1
3,1,10,0,1
4,2,2,0,1


In [17]:
data2=DataBinary.drop(['SalesAmount'], axis=1)
data2.head()

Unnamed: 0,SalesItem,Customer,PurchasedYes
0,0,0,1
1,0,0,1
2,0,0,1
3,1,0,1
4,2,0,1


In [18]:
#for better convenience we add I for Item for every SalesItem. Otherwise we would only have customer and SalesItem Numbers, which can be a little bit puzzling.
#data2['SalesItem'] = 'I' + data2['SalesItem'].astype(str)
data2['SalesItem'] = data2['SalesItem'].astype(str)

In [19]:

#DfMatrix = pd.pivot_table(data,index=["Customer"], columns='SalesItem')
DfMatrix = pd.pivot_table(data2, values='PurchasedYes', index='Customer', columns='SalesItem')
DfMatrix.head()

SalesItem,0,1,10,100,101,102,103,104,105,106,...,90,91,92,93,94,95,96,97,98,99
Customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,1.0,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
3,,,1.0,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [20]:
len(DfMatrix)

6

In [21]:
#since we are only using 1 and 0, we do not need to think about normalization. But talk is cheap, let`s check to see that even if we would normalize, the result is the same, of course:
DfMatrix=DfMatrix.fillna(0) #NaN values need to get replaced by 0, meaning they have not been purchased yet.
DfMatrixNorm3 = (DfMatrix-DfMatrix.min())/(DfMatrix.max()-DfMatrix.min())
DfMatrixNorm3.head()
#the proof is in the pudding. But we will come back to normalization later on again, when we will take real Sales Amount into consideration for recommending as well.

SalesItem,0,1,10,100,101,102,103,104,105,106,...,90,91,92,93,94,95,96,97,98,99
Customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
#we need to bring our pivot table into the desired format, via reset_index and rename_axis. 
DfResetted = DfMatrix.reset_index().rename_axis(None, axis=1) 
DfResetted.head()
#Now each row represents one customer`s buying behaviour: 1 means the customer has purchased, NaN the customer has not yet purchased it

Unnamed: 0,Customer,0,1,10,100,101,102,103,104,105,...,90,91,92,93,94,95,96,97,98,99
0,0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:

DfMatrix.shape

(6, 177)

In [24]:
df=DfResetted #now working: because Customer must be nvarchar! If customer is int, then failure during CustItemSimilarity!

In [25]:
#we need to replace the NaN values with a 0, because our function will not work on NaN values.
#Please note, that we are only checking if a specific customer bought a specific item, yes or no. That is called binary. If customer bought a specific item, that means 1. If not, then 0. Because of this binary problem there is 
#no use in using any further scaling techniques.
df=df.fillna(0)
df.head()

Unnamed: 0,Customer,0,1,10,100,101,102,103,104,105,...,90,91,92,93,94,95,96,97,98,99
0,0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
#Creating a dataframe which only includes Sales Items. Customer is indexed instead.
DfSalesItem = df.drop('Customer', 1) 
DfSalesItem.head()

Unnamed: 0,0,1,10,100,101,102,103,104,105,106,...,90,91,92,93,94,95,96,97,98,99
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
#Calculate the Item based recommendation
import numpy as np
# We will normalize dataframe now, due to ..
#I believe we do not need to normalize, but let us compare..
#vectorized
DfSalesItemNorm = DfSalesItem / np.sqrt(np.square(DfSalesItem).sum(axis=0)) 
DfSalesItemNorm.head()

Unnamed: 0,0,1,10,100,101,102,103,104,105,106,...,90,91,92,93,94,95,96,97,98,99
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
# Calculating with Vectors to compute Cosine Similarities
ItemItemSim = DfSalesItemNorm.transpose().dot(DfSalesItemNorm) 
ItemItemSim.head()

Unnamed: 0,0,1,10,100,101,102,103,104,105,106,...,90,91,92,93,94,95,96,97,98,99
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
101,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [29]:
#Another approach to the above would be using corr fct
#Where is the difference?
SalesItemCorrelation = DfSalesItem.corr()
SalesItemCorrelation.head()

Unnamed: 0,0,1,10,100,101,102,103,104,105,106,...,90,91,92,93,94,95,96,97,98,99
0,1.0,1.0,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,...,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2
1,1.0,1.0,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,...,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2
10,-0.2,-0.2,1.0,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,...,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2,-0.2
100,-0.2,-0.2,-0.2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
101,-0.2,-0.2,-0.2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [30]:
#ItemItemSim.to_excel("ExportItem-Item.xlsx")
# Create a placeholder items for closes neighbours to an item
ItemNeighbours = pd.DataFrame(index=ItemItemSim.columns,columns=range(1,10))
ItemNeighbours.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,,,,,,,,,
1,,,,,,,,,
10,,,,,,,,,
100,,,,,,,,,
101,,,,,,,,,


In [31]:
len(ItemNeighbours)

177

In [32]:
len(ItemItemSim.columns)


177

In [33]:
ItemItemSim.head()

Unnamed: 0,0,1,10,100,101,102,103,104,105,106,...,90,91,92,93,94,95,96,97,98,99
0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
101,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [34]:
ItemNeighbours.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,,,,,,,,,
1,,,,,,,,,
10,,,,,,,,,
100,,,,,,,,,
101,,,,,,,,,


In [35]:
# Create a placeholder items for closes neighbours to an item
#ItemNeighbours = pd.DataFrame(index=ItemItemSim.columns,columns=range(1,10)) 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(ItemItemSim.columns)):
    ItemNeighbours.iloc[i,:9] = ItemItemSim.iloc[0:,i].sort_values(ascending=False)[:9].index
    #we only have 9 items, so we can max recommend 9 items (itself included)
 


In [36]:
ItemNeighbours.head()


Unnamed: 0,1,2,3,4,5,6,7,8,9
0,0,2,1,97,41,34,35,36,37
1,0,2,1,97,41,34,35,36,37
10,14,10,19,15,16,12,11,9,13
100,178,166,179,177,176,175,174,173,172
101,178,166,179,177,176,175,174,173,172


In [37]:
ItemNeighbours.head().iloc[:11,1:9]
#it needs to start at position 1, because position 0 is itself

Unnamed: 0,2,3,4,5,6,7,8,9
0,2,1,97,41,34,35,36,37
1,2,1,97,41,34,35,36,37
10,10,19,15,16,12,11,9,13
100,166,179,177,176,175,174,173,172
101,166,179,177,176,175,174,173,172


In [38]:
ItemNeighbours.to_excel("ExportItem-Item-data_neighbours.xlsx")

In [39]:
df.head()

Unnamed: 0,Customer,0,1,10,100,101,102,103,104,105,...,90,91,92,93,94,95,96,97,98,99
0,0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we will create a customer based recommendation which we need our item similarity matrix for. Then we will have a look which items our customers have bought and get the top N neighbours for each item. Afterwards we calculate the purchase history of the customer for each neighbour and calculate a similarity score for them. So in the end we just have to recommend the items with the highest score. 

In [40]:
#Now we will build a Customer based recommendation, which is build upon the item-item similarity matrix, which we have just calculated above.
# Create a place holder matrix for similarities, and fill in the customer column
CustItemSimilarity = pd.DataFrame(index=df.index,columns=df.columns)
CustItemSimilarity.iloc[:,:1] = df.iloc[:,:1]

In [41]:
CustItemSimilarity.head()

Unnamed: 0,Customer,0,1,10,100,101,102,103,104,105,...,90,91,92,93,94,95,96,97,98,99
0,0,,,,,,,,,,...,,,,,,,,,,
1,1,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,4,,,,,,,,,,...,,,,,,,,,,
4,5,,,,,,,,,,...,,,,,,,,,,


In [42]:
len(CustItemSimilarity.index),len(CustItemSimilarity.columns)

(6, 178)

In [43]:
# Getting the similarity scores
def getScore(history, similarities):
   return sum(history*similarities)/(sum(similarities)+0.0001 )

In [44]:
# This takes ages (35 customers * 3725 items)
#We now loop through the rows and columns filling in empty spaces with similarity scores.
#Note that we score items that the customer has already consumed as 0, because there is no point recommending it again.

from timeit import default_timer as timer#to see how long the computation will take
start = timer()


for i in range(0,len(CustItemSimilarity.index)):
    for j in range(1,len(CustItemSimilarity.columns)):
        user = CustItemSimilarity.index[i]
        product = CustItemSimilarity.columns[j]
 
        if df.loc[i][j] == 1:
            CustItemSimilarity.loc[i][j] = 0
        else:
            ItemTop = ItemNeighbours.loc[product][1:9] #
            #do not use order but sort_values in latest pandas
            ItemTopSimilarity = ItemItemSim.loc[product].sort_values(ascending=False)[1:9]
            #here we will use the item dataframe, which we generated during item-item matrix 
            CustomerPurchasings = DfSalesItem.loc[user,ItemTop]
            print(CustomerPurchasings)
 
            CustItemSimilarity.loc[i][j] = getScore(CustomerPurchasings,ItemTopSimilarity)

end = timer()

print('\nRuntime: %0.2fs' % (end - start))

#if there occurs a strange error  tz=getattr(series.dtype, 'tz', None) .. pandas index.. then this might be if you have used int
# as column headers instead of string

10    0.0
19    0.0
15    0.0
16    0.0
12    0.0
11    0.0
9     0.0
13    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 0, dtype: float64


36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     1.0
37    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 0, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43  

Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 1, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172

Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 2, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172

166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: 

36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 3, dtype: float64
61    0.0
32    0.0
34    0.0
35    0.0
36    0.0
37    0.0
38    0.0
39    0.0
Name: 3, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 3, dtype: float64
61    0.0
32    0.0
34    0.0
35    0.0
36    0.0
37    0.0
38    0.0
39    0.0
Name: 3, dtype: float64
36    0.0
38    0.0
39    0.0
40    0.0
41    0.0
42    0.0
43    0.0
62    0.0
Name: 3, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    0.0
Name: 3, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    0.0
Name: 3, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
174    0.0
173    0.0
172    0.0
Name: 3, dtype: float64
166    0.0
179    0.0
177    0.0
176    0.0
175    0.0
1

10    0.0
19    0.0
15    0.0
16    0.0
12    0.0
11    0.0
9     0.0
13    0.0
Name: 4, dtype: float64
2     0.0
1     0.0
97    0.0
41    1.0
34    1.0
35    1.0
36    1.0
37    1.0
Name: 4, dtype: float64
9      0.0
13     0.0
18     0.0
7      0.0
8      0.0
17     0.0
162    0.0
163    0.0
Name: 4, dtype: float64
0     0.0
43    1.0
36    1.0
37    1.0
38    1.0
39    1.0
4     0.0
40    1.0
Name: 4, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    1.0
Name: 4, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    1.0
Name: 4, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    1.0
Name: 4, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    1.0
Name: 4, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    0.0
0     0.0
37    1.0
Name: 4, dtype: float64
5     0.0
3     0.0
4     0.0
65    0.0
22    0.0
64    

In [45]:
CustItemSimilarity.head()

Unnamed: 0,Customer,0,1,10,100,101,102,103,104,105,...,90,91,92,93,94,95,96,97,98,99
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
'''start=timer()
CustItemSimilarity.head()
end=timer()
print(end-satrt)'''

'start=timer()\nCustItemSimilarity.head()\nend=timer()\nprint(end-satrt)'

In [47]:
#now generate a matrix of customer based recommendations
CustItemRecommend = pd.DataFrame(index=CustItemSimilarity.index, columns=['نام مشتری','آیتم1','آیتم2','آیتم3','آیتم4','آیتم5','آیتم6']) #Top 1,2..6
CustItemRecommend.head()

Unnamed: 0,نام مشتری,آیتم1,آیتم2,آیتم3,آیتم4,آیتم5,آیتم6
0,,,,,,,
1,,,,,,,
2,,,,,,,
3,,,,,,,
4,,,,,,,


In [48]:
CustItemRecommend.iloc[0:,0] = CustItemSimilarity.iloc[:,0]
CustItemRecommend.head()

Unnamed: 0,نام مشتری,آیتم1,آیتم2,آیتم3,آیتم4,آیتم5,آیتم6
0,0,,,,,,
1,1,,,,,,
2,3,,,,,,
3,4,,,,,,
4,5,,,,,,


In [49]:
#Instead of having the matrix filled with similarity scores we want to see the product names.
for i in range(0,len(CustItemSimilarity.index)):
    CustItemRecommend.iloc[i,1:] = CustItemSimilarity.iloc[i,:].sort_values(ascending=False).iloc[1:7,].index.transpose()

In [50]:
CustItemRecommend.head()


Unnamed: 0,نام مشتری,آیتم1,آیتم2,آیتم3,آیتم4,آیتم5,آیتم6
0,0,52,34,35,36,37,38
1,1,52,34,35,36,37,38
2,3,52,34,35,36,37,38
3,4,52,34,35,36,37,38
4,5,52,34,35,36,37,38


In [51]:

CustItemRecommend.to_excel("ExportCustomer-Item-CustItemRecommend.xlsx")
#We have coded a binary recommender engine, which works only sufficient on a small data set. Let us see in the next chapter if we can enhance the performance and scalability.

NearestNeighbors(algorithm='brute', metric='cosine', n_jobs=-1, n_neighbors=20)