# E-commerce Recommendation engine

 __This dataset contains data from the 500 actual SKUs from an outdoor apparel brand's product catalog.__

### Preliminaries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from collections import Counter
import string
import re

### Loading Data 

In [3]:
df = pd.read_csv('sample-data.csv')
print('Shape:{}'.format(df.shape))
df.head()

Shape:(500, 2)


Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


### Cleaning Data

In [4]:
def clean(Text):
    Text = re.sub('-',' ',Text)
    Text = re.sub('>','> ',Text)
    Text = re.sub('<',' <',Text)
    Text = "".join([ch for ch in Text if ch not in string.punctuation ])
    cleanr = re.compile('<.*>')
    cleantext = re.sub(cleanr, ' ', Text)
    cleantext = re.sub(' +',' ',cleantext)
    return cleantext

In [5]:
df_clean = df.copy()
df_clean['description'] = df_clean['description'].str.lower()
df_clean['description'] = df_clean['description'].apply(clean)

__First let's try the statistical NLP approach by calculating the tf-idf based features for the dataset
and then use the cosine similarity function to calculate the list of most similar products to a given input__

In [6]:
## Create a TF-IDF matrix of unigrams, bigrams, and trigrams for each product. 
## The 'stop_words' param tells the TF-IDF module to ignore common english words like 'the', etc.

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df_clean['description'])

In [7]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_similarities.shape

(500, 500)

### Model

In [8]:
def recommender(product_id,cosine_similarities,df,number_similar_products =10):
    similar_products = cosine_similarities[product_id-1].argsort()[:-number_similar_products-2:-1]
    similar_products = similar_products[1:,]
    print('Description of Product:')
    print(df["description"][product_id-1])
    print('\n')
    print("Similar product to {0} : ".format(product_id))
    for x in similar_products:
        print(x)
    print('\n')
    print('Description of Most Similar Product:')
    print(df["description"][similar_products[0]])
    print('\n')
    print(df["description"][similar_products[1]])

In [9]:
recommender(5,cosine_similarities,df_clean)

Description of Product:
alpine wind jkt on high ridges steep ice and anything alpine this jacket serves as a true best of all worlds staple it excels as a stand alone shell for blustery rock climbs cool weather trail runs and high output ski tours and then when conditions have you ice and alpine climbing it functions as a lightly insulated windshirt on the approach as well as a frictionless midlayer when its time to bundle up and tie in the polyester ripstop shell with a deluge dwr durable water repellent treatment sheds snow and blocks wind while the smooth lightly brushed hanging mesh liner wicks moisture dries fast and doesnt bind to your baselayers superlight stretch woven underarm panels enhance breathability and allow for unimpaired arm motion and the two hand pockets close with zippers a drawcord hem elastic cuffs a heat transfer reflective logo and a regular coil center front zipper with dwr finish round out the features updated this season for an improved fit recyclable throug

***