In [1]:
#referred to lesson 9.05
#https://stackoverflow.com/questions/61757170/python-unstacked-dataframe-is-too-big-causing-int32-overflow
#https://stackoverflow.com/questions/28651079/pandas-unstack-problems-valueerror-index-contains-duplicate-entries-cannot-re#:~:text=The%20reason%20why%20you%20get%20ValueError%3A%20Index%20contains,%22%20date%20%22%20combinations%20are%20no%20longer%20unique.
#https://stackoverflow.com/questions/43945653/python-pandas-return-dataframe-where-value-count-is-above-a-set-number
#https://stackoverflow.com/questions/54822879/how-can-i-combine-multiple-sparse-and-dense-matrices-together
#got help from Dan Wilhelm (GA) on creating sparse matrix
#https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
#https://stackoverflow.com/questions/17097643/search-for-does-not-contain-on-a-dataframe-in-pandas
#https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
#https://www.geeksforgeeks.org/python-ways-to-create-a-dictionary-of-lists/
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity
import utils as ut
import sys

In [2]:
df=pd.read_csv('./data/video_games.csv')

In [3]:
#ut.make_recommender_df(vg_df)

Again, I circled back to run the function I created over my dataframe after I was done (above), but the code below shows the process I followed to create it.

In [4]:
df.isnull().sum()

customer_id           0
review_id             0
product_id            0
product_parent        0
product_title         0
star_rating           0
helpful_votes         0
total_votes           0
verified_purchase     0
review_date           0
full_review          47
dtype: int64

47 is an extremely small percentage of my overall data set; I'm good to drop these.

In [5]:
df.dropna(inplace=True)

Don't want to forget to convert my review_date column to datetime format! Then I'll see how the whole thing looks.

In [6]:
df['review_date'] = pd.to_datetime(df['review_date'])

In [7]:
df.head()

Unnamed: 0,customer_id,review_id,product_id,product_parent,product_title,star_rating,helpful_votes,total_votes,verified_purchase,review_date,full_review
0,12039526,RTIS3L2M1F5SM,B001CXYMFS,737716809,Thrustmaster T-Flight Hotas X Flight Stick,5,0,0,1,2015-08-31,amazing joystick I especially love twist Used Elite Dangerous mac amazing joystick I especially love twist stick different movement binding well move normal way
1,2331478,R3BH071QLH8QMC,B0029CSOD2,98937668,Hidden Mysteries: Titanic Secrets of the Fateful Voyage,1,0,1,1,2015-08-31,One Star poor quality work advertised
2,52495923,R127K9NTSXA2YH,B00GOOSV98,23143350,GelTabz Performance Thumb Grips - PlayStation 4 and PlayStation 3,3,0,0,1,2015-08-31,good could bettee nice tend slip away stick intense hard pressed gaming session
3,14533949,R32ZWUXDJPW27Q,B00Y074JOM,821342511,Zero Suit Samus amiibo - Japan Import (Super Smash Bros Series),4,0,0,1,2015-08-31,Great flawed Great amiibo great collecting Quality material desired since perfect
4,17521011,R2F0POU5K6F73F,B008XHCLFO,24234603,Protection for your 3DS XL,5,0,0,1,2015-08-31,A Must I 2012 2013 XL durable comfortable really cool looking


In [8]:
df.columns

Index(['customer_id', 'review_id', 'product_id', 'product_parent',
       'product_title', 'star_rating', 'helpful_votes', 'total_votes',
       'verified_purchase', 'review_date', 'full_review'],
      dtype='object')

I need to see if any customers wrote multiple reviews for the same product. That would throw an error in the pivot table.

In [9]:
df.groupby('customer_id')['product_id'].value_counts().sort_values(ascending=False)

customer_id  product_id
38142327     B00005NH6B    13
43411792     B00000JRSB    13
50763007     B0000AHOOJ    10
25214010     B0000C4M22    10
30995260     B00005BW6Z    10
                           ..
38010719     B000035Y6N     1
38010773     B001C91H4G     1
             B001DL8PES     1
38010873     B0083RDT8C     1
10018        B006VE40JQ     1
Name: product_id, Length: 1641949, dtype: int64

Hm... for example, customer 38142327 wrote 13 reviews for product B00005NH6B... what do those reviews look like?

In [10]:
df.loc[(df['product_id']=='B00005NH6B') & (df['customer_id']==38142327)]

Unnamed: 0,customer_id,review_id,product_id,product_parent,product_title,star_rating,helpful_votes,total_votes,verified_purchase,review_date,full_review
1571527,38142327,R3NY8KHMR5X2KP,B00005NH6B,199086987,Batman Vengeance,1,1,2,0,2002-06-29,Pitiful I huge Batman fan Anything Batman bet I watch So I heard game I wa excited I actually bought PS2 mostly I could play game I thought game would cool boy wa I wrong It I like dark control supposed dark control work perfectly good control But basically game wa boring It wa really really boring And think one second use BatGrapple anytime want Another thing I hate Robin would made game flaw better would made interesting I mean good guy Batman Batgirl Now isnt messed I know I mean Robin Nightwing Wheres Catwoman She one classic villain And villain cartoon They got Joker Girlfriend Dr Freeze Poison Ivy They pitiful job Joker Dr Freeze face Dr Freeze eye look nothing like I mean classic character besides Batman Joker br Gameplay bad Its messed They make Batman seem weak For instance fall building way save Why make bad know could make cool Graphics detail good make gameplay better Another thing I like cuff junk I mean seriously guy Everytime run cuff henchman keep coming back And slow react movement And Batman style This game almost make hate Batman want go running Superman game PS2 coming September Man I played better Batman game cartoonnetwork com I love Batman much crushed see messed game wa Take advice Batman fan get game And Batman fan still get game I traded cousin game Scooby Doo one And seemed like best game ever I got playing Batman Vengeance Overall worst game based best comic
1572890,38142327,R2U1NNLJS8OHHQ,B00005NH6B,199086987,Batman Vengeance,5,0,1,0,2002-06-17,Batman come alooong way From 1930 new millenium Batman ha nothing outstanding All way Batman still continues appear screen new adventure Batman Beyond Justice League e c And hit today finest gaming quality since Gameboy Advance PS2 It seems Batman actually hold Superman The greatest 1st Superhero ever invented Batman wa one superheroes made cape one first thing cross mind hear word quot Superhero quot And Superhero finally get big break PS2 worst classical enimies And character game dedicated game ha fine quality real gamer ask If miss game truly sorry The flaw game control pretty hard beginner Other game absolutely full proof
1572901,38142327,R122RU8UP19R4T,B00005NH6B,199086987,Batman Vengeance,5,0,0,0,2002-06-17,FUN FOR FAMILY If like playing game family special occasion like birthday Father Day Christmas e c perfect game Parents love remembering Batman old fight playing time kid love new animated series PERFECT CHRISTMAS GIFT Make sure one PS2 list
1572912,38142327,R22Z951P3JMRBQ,B00005NH6B,199086987,Batman Vengeance,5,0,0,0,2002-06-17,Best game hit store There isnt superhero like DC Comics superhero Runner beside Superman Batman still hold And duo ha new show Justice League And Batmans hit PlayStation2 br Now ABOUT game Sony PlayStation2 The ONE AND ONLY PROBLEM game hard But quality game surely make Amazing graphic look like watching Batman The Animated Series DVD Fun age Good gameplay PERFECT storyline Trust miss game
1572927,38142327,R2RCRESG64DH7D,B00005NH6B,199086987,Batman Vengeance,5,1,1,0,2002-06-17,Fairly good game kid 12 First I would like say please stop reviewing game never played life make look bad This game ha great graphic great storyline Absolutely positively good age 12 kid enjoy battling night kinda stuff However kid 10 probably find difficult confusing game control hard If like Tenchu love After beating game game lose Grown ups find game family fun well Isn one popular game year overwhelmingly amazing Be dark knight superhero DC Comics 1 Comic writer America creator Superman
1572929,38142327,RHMREFYJF0IWP,B00005NH6B,199086987,Batman Vengeance,4,0,2,0,2002-06-17,WHATCHA LOOKIN AT WAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAH WAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAH WAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAH Why cry ask sniffle cat farted blew house poor Oh yeah batman vengeance best according vibe magazine well gotta go floss cat
1572933,38142327,R1TZYG1TID6GE6,B00005NH6B,199086987,Batman Vengeance,4,1,2,0,2002-06-17,ha spider man beat I really dont feel like writing long page stuff bare line better spider man movie game The thing Spidey got Batman beat movie movie spider man wa great short period time I mean Spider man long forgotten next 5 year Star Wars Episode 2 knocked long I mean spider man trying avoid drowning getting sliced giant fan getting electricuted fighting marvel game goodness sake Now see wa superman bend slicing fan lift crashing plane Spider man weak
1572935,38142327,R1QDQSKQ7XJPLI,B00005NH6B,199086987,Batman Vengeance,5,0,1,0,2002-06-17,You sorry miss UBI Soft best Good work Warner Brothers
1572936,38142327,R2RI3Y4DAWBISL,B00005NH6B,199086987,Batman Vengeance,5,1,2,0,2002-06-17,Best yet If PS2 nice fun game get It pass time And ha quality game need Most reviewer complain hard hard use control stuff OF COURSE ITS HARD THIS IS BATMAN NOT SPIDER MAN EVERY THING YOU DO IN THIS GAME WONT BE AS SIMPLE AS AVOIDING DROIDS AND AND SISSY stuff Okay get Overall one best game year give ability Gotham City dark hero This one miss Its flaw little confusing every exactly going pick flower I sick tired people saying hard I MEAN HELLO THIS ISN T A BABY GAME SOME THINGS WILL BE HARD IF YOU WANT A BABY GAME THEN GO RENT BARNEY ADVENTURES OR SOMETHING people think going win everything first 5 minute
1572979,38142327,RVES7ZUE7F43B,B00005NH6B,199086987,Batman Vengeance,4,0,0,0,2002-06-17,One DC Comic greatest superheroes finally go PS2 I prefer Wonderwoman actually guess Superman Batman okay Superman Wonderwoman Batman leader new Justice League Okay game cool It one best Superhero game made Its almost like Batman world stare TV long And seriously mean But I recomend kid 10 clue I wish Batmans ally like Superman Wonderwoman would schway


So this person made 13 reviews for the same item, rating 4 or 5 stars EXCEPT for the most recent one (6/29) which was 1 star. Perhaps the ending was disappointing?

In [11]:
review_bools = df.groupby('customer_id')['product_id'].value_counts()>1
#T/F if more than 1 review for same product

In [12]:
review_bools[review_bools==True].count()

5309

Okay, based on this there are 5,309 customers with at least 2 reviews for the same product.  

I could try to get the most recent review for each customer based on date. It's likely reasonable to assume that their latest review is probably the one that best reflects their final sentiment about the product. Unfortunately, because I don't have precise timestamps, if they wrote two reviews on the same day, I can't know which was their last word on the product. Other options would be to average all of their reviews' star values or to simply drop them, as they're a fairly small proportion of the overall set.

In [13]:
xtra_rev_cs = []
for key, value in dict(review_bools[review_bools==True]).items():
    xtra_rev_cs.append({key[0]:key[1]}) #return customer id and product id ONLY
#list of customer id numbers for those with more than 1 review for same item
xtra_rev_cs[:5]  #preview

[{198273: 'B00DNGQQUQ'},
 {224068: 'B0049LAVB4'},
 {815919: 'B0000TSR4C'},
 {884129: 'B000WMEEB2'},
 {884129: 'B00ERDGMT4'}]

In [14]:
df[(df['customer_id'] == 38142327) & 
   (df['product_id']=='B00005NH6B')].sort_values(
    by='review_date', ascending=False).index[1:] #test

Int64Index([1572890, 1572901, 1572912, 1572927, 1572929, 1572933, 1572935,
            1572936, 1572979, 1572980, 1572983, 1573111],
           dtype='int64')

Okay - with the above code in conjunction with the xtra_rev_cs dictionary, I can filter the dataframe by customer_id and product_id to isolate the rows that pertain to a single product reviewed mutliple times by the same user, and then order it by date where the most recent date will be on top. While this does open the possibility that two reviews with the same date may be conflated as far as which ACTUALLY came first, the examples I've previewed in this situation are typically duplicates (i.e. someone submitted the same review twice). In that circumstance, either review is equally valid to keep.  

The next step is to loop through my xtra_rev_cs dictionary and create a list of all the indexes EXCEPT the most recent one. I'll drop those others, and be left with one review per person per item.

In [15]:
rev_indexes_to_drop = []
for pair in xtra_rev_cs:
    for key, value in pair.items():
        rev_indexes_to_drop.append(  #add index numbers to the empty list
        df[(df['customer_id'] == key) &    #where customer id is the key from xtra_rev_cs...
       (df['product_id']==value)].sort_values( #and product id is the value from xtra_rev_cs
        by='review_date', ascending=False).index[1:] #starting with the SECOND index number
        )
rev_indexes_to_drop[:5] #preview

[Int64Index([26651], dtype='int64'),
 Int64Index([70501], dtype='int64'),
 Int64Index([90766], dtype='int64'),
 Int64Index([422385], dtype='int64'),
 Int64Index([421883], dtype='int64')]

In [16]:
rev_indexes_to_drop[0][0] #test

26651

In [17]:
ritd_2 = []
for n in rev_indexes_to_drop:
    for k in n:
        ritd_2.append(k)
ritd_2[:5] #preview

[26651, 70501, 90766, 422385, 421883]

In [18]:
len(ritd_2) #6140 rows of duplicate reviews to be dropped

6140

In [19]:
df.drop(index=ritd_2, inplace=True) #drop all index numbers in list ritd_2
df.groupby('customer_id')['product_id'].value_counts().sort_values(ascending=False) #check if gone

customer_id  product_id
53096565     B00006599W    1
17664004     B000U5RRX8    1
17663554     B000SH3XGS    1
17663609     B002TK1PX0    1
17663621     B00E20STAW    1
                          ..
38073024     B0088MVOES    1
38073066     B006VB2UNM    1
38073092     B007P6Y684    1
             B00A9ZHWH0    1
10018        B006VE40JQ    1
Name: product_id, Length: 1641949, dtype: int64

In [20]:
df2 = df[['customer_id', 'product_id', 'product_title', 'star_rating']].copy()

In [21]:
#pivot = pd.pivot_table(df2, index='product_title', columns='customer_id', values='star_rating')

My first attempt at making a pivot table from this data failed because the created dataframe is too big. I'll have to do it in chunks.

In [22]:
#got help from Dan Wilhelm on creating sparse matrix

In [23]:
df2.shape #rows

(1641949, 4)

In [24]:
df2['customer_id'].nunique() #unique customers

979917

In [25]:
df2['product_id'].nunique() #unique products

20952

In [26]:
len(df2)/37

44377.0

In [27]:
df2.dtypes

customer_id       int64
product_id       object
product_title    object
star_rating       int64
dtype: object

My matrix will need the title input to be a numerical value (int); I will create a list of unique products and match them with a number 1 - (# of products) in a dictionary, and then add a new column with that number to my dataframe.

In [28]:
unique_prods = list(set(df2['product_title'])) #create list of unique products
prod_index = {p:i for i,p in enumerate(unique_prods)} #match unique products with integer values

In [29]:
df2['prod_numerical'] = df2['product_title'].apply(lambda x: prod_index[x])

In [30]:
df2.head() #preview

Unnamed: 0,customer_id,product_id,product_title,star_rating,prod_numerical
0,12039526,B001CXYMFS,Thrustmaster T-Flight Hotas X Flight Stick,5,12951
1,2331478,B0029CSOD2,Hidden Mysteries: Titanic Secrets of the Fateful Voyage,1,8231
2,52495923,B00GOOSV98,GelTabz Performance Thumb Grips - PlayStation 4 and PlayStation 3,3,5116
3,14533949,B00Y074JOM,Zero Suit Samus amiibo - Japan Import (Super Smash Bros Series),4,5453
4,17521011,B008XHCLFO,Protection for your 3DS XL,5,8393


I should try to save memory in other ways too. I can convert my star rating to an 8bit format since it will only ever be numbers 1-5. I can also convert my unique customer id numbers (already integers) to smaller numbers so that my matrix doesn't make empty rows in between them.

In [31]:
df2['star_rating'] = df2['star_rating'].astype(np.int8) #convert to take less memory

In [32]:
unique_cs = list(set(df2['customer_id']))
cs_index = {p:i for i,p in enumerate(unique_cs)}
df2['cs_numerical'] = df2['customer_id'].apply(lambda x: cs_index[x])

In [33]:
df2.head() #preview

Unnamed: 0,customer_id,product_id,product_title,star_rating,prod_numerical,cs_numerical
0,12039526,B001CXYMFS,Thrustmaster T-Flight Hotas X Flight Stick,5,12951,727492
1,2331478,B0029CSOD2,Hidden Mysteries: Titanic Secrets of the Fateful Voyage,1,8231,113959
2,52495923,B00GOOSV98,GelTabz Performance Thumb Grips - PlayStation 4 and PlayStation 3,3,5116,32535
3,14533949,B00Y074JOM,Zero Suit Samus amiibo - Japan Import (Super Smash Bros Series),4,5453,912813
4,17521011,B008XHCLFO,Protection for your 3DS XL,5,8393,358299


Okay - that's probably about as minimal as we can get our inputs; let's make this sparse matrix!

In [34]:
sparse_reviews = sparse.csr_matrix((df2.star_rating, (df2.prod_numerical, df2.cs_numerical)), dtype=np.int8)

In [35]:
sparse_reviews.shape

(15938, 979917)

In [36]:
print(sparse_reviews[1]) #preview

  (0, 7466)	5
  (0, 24556)	5
  (0, 43120)	4
  (0, 56791)	5
  (0, 68845)	4
  (0, 80103)	5
  (0, 85878)	4
  (0, 89855)	5
  (0, 100872)	3
  (0, 111387)	4
  (0, 116121)	2
  (0, 121817)	2
  (0, 147474)	5
  (0, 155347)	1
  (0, 165769)	5
  (0, 175603)	5
  (0, 182488)	4
  (0, 182869)	3
  (0, 209617)	5
  (0, 229909)	5
  (0, 240390)	5
  (0, 248679)	4
  (0, 249271)	4
  (0, 251424)	2
  (0, 253226)	5
  :	:
  (0, 476996)	5
  (0, 493866)	3
  (0, 495000)	4
  (0, 501839)	5
  (0, 503132)	2
  (0, 518339)	5
  (0, 526403)	4
  (0, 537005)	1
  (0, 554537)	5
  (0, 565224)	4
  (0, 568129)	5
  (0, 610841)	4
  (0, 614414)	2
  (0, 618300)	5
  (0, 636867)	5
  (0, 646786)	4
  (0, 650607)	3
  (0, 808376)	5
  (0, 809097)	5
  (0, 814836)	10
  (0, 826612)	3
  (0, 866206)	4
  (0, 900646)	5
  (0, 944804)	4
  (0, 972750)	3


In [37]:
df2[df2['prod_numerical']==1] #test - cs numbers above should match cs_numerical column below

Unnamed: 0,customer_id,product_id,product_title,star_rating,prod_numerical,cs_numerical
56707,1360659,B00020V4RG,NCAA Football 2005,5,1,636867
958445,27565503,B00020V4RG,NCAA Football 2005,5,1,147474
996118,22353312,B00020V4RG,NCAA Football 2005,4,1,646786
1196639,14998396,B00020V4QM,NCAA Football 2005,1,1,155347
1272806,42415094,B00020V4RG,NCAA Football 2005,5,1,229909
...,...,...,...,...,...,...
1495115,17343847,B00020V4QM,NCAA Football 2005,5,1,275747
1495127,37575847,B00020V4QM,NCAA Football 2005,5,1,900646
1495145,17351614,B00020V4QM,NCAA Football 2005,5,1,279519
1498672,52071209,B00020V4RG,NCAA Football 2005,5,1,814836


The matrix looks good - now it's time to make the recommender model. In order to do this we will look into the cosine similarities between items to see which ones are on similar vectors.

In [38]:
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity

In [39]:
dists = pairwise_distances(sparse_reviews, metric='cosine')

In [40]:
dists #preview

array([[0.        , 1.        , 1.        , ..., 1.        , 1.        ,
        1.        ],
       [1.        , 0.        , 1.        , ..., 1.        , 1.        ,
        0.98659978],
       [1.        , 1.        , 0.        , ..., 1.        , 1.        ,
        1.        ],
       ...,
       [1.        , 1.        , 1.        , ..., 0.        , 1.        ,
        1.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 0.        ,
        1.        ],
       [1.        , 0.98659978, 1.        , ..., 1.        , 1.        ,
        0.        ]])

At this point, I realize how large my returned recommender (product x product) dataframe is going to be. The next few cells are tests to see how I can create a spare dataframe and what benefit it provides.

In [85]:
dists_sparse = [pd.arrays.SparseArray(dists[n]) for n in range(len(dists))]

In [86]:
print(sys.getsizeof(dists))
sys.getsizeof(dists_sparse)

2032158864


140576

In [243]:
l1 = [0,1,2,3,0,1,2]
l2 = [1,0,1,1,2,1,0]
l3 = [0,1,0,1,3,1,0]
l4 = [0,0,1,1,0,0,1]
l5 = [1,1,2,1,1,2,1]
l6 = [0,1,1,2,1,1,1]
l7 = [2,1,2,1,1,0,1]
a = pd.DataFrame({'one':pd.arrays.SparseArray(l1, fill_value=1),
                 'two':pd.arrays.SparseArray(l2, fill_value=1),
                 'three':pd.arrays.SparseArray(l3, fill_value=1),
                 'four':pd.arrays.SparseArray(l4, fill_value=1),
                 'five':pd.arrays.SparseArray(l5, fill_value=1),
                 'six':pd.arrays.SparseArray(l6, fill_value=1),
                 'seven':pd.arrays.SparseArray(l7, fill_value=1)
                 })
b = pd.DataFrame({'one':l1, 'two':l2, 'three':l3, 'four':l4,
                 'five':l5, 'six':l6, 'seven':l7})

In [244]:
print(a.dtypes)
b.dtypes

one      Sparse[int64, 1]
two      Sparse[int64, 1]
three    Sparse[int64, 1]
four     Sparse[int64, 1]
five     Sparse[int64, 1]
six      Sparse[int64, 1]
seven    Sparse[int64, 1]
dtype: object


one      int64
two      int64
three    int64
four     int64
five     int64
six      int64
seven    int64
dtype: object

In [246]:
a

Unnamed: 0,one,two,three,four,five,six,seven
0,0,1,0,0,1,0,2
1,1,0,1,0,1,1,1
2,2,1,0,1,2,1,2
3,3,1,1,1,1,2,1
4,0,2,3,0,1,1,1
5,1,1,1,0,2,1,0
6,2,0,0,1,1,1,1


In [247]:
b

Unnamed: 0,one,two,three,four,five,six,seven
0,0,1,0,0,1,0,2
1,1,0,1,0,1,1,1
2,2,1,0,1,2,1,2
3,3,1,1,1,1,2,1
4,0,2,3,0,1,1,1
5,1,1,1,0,2,1,0
6,2,0,0,1,1,1,1


In [245]:
sys.getsizeof(a)

420

In [248]:
sys.getsizeof(b)

536

In [250]:
array1 = [l1,l2,l3,l4,l5,l6,l7]
cols = ['one','two','three','four', 'five', 'six', 'seven']

array_dict = {}
for n in range(len(array1)):
    array_dict[cols[n]] = pd.arrays.SparseArray(array1[n], fill_value=1)
#array_dict

In [251]:
c = pd.DataFrame(array_dict, index=cols, columns=cols)
d = pd.DataFrame(array1, index=cols, columns=cols)

In [252]:
c

Unnamed: 0,one,two,three,four,five,six,seven
one,0,1,0,0,1,0,2
two,1,0,1,0,1,1,1
three,2,1,0,1,2,1,2
four,3,1,1,1,1,2,1
five,0,2,3,0,1,1,1
six,1,1,1,0,2,1,0
seven,2,0,0,1,1,1,1


In [253]:
d

Unnamed: 0,one,two,three,four,five,six,seven
one,0,1,2,3,0,1,2
two,1,0,1,1,2,1,0
three,0,1,0,1,3,1,0
four,0,0,1,1,0,0,1
five,1,1,2,1,1,2,1
six,0,1,1,2,1,1,1
seven,2,1,2,1,1,0,1


In [254]:
print(c.dtypes)
d.dtypes

one      Sparse[int64, 1]
two      Sparse[int64, 1]
three    Sparse[int64, 1]
four     Sparse[int64, 1]
five     Sparse[int64, 1]
six      Sparse[int64, 1]
seven    Sparse[int64, 1]
dtype: object


one      int64
two      int64
three    int64
four     int64
five     int64
six      int64
seven    int64
dtype: object

In [255]:
sys.getsizeof(c)

718

In [256]:
sys.getsizeof(d)

834

In [259]:
recommender_df['NBA Live 15'].value_counts() #example

1.000000    15622
0.974699        2
0.978993        1
0.988036        1
0.997491        1
            ...  
0.998127        1
0.979906        1
0.984334        1
0.982730        1
0.000000        1
Name: NBA Live 15, Length: 316, dtype: int64

This is a good example of how high a proportion of 1's each product has. Making them sparse is going to have huge benefits!

In [278]:
dists_smaller = dists.astype(np.float32)

In [279]:
#array1 = [l1,l2,l3,l4,l5,l6,l7]
rec_cols = unique_prods

rec_dict = {}
for n in range(len(dists_smaller)):
    rec_dict[rec_cols[n]] = pd.arrays.SparseArray(dists_smaller[n], fill_value=1)
#rec_dict

In [280]:
recommender_df_sparse = pd.DataFrame(rec_dict,
                             index=rec_cols,
                             columns=rec_cols)

#recommender_df_sparse.head()

In [47]:
recommender_df = pd.DataFrame(dists,
                             index=unique_prods,
                             columns=unique_prods)

#recommender_df.head()

In [281]:
print(sys.getsizeof(recommender_df)/1_000_000_000)
sys.getsizeof(recommender_df_sparse)/1_000_000_000

2.034354337


0.051519521

In [282]:
recommender_df.dtypes

NBA Live 15                            float64
NCAA Football 2005                     float64
Hotel - PC                             float64
Nintendo Wii Fit Bundle                float64
Capcom Classics Collection Reloaded    float64
                                        ...   
Pirates: The Legend of Black Kat       float64
Harvest Moon: Island of Happiness      float64
Zoo Cube                               float64
Escape the Museum 2                    float64
Mvp Baseball 2003: Xbox                float64
Length: 15938, dtype: object

In [283]:
recommender_df_sparse.dtypes

NBA Live 15                            Sparse[float32, 1]
NCAA Football 2005                     Sparse[float32, 1]
Hotel - PC                             Sparse[float32, 1]
Nintendo Wii Fit Bundle                Sparse[float32, 1]
Capcom Classics Collection Reloaded    Sparse[float32, 1]
                                              ...        
Pirates: The Legend of Black Kat       Sparse[float32, 1]
Harvest Moon: Island of Happiness      Sparse[float32, 1]
Zoo Cube                               Sparse[float32, 1]
Escape the Museum 2                    Sparse[float32, 1]
Mvp Baseball 2003: Xbox                Sparse[float32, 1]
Length: 15938, dtype: object

In [53]:
#rec_sparse = [pd.arrays.SparseArray(rec[n]) for n in range(len(dists))]

Alright! There's our recommender dataframe - the numbers in the cells show how related the titles are to each other - 0 means perfect match as a measure of distance (that's why 0's are where the item is matched with itself) and then 1 is the furthest distance away.

In [286]:
q = 'Harvest Moon 3D: A New Beginning'
titles = list(df2[df2['product_title'].str.contains(q)]['product_title'])
#for title in titles:
    #print(title)
    #print('Average rating:', df2.['title'.mean())
    #print('Number of ratings:', df2.loc[title,:].count())
    #print()
print('10 closest items:', recommender_df_sparse.loc[titles[0],:].sort_values()[0:10])
    #print('\n***************************************\n')

10 closest items: Harvest Moon 3D: A New Beginning - Nintendo 3DS    0.000000
Harvest Moon: Tale of Two Towns - Nintendo 3DS     0.940594
Hometown Story with Ember the Dragon Plush         0.948023
Style Savvy: Trendsetters - Nintendo 3DS           0.957537
Rune Factory 4 - Nintendo 3DS                      0.958158
Harvest Moon: The Lost Valley - Nintendo 3DS       0.959587
Harvest Moon: Grand Bazaar - Nintendo DS           0.966500
Story of Seasons - Nintendo 3DS                    0.968455
Animal Crossing: New Leaf                          0.968942
Nintendo 3DS XL - Pink/White                       0.973937
Name: Harvest Moon 3D: A New Beginning - Nintendo 3DS, dtype: Sparse[float32, 1]


In [54]:
q = 'Mass Effect 2'
titles = list(df2[df2['product_title'].str.contains(q)]['product_title'])
#for title in titles:
    #print(title)
    #print('Average rating:', df2.['title'.mean())
    #print('Number of ratings:', df2.loc[title,:].count())
    #print()
print('10 closest items:', recommender_df.loc[titles[0],:].sort_values()[0:10])
    #print('\n***************************************\n')

10 closest items: Mass Effect 2                       0.000000
Mass Effect - Xbox 360 (Limited)    0.866848
Mass Effect 3                       0.886100
Dragon Age: Origins                 0.920578
Dragon Age 2                        0.948162
Alpha Protocol                      0.948719
Bioshock 2                          0.950226
Fallout 3                           0.951378
Dead Space 2                        0.951836
Fallout New Vegas                   0.954123
Name: Mass Effect 2, dtype: float64


In [43]:
sys.getsizeof(recommender_df)/1_000_000_000 #check size of recommender df

2.034354337

At this point I'm going to pause and consolidate what I have done so far into a single function that I can generalize to my other two data sets (movies and books). I'll save this function in my python utility file.

In [44]:
def make_recommender_df(df):
    #drop any null values remaining from cleaning (will only be a handful in concatenated NLP column)
    df.dropna(inplace=True)
    #make T/F list for if cs has written more than 1 review for the same product
    review_bools = df.groupby('customer_id')['product_id'].value_counts()>1
    
    #list of customer id numbers for those with more than 1 review for same item
    xtra_rev_cs = []
    for key, value in dict(review_bools[review_bools==True]).items():
        xtra_rev_cs.append({key[0]:key[1]}) #return customer id and product id ONLY
    
    #make list of original df indexes corresponding to reviews that need to be dropped
    rev_indexes_to_drop = []
    for pair in xtra_rev_cs:
        for key, value in pair.items():
            rev_indexes_to_drop.append(  #add index numbers to the empty list
            df[(df['customer_id'] == key) &    #where customer id is the key from xtra_rev_cs...
           (df['product_id']==value)].sort_values( #and product id is the value from xtra_rev_cs
            by='review_date', ascending=False).index[1:] #starting with the SECOND index number
            )
    ritd_2 = [] #list for indexes
    for n in rev_indexes_to_drop:
        for k in n:
            ritd_2.append(k)
    print(f'Dropping {len(ritd_2)} duplicate values.') #print status update for number of duplicates being dropped
    df.drop(index=ritd_2, inplace=True) #drop all index numbers in list ritd_2
    
    #make new dataframe for recommender build
    df2 = df[['customer_id', 'product_id', 'product_title', 'star_rating']].copy()
    print(f"Unique customers: {df2['customer_id'].nunique()}") #preview number of unique customers
    print(f"Unique products: {df2['product_id'].nunique()}") #preview number of unique products
    
    unique_prods = list(set(df2['product_title'])) #create list of unique products
    prod_index = {p:i for i,p in enumerate(unique_prods)} #match unique products with integer values
    df2['prod_numerical'] = df2['product_title'].apply(lambda x: prod_index[x]) #add column to df2
    
    unique_cs = list(set(df2['customer_id'])) #create list of unique customers
    cs_index = {p:i for i,p in enumerate(unique_cs)} #match unique customers with (smaller) integer values
    df2['cs_numerical'] = df2['customer_id'].apply(lambda x: cs_index[x]) #add column to df2
    
    df2['star_rating'] = df2['star_rating'].astype(np.int8) #convert to take up less memory
    
    #create sparse matrix comparing customer ratings and products
    sparse_reviews = sparse.csr_matrix((df2.star_rating, (df2.prod_numerical, df2.cs_numerical)), dtype=np.int8)
    print(f'Size of matrix: {sparse_reviews.shape}') #preview size of sparse matrix
    #get cosine distances between items
    dists = pairwise_distances(sparse_reviews, metric='cosine')
    #create recommender df
    recommender_df = pd.DataFrame(dists,
                             index=unique_prods,
                             columns=unique_prods)
    print(f'Size of Recommender: {sys.getsizeof(recommender_df)/1_000_000_000} GB') #check size in GB
    return recommender_df

Generally speaking, the more specific my search, the more accurate the results. There are a few searches I've noticed that aren't returning values I'd expect:  

- Playstation - gives game not console
- Xbox 360 - gives game not console
- Batman Arkham - gives console not game

I'd also like to see if I can add an input for words to *not* return in the search. For example - I might want to see games like Age of Empries but I don't want the returned titles to include "PC" because I don't want computer games.

In [150]:
q = 'NBA'
wout = 'Nintendo'
titles = list(df2[(df2['product_title'].str.contains(q)) &  #contains this keyword
                  (~df2['product_title'].str.contains(wout))]['product_title']) #does not contain this keyword
#for title in titles:
    #print(title)
    #print('Average rating:', df2.['title'.mean())
    #print('Number of ratings:', df2.loc[title,:].count())
    #print()
print('10 closest items:', recommender_df.loc[titles[0],:].sort_values()[0:10])
    #print('\n***************************************\n')

10 closest items: NBA 2K12(Covers May Vary)     0.000000
NBA 2K13                      0.941754
Madden NFL 12                 0.956399
NBA 2K11                      0.961517
MLB 11: The Show              0.964082
MLB 12 The Show               0.970108
FIFA Soccer 12                0.973340
NBA 2K10                      0.976300
Major League Baseball 2K12    0.980191
Major League Baseball 2K11    0.980923
Name: NBA 2K12(Covers May Vary), dtype: float64


Not bad! But before I get too far ahead on the recommender, I need to take the building function I just wrote and apply it to the books and movies data sets.