# Recommendation system for Kickstarter

Like many other vendor web applications (though "vendor" here isn't completely accurate, as Kickstarter itself isn't selling products), Kickstarter has many products that a user ("backer") may sift through and possibly buy (more accurately, "back" or "give money to").

It's in the interest of Kickstarter to advertise as many products as possible to a potential backer because it makes 5% of the money that is funded for a product. If a backer only sees a limited number of products, then Kickstarter may not earn potentially as much money because it would be difficult for backers to find other products they are interested in. In order to lower that difficulty, Kickstarter can recommend products based on what a user has clicked on in the past, or, given a page that a person is looking at, recommend products that are similar to that.

# How I'll approach building the recommendation system

The Kickstarter website does not have an API, though web scraping is a "trivial" task that can allow me to access public information such as the Kickstarter project title or the amount needed for the project, just to name a few. Here, I had actually taken some data that was pre-scraped, but scraping would involve grabbing data with a package like Python's ```MechanicalSoup``` and searching the data to find data such as the project's ```name```, ```category```, or ```description```.

In principle I would obtain data from all of those fields (and whatever else I can find) and split up the data into something easily processed by a computer with a text analyzer, but in order to keep the amount that I need to process relatively low, I'll only consider the ```name```, ```category```, and ```main_category```. The main idea here is that I will vectorize each project and measure how close each project is to each other via an inner product. Roughly speaking, I'll be creating recommendations by analyzing how often there are matches in those fields.

To make this a bit more concrete, suppose that there are two products whose ```main_category``` is food, but the ```category``` of one is "Restaurants" and the ```category``` of the other one is "Drinks", these two products are not exact matches, but still match up relatively well because of the match on ```main_category```. Suppose that the ```name``` of the "Restaurants" product is "Monarch Espresso Bar" and the ```name``` of the other product is ""Espresso Machine with Bean Grinder," you would expect that those two products are related because of the match in the ```name``` field on the word "espresso".

Ultimately, this should work better if I could include more data from the ```description``` field of each product, but again, in the interest of making a product that is a proof-of-concept and in the interest of not using too much processing power and memory, I'll only look at the fields I mentioned a couple paragraphs ago.

# Selecting the data

I'm not necessarily trying to predict general features from a small sample space; I will actually just take the entire population as if I were trying explore the data a bit and only select the fields I'm interested in (```name```, ```category```, and ```main_category```).

In [150]:
import pandas as pd

fields = ['name','category','main_category']

df1 = pd.read_csv('ks-projects-201612.csv', sep=',', header=0, encoding='latin1', usecols=fields)
df2 = pd.read_csv('ks-projects-201801.csv', sep=',', header=0, encoding='latin1', usecols=fields)

df = pd.concat([df1,df2], ignore_index=True)
df_original = df

The data here is actually a bit corrupted and I'd rather not manually put quotation marks around the items that ought to be considered as ```name```. There are not that many, but with the number of rows being on the order of 100s of thousands, it'd require a certain type of person to fix those (and as of writing this I have an idea on how to fix it, but I'd like to get the basic idea down first). If I drop the weirdly-categorized entries, I won't lose much. In fact, there's only a few hundred of those entries.

Down here I had to fiddle around with the filtering because there were small coincidences with some names having the same string spilling over into the ```category``` field. I didn't show the code below, but I had output the categories into a text file and did a quick check by eye to make sure nothing suspicious was in there.

In [151]:
categories_with_unique_values = df.groupby(['category']).count()['name'] <= 7

good_categories = categories_with_unique_values[categories_with_unique_values == False].index

df = df[df['category'].map(lambda x: x in good_categories)]

Aside from fixing the categories, I want to begin normalizing the text data so that I can vectorize it. There are a few techniques that I'd like to use here:

* Stemming

* Normalizing the case of the letter

* Taking out stopwords

This only really applies to the ```name``` feature, as the space of ```category``` and ```main_category``` are just made up of discrete, pre-determined labels. In the stop words, you'll notice that I add in some extra words. This is from an early analysis of the most common words. Words like "project" and "cancel" can unintentionally link unrelated projects (e.g. "Project for a surrealist film" or "A dungeons & dragons project for kids". N.b. these aren't actual names but something I made up on the spot to demonstrate).

In [152]:
df['name'] = df['name'].str.lower().astype(str)

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
extra_stop_words = set(['project','cancel','one'])
stop_words = stop_words.union(extra_stop_words)

df['name'] = df['name'].str.split(' ').apply(lambda x: ' '.join(w for w in x if w not in stop_words))

from string import punctuation

df['name'] = df['name'].str.replace('[{}]'.format(punctuation), '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


To ensure that the vector space isn't too large, I'll cull some words by trying to find the roots. There are a couple methods to attempt this: lemmatizing and stemming. To summarize what these do:

* Stemmers more or less do a crude chopping of the end of a word with the hope that it can find the root word or something close to it.

* Lemmatizers try to find actual words by using the morphological rules of a language.

Stemmers are generally quicker because their algorithm is more or less simply chopping off part of a word, and lemmatizers have to go through some set of rules to try and morph a word back into its root form. In some cases, lemmatizers don't do their job properly (I used ```nltk```'s ```WordNetLemmatizer``` to try and find the root word of "chatting," but it did not return "chat" like expected; ```SnowballStemmer``` had done that, and with a much faster time). I'd like the recommender system to perform a bit quickly and I actually do not care what the true root of a word is, so I'll stick with a stemmer. The reason why I do not particularly care about the true root of the word is that even if a stemmer cuts a certain class of words down to something that isn't exactly the root word, but still comes close, as long as I can recognize that part of the root word is the same among stemmed words, then I can group them together (e.g. perhaps "saves" and "saving" will be stemmed down to "sav," but I would recognize that they come from the same word, so the vector that originally had "saves" and the vector that originally had "saving" can have an inner product which is non-zero). In particular, the ```SnowballStemmer``` (also known as the "Porter2" stemmer) is generally considered to be a compromise between a more relaxed Porter stemmer and an aggressive Lancaster stemmer \[1\].

\[1\] [Answer to differences between Porter and Lancaster stemming. This answer fails to give further citations, however.](https://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg)

In [4]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

df['name'] = df['name'].str.split(' ').apply(lambda x: ' '.join(stemmer.stem(w) for w in x))

text = df['name'].str.cat(sep=' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [85]:
df['name_and_categories'] = df['name'] + ' ' + df['category'] + ' ' + df['main_category']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


I want to take a look at a word cloud to make sure that there are, in fact, a set of words in the total space of text that have relative amounts. That way, when I consider the inner product between two vectors, it is possible that there are some measurable differences. Put more simply: suppose that all the text was such that each word appears only once, then it's actually a bit pointless to use the ```name``` feature because there is no overlap between each ```name```, which is to say in the language of linear algebra, each vector with elements strictly in ```name``` would be orthogonal.

Basically, all I need to do is examine the frequency of a word in the corpus to see how common it is. Word clouds are an easy way to visualize this. For this part, I'm using Andreas Mueller's [post on creating a word cloud in Python](https://github.com/amueller/word_cloud).

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

cv = CountVectorizer(min_df=0, max_features=100)
counts = cv.fit_transform([text]).toarray().ravel()
words = np.array(cv.get_feature_names())

counts = counts / float(counts.max())

In [6]:
from wordcloud import WordCloud

wordcloud = WordCloud(max_font_size=50).generate(text)

import matplotlib.pyplot as plt

plt.figure()
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

<Figure size 640x480 with 1 Axes>

I think "cancel" is a word that might accidentally cause interactions between unrelated projects. I thought I had cancelled it above, but perhaps I need to filter it after the stemming. In any case, this seems like a good starting point. Hopefully with the more dominant words in the word cloud I can get some on-point recommendations.

In [153]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer().fit_transform(df['name'])

from sklearn.metrics.pairwise import linear_kernel

In [286]:
given_index = 1000

name_of_item = df.loc[given_index, 'name']
main_category = df.loc[given_index, 'main_category']
main_category_df = df[df['main_category'] == main_category].reset_index().drop(columns='index')
item_index = main_category_df[main_category_df['name'] == name_of_item].index[0]

tv = TfidfVectorizer(analyzer='word').fit_transform(main_category_df['name'])

cosine_similarities = linear_kernel(tv[item_index], tv).flatten()
related_projects_indices = cosine_similarities.argsort()[:-10:-1]
related_projects_scores = cosine_similarities[related_projects_indices]

print(df_original.loc[given_index, 'name'])
print(df_original.loc[given_index, 'main_category'])
print()

for index in related_projects_indices:
    print(str(main_category_df.loc[index, 'name']))

NERO: RFID Blocking Wallet Leather or Kevlar + Carbon Fiber
Design

nero rfid blocking wallet leather kevlar  carbon fiber
nero rfid blocking wallet leather kevlar  carbon fiber
carbon fiber  kevlar wallet
carbon fiber  kevlar wallet
carbon fiber wallet
carbon fiber wallet
ultimate rfid blocking minimalist wallet
ultimate rfid blocking minimalist wallet
w1 wallet rfid blocking


Good/interesting ones:

* \#20 (Food) Mountain brew

* \#60 (Crafts) "Flying" Carpets

Something weird here: A lot of the foods seem to attract beers (cf. #101)