# Coffee Trading Rec System / Exercise

Hamel Husain

hamel.husain@gmail.com

I did not have time to fully implement these functions in the exact way that were specified in the challenge.  I skipped the following items in the interest of time:

1)  I did not use the framework and the class you provided

2)  I did not implement these functions as command line tools

3)  I made a recommendation system, but I did not predict a rating.  I made a top -K recommendation system based upon item-item collaborative filtering.   To make predictions of the ratings I would have calculated the weighted average of item-ratings that the user has rated that are similar to the product we want to predict.   

### Import Stuff

In [15]:
import pandas as pd
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
#The handy dictionary you provided!
COUNTRIES = {
    "Balinese": "Bali",
    "Bolivian": "Bolivia",
    "Brazilian": "Brazil",
    "Costa Rican": "Costa Rican",
    "Dominican": "Dominican Republic",
    "Salvadorean": "El Salvador",
    "Ethiopian": "Ethiopia",
    "Guatemalan": "Guatemala",
    "Indian": "India",
    "Kenyan": "Kenya",
    "Malian": "Mali",
    "Mexican": "Mexico",
    "Panamanian": "Panama",
    "Peruvian": "Peru",
    "Sumatran": "Sumatra",
}

### Put Headers On The File

In [16]:
!echo 'ID\tName\tRating' > header.txt
!cat header.txt coffee_ratings.txt > new_ratings.txt

### Read in the file

In [17]:
df = pd.read_csv('new_ratings.txt', sep = '\t', index_col=0)
df.head()

Unnamed: 0_level_0,Name,Rating
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
191,Decaf Cuzcachapa Kenyan,4
16,Organic Fair Trade Decaf Paradise Valley Kenyan,3
163,Fair Trade Decaf Honey Burst Brazilian,3
118,Organic Fair Trade Decaf Supremo Indian,5
149,Decaf Swiss Water Guatemalan,3


### Define Parse Function

In [18]:
def parse(name):
    """Parse Function"""
    #Look for key words and assign to variables
    Decaf, Organic, Fair_trade =  'decaf' in name.lower(), 'organic' in name.lower(), 'fair trade' in name.lower()
    #Look for countries from the country dict provided
    Country = ''.join([COUNTRIES[x] for x in COUNTRIES.keys() if x in name])
    #Get the keys that match
    country_keys = '|'.join([x for x in COUNTRIES.keys() if x in name])
    #Strip the original name of all the things that we found for the adjective piece
    re_text = 'decaf|organic|fair trade|'+ country_keys.lower()
    Adjective = re.sub(re_text, '', name.lower()).strip()
    
    return Decaf, Organic, Fair_trade, Adjective, Country

### Use Parser To Add Elements To A DataFrame For Sanity Check

In [19]:
dftemp = df.copy()
dftemp[['Decaf', 'Organic', 'Fair_trade', 'Adjective', 'Country']] = dftemp.Name.apply(parse).apply(pd.Series)
dftemp.head()

Unnamed: 0_level_0,Name,Rating,Decaf,Organic,Fair_trade,Adjective,Country
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
191,Decaf Cuzcachapa Kenyan,4,True,False,False,cuzcachapa,Kenya
16,Organic Fair Trade Decaf Paradise Valley Kenyan,3,True,True,True,paradise valley,Kenya
163,Fair Trade Decaf Honey Burst Brazilian,3,True,False,True,honey burst,Brazil
118,Organic Fair Trade Decaf Supremo Indian,5,True,True,True,supremo,India
149,Decaf Swiss Water Guatemalan,3,True,False,False,swiss water,Guatemala


### Summary of Data

In [20]:
def summary(df):
    """This function takes a dataframe, not a file"""
    for x in [col for col in df.columns if col not in ['Name', 'Rating']]:
        print 'Total People: %s' % len(df)
        print 'Total Coffee Types: %s' % len(df.Name.unique())
        print '\nSummary of %s \n---------------------' % x
        #Get the counts
        srs = df[x].value_counts()
        #Print out the counts
        for n,v in zip(srs.index, srs):
            print n, ' ', v

summary(dftemp)

Total People: 20000
Total Coffee Types: 189

Summary of Decaf 
---------------------
True   10716
False   9284
Total People: 20000
Total Coffee Types: 189

Summary of Organic 
---------------------
True   10556
False   9444
Total People: 20000
Total Coffee Types: 189

Summary of Fair_trade 
---------------------
False   11794
True   8206
Total People: 20000
Total Coffee Types: 189

Summary of Adjective 
---------------------
supremo   1881
paradise valley   1877
swiss water   1873
honey burst   1737
peaberry   1718
caturra   1617
sidamo   1595
cuzcachapa   1558
aa   1486
black satin   1342
reserve   1226
longberry   1180
mandheling   909
daysiss water   1
Total People: 20000
Total Coffee Types: 189

Summary of Country 
---------------------
India   1774
Bolivia   1613
Brazil   1612
El Salvador   1597
Bali   1592
Panama   1574
Peru   1475
Sumatra   1475
Dominican Republic   1337
Kenya   1235
Guatemala   1235
Mali   1163
Costa Rican   1030
Ethiopia   888
Mexico   400


## Build Features For Rec Engine - Collaborative Filtering

I thought about using the features that we created in the last exercise, but as a first pass I want to see how a colloborative filtering approach will do rather than an item-based method, especially because we only extracted very simple features for each coffee and people can have very complex taste profiles.  So I'm going to do a purely colloborative filtering approach here.

In [21]:
scores = pd.pivot_table(data = df.reset_index(), values=["Rating"], 
               columns = ['ID'], index = ['Name'], aggfunc='mean').fillna(0)
scores.columns = scores.columns.droplevel()
#scores.reset_index(inplace=True)

In [22]:
#scores[['Decaf', 'Organic', 'Fair_trade', 'Adjective', 'Country']] = scores['Name'].apply(parse).apply(pd.Series)
scores.head()

ID,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AA Ethiopian,2,1,0,1,1,0,1,1,1,0,...,0,4,0,4,0,0,2,2,0,1
Black Satin Indian,0,0,1,0,0,0,0,1,0,0,...,0,4,0,4,0,1,2,0,0,1
Caturra Guatemalan,0,0,0,0,0,0,0,0,0,0,...,0,4,0,0,0,0,0,2,0,0
Cuzcachapa Balinese,0,0,1,0,0,1,1,0,0,1,...,0,0,0,0,0,1,2,2,1,0
Cuzcachapa Costa Rican,2,1,0,1,0,0,0,1,1,0,...,0,0,2,4,0,0,2,2,0,0


### Calculate & Build Pairwise Distance Matrix

In [23]:
dists = cosine_similarity(scores)
dists = pd.DataFrame(dists, columns=scores.index)
dists.index = dists.columns

In [24]:
dists.head()

Name,AA Ethiopian,Black Satin Indian,Caturra Guatemalan,Cuzcachapa Balinese,Cuzcachapa Costa Rican,Cuzcachapa Sumatran,Decaf AA Ethiopian,Decaf AA Guatemalan,Decaf Black Satin Balinese,Decaf Black Satin Bolivian,...,Peaberry Indian,Peaberry Salvadorean,Reserve Balinese,Reserve Salvadorean,Sidamo Panamanian,Supremo Brazilian,Supremo Kenyan,Supremo Salvadorean,Swiss Water Dominican,daysiss Water Salvadorean
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AA Ethiopian,1.0,0.42186,0.32383,0.270808,0.462563,0.366947,0.358424,0.422221,0.300824,0.394914,...,0.396393,0.381742,0.281906,0.406597,0.358838,0.316956,0.405759,0.383367,0.300438,0.0
Black Satin Indian,0.42186,1.0,0.373754,0.372577,0.40345,0.476225,0.314837,0.432908,0.218132,0.314309,...,0.465158,0.424165,0.460635,0.460391,0.505981,0.437504,0.517713,0.445092,0.452405,0.0
Caturra Guatemalan,0.32383,0.373754,1.0,0.283851,0.395258,0.264893,0.33827,0.479878,0.286577,0.27723,...,0.393488,0.246502,0.374747,0.374528,0.345906,0.435384,0.413619,0.286201,0.424627,0.065653
Cuzcachapa Balinese,0.270808,0.372577,0.283851,1.0,0.203554,0.432824,0.419807,0.418791,0.320408,0.378721,...,0.445391,0.38835,0.423934,0.293018,0.487053,0.550915,0.449466,0.254604,0.334487,0.0
Cuzcachapa Costa Rican,0.462563,0.40345,0.395258,0.203554,1.0,0.360105,0.37193,0.452338,0.247916,0.296006,...,0.401738,0.37803,0.405239,0.585091,0.342881,0.349745,0.34902,0.40935,0.440203,0.056796


## Build simple rec engine function

In [25]:
def get_similar(coffee, n=None):
    """
    calculates which kinds of coffee are most similar to the ones provided. 
    
    Parameters
    ----------
    coffee: list
        some beers!
    
    Returns
    -------
    ranked_coffee: list
        rank ordered coffee
    """
    #collect the column names (the list of coffees)
    cof = [c for c in coffee if c in dists.columns]
    #sum the distances over the cofees that we selected (columns in the matrix)
    cof_summed = dists[cof].apply(lambda row: np.sum(row), axis=1)
    #Cosine distance we want to sort in descending order because 1 means perfectly similar
    cof_summed = cof_summed.sort_values(ascending=False)
    #Get rid of the coffees that == the ones that we queried for
    ranked_cof = cof_summed.index[cof_summed.index.isin(cof)==False]
    #Transform into a list
    ranked_cof = ranked_cof.tolist()
    if n is None:
        return ranked_cof
    else:
        return ranked_cof[:n]

## Make Reccomendations

Recommendations if you like Black Satin Indain

In [26]:
get_similar(['Black Satin Indian'], 5)

['Organic Longberry Bolivian',
 'Supremo Kenyan',
 'Organic Caturra Indian',
 'Organic Supremo Peruvian',
 'Sidamo Panamanian']

Recommendations if you like Decaf AA Guatemalan

In [27]:
get_similar('Decaf AA Guatemalan', 10)

['AA Ethiopian',
 'Black Satin Indian',
 'Caturra Guatemalan',
 'Cuzcachapa Balinese',
 'Cuzcachapa Costa Rican',
 'Cuzcachapa Sumatran',
 'Decaf AA Ethiopian',
 'Decaf AA Guatemalan',
 'Decaf Black Satin Balinese',
 'Decaf Black Satin Bolivian']

Recommendations if you like Cuzcachapa Costa Rican

In [28]:
get_similar('Cuzcachapa Costa Rican', 5)

['AA Ethiopian',
 'Black Satin Indian',
 'Caturra Guatemalan',
 'Cuzcachapa Balinese',
 'Cuzcachapa Costa Rican']