# Wine Recommendation System 

This notebook contains all the necessary code to fit, train and run the wine recommendation system. It is split into three parts: 
    
    Part 1. Helper Functions 
        - Preprocessing Text 
        - Preprocessing Data 
        - Vectorize Text 
    
    Part 2. Function that makes the recommendations. Takes an input text string, vectorizes it and prints information for the 5 wines with the most similar vector representations. 
    
    Part 3. Runs the recommendation function using user-supplied strings.

## Part One: Helper Functions 

### Text Preprocessing Function 

In [1]:
def preprocess_text(text: str): 

    '''
    Function that will take a text string and return the preprocessed version of that string. 
    Preprocessing steps include: 
        Remove stop words.
        Removing punctuation.
        Lemmatizing the word using wordnet through ntltk. 
        Making the string all lowercase letters.
    '''

    # Import nlp library 
    import nltk
    from nltk.corpus   import stopwords 
    from nltk.tokenize import wordpunct_tokenize
    from nltk.stem import WordNetLemmatizer

    # set stop words to english 
    engStop = set(stopwords.words('english'))

    # Initiate list of tokens preprocessed tokens
    prep_text = []
    # Tokenize the text 
    tok = wordpunct_tokenize(text)

    # Initiate the lemmatizer 
    lemmatizer = WordNetLemmatizer()

    # Loop to preprocess, appened and join tokens to precessed string 
    for t in tok:
        if t.isalpha() and t not in engStop: 
            lem_t = lemmatizer.lemmatize(t)
            prep_text.append(lem_t.lower())

    # Return processed string 
    return ' '.join(prep_text)

### Data Preprocessing Function 

In [2]:
def group_and_prep_wines(data_dir, overwrite = False): 

    '''
    Function to preprocess the wine data using the raw data directory (raw file is split in two). 
    This function: 
        1. Loads and concatenates raw wine review data.
        2. Drops duplicate reviews.
        3. Groups the wine into unique type and winery groups.
            These groups represent wine varieties and are the unit of recommendation.
        4. Preprocess the review text. 
        5. Save/Load the preprocessed data. 
    '''

    # Import packages 
    import os 
    from glob import glob
    import pandas as pd
    import numpy as np

    # Make Directory to Store Processed data 
    processed_dir = os.path.join(os.getcwd(), 'data', 'processed')
    if not os.path.isdir(processed_dir): os.mkdir(processed_dir)

    # Set processed data file name 
    processed_fn = os.path.join(processed_dir, 'wine_groups.csv')

    # Option to overite the processed data and check if it exists     
    if overwrite is True or not os.path.exists(processed_fn): 
    
        # Get File Names 
        fnames = glob(os.path.join(os.getcwd(), 'data', 'raw', '*.csv'))
    
        # Concatenate review data
        wine = pd.concat([pd.read_csv(f, header=0) for f in fnames], 
                         ignore_index=True)
    
        # Drop duplicated reviews and uneeded column
        wine = wine.drop_duplicates(subset=['description']).drop(wine.columns[0], axis=1)
    
        # Fill Nas of str columns with missing values
        # Here, missing values are meaningful
        for col in ['designation', 'variety', 'country']:
            wine[col] = wine[col].fillna(value = '')
    
        # Helper function to add extracted type variable to df 
        def winetype_col_maker(row): 
            # Uses variety/designation if the other is unavailable 
            if row.variety == '': return row.designation.lower()
            elif row.designation == '': return row.variety.lower()
            # If both are available: use combination of both and alphabatize for comparability 
            else:
                s = str(row.designation) + ' ' + str(row.variety)
                w = [w.lower() for w in s.split(' ')]
                if 'the' in w: w.remove('the')
                w.sort()
                return ' '.join(w)
        # add type column to df 
        wine['winetype'] = wine.apply(winetype_col_maker, axis=1)

        # Collect File Names for grouped data 
        wine_sub = wine.drop(columns=['region_1', 'region_2', 'province', 'taster_name', 
                                      'taster_twitter_handle', 'title'])
        group_cols = wine_sub.columns.tolist()
        group_cols.remove('winetype')

        # Iterate over groups to save group attributes as rows
        from collections import Counter
        rows = []
        for name, group in wine.groupby(['winery', 'winetype']):  
            row = []
            # Simply add rows for wines with only one review for that type 
            if len(group) == 1:
                for col in group_cols:
                    if col == 'description': 
                        row.append(list(group[col]))
                    else: 
                        row.append(group[col].iloc[0])
            # Combine attributes for groups 
            else: 
                for col in group_cols: 
                    # Create list of reviews
                    if col == 'description': 
                        row.append(list(group[col]))
                    # Average numeric attributes 
                    elif np.issubdtype(group[col], np.number): 
                        row.append(round(group[col].mean(), 2))
                    # Save the most common for all other string attributes 
                    else: 
                        row.append(Counter(group[col].tolist()).most_common(1)[0][0])
            rows.append(row)
        groups = pd.DataFrame(data=rows, columns = group_cols)

        # Concatenate list of reviews to single string 
        groups['full_desc'] = groups.description.apply(lambda x: ' '.join(x))
        # Preprocess and lemmatize the reviews 
        groups['lemDesc'] = groups['full_desc'].apply(lambda x: preprocess_text(x))
        # Save the length of the lemmatized review 
        groups['lemLen'] = groups['lemDesc'].apply(lambda x: len(x.split(' ')))

        # Remove reviews with only imporing informtion available 
        filt_idx = [idx for idx, r in groups.iterrows() if r.lemLen < 10 and 'imported' in r.lemDesc] 
        # Drop unneeded columns, reset index  
        groups_filt = groups.drop(filt_idx, axis=0, inplace=False).reset_index()

        # Save preprocessed data 
        groups.to_csv(processed_fn, header=True, index=False)

    else: 
        # Load preprocessed data 
        groups = pd.read_csv(processed_fn, header=0,  converters={'COLUMN_NAME': pd.eval})

    return groups

### Helper Function to Vectorize Text using TF-IDF 

In [3]:
def vectorize_text(text:str):

    '''
    Takes a string and vectorizes it using a vectorizer trained on the preprocessed wine review data.
    '''

    # Preprocess the text 
    prepped_text = preprocess_text(text)
    # Load/Create grouped data 
    groups = group_and_prep_wines(os.path.join(os.getcwd(), 'data', 'raw'), overwrite=False)

    # Import vectorizer 
    from sklearn.feature_extraction.text import TfidfVectorizer

    # Initiate vectorizer and train it on the preprocessed data 
    tfidf_vec = TfidfVectorizer(max_df = 1.0, min_df = 1) 
    tfidf_vec.fit(groups.lemDesc)

    # Vectorize the preprocessed string 
    X = tfidf_vec.transform([prepped_text])

    # Return the vector representation of the string
    return X

## Part 2: Defining KNN Model and Recommendation Algorithm 

In [4]:
def make_recs_desc(): 

    '''
    Function to return recommendations from a user text query. 
    '''

    # Import Packages
    from sklearn.neighbors import NearestNeighbors
    from sklearn.feature_extraction.text import TfidfVectorizer
    import numpy as np 
    import pandas as pd
    from ast import literal_eval
    import os

    # Prompt user for description of wine 
    des = input("Please describe the wine you're interested in finding: ")

    # Preprocess the user text input 
    prepped = preprocess_text(des)

    # Load the preprocessed wine data 
    groups = group_and_prep_wines(os.path.join(os.getcwd(), 'data', 'raw'), overwrite=False)

    ### Train TF-IDF on wine reviews 
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf_vec = TfidfVectorizer(max_df = 1.0, min_df = 1) 
    tfidf_vec.fit(groups.lemDesc)

    # Vectorize all reviews and the input string 
    X_full = tfidf_vec.transform(groups.lemDesc)
    X_new = tfidf_vec.transform([prepped])

    # Fit KNN Model on wine review vectors
    knn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors = 16, n_jobs=-1)
    nbrs = knn_model.fit(X_full)

    # Get the distances and index of the closest wines 
    new_dis, new_idx = nbrs.kneighbors(X_new)
    # Get DataFrame of top distances and wine ids 
    new_df = pd.DataFrame(data = zip(list(new_dis[0]), list(new_idx[0])), columns=['dis', 'idx'])
    # Extract the wine information from the preprocessed data, add them to new df sorted by distance 
    top_hits = groups.iloc[new_df.sort_values('dis').idx].reset_index()

    # Turn the description from full string back to list of strings (pd.to_csv issue)
    top_hits['description'] = top_hits['description'].apply(lambda x: literal_eval(x))

    # Finally, print the top 5 recommended wines 
    print(f'\nTop Recommendations based on description: {des}') 

    rec_info = ['country', 'winery', 'variety', 'designation', 'price', 'points']
    
    for i, row in top_hits[:5].iterrows(): 
        info_dict = { v : (row[v] if not pd.isna(row[v]) else 'Unavailable') for v in rec_info }
        info_dict['description'] = row['description'][0] if len(row['description']) == 1 else max(row['description'], key=len)
        print(f'''
        Recommendation {i + 1}:
        
            Winery: {info_dict['winery']}
            Variety: {info_dict['variety']} 
            Designation: {info_dict['designation']}
            Country: {info_dict['country']}
            Score: {int(info_dict['points'])}
            Approximate Price (per 750 mL): ${int(round(info_dict['price']))}

            Sample Expert Description: 
                 {'\n\t\t'.join([s + '.' for s in info_dict['description'].split('.')[:-1]])}
        ''')

## Part 3: Recommendation System 

In [5]:
make_recs_desc()

Please describe the wine you're interested in finding:  A sweet wine that tastes like blueberries.



Top Recommendations based on description: A sweet wine that tastes like blueberries.

        Recommendation 1:
        
            Winery: Francis Ford Coppola
            Variety: Syrah 
            Designation: Diamond Collection Green Label
            Country: US
            Score: 84
            Approximate Price (per 750 mL): $16

            Sample Expert Description: 
                 A big, jammy, acidic, and slightly sweet wine for washing down barbecue.
		 The blackberry, cherry and blueberry flavors taste like they were baked into a pie.
        

        Recommendation 2:
        
            Winery: Cambridge & Sunset
            Variety: Zinfandel 
            Designation: California Series
            Country: US
            Score: 88
            Approximate Price (per 750 mL): $14

            Sample Expert Description: 
                 An especially tasty fruit character sets this full-bodied wine apart from the pack.
		 It smells like black currants and tastes li