![image wine](https://images.unsplash.com/photo-1516594915697-87eb3b1c14ea?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=750&q=80)

# WINE REVIEW - EXPLORATORY DATA ANALYSIS

# Background
Wine has always been a topic of discussion for human. The wine industry is blossoming and expanding. For this, people want to buy products that are tried and tested. In this product we will try to find some answers.

## Some important questions

1. Which is the most reviewed country and most reviewed variety?
2. Is there any relationship between price and points received?
3. What are some characteristics of wine, country wise?
4. What are some common terms appearing in the lowest rated and highest rated wine?
5. Can we create a recommender system?
6. What should a company keep in mind to get good reviews?
7. Is there any relationship between points and any other attributes?
8. Which variety of grape will be best to make wine?
9. Referring to which reviewer will be beneficial?
10. Should we go with whats common or with something less common, in terms of variety?

We will do exploratory data analysis of wine reviews and try to build a recommender system for the same.<br> The dataset is taken from [here](https://github.com/RoaldSchuring/wine_recommender). <br> The data is scrapped from a famous wine magazine named [Winemag](https://www.winemag.com/).

Starting with importing the libraries

In [None]:
import pandas as pd
import numpy as np
from wordcloud import WordCloud,STOPWORDS
import seaborn as sns
from scipy.stats import kurtosis, skew
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)

In [None]:
wine = pd.read_csv(r'../input/wine-reviews/winemag-data-130k-v2.csv')

## Sneak peak of dataset

In [None]:
wine.head(5)

In [None]:
wine.loc[925,'description']

## Dimension of the dataset

In [None]:
wine.shape

Lots of data to work with.

## Overview of dataset

In [None]:
wine.describe()

We can see some interesting statistic-<br>
1. All the points lie between 80 and 100.<br>
2. The variance in point is much less as compared to price.
3. Prices of wine has huge range starting from 4 to 2300
4. In prices, the IQR (Interquartile Range) is 25. Whereas the max value is 3300

# Cleaning the dataset

In [None]:
wine.isnull().values.any()

So, we do have missing values. Lets see them.

In [None]:
sns.heatmap(wine.isnull(),yticklabels=False,cbar=False,cmap='viridis')

So we have a lot of missing data.

In [None]:
wine = wine.drop(columns=['Unnamed: 0','region_1','region_2','taster_twitter_handle','designation'])
wine1 = wine
sns.heatmap(wine.isnull(),yticklabels=False,cbar=False,cmap='viridis')

__Dealing with missing data__ 
1. First, I am replacing the price with the average price of price.
2. I am dropping the remaining missing data

In [None]:
wine1.price.fillna(wine.price.dropna().median(),inplace =True)
wine1 = wine.dropna()
sns.heatmap(wine1.isnull(),yticklabels=False,cbar=False,cmap='viridis')

## Outliers

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x=wine['price'],palette = 'colorblind')

We have a lot of outliers in price.

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x=wine['points'],palette = 'colorblind')

Here we have 2 points in the outliers but, they cant be called as outlier as maximum point can be 100

## Getting more depth

In [None]:
for feature in wine1.columns:
    uniq = np.unique(wine1[feature])
    print('{}: {} distinct values\n'.format(feature,len(uniq)))

So we have wines from 42 countries reviewed which come from 10949 wineries. 608 types of wine are reviewed.

# Doing some Analysis

## Country Feature

## Most reviewed country

In [None]:
plt.figure(figsize=(20,10))
cnt = wine['country'].value_counts().to_frame()[0:20]
#plt.xscale('log')
sns.barplot(x= cnt['country'], y =cnt.index, data=cnt, palette='colorblind',orient='h')
plt.title('Distribution of Wine Reviews of Top 20 Countries');

## Distribution of country and price

In [None]:
cnt = wine.groupby(['country',]).median()['price'].sort_values(ascending=False).to_frame()

plt.figure(figsize=(20,15))
sns.pointplot(x = cnt['price'] ,y = cnt.index ,color='r',orient='h',markers='o')
plt.title('Country wise average wine price')
plt.xlabel('Price')
plt.ylabel('Country');

In [None]:
plt.figure(figsize = (30,20))
g2 = sns.stripplot(y='price', x= 'country', 
                   data=wine, 
                   jitter=True,
                   dodge=True,
                   marker='o', 
                   alpha=0.5)
g2.set_title("Country X Price Distribution", fontsize=30)
plt.show()

In the first graph we can see that Switzerland has highest mean with regards to price, but when we see the strip plot, we can see that clearly France has a range of price, and also has the most expensive wine.

### Countries producing expensive wine

In [None]:
plt.figure(figsize=(16,8))

cnt = wine.groupby(['country'])['price'].max().sort_values(ascending=False).to_frame()[:20]
g2 = sns.barplot(x = cnt['price'], y = cnt.index, palette= 'colorblind')
g2.set_title('Most expensive wine in country')
g2.set_ylabel('Country')
g2.set_xlabel('')

## Distribution of country and points

In [None]:
cnt = wine.groupby(['country',]).mean()['points'].sort_values(ascending=False).to_frame()

plt.figure(figsize=(20,15))
sns.pointplot(x = cnt['points'] ,y = cnt.index ,color='r',orient='h',markers='o')
plt.title('Country wise average wine points')
plt.xlabel('Points')
plt.ylabel('Country')


Lets check the same using strip plot

In [None]:
plt.figure(figsize = (30,20))
g2 = sns.stripplot(y='points', x= 'country', 
                   data=wine, 
                   jitter=True,
                   dodge=True,
                   marker='o', 
                   alpha=0.5)
g2.set_title("Country X Points Distribution", fontsize=30)
plt.show()

## Variety Feature

## Most reviewed variety

But, before that, what do we mean by variety? <br> Wine “varietals” simply means wine made from a specific winegrape.  Varietal wines in the United States are often named after the dominant grapes used in making the wine.  Cabernet Sauvignon, Merlot, Chardonnay, Riesling, Pinot Noir, and Chenin Blanc are examples of grape varieties. When a wine bottle shows a varietal designation on the label (like Merlot) it means that the wine in the bottle is at least 75%  that grape variety (at least 75% Merlot, for example). <br> [Source](https://www.wines.com/wine-varietals/)

In [None]:
plt.figure(figsize=(20,10))
cnt = wine['variety'].value_counts().to_frame()[0:20]
sns.barplot(x= cnt['variety'], y =cnt.index, data=cnt, palette='colorblind',orient='h')
plt.title('Distribution of Wine Reviews of Top 20 Varieties');

## Price and variety distribution

In [None]:
plt.figure(figsize=(20,18))
cnt = wine.groupby(['variety'])['price'].max().sort_values(ascending=False).to_frame()[:15]
g2 = sns.barplot(x = cnt['price'], y = cnt.index, palette= 'colorblind')
g2.set_title('The grapes used for most expensive wine')
g2.set_ylabel('Variety')
g2.set_xlabel('')
plt.show()

Cool. Bordeaux-style Red blend is the most expensive grape type. <br> Is it also most rated? <br> Lets see.

In [None]:
plt.figure(figsize=(20,18))
cnt = wine.groupby(['variety'])['points'].max().sort_values(ascending=False).to_frame()[:20]
g2 = sns.barplot(x = cnt['points'], y = cnt.index, palette= 'colorblind')
g2.set_title('Varieties who got highest point')
g2.set_ylabel('Variety')
g2.set_xlabel('')
plt.show()

So, we have quite a few who have got full point

## Taster Feature

## Most frequent taster

In [None]:
plt.figure(figsize=(20,10))
cnt = wine['taster_name'].value_counts().to_frame()[0:20]
sns.barplot(x= cnt['taster_name'], y =cnt.index, data=cnt, palette='colorblind',orient='h')
plt.title('Top 20 tasters')
plt.show()

Roger Voss has reviewed the most number of wines and, its 10000 more than the 2nd person who is Michael Schachner

## Taster and point distribution

In [None]:
wine.groupby("taster_name")["points"].describe()

Lets look at the same data, visually

In [None]:
plt.figure(figsize = (30,10))
g2 = sns.stripplot(y='points', x='taster_name', 
                   data=wine, 
                   jitter=True,
                   dodge=True,
                   marker='o', 
                   alpha=0.5)
g2.set_title("Taster Name Points Distribuition", fontsize=25)
plt.show()

To understand the statistics along with the outliers in a better way let us try box plot over strip plot

In [None]:
plt.figure(figsize = (30,10))
sns.boxplot(y='points', x='taster_name', 
                 data=wine )
sns.stripplot(y='points', x='taster_name', 
                   data=wine, 
                   jitter=True,
                   dodge=True,
                   marker='o', 
                   alpha=0.5)
g2.set_title("Taster Name Points Distribuition", fontsize=25)
plt.show()

We can see that, most of the reviewers have the same range in which they give points. There are a few reviewers who are below the range, but it can be because they have reviewed less number of wine.

## Description Feature

## Analysing descriptions

Let us see what the wines with lowest points, have in their description

In [None]:
plt.figure(figsize= (16,8))
plt.title('Word cloud of Description of lowest rated wine')
wc = WordCloud(max_words=1000,max_font_size=40,background_color='black', stopwords = STOPWORDS,colormap='Set1')
wc.generate(' '.join(wine[wine['points']==80]['description']))
plt.imshow(wc,interpolation="bilinear")
plt.axis('off')
plt.show()

We can see the words heavy, flovor,bitter,burnt and words like sour.

Let us see the description of the most expensinve wine

In [None]:
plt.figure(figsize= (16,8))
plt.title('Word cloud of Description of most expensive wines')
wc = WordCloud(max_words=1000,max_font_size=40,background_color='black', stopwords = STOPWORDS,colormap='Set1')
wc.generate(' '.join(wine[wine['price']>=108]['description']))
plt.imshow(wc,interpolation="bilinear")
plt.axis('off')
plt.show()

We can see the words show, dense, elegant etc

Doing the same for highest rates wines.

In [None]:
plt.figure(figsize= (16,8))
plt.title('Word cloud of Description of highest rated wines')
wc = WordCloud(max_words=1000,max_font_size=40,background_color='black', stopwords = STOPWORDS,colormap='Set1')
wc.generate(' '.join(wine[wine['points']==100]['description']))
plt.imshow(wc,interpolation="bilinear")
plt.axis('off')
plt.show()

The word vintage, age and aging appears a lot. This proves that aging is an important aspect of the wine but not the *only* characteristic which makes it score 100.

__Lets take a look at all the descriptions__

In [None]:
sns.set_context('poster')
plt.figure(figsize= (16,8))
plt.title('Word cloud of Description')
wc = WordCloud(max_words=1000,max_font_size=40,background_color='black', stopwords = STOPWORDS,colormap='Set1')
wc.generate(' '.join(wine['description']))
plt.imshow(wc,interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
wine = wine.assign(desc_length = wine['description'].apply(len))

In [None]:
sns.set_context('paper')
plt.figure(figsize=(14,6))

g = sns.regplot(x='desc_length', y='price',
                data=wine, fit_reg=True,  line_kws={'color':'black'},color = 'red' )
g.set_title('Price by Description Length', fontsize=20)
g.set_ylabel('Price(USD)', fontsize = 16) 
g.set_xlabel('Description Length', fontsize = 16)
g.set_xticklabels(g.get_xticklabels(),rotation=45)

plt.show()

## Point Feature

## Average point of wine

In [None]:
g = sns.countplot(x='points', data=wine, palette = 'colorblind') # seting the seaborn countplot to known the points distribuition
g.set_title("Points Count distribuition ", fontsize=20) # seting title and size of font
g.set_xlabel("Points", fontsize=15) # seting xlabel and size of font
g.set_ylabel("Count", fontsize=15) # seting ylabel and size of font


plt.show() #rendering the graphs

The graph looks like a normal distribution, with majority of wines being between 82 to 95

## Is there any relationship between point and price?

Lets try to see the correlation between price and point

In [None]:
# Finding the relations between the variables.
plt.figure(figsize=(10,5))
c= wine.corr()
sns.heatmap(c,cmap="coolwarm",annot=True) #BrBG, RdGy, coolwarm
c

Though there is a relationship between price and point, and it is positive correlation. But the correlation is very weak. To understand it better, let us plot a scatter plot. 
There is also a relationship between desc_length and points. This too is weak, but stronger than price.

In [None]:
plt.figure(figsize=(20,8))

g = sns.regplot(x='points', y='price', 
                data=wine, line_kws={'color':'black'},
                x_jitter=True, fit_reg=True, color = 'red')
g.set_title("Points x Price Distribuition", fontsize=20)
g.set_xlabel("Points", fontsize= 15)
g.set_ylabel("Price", fontsize= 15)

plt.show()

We can clearly see the highest rated wine is not the most expensive one. <br> This brings us to the next part of the project, which is the recommender. The recommendation system will return a wine which has same characteristics.

But before that, lets see whether the length of the description has anything to do with points

In [None]:
wine = wine.assign(description_length = wine['description'].apply(len))
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(x='points', y='description_length', data=wine)
plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Description Length per Points', fontweight="bold", size=25) # Title
ax.set_ylabel('Description Length', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
plt.show()

We can see that there is a linear relation between points and description length. So our recommendation system can be based on the description.

# The Model

![](http://shabal.in/visuals/kmeans/random.gif)

Let us first, simplify the points, because all the points are from 80 to 100 only

Let's try to simplify the model with 5 different values:

0 -> Points 80 to 83 (Under Average wines)

1 -> Points 83 to 87 (Average wines)

2 -> Points 87 to 90 (Good wines)

3 -> Points 90 to 94 (Very Good wines)

4 -> Points 96 to 98 (Excellent wines)

5 -> Points 98 to 100 (Best wines)

In [None]:
def cat_points(points):
    if points in list(range(80,83)):
        return 0
    elif points in list(range(83,87)):
        return 1
    elif points in list(range(87,90)):
        return 2
    elif points in list(range(90,94)):
        return 3
    elif points in list(range(94,98)):
        return 4
    else:
        return 5

wine["rating_cat"] = wine["points"].apply(cat_points)

Let us see the distribution of new simplified points

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Number of wines per points', fontweight="bold", size=25) # Title
ax.set_ylabel('Number of wines', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
wine.groupby(['rating_cat']).count()['description'].plot(ax=ax, kind='bar')

In [None]:
fig, ax = plt.subplots(figsize=(30,10))
sns.boxplot(x='points', y='desc_length', data=wine)
plt.xticks(fontsize=20) # X Ticks
plt.yticks(fontsize=20) # Y Ticks
ax.set_title('Description Length per Points', fontweight="bold", size=25) # Title
ax.set_ylabel('Description Length', fontsize = 25) # Y label
ax.set_xlabel('Points', fontsize = 25) # X label
plt.show()

## Recommender (Naive Try)

In [None]:
given_point = int(input("Your preferred point: "))
given_price = float(input("Whats your budget, in dollars?: "))
found = False
for row_index,row in wine.iterrows():
    if row['points']==given_point and row['price']<given_price:
        print(row['title'], "   ", row['price'], "   ", row['points'])
        found = True
if(not found):
    print("Sorry, not found.")

This is clearly not the way, because, even if the point is same and price is low, it doesn't mean that it is the most preferred, because not all the wines have same flavour and texture, a better model will be which matches according to the description.

## Description Vectorization

One of the simplest method to classify texts with ML nowadays is called vectorization.<br>
Basically, we want to represent our texts in a vector space, associated with weights (number of occurrences etc…), so our classification algorithm will be able to interpret it.<br>
For example, if our dictionary contains the words {Jupyter, is, the, not, great}, and we want to vectorize the text “Jupyter is great”, we would have the following vector: (1, 1, 0, 0, 1).
A few vectorization algorithm are available, the most famous being:
- CountVectorizer: simply weighted by word counting as stated by it’s name
- TF-IDF Vectorizer: the weight increases proportionally to count, but is offset by the frequency of the word in the total corpus. This is called the IDF (Inverse Document Frequency). This allows the Vectorizer to adjust weights with frequent words like “the”, “a” etc…
- n-grams
- stopwords

Vectorization can be called the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.<br>
[Source](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

## Using Count Vectorizer

__CountVectorizer__ <br>
Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.<br> 
[Source](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

## N-Grams
n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles. [Source](https://en.wikipedia.org/wiki/N-gram) <br>
[To understand N Grams](https://kavita-ganesan.com/what-are-n-grams/)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

ngram_range is set at 2,3 which means it will take bigrams and trigrams from the text. 
<br> [This link](https://stackoverflow.com/a/35615151/7449819) has an amazing and brief description about min_df and max_df. 

In [None]:
import matplotlib.gridspec as gridspec # to do the grid of plots
country = wine.country.value_counts()[:20]

grid = gridspec.GridSpec(5, 2)
plt.figure(figsize=(16,7*4))

for n, cat in enumerate(country.index[:10]):
    
    ax = plt.subplot(grid[n])   

    vectorizer = TfidfVectorizer(ngram_range = (2, 3), min_df=5, 
                                 stop_words='english',
                                 max_df=.5) 
    
    X2 = vectorizer.fit_transform(wine.loc[(wine.country == cat)]['description']) 
    features = (vectorizer.get_feature_names()) 
    scores = (X2.toarray()) 
    
    # Getting top ranking features 
    sums = X2.sum(axis = 0) 
    data1 = [] 
    
    for col, term in enumerate(features): 
        data1.append( (term, sums[0,col] )) 

    ranking = pd.DataFrame(data1, columns = ['term','rank']) 
    words = (ranking.sort_values('rank', ascending = False))[:15]
    
    sns.barplot(x='term', y='rank', data=words, ax=ax, 
                color='blue', orient='v')
    ax.set_title(f"Wine's from {cat} N-grams", fontsize=19)
    ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
    ax.set_ylabel(' ')
    ax.set_xlabel(" ")

plt.subplots_adjust(top = 0.95, hspace=.9, wspace=.1)

plt.show()

We can see a few interesting fact. Each country has its own characteristic and description of wine.

## Doing Sentiment Analysis

**What is Sentiment Analysis?**

![funny picture](https://www.brandwatch.com/wp-content/resize/uploads/2015/01/lego-640.jpg__w0)

Sentiment analysis is a type of data mining that measures the inclination of people’s opinions through natural language processing (NLP), computational linguistics and text analysis, which are used to extract and analyze subjective information from the Web - mostly social media and similar sources. The analyzed data quantifies the general public's sentiments or reactions toward certain products, people or ideas and reveal the contextual polarity of the information. [Source](https://www.techopedia.com/definition/29695/sentiment-analysis)

In [None]:
wine['price_log'] = np.log(wine['price'])
wine.head(2)

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

SIA = SentimentIntensityAnalyzer()

# Applying Model, Variable Creation
sentiment = wine.sample(15000).copy()
sentiment['polarity_score']=sentiment.description.apply(lambda x:SIA.polarity_scores(x)['compound'])
sentiment['neutral_score']=sentiment.description.apply(lambda x:SIA.polarity_scores(x)['neu'])
sentiment['negative_score']=sentiment.description.apply(lambda x:SIA.polarity_scores(x)['neg'])
sentiment['positive_score']=sentiment.description.apply(lambda x:SIA.polarity_scores(x)['pos'])

sentiment['sentiment']= np.nan
sentiment.loc[sentiment.polarity_score>0,'sentiment']='POSITIVE'
sentiment.loc[sentiment.polarity_score==0,'sentiment']='NEUTRAL'
sentiment.loc[sentiment.polarity_score<0,'sentiment']='NEGATIVE'

In [None]:
def sentiment_analyzer_scores(sentence):
    score = SIA.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(score)))

In [None]:
print(sentiment_analyzer_scores("yaaaaay"))
print(sentiment_analyzer_scores("Today is a sunny day #love"))
print(sentiment_analyzer_scores("UGGHHH SUCH A BORING DAY"))
print(sentiment_analyzer_scores("i like kaggle a lot lol"))
print(sentiment_analyzer_scores("i like kaggle a lot!!"))

## Plotting the Analysis
Ok cool, we have the analysis, but really, nothing can be understood. We need to plot these to understand whaat exactly is going on.

In [None]:
plt.figure(figsize=(14,5))

plt.suptitle('Sentiment of the reviews by: \n- Points and Price(log) -', size=22)

plt.subplot(121)
ax = sns.boxplot(x='sentiment', y='points', data=sentiment,palette = 'pastel')
ax.set_title("Sentiment by Points Distribution", fontsize=19)
ax.set_ylabel("Points ", fontsize=17)
ax.set_xlabel("Sentiment Label", fontsize=17)

plt.subplot(122)
ax1= sns.boxplot(x='sentiment', y='price_log', data=sentiment,palette = 'pastel')
ax1.set_title("Sentiment by Price Distribution", fontsize=19)
ax1.set_ylabel("Price (log) ", fontsize=17)
ax1.set_xlabel("Sentiment Label", fontsize=17)

plt.subplots_adjust(top = 0.75, wspace=.2)
plt.show()

Here, we can see that, price doesnt has much effect based on the sentiment of the description,which is kinda  obvious. But we can see that sentiments from description really affects the point. So, we can make our recommender system based on the sentiment and the point involved.

## Recommender System using a Collaborative Filtering method
A small recommender system is made using Nearest Neighbors algorithm.

- Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
- Closer the vectors, smaller will be the angle and larger the cosine

Importing some more needed libraries

In [None]:
from sklearn.neighbors import NearestNeighbors # KNN Clustering 
from scipy.sparse import csr_matrix # Compressed Sparse Row matrix
from sklearn.decomposition import TruncatedSVD # Dimensional Reduction

Creating another dataframe containing the columns that we need. <br> Next we want to create a matrix of the same for analysis

In [None]:
# Lets choice rating of wine is points, title as user_id, and variety,
col = ['province','variety','points']

wine1 = wine[col]
wine1 = wine1.dropna(axis=0)
wine1 = wine1.drop_duplicates(['province','variety'])
wine1 = wine1[wine1['points'] > 85]

wine_pivot = wine1.pivot(index= 'variety',columns='province',values='points').fillna(0)
wine_pivot_matrix = csr_matrix(wine_pivot)

Lets see the type of the ```wine_pivot_matrix```

In [None]:
wine_pivot_matrix

## Instantiating the KNN algorithmn and fitting in the Wine Matrix to it

In [None]:
from sklearn.cluster import KMeans
from scipy.cluster.vq import kmeans, vq

In [None]:
trial = wine[['price', 'points']]
data = np.asarray([np.asarray(trial['price']), np.asarray(trial['points'])]).T

In [None]:
X = data
distortions = []
for k in range(2,30):
    k_means = KMeans(n_clusters = k)
    k_means.fit(X)
    distortions.append(k_means.inertia_)

fig = plt.figure(figsize=(15,10))
plt.plot(range(2,30), distortions, 'bx-')
plt.title("Elbow Curve")

In [None]:
knn = NearestNeighbors(n_neighbors=7, algorithm = 'brute',metric = 'cosine')
model_knn = knn.fit(wine_pivot_matrix)

## Running our baseline Model

In [None]:
for n in range(5):
    query_index = np.random.choice(wine_pivot.shape[0])
    #print(n, query_index)
    distance, indice = model_knn.kneighbors(wine_pivot.iloc[query_index,:].values.reshape(1,-1), n_neighbors=6)
    for i in range(0, len(distance.flatten())):
        if  i == 0:
            print('Recmmendation for ## {0} ##:'.format(wine_pivot.index[query_index]))
        else:
            print('{0}: {1} with distance: {2}'.format(i,wine_pivot.index[indice.flatten()[i]],distance.flatten()[i]))
    print('\n')

Woohoo!!

# Future Work

* Build the recommender based on points,price,variety or title as given by user.
* Deploy the recommender on flask or django
* Try to predict points using the description
* Using a better language model

# Some interesting blogs
1. [Sentiment Analysis](https://www.brandwatch.com/blog/understanding-sentiment-analysis/)
2. [N Grams](https://www.youtube.com/watch?v=MZIm_5NN3MY)
3. [Collaborative Filtering](https://www.youtube.com/watch?v=9AP-DgFBNP4&t=390s)
4. [KNN Alogrithm](https://towardsdatascience.com/a-simple-introduction-to-k-nearest-neighbors-algorithm-b3519ed98e)
5. [To understand Numpy](https://www.youtube.com/watch?v=NlZXAytUeeE&list=PLWKjhJtqVAblvI1i46ScbKV2jH1gdL7VQ&index=3)
6. [To understand Pandas](https://www.youtube.com/watch?v=b2mLDkMSyn4&list=PLWKjhJtqVAblvI1i46ScbKV2jH1gdL7VQ&index=4)
7. [Exploratory Data Analysis](https://www.youtube.com/watch?v=Ea_KAcdv1vs)
8. [EDA on wine review](https://towardsdatascience.com/wine-ratings-prediction-using-machine-learning-ce259832b321)
9. [Wine Recommender](https://github.com/RoaldSchuring/wine_recommender)
10. [My notebook inspiration](https://www.kaggle.com/kabure/wine-review-s-eda-recommend-systems)
11. [An amazing guide to Sentiment Analysis](https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f)

## Important links
1. [Matplotlib Documentation](https://matplotlib.org/)
2. [Sea Born Documentation](http://seaborn.pydata.org/)
3. [Scikit Learn Documentation](https://scikit-learn.org/stable/)