In [None]:
import numpy as np
import pandas as pd
import re

wine = pd.read_csv("../input/winemag-data-130k-v2.csv")

In [None]:
# review wine dataset
wine.head()

In [None]:
# make sure that each potential input / output does have values 
# drop every NaN / Null which occur in the necessary data. 
wine.info()

**Fetch the necessary information for the input (question) and output (answer)**

From the info - we can see that the *description* (input) and *title* (output do not have any NaN 
Secondly check if there are any duplicates as well as how many wines do have more than 1 entry. 

**Reasoning why checking the count of 'title' entries: **
To validate the results of our NLP / Deep Learning we do need some kind of Testset. If we take each entry into account - we will run into the issue that we might not be able to automatically test our machine learning algorithm based on existing "data-inputs". By validating the training / test-set later on we can also validate if there is a potential solution of generating an Natural Language Model based on this dataset -- the accuracy of correct "findings" in the test-set. 

In [None]:
# remove duplicate descriptions 
wine.drop_duplicates(subset = ['description', 'title'], inplace = True)

Evaluate the amount of wines which have at least 2 descriptions per title

In [None]:
wine_title_count = wine['title'].value_counts().reset_index().rename(columns = {'index':'title', 'title':'count'})
boolWines = wine_title_count[wine_title_count['count'] > 1]
wine_titles = list(boolWines['title'])

In [None]:
# Drop all indexes which do not have enough descriptions
wine_cleaned = pd.DataFrame(columns = wine.columns.values)
for title in wine_titles:
    boolTitle = wine['title'] == title
    wine_cleaned = wine_cleaned.append(wine[boolTitle == True], ignore_index= True)
            
len(wine_cleaned.index)

In [None]:
len(wine_cleaned['title'].value_counts())

Now we have drastically reduced the number of inputs to an index of 2082 and 934 different wines. 
It is an immense reduction to the basic data set of having a dataset which is based on 130k information rows. 

This reduction resulted by eliminating duplicates and to filter titles with at least 2 entries

## Now we´re starting the fun by preparing everything for our model 

In [None]:
# create input list and output list 
questions = list(wine_cleaned['description'])
answers = list(wine_cleaned['title'])

We will keep the answers as they and only prepare the questions for to be more machine understandable as well as improving the performance. 

Straight questions / answer input has been finished - Lets check an example for the status quo. Afterwards we can compare the status quo with the cleaned text (machine friendly input)

In [None]:
import nltk 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
porterStemmer = PorterStemmer()

In [None]:
def clean_text(text):
    # put everything into lower_case
    text = text.lower()
    
    # sub everything out except of the words of alphabet in lower case. The description of the
    # wines do allow us to only keep that lower case alphabet - depending on some text analyzing
    # e.g. technical data analysis -- having the value (numbers) might be quite important
    text = re.sub('[^a-z]', ' ', text)
    
    # split the whole description in single words
    text = text.split()
    # reduce the number of words to only necessary words - by using stopwords from nltk
    # we can easily kick out words such as "is" "and" to reduce the input on our model later on and
    # keep focus on relevant words
    # also we will rewrite some words which may be written in plural to - e.g. "loved" to "love"
    text = [porterStemmer.stem(word) for word in text if not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    return text

In [None]:
questions_nlp = []
for question in questions:
    questions_nlp.append(clean_text(question))

Lets compare the original questions with the cleaned questions

In [None]:
questions[0:4]

In [None]:
questions_nlp[0:4]

Since we do have at least a count of 2 Descriptions per title which means that we will have a test and train set of each 50%. This also means that we have only one train set per title. This may be quite a low number. Still I´m curious in what the outcome of the training will be. I´ll be using the Naive Bayes to classify the wine based on the description. But before we do that - lets Vectorize the questions and prepare the train and test-sets

In [None]:
# create a dataframe to easilier prepare the correct sets
questions_and_answers = pd.DataFrame(columns = ['questions', 'answers'])
questions_and_answers['questions'] = questions_nlp
questions_and_answers['answers'] = answers
questions_and_answers = questions_and_answers.sort_values(by = 'answers')

In [None]:
questions_and_answers.head(n = 10)

Lets remind what we did. First we reviewed the data and found some titles which only occured once. Those title were dropped because we can´t validate them later on per machinel input. Therefore we kept each title which occurs at least twice. This sums up. In order to easily separate test and train - we have sorted the questions and answers by title. 
As you can see above - we have all descriptions and answers sorted.  
Now we can Vectorize the questions because they´re aligned with the new answers now. 

* Our next steps: 
* Vectorize our questions and then we will split our dataset into train and test set

In [None]:
questions_final = questions_and_answers['questions']
answers_final = questions_and_answers['answers']

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [None]:
questions_ML = cv.fit_transform(questions_final).toarray()
answers_ML = answers_final
len(questions_ML[0])

In [None]:
len(questions_ML)

In [None]:
len(answers_ML)

In [None]:
# number of unique classes
answers_ML.nunique()

We do have 2082 entries in answers and questions. The questions do have in total 3416 different words which may be an answer on which wine we are searching for. 
Now: Split Dataset

In [None]:
# Split the data into train and test set
questions_train = []
questions_test = []
answers_train = []
answers_test = []
 # dividing the whole set by 50% - since the answers were sorted - 
 # we will have each title at least once in a test and train set
for row in range(0, len(answers_ML)):
    if row % 2 == 0:
        answers_train.append(answers_ML[row])
        questions_train.append(questions_ML[row])
    else:
        answers_test.append(answers_ML[row])
        questions_test.append(questions_ML[row])
        


Lets start the learning and review the results in the first attempt.

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(questions_train, answers_train)

In [None]:
y_pred = classifier.predict(questions_test)

In [None]:
Evaluation = pd.DataFrame(columns = ['prediction', 'testvalues'])
Evaluation['prediction'] = y_pred
Evaluation['testvalues'] = answers_test

In [None]:
Evaluation.head(n = 15)

In [None]:
Evaluation['prediction_output'] = Evaluation['prediction'] == Evaluation['testvalues']

In [None]:
Evaluation['prediction_output'].value_counts()

Well I would say that the first Iteration was a full disaster. Based on my course of action I´ve come to some conclusions: 
* I did shrink the dataset really low to around 2100 rows of around 130k - Thats a huge sum
* We trained the dataset on a basis of 50 / 50 --> Now I would say that I need much more different descriptions for one wine
* I did try to classify 934 different wines just based on the description. Unfortunately the results are much more worse than I expected. Another solution could be the use of e.g. Decision Tree / Random Forest
* Add additional information to the predictor - e.g.
* taster : each taster has a different sense of writing what he´s tasting
* variety: might help to get a better classification


**Conclusion for now:** 
Based on the Description of some wines which I read - I think that the wording of descriptions based on a wine vary quite drastically. The test and train set have been prepared and several different models can be tested. Maybe my thinking of dividing it that way might´ve been wrong. 