<div class="alert alert-info">
<h3> Groups All Day</h3>
<p> Group numbers are in <code>data/groups.json</code>. Find your group. Move tables and chairs so that folks are not in row and no one has to turn around to see the board.

Start a new notebook where you will do your work for today. Make the first cell a markdown cell and put a title or notes in there. Second cell can include your <code>import</code> statements.
</div>



In [None]:
%matplotlib inline

import pandas as pd

pd.set_option('display.max_colwidth', 120)

In [None]:
wine_df_full = pd.read_csv('data/wine_reviews.csv')

# let us reduce down our dataset so that it more manageable. 
wine_df = wine_df_full.sample(n = 10000)

In [None]:
wine_df.info()

Turn words into numbers

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
# 1. Set the parameters.
vectorizer = CountVectorizer(lowercase   = True,
                             ngram_range = (1,1),
                             stop_words  = 'english',
                             max_df      = .50,
                             min_df      = .01,
                             max_features = None)

In [None]:
# 2. Fit the data

vectorizer.fit(wine_df['description'])

In [None]:
len(vectorizer.get_feature_names())

In [None]:
# 3. Transform based on the model
review_word_counts = vectorizer.transform(wine_df['description'])

![](images/knn1.png)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# 1. Set the parameters.
knn_classifier = KNeighborsClassifier(n_neighbors = 3)

In [None]:
# 2. Fit the data
knn_classifier.fit(review_word_counts, wine_df['rating'])


In [None]:
# 3. Transform based on the model

knn_prediction = knn_classifier.predict(review_word_counts)

In [None]:

print(accuracy_score(wine_df['rating'], knn_prediction))




<div class="alert alert-info">
<h3> Your turn</h3>
<p> What is the f1 score for the model?

</div>


What about fit on a different data?

In [None]:
wine_df_test = wine_df_full.sample(n = 10000)

In [None]:
# numbers into words
# don't rebuild the model, just predict.

wdt_tf = vectorizer.transform(wine_df_test['description'])

In [None]:
# don't rebuild the model, just predict.

test_prediction = knn_classifier.predict(wdt_tf)

In [None]:
print(accuracy_score(wine_df['rating'], test_prediction))



![](images/knn2.png)

<div class="alert alert-info">
<h3> Your turn</h3>
<p> What about changing your model to 6 neighbors? Does it fit better? Do you have the same results as other members of your group?

</div>


In [None]:
for n in [2, 4, 6, 12]:
    print(n)
    knn_classifier = KNeighborsClassifier(n_neighbors = n)
    knn_classifier.fit(review_word_counts, wine_df['rating'])
    
    train_predict = knn_classifier.predict(review_word_counts)
    print(accuracy_score(wine_df['rating'], train_predict))
    
    test_predict = knn_classifier.predict(wdt_tf)
    print(accuracy_score(wine_df_test['rating'], test_predict))



In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# old model: knn_classifier = KNeighborsClassifier(n_neighbors = 3)

parameters = {'n_neighbors' : [3,5 ],
              'weights'      : ['distance', 'uniform']}
              

In [None]:
grid = GridSearchCV(KNeighborsClassifier(), 
                    parameters, 
                    cv=5)

![](images/cv.png)

In [None]:
grid.fit(review_word_counts, wine_df['rating'])

In [None]:
grid.cv_results_

In [None]:
pd.DataFrame(grid.cv_results_)

In [None]:
grid.best_estimator_

In [None]:
train_prediction = grid.best_estimator_.predict(review_word_counts)

print(accuracy_score(wine_df['rating'], train_prediction))

In [None]:
test_prediction  = grid.best_estimator_.predict(wdt_tf)
print(accuracy_score(wine_df_test['rating'], test_prediction))

<div class="alert alert-info">
<h3> Your turn</h3>
<p> What is the optimal settings for k-nearest neighbor model?
</div>


<div class="alert alert-info">
<h3> Your turn</h3>
<p> How does this compare to a logistic regression model?
<code> google sklearn logistic regression </code>
</div>



And now for something different

<div class="alert alert-info">
<h3> Your turn</h3>
<p> As a group, take a look at the text of the wine descriptions. Ignore the ratings. What different themes do you find?
</div>




![](images/lda.jpg)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation



In [None]:
vectorizer = CountVectorizer(lowercase   = True,
                             ngram_range = (1,2),
                             max_df      = .50,
                             min_df      = .01,
                             max_features = None)

In [None]:
vectorizer.fit(wine_df['description'])

In [None]:
review_word_counts = vectorizer.transform(wine_df['description'])

In [None]:
lda = LatentDirichletAllocation(n_components   = 5)

In [None]:
lda.fit(review_word_counts)

What words are associated with what topics?

LatentDirichletAllocation is bad a showing results in a pretty way.

In [None]:
def column_swap(column):
    column = column.sort_values(ascending = False)
    return column.index

def topic_words_df(lda_model, vectorizer):
    '''
    Generate dataframe of words associated with a topic model.
    '''
    
    word_topic_scores = lda_model.components_.T
    vocabulary        = vectorizer.get_feature_names()
    
    
    topic_words_df = pd.DataFrame(word_topic_scores,
                                  index = vocabulary)
    
    topic_words_df = topic_words_df.apply(column_swap).reset_index(drop = True).rename_axis('rank')
    
    topic_words_df.index = topic_words_df.index + 1
    
    return topic_words_df

In [None]:
topic_words_df(lda, vectorizer).head(10)

<div class="alert alert-info">
<h3> Your turn</h3>
<p> As a group, try different options for your vectorizer and number of topics. What set of parameters creates the most coherent topics?

</div>



<div class="alert alert-info">
<h3> Your turn</h3>
<p> What were the major themes in Donald Trump campaign speeches?
</div>



What documents are associated with what topics?

In [None]:
wine_topics = lda.transform(review_word_counts)

In [None]:
wine_topics

In [None]:
pd.DataFrame(wine_topics).head(10)

We can now use our topics as features

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors = 3, weights = 'distance')

knn_classifier.fit(wine_topics, wine_df['rating'])

In [None]:
train_prediction = knn_classifier.predict(wine_topics)

In [None]:
print(accuracy_score(wine_df['rating'], train_prediction))



In [None]:
test_tf     = vectorizer.transform(wine_df_test['description'])
test_topics = lda.transform(test_tf)
test_prediction = knn_classifier.predict(wine_topics)

In [None]:
print(accuracy_score(wine_df_test['rating'], test_prediction))


<div class="alert alert-info">
<h3> Your turn</h3>
<p> Using your best topic model, what is the prediction rate for your best k nearest neighbors model?

</div>


Let's do it again, but with a different data set

In [None]:
bg_df = pd.read_csv('data/boardgames.csv')

In [None]:
bg_df.info()

In [None]:
bg_df.head()

<div class="alert alert-info">
<h3> Your turn</h3>
<p> Load up this dataset in your other workbook. Topic model the game descriptions.

</div>



In [None]:
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
vectorizer = CountVectorizer(max_df=.6,
                             min_df=.01,
                             stop_words= 'english')

In [None]:
vectorizer.fit(bg_df['description'])

In [None]:
bg_wf = vectorizer.transform(bg_df['description'])

In [None]:
pd.DataFrame(bg_wf.todense(), columns=vectorizer.get_feature_names()).sum().sort_values().tail()

In [None]:
len(vectorizer.get_feature_names())

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
lda = LatentDirichletAllocation(n_components   = 10,
                                n_jobs         = -1,
                               learning_method = 'online')

In [None]:
lda.fit(bg_wf)

In [None]:
topics = lda.transform(bg_wf)

In [None]:
topics

In [None]:
pd.DataFrame(topics)

In [None]:
document_topics(lda, bg_wf)

In [None]:
def column_swap(column):
    column = column.sort_values(ascending = False)
    return column.index

def topic_words_df(lda_model, vectorizer):
    '''
    Generate dataframe of words associated with a topic model.
    '''
    
    word_topic_scores = lda_model.components_.T
    vocabulary        = vectorizer.get_feature_names()
    
    
    topic_words_df = pd.DataFrame(word_topic_scores,
                                  index = vocabulary)
    
    topic_words_df = topic_words_df.apply(column_swap).reset_index(drop = True).rename_axis('rank')
    
    topic_words_df.index = topic_words_df.index + 1
    
    return topic_words_df




In [None]:
top_words = topic_words_df(lda, vectorizer)



In [None]:
top_words.head(10)

In [None]:
def lda_predict(model, tf_matrix):
    prediction = model.transform(tf_matrix)
    return pd.DataFrame(prediction)

In [None]:
lda_predict(lda, bg_wf)

What about a different method?

In [None]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.tree import export_graphviz
from IPython.display import Image



dtc = DecisionTreeClassifier(max_depth = 3, #  Split the sample only three times.
                             min_samples_leaf = 10) # Make sure each leaf 




In [None]:
x_names = ['max_players', 'min_players', 'min_playtime', 'max_playtime', 'min_age']

dtc.fit(bg_df[x_names], bg_df['quality_game'])



In [None]:
export_graphviz(dtc, 
                out_file='dtc.dot', 
                feature_names=x_names)
                
!dot -Tpng dtc.dot -o  dtc.png
Image(filename='dtc.png') 

In [None]:
from sklearn.ensemble import RandomForestClassifier




In [None]:
rf = RandomForestClassifier()
rf

In [None]:
rf.fit(bg_df[x_names], bg_df['quality_game'])



In [None]:
imp = pd.DataFrame(rf.feature_importances_, index = x_names)
imp

In [None]:

# Google "sklearn random forest"
from sklearn.model_selection import GridSearchCV

param_dist = {"max_features": [4],
              "min_samples_split": [10], 
             "class_weight" : ["balanced", None],
             "n_estimators" : [3, 5, 10, 15, 25, 50]}


rfgs = GridSearchCV( RandomForestClassifier(),
                  param_dist, 
                  cv = 5,                  
                  verbose=1 )

In [None]:
rfgs.fit(bg_df[x_names], bg_df['quality_game'])



In [None]:
rfgs.best_estimator_

In [None]:
rf_best = rfgs.best_estimator_
rf_best.get_params

In [None]:
results = pd.DataFrame(rfgs.cv_results_)

results

<div class="alert alert-info">
<h3> Your super big challenge</h3>
<p> You want to make a quality game. Based on this dataset, what sort of game should you make? Use a random forest model to find the best set up parameters. 
<p> Bonus challenge: Use both features in the data set and ones you construct from a topic model!

</div>


In [None]:
categories = ['category_cardgame',
       'category_wargame', 'category_fantasy', 'category_dice',
       'category_partygame', 'category_fighting', 'category_sciencefiction',
       'category_abstractstrategy', 'category_economic',
       'category_childrensgame', 'category_worldwarii', 'category_bluffing',
       'category_animals', 'category_humor', 'category_actiondexterity',
       'category_adventure', 'category_moviestvradiotheme',
       'category_medieval', 'category_deduction', 'category_miniatures']

mechanics = ['mechanic_dicerolling', 'mechanic_handmanagement',
       'mechanic_hexandcounter', 'mechanic_setcollection',
       'mechanic_variableplayerpowers', 'mechanic_none',
       'mechanic_tileplacement', 'mechanic_modularboard',
       'mechanic_carddrafting', 'mechanic_rollspinandmove',
       'mechanic_areacontrolareainfluence', 'mechanic_auctionbidding',
       'mechanic_simulation', 'mechanic_areamovement',
       'mechanic_simultaneousactionselection',
       'mechanic_actionpointallowancesystem', 'mechanic_cooperativeplay',
       'mechanic_pointtopointmovement', 'mechanic_partnerships',
       'mechanic_memory']

In [None]:
rf_prediction = rf_best.predict_proba(bg_df[x_names])

In [None]:
from sklearn.calibration import calibration_curve


def calplot(y_observed, y_predicted):
    rf_y, rf_x = calibration_curve(y_observed, y_predicted[:,1], n_bins=10)
    pd.DataFrame([rf_x , rf_y]).T.plot.scatter(x=0, y=1, figsize = (5,5))

In [None]:
calplot(bg_df['quality_game'], rf_prediction)

In [None]:
idf = pd.Series(rf_best.feature_importances_, index = x_names)

idf.sort_values()

In [None]:
idf.sort_values().plot(kind='barh', )

In [None]:
bg_df.keys()