**This notebook is about analysis of data from [this Kaggle competition](https://www.kaggle.com/c/whats-cooking/overview)**

### Import some important libraries here

In [None]:
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt
import typing as ty
import os

# Data analysis

### Let's download and unzip our data from Kaggle:

https://www.kaggle.com/c/whats-cooking/data

### Let's try to evalate some bash commands from JN:

#### Unzip data

In [None]:
!unzip -l whats-cooking.zip

In [None]:
!unzip whats-cooking.zip test.json.zip train.json.zip sample_submission.csv.zip

In [None]:
for i in ['train.json', 'test.json', 'sample_submission.csv']:
    assert i in os.listdir('./')

#### Remove unnecessary .zip archives:

In [None]:
rm test.json.zip train.json.zip sample_submission.csv.zip

Let's meet with the [most common package for working with tabular data](https://pandas.pydata.org/pandas-docs/stable/index.html)

In [None]:
pd.read_csv('./sample_submission.csv')

#### Let's try to submit `sample_submissions.csv` to [Kaggle](https://www.kaggle.com/c/whats-cooking/submit)

Wait, test, train, public, private, what's happened?

![Picture about private and public dataset](https://storage.googleapis.com/kaggle-forum-message-attachments/543450/13399/Untitled.jpg "Title")

**Let's open json files and check shapes**

In [None]:
train_X = <YOUR CODE HERE>

In [None]:
test_X = <YOUR CODE HERE>

In [None]:
assert len(train_X) == 39774
assert len(test_X) == 9944

In [None]:
assert type(test_X[0]) == dict

### Let's watch at distributions of cusines

In [None]:
cuisines_distribution = <YOUR CODE>

In [None]:
assert len(cuisines_distribution) == 20

### Let's vizualize using distribution plot

**Check [Matplotlib Gallery](https://matplotlib.org/3.3.3/gallery/index.html)**

**[Horizontal Bar Chart](https://matplotlib.org/3.3.3/gallery/lines_bars_and_markers/barh.html#sphx-glr-gallery-lines-bars-and-markers-barh-py) is what we need!**

In [None]:
fig, ax = plt.subplots(figsize=(20, 9))

<YOUR CODE>
ax.set_yticks()
ax.set_yticklabels()
ax.invert_yaxis()  # labels read top-to-bottom

ax.set_xlabel('Amount of Recipes')
ax.set_title('Comparison of cuisines by Recipes amount')

#### Let's check proportion of each cuisines in percents:

In [None]:
<YOUR CODE>

### Let's talk about Score of this Competition

#### We solve here a problem of multiclass classifiaction

![Multiclass problem](https://miro.medium.com/max/972/1*SwXHlCzh-d9UqHOglp3vcA.png)

#### Accuracy - Pros & Cons

![Multiclass Classifiaction](http://gabrielelanaro.github.io/public/post_resources/multiclass/text4384.png)

![Acc formula](https://miro.medium.com/max/2868/1*WGK_3mj_KBZh9yTiLXGh-A.png)

![Acc Matrix](https://miro.medium.com/max/1064/1*5XuZ_86Rfce3qyLt7XMlhw.png)

**Graphically, it is the sum of the element on the diagonal of the confusion matrix divided by the total number of predictions made**

In example above:


accuracy = (2+2+3) / (2+2+1+2+3)

In [None]:
(2+2+3) / (2+2+1+2+3)

**Thus, what's the problem with Accuracy?**

In [None]:
rus_cuisine_df = <YOUR CODE>

In [None]:
rus_cuisine_df.to_csv('constant_submission_rus.csv', index=False)

**Let's sumbit our patriotic submission to Kaggle and check the score :)**

#### Do we have any alternatives? (In theory?)
**Of course! But we can't choose metrics for Kaggle competition**

![Precision & Recall](http://gabrielelanaro.github.io/public/post_resources/multiclass/text4385.png)

### Let's use another tool for analysis of texts - [WordCloud](https://github.com/amueller/word_cloud)

In [None]:
from wordcloud import WordCloud

### Let's check Minimal Example from [official site](http://amueller.github.io/word_cloud/auto_examples/simple.html#sphx-glr-auto-examples-simple-py)

**Documentation with description of WordCloud as [Python object](http://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html#wordcloud.WordCloud)**

In [None]:
def draw_wordcloud_by_text(text, title, is_save_to_file):
    <YOUR CODE>

    plt.figure(figsize=(10, 8))
    plt.title(title)
    <YOUR CODE>
    plt.axis("off")
    plt.show()

In [None]:
draw_wordcloud_by_text('Hello, world! The world is on fire!', 'Example')

### Let's concatenate all ingridients to one big string and vizualize it:

In [None]:
ingridients_by_cusinies = <YOUR CODE>

In [None]:
assert 'Maggi' in ingridients_by_cusinies['russian']
assert 'mayonaise' in ingridients_by_cusinies['russian']
assert 'feta cheese crumbles' in ingridients_by_cusinies['greek']
assert 'flat leaf parsley' in ingridients_by_cusinies['greek']

In [None]:
for cuisine_name, ingridients_list in ingridients_by_cusinies.items():
    <YOUR CODE>

## We can use Pandas for reading json files

We can use Pandas not only with .csv files, but also with [json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

In [None]:
train_X_df = pd.read_json('./train.json', encoding='utf8').set_index('id')
test_X_df = pd.read_json('./test.json', encoding='utf8').set_index('id')

In [None]:
train_X_df.head()

In [None]:
assert train_X_df.shape == (39774, 2)
assert test_X_df.shape == (9944, 1)

## Let's analyze correlation between amount of ingridients per recipe and cuisine

Very convenient type of plot for such analysis - [boxplot](https://en.wikipedia.org/wiki/Box_plot).

[Matplotlib realization of boxplot](https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.boxplot.html)

### Let's generate new feature 'amount_of_ingridients' 

**Maybe, [apply method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) can help you**

In [None]:
train_X_df['amount_of_ingridients'] = <YOUR CODE>

In [None]:
train_X_df.head()

In [None]:
assert train_X_df.iloc[14]['amount_of_ingridients'] == 12
assert train_X_df.iloc[1]['amount_of_ingridients'] == 11
assert train_X_df.iloc[-6]['amount_of_ingridients'] == 10

In [None]:
cuisines = <YOUR CODE>
recipe_ing_amount_sequence = <YOUR CODE>

**Let's Vizualize using [boxplot](https://ru.wikipedia.org/wiki/%D0%AF%D1%89%D0%B8%D0%BA_%D1%81_%D1%83%D1%81%D0%B0%D0%BC%D0%B8)**

![boxplot](https://www.simplypsychology.org/boxplot.jpg)

**Let's check boxplot [demo](https://matplotlib.org/3.3.3/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py) from matplotlib**

In [None]:
# here you can insert any function
sorted_ind = np.argsort([*map(<YOUR CODE>, recipe_ing_amount_sequence)])
fig = plt.figure(figsize =(14, 8))   
# Creating axes instance 
ax = fig.add_axes([0, 0, 1, 1]) 
  
# Creating plot 
<YOUR CODE>

### But there are a lot of gorgeous libraries for vizualization except matplotlib, for instance - [seaborn](https://seaborn.pydata.org/)

In [None]:
import seaborn as sns

f, ax = plt.subplots(figsize=(20, 8))
sns.boxplot(x='cuisine',
            y='amount_of_ingridients',
            data=train_X_df)

# Extraction features from data

### Let's remove column with amount of ingridients and go on

In [None]:
<YOUR CODE>

### Let's convert list of ingridients to string

In [None]:
# which separator is better - ?
<YOUR CODE>

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
<YOUR CODE>

In [None]:
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

### CountVectorizer example (Bag of words)

![](https://miro.medium.com/max/1239/1*WN18F5oVHKzf_DXcCpSFiQ.png)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
 ]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

In [None]:
X.toarray()

### Apply to our data:

In [None]:
vectorizer = <YOUR CODE>
X_train_vec = <YOUR CODE>

In [None]:
len(vectorizer.get_feature_names())

In [None]:
vectorizer.get_feature_names()[::500]

In [None]:
X_train_vec.shape

In [None]:
X_train_vec.todense() # transform to usual matrix

In [None]:
indices_first_receipe = <YOUR CODE>

In [None]:
np.array(vectorizer.get_feature_names())[indices_first_receipe]

In [None]:
X_train.iloc[0]['ingredients'].split('#')

## Check if tokenization was correct:

In [None]:
assert (np.sort(np.array(vectorizer.get_feature_names())[indices_first_receipe]) ==
 np.sort(np.array(X_train.iloc[0]['ingredients'].split('#')))).all()

### Let's inspect our X matrix as Dataframe

In [None]:
bag_of_words_df = pd.DataFrame(X_train_vec.todense(),
             columns=vectorizer.get_feature_names())
bag_of_words_df.head()

### Let's check that we transform our data to matrix correctly:

In [None]:
bag_of_words_df[bag_of_words_df['(   oz.) tomato paste'] != 0].index

In [None]:
train_X_df.iloc[bag_of_words_df[bag_of_words_df['(   oz.) tomato paste'] != 0].index]['cuisine']

## Let's add some ~magic~ Machine Learning here! 🔮🔮🔮

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
X_train_vec = <YOUR CODE>
X_val_vec = <YOUR CODE>

In [None]:
print(X_train_vec.shape, X_val_vec.shape)

In [None]:
type(X_train_vec)

In [None]:
reg = <YOUR CODE>

![machine learns](https://resources.hacware.com/content/images/2020/08/memeticMemoryheader.png)

In [None]:
y_val_proba = <YOUR CODE>

### Count the metric (multiclass accuracy)

In [None]:
from sklearn.metrics import accuracy_score, plot_confusion_matrix

In [None]:
accuracy_score(y_val, y_val_proba)

In [None]:
assert accuracy_score(y_val, y_val_proba) > 0.7

### Check the confusion matrix

**The most convinient way to visualize confusion matrix - to use [plot_confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html), but you can also visualize it by yourself :) You can get matrix with [this method](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)**

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(reg, X_val_vec, y_val, xticks_rotation='vertical', ax=ax)  
plt.show() 

**Let's discuss about what cuisunes do we mix up more?**

### Wow! Impressive!

### Let's predict on test and send to Kaggle

In [None]:
test_X_df.head()

In [None]:
test_X = <YOUR CODE>

In [None]:
test_X.shape

In [None]:
test_proba = reg.predict(test_X)
test_proba[::1000]

In [None]:
sumbit_df = test_X_df.copy()
<YOUR CODE>

In [None]:
sumbit_df.to_csv('./bag_of_words_log_regr.csv', index=False)

## Interactive part (let's cook!) (and check our intuition)

In [None]:
test_X_df['ingredients'].to_numpy()[0]

**Can we use all ingridients which we can imagine with our approach with bag of words?**

In [None]:
def is_food_in_our_bag_of_words(food_name: str, vect_features_hash_map: ty.Set[str]) -> str:
    <YOUR CODE>

In [None]:
assert is_food_in_our_bag_of_words('pierogi', vect_features_hash_map) == 'Ok!'

In [None]:
for i in []:
    is_food_in_our_bag_of_words(i, vect_features_hash_map)

In [None]:
whatnot_reciepe = []
whatnot_vec = vectorizer.transform(['#'.join(whatnot_reciepe)])

## Let's try Decision Tree and look up through hyperparams:

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

**[Perfect explanation of Explanation of Gini Impurity (eng)](https://victorzhou.com/blog/gini-impurity/)**

**For iteration of Hyperparams researches often use [GridSearchCV from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)**

In [None]:
from sklearn import tree
from sklearn.model_selection import GridSearchCV, KFold

cv = KFold(n_splits=3, shuffle=True, random_state=14)
max_depth_array = [2, 4, 8, 16, 32, 64, 128, 256, 512]

decision_tree = tree.DecisionTreeClassifier(random_state=42)
grid_dtc = <YOUR CODE>

grid_dtc.fit(X_train_vec, y_train.to_numpy())

### Check feature importance of the tree

In [None]:
importances = grid_dtc.best_estimator_.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(bag_of_words_df.shape[1])[:10]:
    print("%d. feature %d %s (%f)" % (f + 1, indices[f],
                                      bag_of_words_df.columns[indices[f]],
                                      importances[indices[f]]))

In [None]:
def makePlotOfResultsOfGridSearchCVDependsOnParam(method_name, hyperparam_name, X_axis_label, cv_results_df,
                                                  type_of_x_values=float, is_xticks_rotated=False):
    plt.figure(figsize=(7, 7))
    plt.title("GridSearchCV evaluating result ({} method):".format(method_name), fontsize=16)
    ax = plt.gca()

    X_axis_data = np.array(cv_results_df[hyperparam_name], dtype=type_of_x_values)
    print(X_axis_data)
    test_mean = np.array(cv_results_df['mean_test_score'], dtype=float)
    train_mean = np.array(cv_results_df['mean_train_score'], dtype=float)
    test_std_error = np.array(cv_results_df['std_test_score'], dtype=float)
    train_std_error = np.array(cv_results_df['std_train_score'], dtype=float)

    ax.fill_between(X_axis_data, test_mean - test_std_error,
                    test_mean + test_std_error,
                    alpha=0.4, color='red', label='Std error on test data')

    ax.fill_between(X_axis_data, train_mean - train_std_error,
                    train_mean + train_std_error,
                    alpha=0.4, color='green', label='Std error on train data')

    ax.plot(X_axis_data, test_mean,
            alpha=1,
            label="test_mean")
    ax.plot(X_axis_data, train_mean,
            alpha=1,
            label="train_mean")

    plt.xlabel(X_axis_label)
    plt.ylabel('Accuracy')
    plt.legend(loc=1)

    if is_xticks_rotated:
        plt.xticks(rotation=90)

    plt.grid(True)
    plt.show()

In [None]:
makePlotOfResultsOfGridSearchCVDependsOnParam('Decision Tree',
                                              'param_max_depth',
                                              'Depth of trees',
                                              grid_dtc.cv_results_
                                             )

**What can we see here?**
<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>



















**(SPOILER)**
### Looks like overfitting

### How to fight?

**We can decrease complexity of our model, fix optimal depth**

[Decision Tree Classifier sklearn Docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
min_samples_leaf_array = np.arange(1, 20)
decision_tree = tree.DecisionTreeClassifier(<YOUR CODE>)

grid_dtc = <YOUR CODE>

grid_dtc.fit(X_train_vec, y_train.to_numpy())

### What's next?

![whats next](https://freecontent.manning.com/wp-content/uploads/Elgendy_tCVPp4_06b.png)

- [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)
- [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- [LSA](https://en.wikipedia.org/wiki/Latent_semantic_analysis)

...

NLP approaches

### Algorithms?

- SVM
- Boosting (XGboost, CatBoost)
- Bagging (Random Forest)
- Neural Nets (of course!)

![Tom](https://storage.googleapis.com/kaggle-forum-message-attachments/703594/14673/DATAAC.jpeg)