# Introduction

Before running this notebook, run [Prepare data](Prepare data.ipynb) in order to create `../data/CleanedReviews.pickle`, which is required here.

This document shows some information about the food reviews data set. Currently, product and user distributions by number of reviews are included. Also, there is a chart with the evolution of reviews.

# Setup

In [None]:
%matplotlib inline

In [None]:
import pandas as pd
import numpy as np

from datetime import datetime, timedelta

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="white", color_codes=True)

In [None]:
figsize = (16, 5)

## Data import

In [None]:
reviews = pd.read_pickle('../data/CleanedReviews.pickle')

In [None]:
reviews.info()

In [None]:
reviews.head(2)

# Products

In [None]:
ax = reviews.groupby(['ProductId']).size().hist(bins=np.arange(1, 50, 1), figsize=figsize)
ax.set_title('Histogram of the number of reviews by product')
ax.set_xlabel('Reviews')
ax.set_ylabel('Products')
plt.show()

In [None]:
agg = {
    'ProductId': {'Reviews': 'size'},
    'Score': {'MeanScore': 'mean'},
    'Time': {'FirstTime': 'first', 'LastTime': 'last'},
}
products = reviews.sort_values('Time').groupby('ProductId').agg(agg)
products.columns = products.columns.droplevel()
products.reset_index(inplace=True)
products['Duration'] = products['LastTime'] - products['FirstTime']

## Product durations

The following chart shows how many days have passed from the first review to the last one. To do a proper estimation of product durations (based on reviews), we can use [survival analysis](https://en.wikipedia.org/wiki/Survival_analysis).

In [None]:
ax = (products.loc[products['Reviews'] >= 5, 'Duration'] / timedelta(days=1)).hist(bins=100, figsize=figsize)
ax.set_title('Days between first review and last review for products with more than 5 reviews')
ax.set_xlabel('Durations')
ax.set_ylabel('Products')
plt.show()

# Users

From the following chart, we can observe that most users wrote a single comment.

In [None]:
ax = reviews.groupby(['UserId']).size().hist(bins=np.arange(1, 50, 1), figsize=figsize)
ax.set_title('Histogram of the number of reviews by user')
ax.set_xlabel('Reviews')
ax.set_ylabel('Users')
plt.show()

# Reviews

In [None]:
ax = reviews.groupby(pd.TimeGrouper(key='Time', freq='w')).size().plot(figsize=figsize)
ax.set_title('Reviews per week')
ax.set_xlabel('')
ax.set_ylabel('Reviews')
plt.show()