### Summary from part 2

We are going to reload the reviews data from csv file

In [None]:
import pandas as pd

In [None]:
reviews_df = pd.read_csv('yelp_reviews_clemson_sc.csv')

In [None]:
reviews_df.head()

We are going to load the following libraries to help with the descriptive analysis process

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Remove invalid rows

In [None]:
reviews_df = reviews_df.dropna()

#### Remove unnecessary symbols and phrases

In [None]:
clean_reviews_df = pd.DataFrame()

In [None]:
for i in range(0,20):
    print(reviews_df.restaurant[i])

In [None]:
clean_reviews_df['Review'] = reviews_df.Review.str[2:-2]
clean_reviews_df['author'] = reviews_df.author.str[2:-2]
clean_reviews_df['date'] = reviews_df.date.str[12:22].str.lstrip(' ').str.rstrip('\\n')
clean_reviews_df['rating'] = reviews_df.rating.str[2:5]
clean_reviews_df['restaurant'] = reviews_df.restaurant.str.replace("', '", " ").str[2:-2]

In [None]:
clean_reviews_df.head(15)

In [None]:
clean_reviews_df['score'] = clean_reviews_df.rating.astype(float)

In [None]:
clean_reviews_df.describe()

In [None]:
clean_reviews_df['review_length'] = clean_reviews_df['Review'].apply(lambda x: len(x) - x.count(' '))
clean_reviews_df['Month'], clean_reviews_df['Day'], clean_reviews_df['Year'] = clean_reviews_df['date'].str.split('/').str

In [None]:
clean_reviews_df.head(15)

Remove restaurants with too few reviews ...

In [None]:
restaurants = clean_reviews_df.groupby('restaurant').rating.count()
restaurants

In [None]:
restaurants = restaurants[restaurants > 40]
restaurants

In [None]:
clean_reviews_df['restaurant'] = clean_reviews_df.restaurant.str.lstrip("one, '")

In [None]:
list(restaurants.index)

In [None]:
restaurant_df = clean_reviews_df[clean_reviews_df.restaurant.isin(list(restaurants.index))]
restaurant_df

#### First analysis: Visual review lengths

In [None]:
hist = sns.FacetGrid(data=restaurant_df, col='rating')
hist.map(plt.hist, 'review_length', bins=50)

Looks interesting, but not quite visible yet. Let's try another view

In [None]:
sns.boxplot(x='rating', y='review_length', data=restaurant_df)

#### Yet another graphical library: ggplot

In [None]:
!pip install plotnine

In [None]:
%matplotlib inline
import plotnine as p9
p9.options.figure_size = (8, 6)

#### Plotting distributions

In [None]:
(p9.ggplot(data=restaurant_df,
           mapping=p9.aes(x='restaurant',
                          y='score',color='restaurant'))
    + p9.geom_jitter(alpha=0.2)
    + p9.geom_boxplot(alpha=0.)
    + p9.theme_bw()
    + p9.theme(axis_text_x = p9.element_text(angle=90))
)

#### How have the review ratings evolved over time for the restaurants?

In [None]:
yearly_counts = restaurant_df.groupby(['Year', 'restaurant'])['score'].mean()
yearly_counts.head()

When checking the result of the previous calculation, we actually have both the year and the species_id as a row index. We can reset this index to use both as column variable:

In [None]:
yearly_counts = yearly_counts.reset_index(name='score')
yearly_counts['Year'] = yearly_counts.Year.astype(int)

In [None]:
yearly_counts=yearly_counts[yearly_counts.Year > 2007]

In [None]:
(p9.ggplot(data=yearly_counts,
           mapping=p9.aes(x='Year',
                          y='score'))
    + p9.geom_line()
)

Unfortunately this does not work, because we plot data for all the species together. We need to tell plotnine to draw a line for each restaurant by modifying the aesthetic function and map the restaurant to the color:

In [None]:
(p9.ggplot(data=yearly_counts,
           mapping=p9.aes(x='Year',
                          y='score',
                          color='restaurant'))
    + p9.geom_point()
    + p9.geom_line()
)

#### Faceting

In [None]:
(p9.ggplot(data=restaurant_df,
           mapping=p9.aes(x='score',
                          y='review_length',
                          color='Year'))
    + p9.geom_point(alpha=0.1)
)

In [None]:
(p9.ggplot(data=restaurant_df,
           mapping=p9.aes(x='score',
                          y='review_length',
                          color='Year'))
    + p9.geom_point(alpha=0.1)
    + p9.facet_wrap("restaurant")
    + p9.theme(strip_text_x = p9.element_text(size = 4, colour = "red"))
)