## Question 4: Reviewers

Is there a significant difference between average scores per reviewer? In other words, is there a most critical and least critical reviewer?

Refer to baseball bayesian approach: https://www.linkedin.com/pulse/worst-hitter-history-major-league-baseball-bayesian-approach-damico/

- Alternative Hypothesis: There will be one reviewer with significantly higher scores than other reviewers, and one reviewer with significantly lower scores than others.
- Null Hypothesis: All reviewers will have statistically similar review scores.


### ALTERNATIVE: Using Document Classification with Naive Bayes, can you predict the author of a review?

Steps:

- remove common words (using list of common stop words)
    - Francois helpfully slacked me the dictionary of the words
- remove authors with less than 30 reviews (would only be relevant for authors who have written a few reviews for Pitchfork)
    - this gives us a list of 119 potential authors to predict out of 432 total authors

In [1]:
# Imports

import sqlite3
import pandas as pd
import numpy as np
from scipy import stats 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
# Creating connection to SQL database

conn = sqlite3.Connection("database.sqlite")
c = conn.cursor()

In [34]:
# Querying for reviewers with more than 30 reviews as well as the content of
# each of their reviews

c.execute("""SELECT author, author_type, content
            FROM reviews
            JOIN content
            USING(reviewid)
            WHERE author IN
                (SELECT author
                FROM reviews
                GROUP BY author
                HAVING count(author) >= 30)
            ;""")

# Creating a pandas dataframe from our SQLite query
df = pd.DataFrame(c.fetchall())
df.columns = [x[0] for x in c.description]
df.head()

Unnamed: 0,author,author_type,content
0,nate patrin,contributor,"“Trip-hop” eventually became a ’90s punchline,..."
1,zoe camp,contributor,"Eight years, five albums, and two EPs in, the ..."
2,jenn pelly,associate reviews editor,Kleenex began with a crash. It transpired one ...
3,kevin lozano,tracks coordinator,It is impossible to consider a given release b...
4,katherine st. asaph,contributor,"Rapper Simbi Ajikawo, who records as Little Si..."


In [33]:
df.shape

(16050, 3)

In [32]:
authors = list(df["author"].unique())
len(authors)

119

In [37]:
df.author.value_counts()

joe tangari                   818
stephen m. deusner            725
ian cohen                     699
brian howe                    500
mark richardson               476
stuart berman                 445
marc hogan                    439
nate patrin                   347
marc masters                  312
jayson greene                 299
brandon stosuy                290
grayson currin                289
matthew murphy                274
dominique leone               273
jess harvell                  273
andrew gaerig                 270
jason crock                   270
rob mitchum                   267
andy beta                     250
paul thompson                 222
joshua klein                  217
larry fitzmaurice             217
chris dahlen                  214
nick neyland                  211
philip sherburne              209
adam moerder                  209
tom breihan                   208
amanda petrusich              200
matt lemay                    196
ryan dombal   