# Big Data Modeling and Management Assigment

## Group 32
- António Pinto
- Davide Farinati
- Henrique Vaz
- João César
- Philipp Metzger


## 🍺 The Beer project  🍺 

As it was shown in classes, graph databases are a natural way of navegating distinct types of data. For this first project we will be taking a graph database to analyse beer and breweries!   

_For reference the dataset used for this project has been extracted from [kaggle](https://www.kaggle.com/ehallmar/beers-breweries-and-beer-reviews), released by Evan Hallmark. Even though the author does not present metada on the origin of the data it is probably a collection of open data from places like [beeradvocate](https://www.beeradvocate.com/)_ 

#### Problem description

Explore the database via python neo4j connector and/or the graphical tool in the NEO4J webpage. Answer the questions. Submit the results by following the instructions

#### Connection details to the neo4j database
```
Host: rhea.isegi.unl.pt:7474
Username: neo4j  
Password: F3cfcrnvBev57KZ8mcMk78L9wHgJVZuJ 
Connect URL : bolt://rhea.isegi.unl.pt:7687
```


#### Questions


0. __Example Question__ _How many beers does the database contain?_
1. How many different countries exist in the database?
1. Most reviews:  
    1. Which `Beer` has the most reviews?  
    1. Which `Brewery` has the most reviews for its beers?
    1. Which `Country` has the most reviews for its beers? 
1. Find the user/users that have the most shared reviews (reviews of the same beers) with the user CTJman?
1. Which Portuguese brewery has the most beers?
1. From those beers (the ones returned from the previous question), which has the most reviews?
1. On average how many different beer styles does each brewery produce?
1. Which brewery produces the strongest beers according to ABV?
1. If I typically enjoy a beer due to its aroma and appearance, which beer style should I try?
1. Using Graph Algorithms answer **two** of the following questions:
    1. Which two Countries are most similiar when it comes to their **top 10** most produced Beer styles?
    1. Which beer has the most similar reviews as the beer `Super Bock Stout`?
    1. Which user is the most influential when it comes to reviews made?
    1. Which beer styles are more central when it comes the amount of beers? \
    Note: In case of a tie for the top entity, in terms of metrics outputed from the algorithms, **simply output the first.**
1. If you had to pick 3 beers to recommend using only this database, which would you pick and why?


 Questions 8 to 10 are somewhat open, which means we'll also be evaluating the reasoning behind your answer. So there aren't necessarily bad results there are only wrong criteria, explanations or execution. 
 
##### Groups  

Groups should have 4 to 5 people  
You should register your group on moodle. An email will be going out to everyone with the credentials for the database to use when storing the results.


##### Submission      

Submission of the query results to be done to the group's redis database (explained on the first class, credentials sent via email).  
The following format is expected:
```
    >>> redis.set("0", "358873")
```

This result should be the anwser of the group to question 0

The code used to produce the results and respective explations should be uploaded to moodle. They should have a clear reference to the group, either on the file name or on the document itself. Preferably one Jupyter notebook per group.

Delivery date: Until the **midnight of May 2nd**

##### Evaluation   

This will be 20% of the final grade.   
Each solution will be evaluated on 2 components: correctness of results and simplicity of the solution.  
All code will go through plagiarism automated checks. Groups with the same code will undergo investigation.

**Note:**
Remember the Neo4j is a shared database and when creating in-memory graphs please use your group's prefix.  
Ex. Instead of `my-graph` as the name of your graph please use `group0-my-graph`.

# Connection

In [1]:
import py2neo
from pprint import pprint
Host = 'rhea.isegi.unl.pt:7474'  
Username = 'neo4j'  
Password = 'F3cfcrnvBev57KZ8mcMk78L9wHgJVZuJ' 
beer_graph = py2neo.Graph(f"http://{Username}:{Password}@{Host}")
#http://rhea.isegi.unl.pt:7474/browser/

# Resolutions

1. How many different countries exist in the database?

In [2]:
# Query returns the amount of distinct countries in the database that produce beer
result = beer_graph.run("""
        MATCH (c:Country)
        RETURN COUNT(DISTINCT c) as Unique_countries
""").data()
pprint(result)

[{'Unique_countries': 200}]


2. Most reviews:  
    1. Which `Beer` has the most reviews?

In [3]:
# Query returns the beer that has the most reviews, by ordering the count in a descending order 
# and retrieving the 1st element with ´limit´
result = beer_graph.run(
    """
    MATCH (r:Reviews)-[:ABOUT]->(b:Beers)
    RETURN b.name, count(r) as no_beers
    ORDER BY no_beers DESC
    LIMIT 1
    """
).data()
pprint(result)

[{'b.name': 'IPA', 'no_beers': 31387}]


2. Most reviews:
    2. Which `Brewery` has the most reviews for its beers?

In [4]:
# Query returns the beer that has the most reviews, by matching all the reviews about the beers produced by the breweries
# ordering the count in a descending order and retrieving the 1st element with ´limit´
result = beer_graph.run(
    """
    MATCH (r:Reviews)-[:ABOUT]->(:Beers)-[:BREWED_AT]->(brewery:Breweries)
    RETURN brewery.name, count(r)
    ORDER BY count(r) DESC
    LIMIT 1
    """
).data()
pprint(result)

[{'brewery.name': 'Sierra Nevada Brewing Co.', 'count(r)': 175161}]


2. Most reviews:
    3. Which `Country` has the most reviews for its beers?

In [5]:
# Query returns the beer that has the most reviews, by matching all the reviews about the beers produced by the breweries
# from countries ordering the count in a descending order and retrieving the 1st element with ´limit´
result = beer_graph.run(
    """
    MATCH (r:Reviews)-[:ABOUT]->(:Beers)-[:BREWED_AT]->(:Breweries)-[:FROM]->(country:Country)
    RETURN country.country_digit, count(r)
    ORDER BY count(r) DESC
    LIMIT 1
    """
).data()
pprint(result)

[{'count(r)': 7524410, 'country.country_digit': 'US'}]


3. Find the user/users that have the most shared reviews (reviews of the same beers) with the user CTJman?

In [6]:
# To the user with most shared reviews with CTJman we first filter the reviews from CTJman, getting the beers CTJman reviewed 
# so that we can get all the reviews of those beers and aggregate them by user and ordering them by the amount each user wrote 
# (in a descending order)
result = beer_graph.run("""
        MATCH (u1:Username{user_name: 'CTJman'})-[:MADE]->(:Reviews)-[:ABOUT]->(:Beers)<-[:ABOUT]-(r2:Reviews)<-[:MADE]-(u2:Username)
        RETURN u2 as username, count(r2) as no_reviews
        ORDER BY no_reviews DESC
        LIMIT 1
""").data()
pprint(result)

[{'no_reviews': 1428, 'username': Node('Username', user_name='acurtis')}]


4. Which Portuguese brewery has the most beers?

In [7]:
# To get the Portuguese brewery with the most beers we first filter the countries to query just the breweries 
# from Portugal and aggregate them by brewery, counting the amount of beers
result = beer_graph.run("""
        MATCH (:Country{country_digit:'PT'})<-[:FROM]-(bwr:Breweries)<-[:BREWED_AT]-(beer:Beers)
        RETURN bwr.name, count(beer)
        ORDER BY count(beer) DESC
        LIMIT 1
""").data()
pprint(result)

[{'bwr.name': 'Dois Corvos Cervejeira', 'count(beer)': 40}]


5. From those beers (the ones returned from the previous question), which has the most reviews?

In [8]:
# To get the most famous beer from 'dois corvos cervejaria' we filter the brewery by name and count the amount of reviews
result = beer_graph.run(
    """
    MATCH (reviews:Reviews)-[a:ABOUT]->(beers:Beers)-[ba:BREWED_AT]->(breweries:Breweries{name: 'Dois Corvos Cervejeira'})
    RETURN DISTINCT beers.name as beer_name, count(reviews) as review_count
    ORDER BY review_count DESC
    LIMIT 1
    """
).data()
pprint(result)

[{'beer_name': 'Finisterra', 'review_count': 10}]


6. On average how many different beer styles does each brewery produce?

In [9]:
# Retrieve the beer style count for each brewery (that produces at least one beer style)
result = beer_graph.run(
    """
    MATCH (style:Style)<-[:OF_TYPE]-(:Beers)-[:BREWED_AT]->(brewery:Breweries)
    WITH DISTINCT brewery.name as names, count(DISTINCT style.name) as style_counts
    RETURN AVG(style_counts) as average_style_count_per_brewery
    """
).data()
pprint(result)

[{'average_style_count_per_brewery': 10.669977315921768}]


7. Which brewery produces the strongest beers according to ABV?

In [10]:
# Returns the brewery that has the highest average of ABV, to justify that this brewery is actually the one that produces the 
# strongest beers and did not just produced one beer, we can see that in the top 5 strongest beers, this brewery has the first 
# and second place, making it a valid brewery (with an average ABV of 25). We had to remove the unknown values due to a computations
# conflict
result = beer_graph.run("""
            MATCH (b:Beers)-[:BREWED_AT]->(bwr:Breweries)
            WHERE NOT b.abv CONTAINS 'Unknown'
            RETURN bwr.name as brewery, avg(toInteger(b.abv)) as average_abv
            ORDER BY average_abv DESC
            LIMIT 1
                """).data()
pprint(result)

print('Strongest beers:')
result = beer_graph.run("""
            MATCH (b:Beers)-[:BREWED_AT]->(bwr:Breweries)
            WHERE NOT b.abv CONTAINS 'Unknown'
            RETURN bwr.name as brewery, toInteger(b.abv) as ABV
            ORDER BY ABV DESC
            LIMIT 5
                """).data()
pprint(result)

[{'average_abv': 25.111111111111114, 'brewery': '1648 Brewing Company Ltd'}]
Strongest beers:
[{'ABV': 100, 'brewery': '1648 Brewing Company Ltd'},
 {'ABV': 100, 'brewery': '1648 Brewing Company Ltd'},
 {'ABV': 100, 'brewery': 'Avondale Brewing Co.'},
 {'ABV': 80, 'brewery': 'Morgan Territory Brewing'},
 {'ABV': 67, 'brewery': 'Brewmeister'}]


8. If I typically enjoy a beer due to its aroma and appearance, which beer style should I try?

In [11]:
# The support of the recommendation is based on the fact that the properties of look and taste are the ones most appreciate, 
# so we average these 2 fields and recommend the top result. For this query we had to remove the Unknown results as they would 
# conflict with the computation of the average
result = beer_graph.run("""
        MATCH (r:Reviews)-[:ABOUT]->(:Beers)-[:OF_TYPE]->(s:Style)
        WHERE NOT r.look CONTAINS 'Unknown' AND NOT r.taste CONTAINS 'Unknown' 
        RETURN s, avg(toFloat(r.look)) as look_avg, avg(toFloat(r.taste)) as taste_avg
        ORDER BY (look_avg + taste_avg) DESC
        LIMIT 1
""").data()
pprint(result)

[{'look_avg': 4.383595613210904,
  's': Node('Style', name='New England IPA'),
  'taste_avg': 4.418206168244489}]


9. Using Graph Algorithms answer two of the following questions:
    3. Which user is the most influential when it comes to reviews made?

In [12]:
# Order users by their score regarding the review made, with graph betwen username and reviews (reverse orientation)

data = beer_graph.run(
    """
        CALL gds.graph.create(
            'group32-9c ',
            [
                'Username',
                'Reviews'
            ],
            {
                MADE: {
                    orientation: 'REVERSE'
                }
            }
        )
    """
).data()
data

data = beer_graph.run(
    """
        CALL gds.pageRank.stream('group32-9c ') YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).user_name AS username, score
        ORDER BY score DESC LIMIT 1
    """
).data()
data

[{'username': 'Sammy', 'score': 1759.3712493896485}]

9. Using Graph Algorithms answer two of the following questions:
    4. Which beer styles are more central when it comes the amount of beers?

In [13]:
# Order sytles by the amount of beers, with graph betwen style and beers (natural orientation)

data = beer_graph.run(
    """
        CALL gds.graph.create(
            'group32-9D',
            [
                'Style',
                'Beers'
            ],
            {
                OF_TYPE: {
                    orientation: 'NATURAL'
                }
            }
        )
    """
).data()

data = beer_graph.run(
    """
        CALL gds.pageRank.stream('group32-9D') YIELD nodeId, score
        RETURN gds.util.asNode(nodeId).name AS Style, score
        ORDER BY score DESC LIMIT 1
    """
).data()
data

[{'Style': 'American IPA', 'score': 5702.417230224609}]

10. If you had to pick 3 beers to recommend using only this database, which would you pick and why?

In [14]:
# This recommendation is based on the average of the score of the beer. Since we would like to recommend a beer with 
# a sustainable reasoning, we decided to exclude beers that had few reviews, as these could be biased by a small group of 
# individuals that had odd tastes.
result = beer_graph.run("""
        MATCH (r:Reviews)-[:ABOUT]->(b:Beers)
        WHERE toFloat(r.score) IS NOT NULL AND b.retired =~ 'f'
        WITH b, avg(toFloat(r.score)) as beer_score, COUNT(r) as no_reviews
        WHERE no_reviews > 1000
        RETURN b.name, beer_score, no_reviews
        ORDER BY beer_score DESC
        LIMIT 3
""").data()
pprint(result)

[{'b.name': 'Barrel-Aged Abraxas',
  'beer_score': 4.742700831024935,
  'no_reviews': 1444},
 {'b.name': 'Marshmallow Handjee',
  'beer_score': 4.735540457072266,
  'no_reviews': 1619},
 {'b.name': "Hunahpu's Imperial Stout - Double Barrel Aged",
  'beer_score': 4.728153067678674,
  'no_reviews': 1581}]


# Answers

In [16]:
redis.set("0", "358873")
redis.set("1", "200")
redis.set("2A", "IPA")
redis.set("2B", "Sierra Nevada Brewing Co.")
redis.set("2C", "US")
redis.set("3", "acurtis")
redis.set("4", "Dois Corvos Cervejeira")
redis.set("5", "Finisterra")
redis.set("6", "10.7")
redis.set("7", "1648 Brewing Company Ltd")
redis.set("8", "New England IPA")
redis.set("9C", "Sammy")
redis.set("9D", "American IPA")
redis.set("10", "Barrel-Aged Abraxas, Marshmallow Handjee and hpu's Imperial Stout - Double Barrel Aged")

True