Skip to content

lju-lazarevic/wine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Graph adventures with wine

Weeks so far
photo 1423483641154 5411ec9c0ddf?ixlib=rb 1.2
Grapes by Maja Petric

Based on the data from Kaggle, this repository will hold all models, thoughts, questions and code as we explore this data set on a weekly basis.

Follow along on the Neo4j Twitch channel!

Need to catch up? Here’s the adventures with wine data playlist.

Week 1 (18 May): Let’s explore the data!

Want to participate?

So this was an exciting start to the week! Having only just found the wine data set not that much before the start of the Twitch session, it certainly was a voyage of discovery. In this session we:

  • Had a look at the data set, thinking about some of the questions we might ask and insight we could discover

  • Had an initial think about what might the data model look like, and some of the modelling decisions we’re going to have to make over the coming weeks

  • Decided upon pairing down the data set for now, and start thinking about importing that data

Exploring the data

First of all, let’s take a sample peak at the data:

id country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery

0

Italy

Aromas include tropical fruit, broom <…​>

Vulkà Bianco

87

Sicily & Sardinia

Etna

Kerin O’Keefe

@kerinokeefe

Nicosia 2013 Vulkà Bianco (Etna)

White Blend

Nicosia

1

Portugal

This is ripe and fruity, <…​>

Avidagos

87

15

Douro

Roger Voss

@vossroger

Quinta dos Avidagos 2011 Avidagos Red (Douro)

Portuguese Red

Quinta dos Avidagos

2

US

Tart and snappy, the flavors of lime <…​>

87

14

Oregon

Willamette Valley

Willamette Valley

Paul Gregutt

@paulgwineÂ

Rainstorm 2013 Pinot Gris (Willamette Valley)

Pinot Gris

Rainstorm

We’ll probably find having a definition of the headers helpful for some of the less obvious fields!

Header Description

<id>

The row id of the data set

country

The country that the wine is from

description

Description used to describe the tasting features of the wine

designation

The vineyard within the winery where the grapes that made the wine are from. A winery may have more than one vineyard

points

The number of points WineEnthusiast rated the wine on a scale of 1-100

price

The cost for a bottle of the wine

province

The province or state that the wine is from

region_1

The wine growing area in a province or state (ie Napa Valley)

region_2

Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley)

taster_name

Name of the taster

taster_twitter_handle

Taster’s twitter handle

title

The title of the wine review, which often contains the vintage if you’re interested in extracting that feature

variety

The type of grapes used to make the wine (ie Pinot Noir)

winery

The winery that made the wine

A very interesting data set indeed. There are some rather cool things that we can explore, such as:

  • Do tasters stick to certain wines, or do they go across different grape varieties/countries?

  • How do points compare to price?

  • How do varities cross countries?

  • We can tokenise the description - can we recommend wines based on description elements?

  • …​and so many more!

Data set challenges

As we explored this data, a number of questions arose:

  • Could we safely make the assumption that each line represented a unique wine? Or were we looking at a wine with multiple entries becauses different reviewers reviewed it?

  • Were there data duplications?

  • What’s the year of the wine?

  • and so forth

We decided to have a quick look at the data using MS Excel. We discovered some things about the data:

  • there is only one wine per reviewer, so we’re not dealing with multiple reviews per wine

  • there are indeed duplications in the wine, we need to resolve those

  • we are going to need to do some work on the wine name - we’ll need to extract the year, and also we’ll want to keep the title as a wine can be across many years

As an outcome of the session, I will revise the data and removed the duplicates in the dataset. Note that we could have done this within Neo4j, but I am always a fan of cleaning the data prior to a load if it is straightforward to do so!

Modelling

We then turned our thoughts to modelling. Using Arrows, we took a first pass at taking all of the data elements available, and then assigning them as either node labels, relationship types, or properties on either. The first pass looked like this:

img\model1
Figure 1. The initial pass - getting the data down on paper!

Whilst this is far from the finished article, this initial pass allows us to start thinking about what questions we were looking to answer, and based on that, how would we change this model. There will be other things we’ll need to think about resolving too, such as:

  • How are we going to manage ProvinceRegion1Region2? Not all wines have all those details

  • How are we going to represent Wine (of which the title contains name wine + year), and then the WineTitle and it’s respective Year?

  • We’ve also got Designation to add! Where’s that going?

We will visit all of these questions, and more, as we continue our wine adventure!

Importing the data

For now, we’ve got a cut-down model we’re going to import (we’ll import the rest in the next session!), which is the following

img\model2
Figure 2. Model based on partial data - slightly less contravertial!

Based on the approaches we use below, we are dealing with some of the duplicate values for now.

Setting indexes

As we would expect Winery and Country to be unique names, we are going to be setting some indexes to allow use to MERGE as we load the new data. MERGE behaves like a CREATE if the data doesn’t already exist, and a MATCH if it does. You can read more about MERGE in the documentation.

For now, we’re going to assume that Province is also unique, and we’ll set an index on a property for that too. Not the end of the world we’ve we’re wrong - we’ll just correct the data later. All part of the journey.

I suggest you enable multi statement query editor in browser as we start to do multiple queries in a row!

To set the indexes, run the following in Neo4j Browser:

CREATE INDEX ON :Winery(name);
CREATE INDEX ON :Province(name);
CREATE INDEX ON :Country(name);

Dealing with null values

So we discovered some null values in our data! We have some decisions to make. Do we want to skip values if they have a null, or do we want to set a default value? For the purposes of Country, Province and Winery, we are going to want to set some sort of value. Either we can later on extrapolate the information and correct it, or we are getting useful information by knowing it does not exist. So for this scenario we are going to stick with 'No Country', 'No Province' and 'No Winery' as our default options.

There are a number of ways we can deal with null values, and in this instance, I’m going to use the FOREACH - IN CASE WHEN trick. We are going to be combining FOREACH to allow us to do a MERGE, and CASE to check for nulls and set a default. I’m still looking for any content talking about this trick, and I’ll update this document when I find it!

Loading the nodes and relationships can be memory hungry, so I am going to load the data in two passes. First of all, let’s load all the nodes:

:auto //add this line if you're using Neo4j Browser
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
FOREACH (i IN
    CASE WHEN row.country IS NOT NULL
         THEN [row.country]
         ELSE ["No Country"]
    END | MERGE (c:Country {name:i}))
FOREACH (i IN
    CASE WHEN row.province IS NOT NULL
         THEN [row.province]
         ELSE ["No Province"]
    END | MERGE (p:Province {name:i}))
FOREACH (i IN
    CASE WHEN row.winery IS NOT NULL
         THEN [row.winery]
         ELSE ["No Winery"]
    END | MERGE (w:Winery {name:i}))

Now we need to add the relationships between country, province and winery. I’m sure there’s a prettier way to do this, which I will update when I think of it! For now, we also need to think about those null values again when we’re doing the second pass:

:auto //add this line if you're using Neo4j Browser
USING PERIODIC COMMIT 1000
//:auto
//USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
WITH
    CASE row.country
        WHEN null
        THEN "No Country"
        ELSE row.country
    END AS country,
    CASE row.province
        WHEN null
        THEN "No Province"
        ELSE row.province
    END AS province,
    CASE row.winery
        WHEN null
        THEN "No Winery"
        ELSE row.winery
    END AS winery
MATCH (c:Country {name:country}),
    (p:Province {name:province}),
    (w:Winery {name:winery})
MERGE (w)-[:FROM_PROVENCE]->(p)
WITH p,c
MERGE (p)-[:PROVINCE_COUNTRY]->(c)

And that’s the data in! As we’ve been using MERGE, that will take care of any duplicate values, so we don’t need to worry about those.

What next?

Phew! So we’ve got some data in. I’ll leave it to you, dear reader, to think what questions you might ask of the data. I’ll suggest the following to get you started, and we’ll cover them in the next session:

  • Which countries have the most wineries?

  • Are there any wineries across different countries?

  • Which wineries are across multiple provinces?

I would love to hear what interesting discoveries you have also found. Let me know during the session!

See you on the 1st June!

Week 2 (30 May - 9am PDT, 12pm EDT, 5pm BST, 6pm CEST): Importing the data, asking some questions, revising our data model

photo 1464036388609 747537735eab?ixlib=rb 1.2
Vineyard and Hills by Tim Mossholder

Want to participate?

Can’t make the session? Not to worry, the recording will be on the Neo4j YouTube channel soon after.

This week was an interesting installment! We spent the first part of the session walking through the rationale behind the load queries for the data. As you may recall we had some duplicates to deal with, so we made some decisions, such as assuming that all wineries and provinces has unique names, and we imported the data.

I also posed some questions to you all to ask of this data, as a reminder, the questions were:

  • Which countries have the most wineries?

  • Are there any wineries across different countries?

  • Which wineries are across multiple provinces?

As well as using Cypher to answer the questions, we also had a brief look at the data using Neo4j Bloom. For those of you who have not come across Bloom before, it’s a near natural language visualisation tool for graphs. You can read more about it here.

So, onto those queries!

Which countries have the most wineries?
MATCH path=(w:Winery)-[:FROM_PROVENCE]->(p:Province)-[:PROVINCE_COUNTRY]->(c:Country)
RETURN c.name AS Country, count(DISTINCT w) AS Total
ORDER BY Total DESC
Which wineries are across multiple provinces?
MATCH (w:Winery)-[:FROM_PROVENCE]->(p:Province)
WITH w, COLLECT(p.name) AS Provinces, count(p) AS Total
RETURN w.name AS Winery, Provinces, Total
ORDER BY Total DESC

So, having had a look at that data, we then decided to add more data. Again, we’re going to leave the sticky matter of how to deal with regions for now. We decided we’d add Taster, Designation and Variety, as well as the Wine itself. If you recall, Wine name also contained the year - and we’ll probably want to extract that. But for now, we’ll import it as is.

img\model3
Figure 3. The expanded data model that now includes Designation, Taster, Wine and Variety

First of all, we need to set some indexes. We do this because this will allow us to search for data faster by using an index look up on properties of interest rather than doing a full database scan. It’s also super helpful if we’re using MERGE, again for the same reason.

Set the required indexes
//indexes for additional data
CREATE INDEX ON :Wine(id);
CREATE INDEX ON :Taster(name);
CREATE INDEX ON :Variety(name);
CREATE INDEX ON :Designation(name);

And now we can load the data. As before, we will do two passes of this, one to create the nodes, and then the second pass to create the relationships. You will notice in the second pass we also search for the Winery node - this is so that we can connect it up to Wine!

Loading the Designation, Taster, Variety and Wine nodes. Note that we use CREATE for Wine - that’s because we expect all instances to be unique, so we can just create them
:auto //add this line if you're using Neo4j Browser
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
FOREACH (i IN
    CASE WHEN row.designation IS NOT NULL
         THEN [row.designation]
         ELSE ["No Designation"]
    END | MERGE (d:Designation {name:i}))
FOREACH (i IN
    CASE WHEN row.taster_name IS NOT NULL
         THEN [row.taster_name]
         ELSE ["No Taster"]
    END | MERGE (t:Taster {name:i}))
FOREACH (i IN
    CASE WHEN row.variety IS NOT NULL
         THEN [row.variety]
         ELSE ["No Variety"]
    END | MERGE (v:Variety {name:i}))
CREATE (w:Wine {id:row.id, title:row.title})
And following up with creating the relationships. You will notice all of the relationship types are CREATE, this is because they all join onto Wine, which as we said previously, we assume to be unique for all entries.
:auto //add this line if you're using Neo4j Browser
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
WITH
    CASE row.designation
        WHEN null
        THEN "No Designation"
        ELSE row.designation
    END AS designation,
    CASE row.taster_name
        WHEN null
        THEN "No Taster"
        ELSE row.taster_name
    END AS taster,
    CASE row.variety
        WHEN null
        THEN "No Variety"
        ELSE row.variety
    END AS variety,
    CASE row.winery
        WHEN null
        THEN "No Winery"
        ELSE row.winery
    END AS winery,
    row.id as id
MATCH (d:Designation {name:designation}),
    (t:Taster {name:taster}),
    (v:Variety {name:variety}),
    (w:Wine {id:id}),
    (win:Winery {name:winery})
CREATE (w)-[:FROM_WINERY]->(win)
CREATE (w)-[:HAS_VARIETY]->(v)
CREATE (t)-[:RATES_WINE]->(w)
CREATE (w)-[:HAS_DESIGNATION]->(d)

Excellent! So we’ve got that data in, and now we can think about some different questions we can ask. For example:

  • Who is the most prolific wine taster?

  • How many wine varieties contain the word 'red'?

Querying for the most prolific wine taster
//Most prolific taster
MATCH (t:Taster)
WHERE t.name <> "No Taster"
WITH t
MATCH (t)-[:RATES_WINE]->(w:Wine)-[:HAS_VARIETY]->(v:Variety)
WITH t, count(w) AS total, COLLECT(DISTINCT v.name) AS varieties
RETURN t.name AS taster, varieties, total
ORDER BY total DESC
Finding all the varieties that contain the word 'red' in them
MATCH (v:Variety)
WHERE tolower(v.name) CONTAINS 'red'
RETURN v.name
ORDER BY v.name

Have a go at some other questions yourself! You can always use Bloom to help think about what you might want to investigate too. If you can think of a question that you’re not sure how to write a query for, we can cover it in the session.

Week 3 (6 June - 9am PDT, 12pm EDT, 5pm BST, 6pm CEST): Let’s tackle those wine years!

photo 1568213816046 0ee1c42bd559?ixlib=rb 1.2
Four glasses of wine by Maksym Kaharlytskyi

Want to participate?

Well…​ this was a slightly painful week! We all have bad days and I guess it was my turn :). Nevertheless, there is some very good learning to be had, and hopefully you’ll be equiped that little bit better to spot issues.

Let’s start off with a top tip for this week…​

Tip
Do you have a smallish dataset? Is it taking a very long time to MERGE your data? Check if you have correctly set your indexes! More on that shortly <blush>.

So, what we were aiming to do this week:

  • Refactor the model yet again to think how we’ll show years and titles. The big difference here being we’re not importing data, we’re working with data that we already have in the database

  • Pull the years out of the wine titles and create separate Year and WineTitle nodes

  • Ask some questions!

So, let’s get going!

Yet another model revision

We started of with trying to decide how would we go about modelling the relationship between:

  • Wine Group (the term we decided to refer to a wine’s title but without the year)

  • Year

  • Wine (which has a title that contains wine group and year)

For the puroses of just getting something to work with, we ended up with a sketch of the following to walk through the rationale:

img\model4
Figure 4. A worked example of how the physical data might connect

We don’t expect there to be many relationships coming off of WineGroup - there’ll probably be a few years and that’s it. The more tricky one may well be WineGroup to Year - Year could quite possibly become a dense node. Not all dense nodes are bad - if you’re not traversing between multiple dense nodes it might be fine. For now, let’s leave it as it is, and we can always refactor the model to deal with it if necessary later.

So, based on this, let’s have a look at what our data model now stands:

img\model5
Figure 5. An update…​

It didn’t feel quite right, so I decided that Year should come off Wine and not WineGroup:

img\model6
Figure 6. The latest iteration of the wine data model

I’m still not crazy about this model, but that’s completely fine. That’s one of the things that I love about graph databases - we don’t have to get the data model perfect, we just get something that’s in the right direction, and we just iterate and refine it as we go along and understand our data better. Undoubtedly we’ll be back soon enough with some changes.

I’m coming for you, wine year and group!

Ok, now with that out of the way, we are going to extract that year, and a title without the year for our two new nodes, Year and WineGroup. We could have dealt with this before importing the data, but I thought it would be a good opportunity to use APOC to help us make changes with what we already have, in the database.

Due to how the wine title is structured, we are going to be using some fancy regex patterns to get the job done. We’ll also be using the text helper function apoc.text.replace(). Let’s look at some examples.

Getting the yearless wine group

This is the easy bit - we want to find 4 digits next to each other, and then replace them with nothing. Just pulling a few to look at as an example, if we run the following:

MATCH (w:Wine)
RETURN w.title, apoc.text.replace(w.title, '([0-9][0-9][0-9][0-9])', '') AS test LIMIT 5

We get:

img\img1

Brilliant! Turns out getting the year is a teeny bit more involved with it comes to regex…​ Undoubtedly there’ll be a better way to do this, but I’m glad I’ve got something at all:

MATCH (w:Wine)
RETURN w.title, apoc.text.replace(w.title, '[^0-9]|[^0-9][0-9]{1,3}[^0-9]|^[0-9]{1,3}[^0-9]|[^0-9][0-9]{1,3}$', '') AS test LIMIT 5

Woah! Yes…​ ok, taking each option (separated with |), we have:

  • Is not a digit

  • Is not a digit, followed by a digit that repeats up to 3 times followed by not a digit

  • Starts with a number that repeats up to 3 times followed by not a digit

  • Not a digit, followed with a number that repeats up to 3 times at the end

I would love to hear input on how to make that pattern more elegant! Let’s have a look at a sample:

img\img2

Success! We can extract both the year and wine groups - now let’s convert them into nodes, and hook everything up according to our model. To do that, let’s use another APOC tool to help, the procedue apoc.periodic.iterate().

First of all, let’s set those all important indexes:

CREATE INDEX ON :WineGroup(title);
CREATE INDEX ON :Year(value);

So - some of you who were watching me last week may have noticed I was just not getting any joy when trying to create the WineGroup node…​ in the end it took over an hour(!!!) to finish - I stopped the video when it was taking minutes and just left it. I only just spotted the shocking mistake I had made when I came to do the write up…​ I had created an index for :WineGroup(title), but I was trying to create a node of :WineTitle(title). Ouch. Rerunning the whole thing again (with the indexes/right label names), reader I can assure you it only took a mere 4s. So, as for our impromptu tip of the day - if it’s a small data set and it’s taking a long time, check your indexes.

That aside, let’s get to the business of the day, updating the data. As before, we’ll go in 3 passes, the two different node labels, and then join them with the relationships:

//Create the WineGroup nodes
CALL apoc.periodic.iterate(
  "MATCH (w:Wine) RETURN apoc.text.replace(w.title, '([0-9][0-9][0-9][0-9])', '') AS wineTitle",
  "MERGE (g:WineGroup{title:wineTitle})",
  {batchSize:100, parallel:false})
//Create the Year nodes
CALL apoc.periodic.iterate(
  "MATCH (w:Wine) RETURN apoc.text.replace(w.title, '[^0-9]|[^0-9][0-9]{1,3}[^0-9]|^[0-9]{1,3}[^0-9]|[^0-9][0-9]{1,3}$', '') AS year",
  "MERGE (y:Year{value:year})",
  {batchSize:100, parallel:false})
//join it all together
MATCH (w:Wine)
WITH apoc.text.replace(w.title, '([0-9][0-9][0-9][0-9])', '') AS wineTitle,
     apoc.text.replace(w.title, '[^0-9]|[^0-9][0-9]{1,3}[^0-9]|^[0-9]{1,3}[^0-9]|[^0-9][0-9]{1,3}$', '') AS year, w
MATCH (y:Year {value:year}), (wg:WineGroup{title:wineTitle})
CREATE (w)-[:FROM_YEAR]->(y),
       (w)-[:IN_WINE_GROUP]->(wg)

But uh oh…​. we have a problem…​ Looking at the years we have, we get the following:

img\img3

and

img\img4

Not to worry, let’s get that fixed next :).

Week 4 (13 June - 9am PDT, 12pm EDT, 5pm BST, 6pm CEST): Oh no! We have a year problem!

photo 1491924778227 f225b115dd5f?ixlib=rb 1.2
Pile of brown corks by Elisha Terada

Want to participate?

A quick entry for this week, and I’ll expand accordingly later.

So it turns out the regex didn’t quite work as expected from last week. So in this week’s session we did a bit of digging. We discovered that we returning all digits for year, and other peculiarities.

For example, we were seeing things like this: image:: img/img5.jpg[width="600"]

We did some further investigation, and spotted some other fun things going on. For example:

  • Some wines have 'weird' years - branding years

  • Some wines have 2 years, the branding year and the actual wine year

  • Some wines have no years!

After we interrogated the data, and decided one of the sensible ways to tackle this would be the following:

  • Assume year range of 1970-2017 - any wine with a year that doesn’t fall within this range we will assume is a branding year

  • Wines with a branding year or no year, we’ll tag with 'No Year', like we have done for other labels

  • If a wine has two years, go for the one that sits in the above range

  • If a wine has two years of which both sit in the above range, go for the first one that appears

Whilst it may involve some creative querying to enforce the above rules, the fact that we have them will be a huge help, irrespective of what approach we use to resolve this particular challenge.

So, with the above in mind, let’s try some things out.

*Finding wines with two years in the same range

So, how many wines do we actually have to deal with that have multiple years? Let’s find out:

MATCH (w:Wine)
WITH w, apoc.text.replace(w.title, '( [2][0-1][0-1][0-9])', '') AS test1, apoc.text.replace(w.title, '( [1][9][7-9][0-9])', '') AS test2
    WHERE size(test1)=size(test2) AND size(w.title)>size(test1)
RETURN w.title, test1, test2
//LIMIT 5

Only 14?! Ok - I’m tempted to leave them for now and deal with them later.

So, let’s turn our attention to the rest of the years. We’ll exclude those 14, and process the rest

Get the Wine Group

Using our range defined above, I put together this (rather cumbersome) query that will try and pick out the year based on

//display wines without years
MATCH (w:Wine)
WITH w, apoc.text.replace(w.title, '( [2][0-1][0-1][0-9])|( [1][9][7-9][0-9])', '') AS wineGroup
    WHERE size(wineGroup)=size(w.title)-5 OR size(wineGroup)=size(w.title)
RETURN w.title, wineGroup

It’s not pretty, but it does the job, and I’ll take it!

Clean up on aisle 7

Before we continue, we should clear up the wine years and groups. We can do that with the following code snippets:

//Remove the year nodes
MATCH (y:Year)
DETACH
DELETE y;
//Remove the wine group nodes
MATCH (wg:WineGroup)
DETACH
DELETE wg;

One more attempt…​ and then we’ll move onto the next part of the journey

Week 5 (28 June - 9am PDT, 12pm EDT, 5pm BST, 6pm CEST): Wine years and titles, final attempt!

photo 1562601579 599dec564e06?ixlib=rb 1.2
Wine shelf by Hermes Rivera

Want to participate?

Last attempt of trying to sort out the year and wine groups before I turn to alternative means so that we can move on!

Upon having a dig around in the APOC documentation, I spot something interesting…​ apoc.text.regexGroups() - could this be what we are looking for? We can reuse the regex pattern we have for finding our in range years, without the convoluted logic for replace(). Obviously, we now have to give that a spin!

MATCH (w:Wine)
WITH w, apoc.text.regexGroups(w.title, '([2][0-1][0-1][0-9])|([1][9][7-9][0-9])') AS years
WITH w, years, size(years) AS s
    WHERE s>0 //get rid of no year wines
RETURN w.title, years[0][0], size(years) AS s ORDER BY s LIMIT 10

Which gives us:

img6

Also, we can now deal with the awkward 14 using replace(), now that we know what year to get rid of. Success!

Right, given we can successfully extract the years, it makes the most sense to do the years first, and then use those for determining the WineGroup nodes.

Firstly, process all the Wine nodes that have a year:

  • Create the Year node

  • Connect the Year node to the Wine node

  • Then use both to create the WineGroup node and connect up

//Create the Year nodes
CALL apoc.periodic.iterate(
  "MATCH (w:Wine) WITH w, apoc.text.regexGroups(w.title, '([2][0-1][0-1][0-9])|([1][9][7-9][0-9])') AS years WITH w, years, size(years) AS s WHERE s>0 RETURN years[0][0] as year",
  "MERGE (y:Year{value:year})",
  {batchSize:100, parallel:false});
//Connect Year node to Wine node
MATCH (w:Wine)
WITH w, apoc.text.regexGroups(w.title, '([2][0-1][0-1][0-9])|([1][9][7-9][0-9])') AS years
WITH w, years, size(years) AS s WHERE s>0
MATCH (y:Year {value:years[0][0]})
CREATE (w)-[:FROM_YEAR]->(y);
//Create the WineGroup nodes
MATCH (w:Wine)-[:FROM_YEAR]->(y:Year)
MERGE (wg:WineGroup {title:apoc.text.replace(w.title, y.value,'')})
WITH wg, w
CREATE (w)-[:IN_WINE_GROUP]->(wg);

What about wines with no year?

Interestingly, given how we deduplicated our data (assume all wine are unique, we do have the interesting question of whether it’s worth creating a WineGroup node for those that don’t have a valid year. I think we’ll leave them for now, and we can revisit all of this if it turns out to be a wrong assumption!

Finally, let’s ask some questions!

I’m so excited! We can now start to ask some questions around popular wine years, which wine groups have had successful years, and so forth. In this session we looked at the following:

  • Which Year had the most Wine?

  • Which WineGroup has the most Year nodes?

  • Which Winery produces the most Wine for a given Year?

Let’s have a look at the first query:

MATCH (w:Wine)-[:FROM_YEAR]->(y:Year)
WITH y, collect(w) AS wines
RETURN y.value, size(wines) AS s ORDER BY s DESC
img7

I’ve taken the liberty of turning the results into a chart (you can export CSV straight from Neo4j Browser, I’ve circled the button on the above image):

img8

Neck and neck between 2012 and 2013! Very interesting, given our wine range goes up to 2017. I was not expecting that at all, more like 2014-15.

Onto the next query, let’s look at the WineGroup with the most Year nodes:

MATCH (wg:WineGroup)<-[:IN_WINE_GROUP]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wg, collect(y.value) AS years
RETURN wg.title, years, SIZE(years) AS cy ORDER BY cy DESC
img9

Double years? Hmm, let’s go investigate:

MATCH (wg:WineGroup)<-[:IN_WINE_GROUP]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WHERE id(wg)=576395
RETURN *
img10

Ah - our assumption that each Wine we imported was unique was a false assumption! How big is the problem?

MATCH (wg:WineGroup)<-[:IN_WINE_GROUP]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wg, y, collect(w) AS wines WHERE size(wines)>1
RETURN count(wg)

Returns 471 - ok, not so bad. I’m feeling dangerous, let’s try and fix it in a (questionable) query…​ I’m going to take the first wine in the collect, and remove the others remaining.

MATCH (wg:WineGroup)<-[:IN_WINE_GROUP]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wg, y, collect(w) AS wines WHERE size(wines)>1
WITH wines[1] AS w
DETACH DELETE w

Let’s go into a little what the above query does. You’ll probably remember doing something like MATCH (n:MyNode) DETACH DELETE n will match all nodes with label MyNode, detach any relationships and then delete all the nodes that have that label. What the above query does with the WITH statement is pull out the 2nd node from our collection (which we know is a duplicate), and then apply the same method. Hopefully that makes sense!

(I cheated slightly, 2 of the 471 were in fact triplicates, so I ran the above query twice to remove those too)

So now, if we run our query again for the WineGroup with the most Year nodes, we now get:

img11

Success! On to the next query…​ Finding out the Winery with the most most Wine for a given Year.

As before, we’re using Wine and also Year as the method to connect things together.

MATCH (wy:Winery)<-[:FROM_WINERY]-(w:Wine)-[:FROM_YEAR]->(y:Year)
WITH wy, y, COLLECT(w) AS wines
RETURN wy.name AS Winery, y.value AS Year, size(wines) AS `No of Wines`
ORDER BY `No of Wines` DESC
img12

Very popular, that Wines & Winemakers. Perhaps we’ll investigate a bit more at some point in the future.

Week 6 (3 July - 8am PDT, 11pm EDT, 4pm BST, 5pm CEST): So what exactly is designation?!

photo 1516594915697 87eb3b1c14ea?ixlib=rb 1.2
Napa in a Bottle by Scott Warman

Want to participate?

Welcome back, everybody! Well, I have to say, lots of fun was had this week.

So we set out to understand what exactly was Designation. We had a quick look, and it very much appeared like it was a generated data field, most likely from the wine title, and didn’t seem particularly useful. So it was quickly discounted for now - and an thorough investigation into the original data set may be useful to do in the future.

So.. with all this time left for the session, it was time to decide what to do next! I was quite keen to have a play with Variety. When we looked at it last time (at the point of data import), we noticed there were wine grape varieties in the data, and there are lots of interesting queries we could do off the back of that, i.e. recommending different blends of wine, and so forth.

First of all, we had a look at the different varieties we had:

MATCH (v:Variety)
RETURN v.name ORDER BY v.name

Which lets us know we have 708 records for variety.

There are some interesting things going on, when we explore the list of varieties returned. For example:

  • different spellings for the same grape, e.g. Aragonez and Aragonês

  • different names for the same grape, e.g. Syrah and Shiraz

  • different ordering of wine blends, e.g. Cabernet-Shiraz and Shiraz-Cabernet

Ideally, we want to be able to clean these up, as well as being able to link up all wines that have a grape variety in common together, e.g.

MATCH (v:Variety)
WHERE tolower(v.name) CONTAINS 'shiraz'
RETURN v.name ORDER BY v.name
img13
Figure 7. We want all of these to link to Shiraz Variety!

This problem that we tackled this week pretty much follows the same approach I used when working on food ingredients data. This problem comes up a lot, and you will see it in many places. If this is something of interest, I suggest you read my BBC GoodFood blog post on it!

First of all, let’s have a look at the similar names. Once again, we are going to use APOC for helping us do the text manipulation. Before we do major surgery on our data, let’s do a bit of comparison work.

I’ve previously used both Levenstein similarity and Sorensen dice similarity for comparting text, each have their pros and cons. Where strings vary in size compared to each other, I am not a big fan of Levenstein distance - I find the value a bit meaningless as returning 3 might mean very little similarity between two words, but would suggest two paragraphs are nearly the same!

When picking your approach for comparing strings, do your homework, and be well aware of what the chosen algorithm is doing, so that you are not caught by surprise. You can look at the documentation for the various fuzzy text matching options in the APOC documentation.

Using Levenstein similarity (it returns us a value between 0-1) means we at least have two fixed points that we can apply a threshold to. Again, another warning, dear reader. When deciding what threshold value you decide to chose, beware the underfitting and overfitting problem! Even with something as simple as this, you can be caught unaware.

So, back to our comparitor query:

MATCH (v1:Variety), (v2:Variety)
WHERE ID(v1)<ID(v2)
WITH v1, v2, apoc.text.levenshteinSimilarity(v1.name, v2.name) as d
RETURN v1.name, v2.name, d ORDER BY d DESC
img14

Nice! This is picking up the varieties with similar names. However, if we scroll down a bit…​

img15

Ah, we spot a couple of problems:

  • What do we do about those barrel names? This doesn’t help the ordering problem!

  • The challenge with Levenstein similarity the longer the word gets, as long as lots of letters match, it’s going to get a good similarity score. Blanco and Blend are going to be matched if they are part of a longer word.

Before we continue on how we’re going to fix that, perhaps another thing to mention. It is highly unlikely we’re going to be able to apply 'automated' means to process the data, and get a 100% result. Underfitting and overfitting aside, we’re going to have to accept that what we do won’t be completely right, but as long as it’s 'good enough' (depending on what we define 'good enough'), then we are winning. For me, given this data set and what we’re trying, getting a few wrong varieties joined together is completely fine, as long as the vast majority are correct.

Ok, enough of that, let’s talk about how we’re going to resolve the problems above in one, fell swoop. We are going to tokenise! Time to revamp that data model (again)…​

We are going to create a new node label, VarietyName, and this will have the tokenised name(s) from Variety. For example, for 'Grenanche Blanc', there are going to be two new nodes created with the values 'Grenance' and 'Blanc', connecting back to 'Grenanche Blanc'. Like this, we do away with the ordering problem, and our words are now nice and short, so that similarity score is going to be more meainingful. Then we can do all the comparitor work on top.

img16
Figure 8. Our latest update!

We are going to add our new node following these steps:

  • Create an index for VarietyName

  • Get all the variety names, split the strings by spaces and hyphens, apply a text cleaning function (to remove accents, etc.), and then MERGE them. Attach back to originating Variety node

  • Go across and compare all the VarietyName nodes against each other, using Levenstein Similarity. Any names that match, reconnect the originating Variety node to one of the VarietyName, and delete the other one

Let’s get that index sorted:

CREATE INDEX ON :VarietyName(name)

First pass - I’m using APOC again to do a multiple split:

MATCH (v:Variety)
WITH v, apoc.text.split(v.name, "[ ]|[-]") AS names
FOREACH (n IN names|
 MERGE (vn:VarietyName {name:apoc.text.clean(n)})
 MERGE (vn)-[:IS_COMPONENT_OF]->(v)
    )

And the second pass:

MATCH (v1:VarietyName), (v2:VarietyName)-[:IS_COMPONENT_OF]->(var:Variety)
WHERE ID(v1)<ID(v2)
WITH v1, v2, apoc.text.levenshteinSimilarity(v1.name, v2.name) as d WHERE d >= 0.8
MERGE (v1)-[:IS_COMPONENT_OF]->(var)
WITH v2
DETACH DELETE v2

We decided to live dangerously and a threshold of 0.8 seemed to do a good job. Remember, folks, if we were doing this for a more serious project, we’d be a bit more careful with how we chose the threshold.

How many VarietyName do we have?

img17

Wow, that’s a fair fewer than the original Variety node.

There’s probably a bunch of other things we can do to clean it up further, and we’ll revisit that later. But for now, let’s ask some new questions!

Show Variety linked to VarietyName

MATCH (vn:VarietyName)-[:IS_COMPONENT_OF]->(v:Variety)
WITH vn, COLLECT(v) AS var
RETURN vn.name, var, size(var) AS s
ORDER BY s DESC LIMIT 5
img18

Which VarietyName have the most Wine?

MATCH (vn:VarietyName)-[:IS_COMPONENT_OF]->(v:Variety)<-[:HAS_VARIETY]-(w:Wine)
WITH vn, COLLECT(w) AS wines
RETURN vn.name, size(wines) AS s
ORDER BY s DESC LIMIT 5
img19

Sooo, we’ll probably need to use some knowledge to deal with those, but don’t worry too much that things like 'blend' are popping up. What we will do very soon is use the VarietyName nodes to help use 'dedulicate' our Variety nodes, and the 'blend' problem will go away. More on that to come soon!

Week 7 (10 July - 8am PDT, 11pm EDT, 4pm BST, 5pm CEST): Time for some description fun!

photo 1586864985444 ddd0ba84c323?ixlib=rb 1.2
Selling rose wine by Raissa Lara Lütolf

Want to participate?

All about the descriptions!

So this week we had a look at the descriptions for the data. These are those notes you get about the wine, such as what does it taste of, smell of, etc. We hadn’t previously imported those before, so this is the week to do it!

I am really excited about the descriptions, they give us yet another layer of data to explore that allow us to see how different wines are connected together. Perhaps we can use them to discover new grape varieties we might like, based on the wine’s tasting notes. Or perhaps there’s some specific about the soil in a vineyard that we like - many interesting things to look into!

We’re going to use a similar process to what we did with the Variety nodes - tokenise the words. The difference here being that some of the descriptions are quite long - paragraphs long! There will be lots of words we’ll want to get rid of, our, 'stop' words such as 'the, as, wine' and so forth. Our focus for now is to get that data in to begin with, and then we can start the clean up.

Updates to the model

As mentioned above, we haven’t added Description to our data, and, as we’ll be tokenising the description into words, we’ll need to add a DescriptionWord as well. Let’s take a look at the next model iteration:

model8

Undoubtedly, we’ll be doing a few more revisions in the future, so don’t get too attached to that model!

Let’s load up the data

As always, the first order of the day is set up an index. We don’t need to do one for Description. It is highly unlikely we’ll ever do a search directly on the description - we won’t be MERGE-ing on it (we assume each description is unique, like the wine). However, we do need one for DescriptionWord. We’ll be doing loads of querying on that, so let’s get one set up:

CREATE INDEX ON :DescriptionWord(value)

Next, let’s load up the data. I’m going to start with Description. This is slightly different to the Twitch stream, as I’ve altered the query so that we connect Description to Wine at the time of load:

:auto //add this line if you're using Neo4j Browser
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
CREATE (d:Description {value:coalesce(row.description, "No Description")})
WITH d, row
MATCH (w:Wine {id:row.id})
CREATE (w)-[:HAS_DESCRIPTION]->(d)

For those of you who were watching the stream, you will have spotted my error - I had put an index on id, and not name. That’s why it was taking so long! I must remind myself to check the :Schema each time I think about adding data 😀. This took less than 12s, again that’s what I’d expect. If it takes a long time, check those indexes/queries where you assume an index is being used.

Ok, let’s tokenise those description words! We’re going to use apoc.periodic.iterate() to help us

//Create the Description Word nodes
CALL apoc.periodic.iterate(
  "MATCH (d:Description) RETURN d",
  "WITH d, split(d.value, ' ') AS words
FOREACH (n IN words|
 MERGE (dw:DescriptionWord {value: lower(n)})
 MERGE (dw)-[:IS_DESCRIPTION_WORD]->(d)
    )",
  {batchSize:100, parallel:false})

We did an initial creation of the words splitting with space, but we quickly discovered that there are other things, such as slashes, hyphens, etc, we need to include as well as spaces for splitting. We also have other characters such as commas causing mischief. So, let’s have another go at loading the words to account for this. The quickest way will be to delete all the DescriptionWord nodes, and start again…​ Note that we have 71135 DescriptionWord nodes…​

img20
//Delete the DescriptionWord nodes
call apoc.periodic.commit(
"match (dw:DescriptionWord)
WITH dw LIMIT {limit}
DETACH DELETE dw
RETURN count(*)", {limit:100})

You may be wondering why is limit set so low? Don’t forget that we’ve got some very densely connected DescriptionWord nodes, so we’ll want to keep the limit low as they might be connected to 80k+ wines!

Ok, let’s load those nodes back in. We’ll also use 'apoc.text.clean()' to strip out commas, accents, and the like:

CALL apoc.periodic.iterate(
  "MATCH (d:Description) RETURN d",
  "WITH d, apoc.text.split(d.value, '[ ]|[-]|[/]') AS words
FOREACH (n IN words|
 MERGE (dw:DescriptionWord {value: apoc.text.clean(n)})
 MERGE (dw)-[:IS_DESCRIPTION_WORD]->(d)
    )",
  {batchSize:100, parallel:false})
img21

Much better! Down to 36344 nodes…​ but we have lots of numbers…​ How many do we have?

MATCH (d:DescriptionWord)
WHERE size(apoc.text.replace(d.value,"[0-9]", ""))=0
RETURN count(d)

956…​ Ok - that’s for DescriptionWord nodes with just numbers, do we have any that have letters and numbers?

MATCH (d:DescriptionWord)
WITH size(apoc.text.replace(d.value,"[0-9]", "")) as hasNum, d
WHERE hasNum < size(d.value) AND hasNum >0
RETURN d.value
img22

I think it’s safe to say we can get rid of all of those, so let’s do that now:

MATCH (d:DescriptionWord)
WHERE size(apoc.text.replace(d.value,"[0-9]", ""))<size(d.value)
DETACH DELETE d

And we’re down to 35086 DescriptionWord nodes, far less than originally! We’ll still have to deal with plurals and the sort - but let’s leave that for next week.

So, what are the most commonly used DescriptionWord values? I’m going to guess there are a lot of stop words taking pole position…​

MATCH (dw:DescriptionWord)-[:IS_DESCRIPTION_WORD]->(d:Description)
WITH dw, COLLECT(d) AS descriptions
RETURN dw.value, size(descriptions) as size ORDER BY size DESC
img23

We’ll have a look at those next week!

We then finished off the week looking at the similarity of the words. We quickly discovered we needed to do a fair bit of throttling on the words so that relevant words were compared together, e.g.

MATCH (d1:DescriptionWord), (d2:DescriptionWord)
WHERE id(d1)<id(d2) AND size(d1.value) > 7 AND size(d1.value) <10 AND size(d2.value) > 7 AND size(d2.value) <10 AND left(d1.value,1)=left(d2.value,1)
WITH d1, d2, apoc.text.levenshteinSimilarity(d1.value, d2.value) as res
RETURN d1.value, d2.value, res ORDER BY res DESC

To try and make sure we are comparing similar words to begin with, we are:

  • Making sure that the words are within 2-3 characters in length of each other

  • Making sure they start with the same letter

img24

We can see we have some very interesting things to deal with. Along with plurals, we also have spelling mistakes. Again, another one we’ll take a closer look at next week. Should be fun!

Week 8 (17 July - 8am PDT, 11am EDT, 4pm BST, 5pm CEST, 8:30pm IST): Let’s do more with descriptions!

photo 1521153144914 aae4bcd3d201?ixlib=rb 1.2
Nimb Brasserie, København, Denmark by Nick Karvounis

Want to participate?

Welcome back, dear oenophiles! We’re picking up where we left off last week, dealing with the various irregularities around plurals, stop words and so forth.

We had a look at a rough distribution of how many of each word length we had in DescriptionWord:

MATCH (dw:DescriptionWord)
RETURN size(dw.value) as size, count(dw) as c ORDER by c DESC
img25

Interestingly we have a few very long words. We had a look at those and discovered that we missed some things to split by (commas, long hyphens). We’ll need to redo that soon!

We also took the 'live by the seat of our pants' philosophy to assume that all words 3 characters or less are stop words, so we removed all of those:

MATCH (dw:DescriptionWord)
WHERE size(dw.value) < 4
DETACH DELETE dw

We also used Levenshtien Similarity to remove similar words, which helpfully deals with plurals for us:

//Remove duplicate description words by similarity
CALL apoc.periodic.iterate(
  "MATCH (d1:DescriptionWord), (d2:DescriptionWord)-[:IS_DESCRIPTION_WORD]->(d:Description)
  WHERE id(d1)<id(d2) AND left(d1.value,1)=left(d2.value,1) AND size(d1.value) > 7 AND size(d1.value) <11 AND 	size(d2.value) > 7 AND size(d2.value) <11
	WITH d1, d2, apoc.text.levenshteinSimilarity(d1.value, d2.value) as res WHERE res >0.88
  RETURN d1, d2, d",
  "MERGE (d1)-[:IS_DESCRIPTION_WORD]->(d)
	WITH d2
	DETACH DELETE d2)",
  {batchSize:10, parallel:false})

We used apoc.periodic.iterate() because there are some very densely connect nodes in there, so we want to do them in small batches.

There’s more stuff we can do to clean up, and we’ll tackle that later/I’ll do it on the sly and let you know what to do.

Let’s do some fun stuff!

So one of the things I was really keen to try out was how did Description compare with grape Variety. Going straight into the deep end, let’s try this…​

MATCH (v:VarietyName)
WITH v
MATCH (dw:DescriptionWord {value:v.name})
WITH dw
MATCH (dw)-[:IS_DESCRIPTION_WORD]->(d:Description)<-[:HAS_DESCRIPTION]-(wine:Wine)-[:HAS_VARIETY]->(v:Variety)
WHERE dw.value <> vn.name
RETURN DISTINCT v.name, d.value, wine.title limit 50

What we’re doing here:

  • Get all the VarietyName nodes (tokenised grape varieties)

  • Filter all of the DescriptionWords against the VarietyName nodes to keep the ones that match

  • Using the matching DescriptionWords, pull back all of the Wine and Variety nodes that match. Note that the Variety nodes don’t necessarily have to link back to the original VarietyName nodes!

For those of you who saw my stream for this week, some of the results that came back made me very excited indeed!

img26

Take a close look at result 3 - the variety name listed is 'Rhone-style Red Blend' - no mention of any specific grapes, just 'Red Blend'. That’s not particularly useful if you wanted to recommend a wine based on specific grames! But…​ graphs to the rescue! What we’ve been able to do is use a combination of the description and the variety (over the whole data set) to figure out what are grape varieties within the description…​ and then use that to tell us what the grape varieties are for 'Red Blend' 🤯. We can use this to enrich our data! We can now say what are the grapes used in these mysterious blends.

Let’s go one further, and specifically look for those wines that are declared as 'Red Blends':

MATCH (v:VarietyName)
WITH v
MATCH (dw:DescriptionWord {value:v.name})
WITH dw
MATCH (dw)-[:IS_DESCRIPTION_WORD]->(d:Description)<-[:HAS_DESCRIPTION]-(wine:Wine)-[:HAS_VARIETY]->(v:Variety)<-[:IS_COMPONENT_OF]-(vn:VarietyName)
WHERE dw.value <> vn.name
    AND tolower(v.name) contains('red')
RETURN DISTINCT v.name, d.value, wine.title limit 50
img27

Amazing!

So, whilst we’re here, let’s find out what are the popular grapes in this 'red blend' variety…​

MATCH (v:VarietyName)
WHERE NOT v.name in ['black', 'red', 'white', 'blend','style', 'other']
WITH v //LIMIT 20
MATCH (dw:DescriptionWord {value:v.name})
WITH dw
MATCH (dw)-[:IS_DESCRIPTION_WORD]->(d:Description)<-[:HAS_DESCRIPTION]-(wine:Wine)-[:HAS_VARIETY]->(v:Variety)
WHERE tolower(v.name) contains("red blend")
WITH wine, dw ORDER BY dw.value
WITH wine, collect(dw.value) as grapes
RETURN grapes, count(grapes) as popularity order by popularity desc

Allow me to briefly talk you through this query…​

  • Firstly we want to avoid all the variety names that contain generic words. We did a quick query, and ascertained the above array was it

  • Then we match those description words to the variety names

  • After, we pull back all of the wines that match that have 'red blend' as a variety

  • Then we do a count against against the most popular grape varieties to wine!

img28

There’s probably a bit more data cleaning to go, but the point still stands - this is a very powerful way we can use all of these connections and the data that we’ve been processing to enrich our understanding further. I think we’re going to be having a lot of fun with this in future episodes!

Week 9 (17 July - 8am PDT, 11am EDT, 4pm BST, 5pm CEST, 8:30pm IST): Cleaning up the description words

photo 1498811077893 814242077ef6?ixlib=rb 1.2
New by Tim Mossholder

Want to participate?

A brief update for this week! We spent a bit of time this week looking at how could we remove some of the common words away from the wine description, such as stop words, verbs, etc. that do not really describe the wine!

The goal behind this is so that we can try and eventually be able to recommend wines based on certain description words, e.g. 'berries', or 'citrus'.

We did bit of a hunt around the internet for potential lists of stop words, and found the following:

Having downloaded these lists and got them into a csv-friendly format, we then used them to search for matching description words, and then remove:

call apoc.periodic.commit(
"LOAD CSV WITH HEADERS FROM 'file:///sw1.csv' as row
MATCH (dw:DescriptionWord {value:apoc.text.clean(row.i)})
WITH dw limit $limit
detach delete dw
RETURN count(*)", {limit:10})

The verbs were in table format, so with a bit of unwind magic, we can wrangle that too:

call apoc.periodic.commit("
LOAD CSV WITH HEADERS FROM 'file:///sw2.csv' as row
WITH [row.a,row.b,row.c,row.d,row.e] as words
unwind words as word
MATCH (dw:DescriptionWord {value:apoc.text.clean(word)})
with dw limit $limit
detach delete dw
return count(*)",
{limit:20})

Week 10 (11 September - 8am PDT, 11am EDT, 4pm BST, 5pm CEST, 8:30pm IST): More work on the description words

photo 1500669950830 1a4f3a4b0879?ixlib=rb 1.2
Collection of wine bottles by Rebecca Matthews

Want to participate?

We’ve spoken about some of the challenges with have with the description words, including with all of the various splitters. I propose this week we:

  • Remove and redo the description words, taking into account all splitters

  • Use word similarity comparison to remove similar words

  • Use the stop word and verb lists to remove those common words

  • Examine what we’ve got left!

I will update this accordingly, along with proposed clean up for the description words eventually :)

Week 11 (5 October - 8am PDT, 11am EDT, 4pm BST, 5pm CEST, 8:30pm IST) - Adding the rating

photo 1545729317 a7beae3bf432?ixlib=rb 1.2
A view to drink by Daniel Vogel

Want to participate?

This week we looked at adding in the ratings the reviewers gave for the wines. If you recall, we have the following set up with the data:

  • Only one review per wine

  • Reviews are tied to the reviewer

  • Points theoretically range between 0-100

Originally, in our data model, we assigned for now that the review would be a relationship property between Wine and Reviewer:

model8

Now - there’s potentially a problem here. We have approx 120k wines, which means up to 120k points…​ which will mean up to 120k relationships between reviewer and wine. This may be an issue if we want to do things with those points, and if we do, we potentially have to scan every single relationship to look at it’s score property. Not great…​

In this scenario it might be worth thinking out of the box (or bottle?!) and consider how many different points we may have for the reviews.

So, how many Tasters do we actually have? We’ll also check out how many wine’s they’ve reviewed too. Let’s take a look:

MATCH (t:Taster)-[:RATES_WINE]->(w:Wine)
WITH t, COLLECT(w) AS wines
RETURN t.name AS Taster, SIZE(wines) AS s ORDER BY s DESC
img31

Roger has tasted how many wines?!?!?!? 😱 Roger certainly shows dedication to the task at hand 😂

We have a total of 19 named reviewers in our data, and a potential of 101 different points (0-100), so if we had the points as nodes, worst case scenario, we’d have to look up 19 * 101 = 1919 nodes for points. That’s significantly better than looking up 120k properties!

I’ve also got a hunch that we don’t even have that many points available. So, first thing’s first, let’s have a look.

LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
WITH toInteger(row.points) AS point
RETURN DISTINCT point ORDER BY point

Ah…​ in fact, the points only range between 80-100! Whilst this now has shattered my faith in all things related to points awarded to wines and how seriously I can take them, it does mean that worst case scenario, we’re only looking at 19 * 21 nodes to look up (not all reviewers will award all those points).

You know what this means. Time to refactor that model, oh yes!

model9

And of course, it is also time to load in our Points data too. We are going to do this in two parts. The first part we’re going to generate the range of possible points available, and assign them to the wine tasters. We’ll also assign the Points nodes to 'No Taster'. Maybe that will come in useful at some point in the future. Then we’ll map wines to reviewers point scores.

First of all, we’ll remove that existing relationship between Taster and Wine:

CALL apoc.periodic.commit("
MATCH (t:Taster)-[r:RATES_WINE]->(w:Wine)
WITH r LIMIT $limit
DELETE r
RETURN COUNT(*)",
{limit:10})

Now let’s add the Points nodes to Taster:

//Add the Points nodes to reviewers
WITH range(80,100) AS points //generate an array with values from 80-100
UNWIND points AS point //unwind the array
MATCH (t:Taster)
CREATE (p:Points {value:point})<-[:GAVE_POINTS]-(t)

And finally the Points to Wine:

:auto //add this line if you're using Neo4j Browser
//Add the reviewers to points nodes based on wines they tasted
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
WITH COALESCE(row.taster_name, 'No Taster') AS taster, row WHERE row.points <> ""
MATCH (t:Taster {name:taster})-[:GAVE_POINTS]->(p:Points {value:toInteger(row.points)})
MATCH (w:Wine {id:row.id})
CREATE (p)<-[:HAS_POINTS]-(w)

Excellent! All done. Now, one thing that grabbed my attention was those 100-point wines! Let’s find out what they were and who reviewed them:

MATCH (p:Points {value:100})<-[:HAS_POINTS]-(w:Wine)-[:HAS_VARIETY]-(v:Variety)
MATCH (p)<-[:GAVE_POINTS]-(t:Taster)
RETURN t.name AS `Reviewer`,  w.title AS `Wine title`, v.name AS `Grape variety` ORDER BY `Grape variety`
img29

There’s Roger again. Although, I guess that isn’t much of a surprise, given what we learned earlier!

Someone suggested we looked at the grape variety of wines wines rated 90 and above, so let’s do that:

MATCH (p:Points)<-[:HAS_POINTS]-(w:Wine)-[:HAS_VARIETY]-(v:Variety)
WHERE p.value>89
WITH v.name AS grapes, collect(v) AS freq
RETURN grapes, size(freq) AS `well rated` ORDER BY `well rated` DESC
img30

Rather well-rated, that Pinot Noir!

A viewer suggested we check out the following question. Let’s have a look at that soon:

  • What´s the best value wine, according to the tasters?

That’s all for now. See you next week!

Week 12 (12 October - 8am PDT, 11am EDT, 4pm BST, 5pm CEST, 8:30pm IST) - Adding and exploring the price

photo 1581919769013 950a210fe5ea?ixlib=rb 1.2
Collection of wine bottles by Alexander Gamanyuk

Want to participate?

This week we decided to have a look at the wine price data. Given our 'discovery' from the previous week that we only had 21 different scores available for 120k wines (and therefore we went for 21 Point nodes, rather than 120k properties on relationships), we thought we’d check out what the situation was on prices:

LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
WITH toInteger(row.price) AS price WHERE price <>""
RETURN DISTINCT price ORDER BY price

So we’ve only got 390 different wine prices. You know what this means…​ We’re also going to have a Price node :). Time to refactor that data model once again:

model10

Also, let’s get that price data in too:

:auto //add this line if you're using Neo4j Browser
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/wine/master/data/winemag-data-130k-v3.csv' AS row
WITH row.price AS price, row.id AS id WHERE price <>""
MATCH (w:Wine {id:id})
MERGE (p:Price {value:toInteger(price)})
CREATE (p)<-[:HAS_PRICE]-(w)

When we were checking what wine prices we had, we did spot some expensive wines…​ Let’s check those out again, along with what points they scored:

MATCH (point:Points)<-[:HAS_POINTS]-(w:Wine)-[:HAS_PRICE]->(p:Price)
WHERE p.value >=1000
RETURN w.title AS Title, p.value AS Price, point.value AS Points ORDER BY p.value
img32

So, despite having a wine that cost $3300(!), it only scored 88 points, shocking! There were a couple of insanely priced wines that scored a perfect 100 though.

For a bit of contect, let’s have a look at the wines, in ascending order of wine, that scored 97 and above points, and their price and variety:

MATCH (p:Price)<-[:HAS_PRICE]-(w:Wine)-[:HAS_POINTS]->(point:Points)
WHERE point.value >96
WITH p, w, point
MATCH (w)-[:HAS_VARIETY]->(v:Variety)
RETURN w.title AS Title, v.name AS Variety, point.value AS POINTS, p.value AS Price ORDER BY Price
img33

That’s right, folks! You can get yourself some well-rated wine without having to dip your hand early into the pension fund. For a mere $35, you can pick up a fine bottle of Syrah. The lowest priced wine scoring 100 (also a Syrah, yum!) comes in at $80. Ok, bit on the pricy side, but you don’t have to raid the kid’s education fund to sample some of the good stuff.

Time for a bit of stats and back to those points

So, we decided to do a bit of exploring. Remember our 19 named tasters? Let’s find out what is the cheapest and most expensive wine they’ve sampled, along with the average price:

MATCH (p:Price)<-[:HAS_PRICE]-(w:Wine)-[:HAS_POINTS]->(point:Points)<-[:GAVE_POINTS]-(t:Taster)
WHERE t.name <> 'No Taster'
WITH t, p.value AS prices
RETURN t.name AS Taster, min(prices) AS Cheapest, max(prices) AS MostExpensive, avg(prices) AS Average ORDER BY MostExpensive
img34

Ok, so we have some interesting ranges, but the averages pretty much sit around $20-40. But how big the range? If we’ve got a big variance around the price, then that’s not a meaningful view for the 'average' price of the wines being reviewed. Let’s add the standard deviation value to that picture:

MATCH (p:Price)<-[:HAS_PRICE]-(w:Wine)-[:HAS_POINTS]->(point:Points)<-[:GAVE_POINTS]-(t:Taster)
WHERE t.name <> 'No Taster'
WITH t, p.value AS prices
RETURN t.name AS Taster, min(prices) AS Cheapest, max(prices) AS MostExpensive, avg(prices) AS Average, stdev(prices) ORDER BY MostExpensive

Wow, there can be a fair bit of variance there! What does this mean? The larger the standard deviation, the more the prices for the taster vary wildly away from the average (mean) price.

Let’s leave that there for now. A viewer was quite keen to see which Countries had the best scoring wines. Let’s check that out now

MATCH (p:Points)<-[:HAS_POINTS]-(w:Wine)-[:FROM_WINERY]->(:Winery)-[:FROM_PROVENCE]->(:Province)-[:PROVINCE_COUNTRY]->(c:Country)
WHERE c.name <> 'No Country'
RETURN c.name, min(p.value) AS min, max(p.value) AS max, avg(p.value) AS avg ORDER BY avg DESC
img35

Oh yeah! Get in, check out our fine selection of English wines - admittedly, we do some very good sparking wines. I’m a big fan of the Ridgeview stuff myself.

That’s all for this week. If there’s something you’d like me to look into the next episode, let me know via the usual channels.

Week 13 (26 October - 8am PDT, 11am EDT, 4pm UTC, 5pm CET, 9:30pm IST) - A week of reflection

photo 1543060534 2c124acc29ba?ixlib=rb 1.2
Wine bottles by Francesco Ungaro

Want to participate?

Not much of a write-up for this week :)

We took the opportunity this session to reflect on what we’ve done on this wine journey, the decisions made along the way, and where we’re up to now.

Week 14 (9 November - 8am PST, 11am EST, 4pm UTC, 5pm CET, 9:30pm IST) - Location, Location, Location!

photo 1508471608746 b7f6b8a5b0b0?ixlib=rb 1.2
Holidaying with my friend by Elle Hughes

Want to participate?

Week 15 (23 November - 8am PST, 11am EST, 4pm UTC, 5pm CET, 9:30pm IST) - Blooming Marvelous

photo 1503427128716 12b0ed4822bb?ixlib=rb 1.2
Stellenbosch, South Africa by Matthieu Joannon

Want to participate?

Hello everybody! This week was Neo4j Bloom week, by popular request the previous week. And didn’t we have fun?

We did a mix of show and tell where we jumped between different parts of the Bloom app, and implementing actual functionality. I’d heartily suggest you watch the catch-up session to see what exactly we covered.

We looked at:

*

Week 16 (30 November - 8am PST, 11am EST, 4pm UTC, 5pm CET, 9:30pm IST) - Let’s sort that location

photo 1516154767575 2146adebdf32?ixlib=rb 1.2
Catania, Italy by Brandy Turner

Want to participate?

  • Download and install Neo4j Desktop

  • Create a new project in Neo4j Desktop called wine, and add a database. You may find this developer guide helpful

  • Complete the steps for Week 1 - 2, 5-8, 11-12 to load the data

So this week we’re going to pick up where we left off in week 14, and make those location changes!

About

Exploring a wine data set with graphs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published