<a href="https://colab.research.google.com/github/kleczekr/tolkenizer/blob/master/scraping_avocado_post_2_retake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from tabulate import tabulate
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [3]:
products = pd.read_csv('drive/My Drive/deite/avocado/wohnen.csv')
sales = pd.read_csv('drive/My Drive/deite/avocado/sales_randint.csv')
pageviews = pd.read_csv('drive/My Drive/deite/avocado/pageviews_randint.csv')

In [4]:
products.head(2)

Unnamed: 0,name,link,price,postage,brand,metadescription,description,criteria,category
0,"\nKopfkissen Bio Lotus Natural®, Bio-Baumwolle...",/products/128594-kopfkissen-bio-lotus-natural-...,"\n\n\n\n\n24,77\n\n€\n\n","\nVersand 5,95 €\n/\nVersandkostenfrei ab 90,0...",\nLotus Design\n,Bio Kopfkissen befüllt mit Kapok; Kapok Schlaf...,"\n\nKopfkissen Bio Lotus Natural®, Bio-Baumwol...",\n\nAvocadostore-Kriterien\n\n\n\n\nRohstoffe ...,wohnen
1,\nGeschirrtuch aus Halbleinen mit Bio-Baumwoll...,/products/201249-geschirrtuch-aus-halbleinen-m...,"\n\n\n\n\n10,23\n\n€\n\n","\nVersand 3,95 €\n/\nVersandkostenfrei ab 50,0...",\nKONTOR 1710\n,Geschirrtuch aus Bio-Baumwolle & Leinen fussel...,\n\nGeschirrtuch aus Halbleinen mit Bio-Baumwo...,\n\nAvocadostore-Kriterien\n\n\n\n\nRohstoffe ...,wohnen


In [5]:
sales.head(2)

Unnamed: 0,articles_sold,unique_sales,total_revenue,average_revenue,article_name
0,83,55,20,99,XL Set Vorratsgläser aus Glas mit Deckel aus 1...
1,94,40,33,41,Wiederverwendbarer Lebensmittelbeutel ”Large” ...


In [6]:
pageviews.head(2)

Unnamed: 0,pageviews,unique_pageviews,time_on_page,path
0,351,770,977,/220612-xl-set-vorratsglaeser-aus-glas-mit-dec...
1,794,744,590,/203165-wiederverwendbarer-lebensmittelbeutel-...


It seems that we can connect the sales data with the product data based on the name (product dataframe) and article_name (sales database), but we need to remove the newspace characters from the product data. While we are at it, we can also remove rows of newspace characters from some other columns in the product df.

In [7]:
# cleaning the line breaks
# some of the following lines look weird. we don't want to get rid of all of the
# newline characters. in most columns, we just want to convert multiple newline
# characters to a single newline character
products.name = products.name.str.replace(r'\n', '')
products.price = products.price.str.replace(r'\n\n\n\n', '\n')
products.price = products.price.str.replace(r'\n\n\n', '\n')
products.price = products.price.str.replace(r'\n\n', '\n')
products.postage = products.postage.str.replace(r'\n\n\n\n', '\n')
products.postage = products.postage.str.replace(r'\n\n\n', '\n')
products.postage = products.postage.str.replace(r'\n\n', '\n')
products.brand = products.brand.str.replace(r'\n', '')
products.description = products.description.str.replace(r'\n\n\n\n', '\n')
products.description = products.description.str.replace(r'\n\n\n', '\n')
products.description = products.description.str.replace(r'\n\n', '\n')
products.criteria = products.criteria.str.replace(r'\n\n\n\n', '\n')
products.criteria = products.criteria.str.replace(r'\n\n\n', '\n')
products.criteria = products.criteria.str.replace(r'\n\n', '\n')

When it comes to the pageview data, we can probably connect the two dataframes by the column link (product DF) = path (pageview DF). Notice that the path recorded in the pageview DF does not have the '/products' prefix, we need to get rid of it in the product DF.

In [8]:
products.link = products.link.str.replace(r'/products', '')

In [9]:
products.shape

(144, 9)

In [10]:
sales.shape

(143, 5)

In [11]:
pageviews.shape

(143, 4)

In [12]:
product_views = pd.merge(products, pageviews, left_on='link', right_on='path', how='outer')

In [13]:
product_views.head()

Unnamed: 0,name,link,price,postage,brand,metadescription,description,criteria,category,pageviews,unique_pageviews,time_on_page,path
0,"Kopfkissen Bio Lotus Natural®, Bio-Baumwolle k...",/128594-kopfkissen-bio-lotus-natural-r-bio-bau...,"\n24,77\n€\n","\nVersand 5,95 €\n/\nVersandkostenfrei ab 90,0...",Lotus Design,Bio Kopfkissen befüllt mit Kapok; Kapok Schlaf...,"\nKopfkissen Bio Lotus Natural®, Bio-Baumwolle...",\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,,,,
1,Geschirrtuch aus Halbleinen mit Bio-Baumwolle ...,/201249-geschirrtuch-aus-halbleinen-mit-bio-ba...,"\n10,23\n€\n","\nVersand 3,95 €\n/\nVersandkostenfrei ab 50,0...",KONTOR 1710,Geschirrtuch aus Bio-Baumwolle & Leinen fussel...,\nGeschirrtuch aus Halbleinen mit Bio-Baumwoll...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,,,,
2,Feueranzünder,/41162-feueranzuender-if-you-care-iyc,"\n2,90\n€\n(8,89€/100stück)\n","\nVersand 4,90 €\n/\nVersandkostenfrei ab 100,...",promavis,ideal für Holzkohlegrill oder Kachelöfen,\nFeueranzünder\nFeueranzünder aus FSC zertifi...,\nAvocadostore-Kriterien\nRecycelt & Recycleba...,wohnen,463.0,29.0,191.0,/41162-feueranzuender-if-you-care-iyc
3,Duschseife Milchsamt BIO - mit Kokosmilch,/186601-duschseife-milchsamt-bio-mit-kokosmilc...,"\n7,95\n€\n(8,37)\n","\nVersand 5,90 €\n/\nVersandkostenfrei ab 100,...",ASAVO,"Duschseife Milchsamt - handgemachte, vegane Ko...",\nDuschseife Milchsamt BIO - mit Kokosmilch\nB...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,188.0,922.0,372.0,/186601-duschseife-milchsamt-bio-mit-kokosmilc...
4,Bio Schoko-Kürbis Zartbitter,/49787-bio-schoko-kuerbis-zartbitter-landgarten,"\n2,49\n€\n(4,98/100g)\n","\nVersand 4,90 €\n/\nVersandkostenfrei ab 100,...",promavis,Bio Landgarten Bio Schoko-Kürbis Zartbitter Ve...,\nBio Schoko-Kürbis Zartbitter\nBio Schoko-Kür...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,466.0,882.0,653.0,/49787-bio-schoko-kuerbis-zartbitter-landgarten


In [14]:
product_views.shape

(244, 13)

In [15]:
for index, row in product_views.iterrows():
  print('Link: {}\nPath: {}\nLink==Path: {}\n{}'.format(row.link,
                                                     row.path,
                                                     row.link==row.path,
                                                     '*'*60))

Link: /128594-kopfkissen-bio-lotus-natural-r-bio-baumwolle-kba-fuellung-kapokfaser-lotus-natural-r?variant_id=932804
Path: nan
Link==Path: False
************************************************************
Link: /201249-geschirrtuch-aus-halbleinen-mit-bio-baumwolle-50-x-70-cm-aspegren?variant_id=1560821
Path: nan
Link==Path: False
************************************************************
Link: /41162-feueranzuender-if-you-care-iyc
Path: /41162-feueranzuender-if-you-care-iyc
Link==Path: True
************************************************************
Link: /186601-duschseife-milchsamt-bio-mit-kokosmilch-asavo-1
Path: /186601-duschseife-milchsamt-bio-mit-kokosmilch-asavo-1
Link==Path: True
************************************************************
Link: /49787-bio-schoko-kuerbis-zartbitter-landgarten
Path: /49787-bio-schoko-kuerbis-zartbitter-landgarten
Link==Path: True
************************************************************
Link: /208453-saatgutkonfetti-kompostierbares-konfet

In [16]:
for index, row in product_views.iterrows():
  if row.link!=row.path:
    print('{}\n{}'.format(row.link, row.path))

/128594-kopfkissen-bio-lotus-natural-r-bio-baumwolle-kba-fuellung-kapokfaser-lotus-natural-r?variant_id=932804
nan
/201249-geschirrtuch-aus-halbleinen-mit-bio-baumwolle-50-x-70-cm-aspegren?variant_id=1560821
nan
/208453-saatgutkonfetti-kompostierbares-konfetti-das-saatgut-24-heimischer-wild-pflanzenarten-enthaelt-saatgutkonfetti?variant_id=1697669
nan
/96254-spork-bambus-mini-besteck-fuer-unterwegs-bambu
nan
/194120-costa-rica-cola-el-puente
nan
/74236-rasierpinsel-kunsthaar-mit-olivenholzgriff-vegan-olivenholz-erleben
nan
/221340-22-tuerchen-bluetooth-box-kreafunk?variant_id=1743177
nan
/152740-glas-trinkhalme-glas-strohhalme-trinkhalm-glas-plus-buerste-6st-215cm-ecoyou
nan
/209325-veganes-kaugummi-himbeere-vanille-true-gum
nan
/43560-schafmilch-seife-herz-saling-naturprodukte
nan
/144074-ras-el-hanout-bio-gewuerzmischung-mill-und-mortar
nan
/197524-becher-elsa-tranquillo
nan
/174876-bio-saatgut-box-fuer-das-ganze-jahr-jahresbox-gartenjahr-rankwerk
nan
/193124-blumenwiese-fuer-bienen-

It seems that there is a huge number of articles for which we do not have pageview data. It took me quite some time to figure out why it is so---then I remembered: I produced the pageview data on other day than when I produced the CSV file listing products, so it is effectively based on a different selection of products---it appears that there is either some randomness about what products are appearing first on the homepage of Avocadostore, or the selection is updated periodically.

I thought about updating both files to make them compatible, but then I decided against it. It is usually the case that different files we have do not match perfectly. It is sometimes the case that we have more missing values than actually present values when we combine two different files. Let's truncate the new merged file (the pageview data for products which we do not have in the product DF will not be necessary), and replace the missing values with 0s.

In [17]:
product_views = product_views[:144]

In [18]:
product_views.shape

(144, 13)

In [19]:
# drop the path column, since it is the same as the link column
product_views = product_views.drop('path', axis=1) 

In [20]:
# fill nan values with zeroes
product_views = product_views.fillna(0)

In [21]:
product_views.tail()

Unnamed: 0,name,link,price,postage,brand,metadescription,description,criteria,category,pageviews,unique_pageviews,time_on_page
139,Bulgarische Joghurtkulturen - Naturjoghurt sel...,/176295-bulgarische-joghurtkulturen-naturjoghu...,"\n14,99\n€\n19,99\n€\n(2,50€/1g)\n","\nVersand 6,90 €\n/\nVersandkostenfrei ab 75,0...",Wellness-Drinks,Echten Bulgarischen Joghurt (Kiselo Mlyako) en...,\nBulgarische Joghurtkulturen - Naturjoghurt s...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,268.0,497.0,764.0
140,SoulSpice Sweet Kashmir Mango Curry Gewürzmix,/223037-soulspice-sweet-kashmir-mango-curry-ge...,"\n8,20\n€\n(16,40€/100g)\n","\nVersand 3,95 €\n/\nVersandkostenfrei ab 50,0...",KONTOR 1710,"Mit edler Bio-Mango, Ingwer, Kurkuma, Zimt und...",\nSoulSpice Sweet Kashmir Mango Curry Gewürzmi...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,0.0,0.0,0.0
141,"Baumschmuck aus Holz, verschiedene Motive",/182279-baumschmuck-aus-holz-verschiedene-moti...,"\n5,90\n€\n6,50\n€\n","\nVersand 4,90 €\n/\nVersandkostenfrei ab 99,0...",Mitienda Shop,Baumschmuck aus Holz. Handgemachter Christbaum...,"\nBaumschmuck aus Holz, verschiedene Motive\nH...",\nAvocadostore-Kriterien\nHaltbar\nMit der ide...,wohnen,0.0,0.0,0.0
142,Mousepad aus recyceltem Leder,/11529-mousepad-aus-recyceltem-leder-vireo?var...,"\n5,90\n€\n6,90\n€\n","\nVersand 5,90 €\n/\nVersandkostenfrei ab 600,...",Vireo,Mit Deinem Mousepad aus recycletem Naturleder ...,\nMousepad aus recyceltem Leder\nMOUSEPAD AUS ...,\nAvocadostore-Kriterien\nRecycelt & Recycleba...,wohnen,0.0,0.0,0.0
143,Handcreme (fest) | Biokosmetik | Biologisch | ...,/184966-handcreme-fest-biokosmetik-biologisch-...,"\n12,90\n€\n(43,00 €/100g)\n","\nVersand 3,90 €\n/\nVersandkostenfrei ab 35,0...",Daumenschmaus,nachhaltige feste Handcreme - Seife - Naturkos...,\nHandcreme (fest) | Biokosmetik | Biologisch ...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,0.0,0.0,0.0


Now we merge this dataframe with the sales data. This could possibly be more problematic, as I did some work on the product names so that they are not 100% compatible---and that before I knew that the titles are not compatible to start with!

In [22]:
df = pd.merge(product_views, sales, left_on='name', right_on='article_name', how='outer')

In [23]:
df.shape

(246, 17)

In [24]:
df.head()

Unnamed: 0,name,link,price,postage,brand,metadescription,description,criteria,category,pageviews,unique_pageviews,time_on_page,articles_sold,unique_sales,total_revenue,average_revenue,article_name
0,"Kopfkissen Bio Lotus Natural®, Bio-Baumwolle k...",/128594-kopfkissen-bio-lotus-natural-r-bio-bau...,"\n24,77\n€\n","\nVersand 5,95 €\n/\nVersandkostenfrei ab 90,0...",Lotus Design,Bio Kopfkissen befüllt mit Kapok; Kapok Schlaf...,"\nKopfkissen Bio Lotus Natural®, Bio-Baumwolle...",\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,0.0,0.0,0.0,,,,,
1,Geschirrtuch aus Halbleinen mit Bio-Baumwolle ...,/201249-geschirrtuch-aus-halbleinen-mit-bio-ba...,"\n10,23\n€\n","\nVersand 3,95 €\n/\nVersandkostenfrei ab 50,0...",KONTOR 1710,Geschirrtuch aus Bio-Baumwolle & Leinen fussel...,\nGeschirrtuch aus Halbleinen mit Bio-Baumwoll...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,0.0,0.0,0.0,,,,,
2,Geschirrtuch aus Halbleinen mit Bio-Baumwolle ...,/201249-geschirrtuch-aus-halbleinen-mit-bio-ba...,"\n10,23\n€\n","\nVersand 3,95 €\n/\nVersandkostenfrei ab 50,0...",KONTOR 1710,Geschirrtuch aus Bio-Baumwolle & Leinen fussel...,\nGeschirrtuch aus Halbleinen mit Bio-Baumwoll...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,0.0,0.0,0.0,,,,,
3,Feueranzünder,/41162-feueranzuender-if-you-care-iyc,"\n2,90\n€\n(8,89€/100stück)\n","\nVersand 4,90 €\n/\nVersandkostenfrei ab 100,...",promavis,ideal für Holzkohlegrill oder Kachelöfen,\nFeueranzünder\nFeueranzünder aus FSC zertifi...,\nAvocadostore-Kriterien\nRecycelt & Recycleba...,wohnen,463.0,29.0,191.0,71.0,22.0,0.0,38.0,Feueranzünder
4,Duschseife Milchsamt BIO - mit Kokosmilch,/186601-duschseife-milchsamt-bio-mit-kokosmilc...,"\n7,95\n€\n(8,37)\n","\nVersand 5,90 €\n/\nVersandkostenfrei ab 100,...",ASAVO,"Duschseife Milchsamt - handgemachte, vegane Ko...",\nDuschseife Milchsamt BIO - mit Kokosmilch\nB...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,188.0,922.0,372.0,72.0,48.0,87.0,28.0,Duschseife Milchsamt BIO - mit Kokosmilch


Okay, so we need to manually do some searching to see if there are any product sales rows which we can salvage. This can be done manually---which is, in fact, how I would proceed most of the times, just looking at the articles which ended up with null values, looking out for what might've caused a problem. Just looking at the data often gives me some insight into what might've gone wrong.

But let's speed up the process. Let's write a small loop which iterates through rows and, if it finds a row with empty article_name column, it iterates again to look for a row with article_name column with the same beginning as the name column which did not find a match. This should give us a sense of closure at least, even if we do not find any additional rows we can reconcile.

In [31]:
for index, row in df.iterrows():
  # check if article_name is NaN
  # DO NOT use the dot notation in this loop.
  # DOT NOTATION MIGHT CAUSE ERROR
  if pd.isnull(row['article_name']):
    # iterate through the dataframe again
    for index2, row2 in df.iterrows():
      # is the name column NaN?
      if pd.isnull(row2['name']):
        # are the beginning 5 characters of the first name and the
        # second article_name equivalent?
        if (row['name'][:5] == row2['article_name'][:5]):
          # print it it as a possible match
          print('Possible match: \n{}\nmight be equivalent to:\n{}\n{}\n'.format(
              row['name'], row2['article_name'], '*'*60
          ))

Possible match: 
Geschirrtuch aus Halbleinen mit Bio-Baumwolle 50 x 70 cm
might be equivalent to:
Geschirrtuch LARA, Biobaumwolle, GOTS-zertifiziert, 50 x 70 cm
************************************************************

Possible match: 
Geschirrtuch aus Halbleinen mit Bio-Baumwolle 50 x 70 cm
might be equivalent to:
Geschirrtuch LARA, Biobaumwolle, GOTS-zertifiziert, 50 x 70 cm
************************************************************

Possible match: 
Rasierpinsel KUNSTHAAR mit Olivenholzgriff (vegan)
might be equivalent to:
Rasierhobel KLASSIK mit Griff Watzmann aus Olivenholz
************************************************************

Possible match: 
HALM Strohhalme aus Glas Trinkhalme 3x 23 cm (gebogen) + 3x 20 cm (gerade) + Reinigungsbürste
might be equivalent to:
HALM Glasstrohhalme  6x 20 cm (gerade)
************************************************************

Possible match: 
ShowerBit - festes Duschgel Meeresfrische - 60g
might be equivalent to:
ShowerBit - festes Du

This is quite successful. It seems that there are two articles where the names are not compatible because of a different hyphen character used in the two dataframes---we should do something about the 'EcoYou Vegane Zahnseide' and the 'Interdentalbürsten aus Bambus' articles. I propose to alter the original sales data, replacing the '—' character with the '-' character, making them mutually compatible; then we can merge the two dataframes again. We won't rerun the above function, we will just check the dimensions of the resulting dataframe to see if the two extra rows disappeared.

In [36]:
sales.article_name = sales.article_name.str.replace(r'—', '-')

In [37]:
df_improved = pd.merge(product_views, sales, left_on='name', right_on='article_name', how='outer')

In [38]:
df.shape

(246, 17)

In [39]:
df_improved.shape

(245, 17)

Well, one row indeed disappeared. What about the other? I must've missed something...

In [40]:
for index, row in df_improved.iterrows():
  # check if article_name is NaN
  # DO NOT use the dot notation in this loop.
  # DOT NOTATION MIGHT CAUSE ERROR
  if pd.isnull(row['article_name']):
    # iterate through the dataframe again
    for index2, row2 in df_improved.iterrows():
      # is the name column NaN?
      if pd.isnull(row2['name']):
        # are the beginning 5 characters of the first name and the
        # second article_name equivalent?
        if (row['name'][:5] == row2['article_name'][:5]):
          # print it it as a possible match
          print('Possible match: \n{}\nmight be equivalent to:\n{}\n{}\n'.format(
              row['name'], row2['article_name'], '*'*60
          ))

Possible match: 
Geschirrtuch aus Halbleinen mit Bio-Baumwolle 50 x 70 cm
might be equivalent to:
Geschirrtuch LARA, Biobaumwolle, GOTS-zertifiziert, 50 x 70 cm
************************************************************

Possible match: 
Geschirrtuch aus Halbleinen mit Bio-Baumwolle 50 x 70 cm
might be equivalent to:
Geschirrtuch LARA, Biobaumwolle, GOTS-zertifiziert, 50 x 70 cm
************************************************************

Possible match: 
Rasierpinsel KUNSTHAAR mit Olivenholzgriff (vegan)
might be equivalent to:
Rasierhobel KLASSIK mit Griff Watzmann aus Olivenholz
************************************************************

Possible match: 
HALM Strohhalme aus Glas Trinkhalme 3x 23 cm (gebogen) + 3x 20 cm (gerade) + Reinigungsbürste
might be equivalent to:
HALM Glasstrohhalme  6x 20 cm (gerade)
************************************************************

Possible match: 
ShowerBit - festes Duschgel Meeresfrische - 60g
might be equivalent to:
ShowerBit - festes Du

Nah, I was paranoid. Both rows are now incorporated with the main dataframe. Let's truncate the main df and fill missing values with 0s, then we can export it to a CSV.

In [41]:
df_final = df_improved[:144]

In [43]:
# drop the path column, since it is the same as the link column
df_final = df_final.drop('article_name', axis=1) 

In [44]:
# fill nan values with zeroes
df_final = df_final.fillna(0)

In [45]:
df_final.tail(2)

Unnamed: 0,name,link,price,postage,brand,metadescription,description,criteria,category,pageviews,unique_pageviews,time_on_page,articles_sold,unique_sales,total_revenue,average_revenue
142,SoulSpice Sweet Kashmir Mango Curry Gewürzmix,/223037-soulspice-sweet-kashmir-mango-curry-ge...,"\n8,20\n€\n(16,40€/100g)\n","\nVersand 3,95 €\n/\nVersandkostenfrei ab 50,0...",KONTOR 1710,"Mit edler Bio-Mango, Ingwer, Kurkuma, Zimt und...",\nSoulSpice Sweet Kashmir Mango Curry Gewürzmi...,\nAvocadostore-Kriterien\nRohstoffe aus Bioanb...,wohnen,0.0,0.0,0.0,0.0,0.0,0.0,0.0
143,"Baumschmuck aus Holz, verschiedene Motive",/182279-baumschmuck-aus-holz-verschiedene-moti...,"\n5,90\n€\n6,50\n€\n","\nVersand 4,90 €\n/\nVersandkostenfrei ab 99,0...",Mitienda Shop,Baumschmuck aus Holz. Handgemachter Christbaum...,"\nBaumschmuck aus Holz, verschiedene Motive\nH...",\nAvocadostore-Kriterien\nHaltbar\nMit der ide...,wohnen,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
df_final.to_csv('merged_data.csv', index=False)
!cp merged_data.csv "drive/My Drive/deite/avocado/"

And that's it---we have a neat CSV file with article information, descriptions, but also with pageview data and sales data, giving us some additional insights. The file we've produced is very imperfect---as we've seen before, there are many products which lack the pageview and sales data. This is of course an uncomfortable result---but it is also something that happens only too often when dealing with real-world data. I hope that this exercise was as interesting for you as it was for me.