# Classifying recipe posts

Here I shall investigate 6000 posts scraped from Instagram and try to determine which posts display recipes.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# reading in csv of scraped posts
posts = pd.read_csv("delicouslyella_posts.csv")
posts

Unnamed: 0,postUrl,description,commentCount,likeCount,pubDate,likedByViewer,isSidecar,type,videoUrl,viewCount,...,taggedFullName4,taggedUsername4,location,locationId,taggedFullName5,taggedUsername5,taggedFullName6,taggedUsername6,taggedFullName7,taggedUsername7
0,https://www.instagram.com/p/Cf6xvReoMjO/,So excited to share this with you 💕🌱 What do y...,223,4969,2022-07-12T16:09:40.000Z,False,False,Video,https://scontent-lhr8-1.cdninstagram.com/v/t50...,97193.0,...,,,,,,,,,,
1,https://www.instagram.com/p/Cc2WQ4LDqyn/,Today marks ten years since @ella.mills__ foun...,424,14521,2022-04-27T09:28:48.000Z,False,False,Video,https://scontent-lhr8-1.cdninstagram.com/v/t50...,203404.0,...,,,,,,,,,,
2,https://www.instagram.com/p/CgCMVs3jk4L/,Introducing @ella.mills__ favourite salad - cr...,144,12121,2022-07-15T13:17:13.000Z,False,False,Video,https://scontent-lhr8-2.cdninstagram.com/v/t50...,287491.0,...,,,,,,,,,,
3,https://www.instagram.com/p/Cf_u2NnDJGf/,"How it started and how it’s going ✨ 2016, cold...",84,8007,2022-07-14T14:20:12.000Z,False,True,Photo,,,...,,,,,,,,,,
4,https://www.instagram.com/p/Cf_u2NnDJGf/,"How it started and how it’s going ✨ 2016, cold...",84,8007,2022-07-14T14:20:12.000Z,False,True,Photo,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6076,https://www.instagram.com/p/VRCCTUIuVj/,Morning berry banana #smoothie with acai berri...,5,122,2013-02-03T10:51:11.000Z,False,False,Photo,,,...,,,,,,,,,,
6077,https://www.instagram.com/p/VPDxU5ouZ0/,"Rice crispy treats! Brown rice, almond butter,...",5,181,2013-02-02T16:27:51.000Z,False,False,Photo,,,...,,,,,,,,,,
6078,https://www.instagram.com/p/VJpdOGIubY/,Nutty berry cheesecake for dinner tonight! #in...,14,217,2013-01-31T14:01:43.000Z,False,False,Photo,,,...,,,,,,,,,,
6079,https://www.instagram.com/p/UtMcf7IuRy/,,1,240,2013-01-20T12:49:29.000Z,False,False,Photo,,,...,,,,,,,,,,


In [3]:
posts.columns

Index(['postUrl', 'description', 'commentCount', 'likeCount', 'pubDate',
       'likedByViewer', 'isSidecar', 'type', 'videoUrl', 'viewCount',
       'caption', 'profileUrl', 'username', 'taggedFullName1',
       'taggedUsername1', 'imgUrl', 'postId', 'timestamp', 'query',
       'taggedFullName2', 'taggedUsername2', 'taggedFullName3',
       'taggedUsername3', 'taggedFullName4', 'taggedUsername4', 'location',
       'locationId', 'taggedFullName5', 'taggedUsername5', 'taggedFullName6',
       'taggedUsername6', 'taggedFullName7', 'taggedUsername7'],
      dtype='object')

In [6]:
# what info do we have on each post?

posts.iloc[0]

postUrl                     https://www.instagram.com/p/Cf6xvReoMjO/
description        So excited to share this with you 💕🌱 What do y...
commentCount                                                     223
likeCount                                                       4969
pubDate                                     2022-07-12T16:09:40.000Z
likedByViewer                                                  False
isSidecar                                                      False
type                                                           Video
videoUrl           https://scontent-lhr8-1.cdninstagram.com/v/t50...
viewCount                                                    97193.0
caption                                                          NaN
profileUrl                    https://www.instagram.com/ella.mills__
username                                                ella.mills__
taggedFullName1                                     Deliciously Ella
taggedUsername1                   

In [14]:
# How many posts have a location?

len(posts[posts.location.isna()==False])

50

In [19]:
# Date range of posts

print(f"Date range\n-------------\nOldest post: {min(pd.to_datetime(posts.pubDate))}\nNewest post: {max(pd.to_datetime(posts.pubDate))}")

Date range
-------------
Oldest post: 2013-01-17 00:57:28+00:00
Newest post: 2022-07-15 13:17:13+00:00


### Sidecar posts cover multiple rows

Posts with sidecar (an Instagram feature where you can include multiple pictures in a post) are included across multiple rows in my dataframe. This is because each row relates to a picture from the post.
Since we are only interested in each unique post (using the descriptions rather than the images), we can remove all duplicate sidecar rows so that we are left with only one row per post.

In [36]:
posts[posts.isSidecar].head()

Unnamed: 0,postUrl,description,commentCount,likeCount,pubDate,likedByViewer,isSidecar,type,videoUrl,viewCount,...,taggedFullName4,taggedUsername4,location,locationId,taggedFullName5,taggedUsername5,taggedFullName6,taggedUsername6,taggedFullName7,taggedUsername7
3,https://www.instagram.com/p/Cf_u2NnDJGf/,"How it started and how it’s going ✨ 2016, cold...",84,8007,2022-07-14T14:20:12.000Z,False,True,Photo,,,...,,,,,,,,,,
4,https://www.instagram.com/p/Cf_u2NnDJGf/,"How it started and how it’s going ✨ 2016, cold...",84,8007,2022-07-14T14:20:12.000Z,False,True,Photo,,,...,,,,,,,,,,
5,https://www.instagram.com/p/Cf_u2NnDJGf/,"How it started and how it’s going ✨ 2016, cold...",84,8007,2022-07-14T14:20:12.000Z,False,True,Photo,,,...,,,,,,,,,,
14,https://www.instagram.com/p/Cft1VFwjp78/,July’s 5-day “Summer staples” meal plan is now...,19,2003,2022-07-07T15:30:31.000Z,False,True,Photo,,,...,,,,,,,,,,
15,https://www.instagram.com/p/Cft1VFwjp78/,July’s 5-day “Summer staples” meal plan is now...,19,2003,2022-07-07T15:30:31.000Z,False,True,Photo,,,...,,,,,,,,,,


In [44]:
print(f"Posts with sidecar: {len(posts[posts.isSidecar])}")
print(f"Unique posts with sidecar: {posts[posts.isSidecar].postUrl.nunique()}")
print(f"---\nSo we need to remove {len(posts[posts.isSidecar])-posts[posts.isSidecar].postUrl.nunique()} duplicates.")

Posts with sidecar: 546
Unique posts with sidecar: 187
---
So we need to remove 359 duplicates.


In [48]:
posts.drop_duplicates(subset="postUrl", keep='first', inplace=True)
posts

Unnamed: 0,postUrl,description,commentCount,likeCount,pubDate,likedByViewer,isSidecar,type,videoUrl,viewCount,...,taggedFullName4,taggedUsername4,location,locationId,taggedFullName5,taggedUsername5,taggedFullName6,taggedUsername6,taggedFullName7,taggedUsername7
0,https://www.instagram.com/p/Cf6xvReoMjO/,So excited to share this with you 💕🌱 What do y...,223,4969,2022-07-12T16:09:40.000Z,False,False,Video,https://scontent-lhr8-1.cdninstagram.com/v/t50...,97193.0,...,,,,,,,,,,
1,https://www.instagram.com/p/Cc2WQ4LDqyn/,Today marks ten years since @ella.mills__ foun...,424,14521,2022-04-27T09:28:48.000Z,False,False,Video,https://scontent-lhr8-1.cdninstagram.com/v/t50...,203404.0,...,,,,,,,,,,
2,https://www.instagram.com/p/CgCMVs3jk4L/,Introducing @ella.mills__ favourite salad - cr...,144,12121,2022-07-15T13:17:13.000Z,False,False,Video,https://scontent-lhr8-2.cdninstagram.com/v/t50...,287491.0,...,,,,,,,,,,
3,https://www.instagram.com/p/Cf_u2NnDJGf/,"How it started and how it’s going ✨ 2016, cold...",84,8007,2022-07-14T14:20:12.000Z,False,True,Photo,,,...,,,,,,,,,,
6,https://www.instagram.com/p/Cf_W3qpDf80/,"Sunshine smoothie for the heatwave. Fresh, fru...",60,6350,2022-07-14T10:55:10.000Z,False,False,Video,https://scontent-lhr8-1.cdninstagram.com/v/t50...,203527.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6076,https://www.instagram.com/p/VRCCTUIuVj/,Morning berry banana #smoothie with acai berri...,5,122,2013-02-03T10:51:11.000Z,False,False,Photo,,,...,,,,,,,,,,
6077,https://www.instagram.com/p/VPDxU5ouZ0/,"Rice crispy treats! Brown rice, almond butter,...",5,181,2013-02-02T16:27:51.000Z,False,False,Photo,,,...,,,,,,,,,,
6078,https://www.instagram.com/p/VJpdOGIubY/,Nutty berry cheesecake for dinner tonight! #in...,14,217,2013-01-31T14:01:43.000Z,False,False,Photo,,,...,,,,,,,,,,
6079,https://www.instagram.com/p/UtMcf7IuRy/,,1,240,2013-01-20T12:49:29.000Z,False,False,Photo,,,...,,,,,,,,,,


In [53]:
for i in range(100):
    print(posts.description.iloc[i])
    print("\n--------\n")

So excited to share this with you 💕🌱 What do you think? I felt there was no one-stop-shop for going/being plant-based or flexitarian, a definitive guide with all the resources and expertise that you needed, so I spent the last eighteen months putting it together and it’s finally here 💃 It’s half price on Waterstones and Amazon right now if you want to pre-order it and it launches on the 18th August!
@plantpowerdoctor @dr.alandesmond @plantbasedhealthprofessionals @rohinibajekal @plantdietitianrosie @plantbasedkids.uk @shahroo_izadi

--------

Today marks ten years since @ella.mills__ founded the company, writing our very first recipe post on deliciously ella.com - it was a simple recipe, roasted sweet potatoes with an avocado dip. We’ve evolved a lot since then but the mission hasn’t changed, to help as many people feel better every day. It’s been an adventure, to say the least, and we couldn’t have done it without you. Thank you for the incredible support and for being a part of every