# Language Detection

This notebook aims to identify the best method for filtering out non-English posts.

In [27]:
import pandas as pd
from datetime import datetime

In [46]:
posts = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts.csv")
posts.head(2)

Unnamed: 0,postUrl,profileUrl,username,fullName,commentCount,likeCount,pubDate,description,location,imgUrl,postId,ownerId,type,query,timestamp,isSidecar,sidecarMedias,videoUrl,viewCount
0,https://www.instagram.com/p/CgWcU2fss5n/,https://www.instagram.com/jadesbites,jadesbites,JADE | Recipes & Food,47,1648,2022-07-23T10:00:53.000Z,üßÑ ~ A G L I O ~ E ~ O L I O ~ üßÑ\n\nSimply mean...,Liverpool,https://scontent-lhr8-2.cdninstagram.com/v/t51...,2888620789210467943,3656910377,Photo,#recipeoftheday,2022-07-26T08:27:52.635Z,False,,,
1,https://www.instagram.com/p/CgckMjTpSPs/,https://www.instagram.com/sweettreatsyt,sweettreatsyt,Ania | SweetTreats,4,3,2022-07-25T19:05:06.000Z,NEW! Condensed Milk Brownies. These brownies a...,"Toronto, Ontario",https://scontent-lhr8-1.cdninstagram.com/v/t51...,2890344253083689964,633730791,Carousel,#recipeoftheday,2022-07-26T08:27:52.635Z,True,3.0,,


### Using SpaCy's language detector

In [8]:
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

@Language.factory("language_detector")
def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('language_detector', last=True)
print(nlp("This is an english text.")._.language)

{'language': 'en', 'score': 0.9999981485694562}


In [24]:
def detect_language(post):
    lang_dict = nlp(str(post))._.language
    return pd.Series([lang_dict['language'], lang_dict['score']])

In [28]:
now = datetime.now()
posts[['language', 'score']] = posts['description'].apply(detect_language)
print(f"Completed!\nTime taken: {datetime.now()-now}")

Completed!
Time taken: 0:00:56.144062


### Analysis of results

In [31]:
# Different languages in dataset
posts['language'].unique()

array(['en', 'fr', 'tr', 'it', 'de', 'hu', 'nl', 'ro', 'id', 'ja', 'pl',
       'ko', 'th', 'hr', 'ar', 'cs', 'es', 'mk', 'tl', 'bg', 'et', 'lt',
       'el', 'ru', 'sl', 'sv', 'sq', 'da', 'pt'], dtype=object)

In [36]:
# Lets look at some posts with low confidence scores. These are mainly posts that are in multiple languages.
for index, post in posts[posts['score']<0.6].iterrows():
    print(post['description'], "\n\n", post['language'], post['score'])

Auntie Kai‚Äôs Coco Pipinu (Cucumber Coco) üòãüî•‚ù§Ô∏è 

#nettycee #auntiekaiscucumbercoco #cocorecipe #coco #cucumber #cocopipinu #recipe #easyrecipe #easyrecipes #saipan #cnmi #marianas #themarianas #pickledcucumber #foodie #tiktokfoodie 

 ro 0.5714267746615391
#brownies #homemade#chocolate #delicious #2022 #easyrecipes 

 en 0.5714268193276484
Sabudana Vada ü•∞ Tag your bestie üòç 
Follow @foodbook_by_aditi 

#sabudanavada #meduvada #vadapav #pakoda #easyrecipes #quickrecipes #recipe #recipes 

 es 0.5654219400703555
Niku-maki bento ! üç±(veggies wrapped in pork) ‚úøË±ö„ÅÆËÇâÂ∑ª„ÅçÂºÅÂΩì‚úøÔºÅ

Bento main:
-Veggies, shiso and cheese wrapped in pork ü•ïüê∑
-Brown rice

Bento sides:
-Broccoli dressed in sesame ü•¶
-Bell pepper, tomato and basil salad üçÖüåø
-Hard boiled egg ü•ö 

‚úøË±ö„ÅÆËÇâÂ∑ª„ÅçÂºÅÂΩì‚úø
‚óéË±ö„ÅÆËÇâÂ∑ª„Åç („Å´„Çì„Åò„Çì„ÄÅ„Ç§„É≥„Ç≤„É≥„ÄÅ„ÉÅ„Éº„Ç∫„ÄÅÂ§ßËëâÂÖ•„Çä)
‚óéÁéÑÁ±≥
‚óé„Éñ„É≠„ÉÉ„Ç≥„É™„Éº„ÅÆËÉ°È∫ªÂíå„Åà
‚óé„Éî„Éº„Éû„É≥„ÄÅ„Éà„Éû„Éà„Å®„Éê„Ç∏„É´„ÅÆ„Ç

In [44]:
# Looking at posts which have been classified as english but with low confidence scores.
for index, post in posts[(posts['score']<0.95) & (posts['language']=='en')].iterrows():
    print(post['description'], "\n\n", post['score'])

Tried this Creamy burger steak version ü•∞

Follow on:
Tiktok: @mjkevsvlog
Instagram: @mjkevss
YT: MJKevss

#burgersteak #jollibeeburgersteak #foodlover #food #foodstagram #foodie #mjkevss #recipe #recipeoftheday #fyp 

 0.8571390987604018
Spicy chikan Tika mecroni |colourful mecroni 

Ingredients

#chikan
#alo 
#gajar 
#mecroni 
#tikamasla 
#chatmasla 
#kalimirch powder
#namak 
#soyasauce 
#sirka 
#chilisauce 
#sabaz mirch

#tikamecroni#chikanmecroni#colorful
#mecroni #spicy #tikamasla #chikan #alo #gajar #soyasauce #chilisauce #reels #instgram #recipe #recipeoftheday #vedio #foodinstgram 

 0.8571413169423909
Meat Balls With Zucchini Zoodles & Veges üçùü•ïüßÑüßÖüçÖ

Recipe now up over at @saharskitchenette 

 0.8571404725726666
A good day starts with a good breakfast ü•∞‚ô•Ô∏èü•∞

ÿµÿ®ÿßÿ≠ ÿßŸÑÿÆŸäÿ± Ÿà ÿßŸÑÿ≥ÿπÿßÿØÿ© üåπ

ÿ∑ÿ®ŸÇ ÿ≥ŸáŸÑ ŸÖÿ™Ÿàÿßÿ≤ŸÜ ÿ≥ÿπÿ±ÿßÿ™Ÿá ŸÇŸÑŸäŸÑŸá ü•ó‚úÖ‚úÖ.

‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî#healthyfood #healthylifest

From the above analysis it seems appropriate to remove all such posts:

1. Any posts not classified as english.

2. Any posts classified as english but with confidence scores less than 0.95. This is because these posts are likely to contain multiple languages.

In [45]:
def get_english_posts(df, language='en', confidence=0.95):
    
    print("Detecting language of each post...")
    now = datetime.now()
    
    df[['language', 'score']] = df['description'].apply(detect_language)
    
    print(f"Language detection complete.\nTime taken: {datetime.now()-now}")
    
    return df[(df['score']>0.95) & (df['language']=='en')]

In [47]:
get_english_posts(posts)

Detecting language of each post...
Language detection complete./nTime taken: 0:01:04.993261


Unnamed: 0,postUrl,profileUrl,username,fullName,commentCount,likeCount,pubDate,description,location,imgUrl,...,ownerId,type,query,timestamp,isSidecar,sidecarMedias,videoUrl,viewCount,language,score
0,https://www.instagram.com/p/CgWcU2fss5n/,https://www.instagram.com/jadesbites,jadesbites,JADE | Recipes & Food,47,1648,2022-07-23T10:00:53.000Z,üßÑ ~ A G L I O ~ E ~ O L I O ~ üßÑ\n\nSimply mean...,Liverpool,https://scontent-lhr8-2.cdninstagram.com/v/t51...,...,3656910377,Photo,#recipeoftheday,2022-07-26T08:27:52.635Z,False,,,,en,0.999995
1,https://www.instagram.com/p/CgckMjTpSPs/,https://www.instagram.com/sweettreatsyt,sweettreatsyt,Ania | SweetTreats,4,3,2022-07-25T19:05:06.000Z,NEW! Condensed Milk Brownies. These brownies a...,"Toronto, Ontario",https://scontent-lhr8-1.cdninstagram.com/v/t51...,...,633730791,Carousel,#recipeoftheday,2022-07-26T08:27:52.635Z,True,3.0,,,en,0.999996
3,https://www.instagram.com/p/Cgdy9wBjeYl/,https://www.instagram.com/nicolejcooks,nicolejcooks,Nicole Jain,4,46,2022-07-26T06:33:24.000Z,HOT & COLD\n.\nRoasted peppers & tomatoes pair...,,https://scontent-lhr8-1.cdninstagram.com/v/t51...,...,10645378983,Carousel,#recipeoftheday,2022-07-26T08:27:52.635Z,True,5.0,,,en,0.999997
4,https://www.instagram.com/p/Cgdp-q1vao_/,https://www.instagram.com/its_shreyajoshi,its_shreyajoshi,Food Blogger,7,3,2022-07-26T05:14:53.000Z,Chocolate Bar\n\n#chocolate #recipes #recipeof...,,https://scontent-lhr8-2.cdninstagram.com/v/t51...,...,40803727237,Carousel,#recipeoftheday,2022-07-26T08:27:52.635Z,True,2.0,,,en,0.999997
6,https://www.instagram.com/p/CgRpagzj1in/,https://www.instagram.com/alwayshungryinlondon,alwayshungryinlondon,ùêáùêöùêßùêßùêöùê° ùêÉùêâ,62,2821,2022-07-21T13:19:03.000Z,Roasted Vegetable Lentil Bowl\n‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî...,United Kingdom,https://scontent-lhr8-1.cdninstagram.com/v/t51...,...,4204503727,Photo,#recipeoftheday,2022-07-26T08:27:52.635Z,False,,,,en,0.999995
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1266,https://www.instagram.com/p/CgcTdXLI0Dd/,https://www.instagram.com/liz_sw_weightloss_jo...,liz_sw_weightloss_journey,,1,13,2022-07-25T16:38:51.000Z,Fancied burger pasta tonight but without the p...,,https://scontent-lhr8-2.cdninstagram.com/v/t51...,...,34220359606,Photo,#easymeals,2022-07-26T08:31:32.893Z,False,,,,en,0.999997
1267,https://www.instagram.com/p/CgcS-dFJRmx/,https://www.instagram.com/_nanny_gram_,_nanny_gram_,Felicia,0,3,2022-07-25T16:34:38.000Z,Nothing better than starting a Monday with one...,,https://scontent-lhr8-1.cdninstagram.com/v/t51...,...,54519380490,Photo,#easymeals,2022-07-26T08:31:32.893Z,False,,,,en,0.999997
1268,https://www.instagram.com/p/CgcSs3GjRSE/,https://www.instagram.com/goodiesfoodhall,goodiesfoodhall,Goodies Food Hall,2,18,2022-07-25T16:32:14.000Z,Happy summer holidays! üåûüéâ We hope you survived...,Pulham Market,https://scontent-lhr8-1.cdninstagram.com/v/t51...,...,2680681440,Carousel,#easymeals,2022-07-26T08:31:32.893Z,True,3.0,,,en,0.999997
1269,https://www.instagram.com/p/CgcStMdv2vK/,https://www.instagram.com/smoothie.smart,smoothie.smart,Smoothies | Weight Los | Diet,3,3,2022-07-25T16:32:16.000Z,AMAZING tranformation results from a customer ...,,https://scontent-lhr8-1.cdninstagram.com/v/t51...,...,31927134756,Carousel,#easymeals,2022-07-26T08:31:32.893Z,True,2.0,,,en,0.999997
