# Language Detection

This notebook aims to identify the best method for filtering out non-English posts.

In [27]:
import pandas as pd
from datetime import datetime

In [3]:
posts = pd.read_csv("/Users/maxkirwan/Desktop/Uni/Data Science MSc/Data Science Project/nutrition-insta/Instagram Data Scraping/Phantom Buster/recipe_posts.csv")
posts.head(2)

Unnamed: 0,postUrl,profileUrl,username,fullName,commentCount,likeCount,pubDate,description,location,imgUrl,postId,ownerId,type,query,timestamp,isSidecar,sidecarMedias,videoUrl,viewCount
0,https://www.instagram.com/p/CgWcU2fss5n/,https://www.instagram.com/jadesbites,jadesbites,JADE | Recipes & Food,47,1648,2022-07-23T10:00:53.000Z,🧄 ~ A G L I O ~ E ~ O L I O ~ 🧄\n\nSimply mean...,Liverpool,https://scontent-lhr8-2.cdninstagram.com/v/t51...,2888620789210467943,3656910377,Photo,#recipeoftheday,2022-07-26T08:27:52.635Z,False,,,
1,https://www.instagram.com/p/CgckMjTpSPs/,https://www.instagram.com/sweettreatsyt,sweettreatsyt,Ania | SweetTreats,4,3,2022-07-25T19:05:06.000Z,NEW! Condensed Milk Brownies. These brownies a...,"Toronto, Ontario",https://scontent-lhr8-1.cdninstagram.com/v/t51...,2890344253083689964,633730791,Carousel,#recipeoftheday,2022-07-26T08:27:52.635Z,True,3.0,,


### Using SpaCy's language detector

In [8]:
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

@Language.factory("language_detector")
def get_lang_detector(nlp, name):
    return LanguageDetector()

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('language_detector', last=True)
print(nlp("This is an english text.")._.language)

{'language': 'en', 'score': 0.9999981485694562}


In [24]:
def detect_language(post):
    lang_dict = nlp(str(post))._.language
    return pd.Series([lang_dict['language'], lang_dict['score']])

In [26]:
from datetime import datetime
import time
now = datetime.now()
time.sleep(5)
print(f"Time taken: {datetime.now()-now}")

Time taken: 0:00:05.000990


In [28]:
now = datetime.now()
posts[['language', 'score']] = posts['description'].apply(detect_language)
print(f"Completed!\nTime taken: {datetime.now()-now}")

Completed!
Time taken: 0:00:56.144062


In [31]:
# Different languages in dataset
posts['language'].unique()

array(['en', 'fr', 'tr', 'it', 'de', 'hu', 'nl', 'ro', 'id', 'ja', 'pl',
       'ko', 'th', 'hr', 'ar', 'cs', 'es', 'mk', 'tl', 'bg', 'et', 'lt',
       'el', 'ru', 'sl', 'sv', 'sq', 'da', 'pt'], dtype=object)

In [36]:
# Lets look at some posts with low scores. These are mainly posts that are in multiple languages.
for index, post in posts[posts['score']<0.6].iterrows():
    print(post['description'], "\n\n", post['language'], post['score'])

Auntie Kai’s Coco Pipinu (Cucumber Coco) 😋🔥❤️ 

#nettycee #auntiekaiscucumbercoco #cocorecipe #coco #cucumber #cocopipinu #recipe #easyrecipe #easyrecipes #saipan #cnmi #marianas #themarianas #pickledcucumber #foodie #tiktokfoodie 

 ro 0.5714267746615391
#brownies #homemade#chocolate #delicious #2022 #easyrecipes 

 en 0.5714268193276484
Sabudana Vada 🥰 Tag your bestie 😍 
Follow @foodbook_by_aditi 

#sabudanavada #meduvada #vadapav #pakoda #easyrecipes #quickrecipes #recipe #recipes 

 es 0.5654219400703555
Niku-maki bento ! 🍱(veggies wrapped in pork) ✿豚の肉巻き弁当✿！

Bento main:
-Veggies, shiso and cheese wrapped in pork 🥕🐷
-Brown rice

Bento sides:
-Broccoli dressed in sesame 🥦
-Bell pepper, tomato and basil salad 🍅🌿
-Hard boiled egg 🥚 

✿豚の肉巻き弁当✿
◎豚の肉巻き (にんじん、インゲン、チーズ、大葉入り)
◎玄米
◎ブロッコリーの胡麻和え
◎ピーマン、トマトとバジルのサラダ
◎ゆで卵

Reel coming soon 😊▶️
.
.
.

#mizuseats#cooking#food#foodstagram #easyrecipes##自炊#自炊記録 #bentobox#bento#本日のお弁当#お弁当#弁当#japanese弁当 #肉巻き#lunch##recipes#簡単レシピ #簡単料理 #肉巻き弁当#豚の肉巻き#弁当記