<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setting-up-Environment" data-toc-modified-id="Setting-up-Environment-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setting up Environment</a></span></li><li><span><a href="#Initializing-Storage-Variables" data-toc-modified-id="Initializing-Storage-Variables-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Initializing Storage Variables</a></span></li><li><span><a href="#Wikipedia-Scraping" data-toc-modified-id="Wikipedia-Scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Wikipedia Scraping</a></span><ul class="toc-item"><li><span><a href="#Extrating-Chinese-food-names" data-toc-modified-id="Extrating-Chinese-food-names-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Extrating Chinese food names</a></span></li><li><span><a href="#Extracting-all-other-cuisines" data-toc-modified-id="Extracting-all-other-cuisines-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Extracting all other cuisines</a></span></li><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Data Cleaning</a></span></li><li><span><a href="#Exporting-Corpus" data-toc-modified-id="Exporting-Corpus-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Exporting Corpus</a></span></li></ul></li><li><span><a href="#Importing-Dataset" data-toc-modified-id="Importing-Dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Importing Dataset</a></span><ul class="toc-item"><li><span><a href="#Data-Pre-processing" data-toc-modified-id="Data-Pre-processing-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Data Pre-processing</a></span></li></ul></li></ul></div>

# Setting up Environment

In [158]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
import wikipedia

# Initializing Storage Variables

In [159]:
df_corpus = pd.DataFrame(columns=["Food", "Cuisine"])
df_corpus

Unnamed: 0,Food,Cuisine


In [160]:
cuisines = ["Chinese", "Malay", "Indian", "Cross-cultural",
            "Seafood", "Fruit", "Desserts", "Drinks and beverages"]

# Wikipedia Scraping
Documentation at https://wikipedia.readthedocs.io/en/latest/code.html

In [161]:
wiki = wikipedia.page("Singaporean cuisine")
links = wiki.links

In [162]:
wiki.sections

[]

## Extrating Chinese food names

In [163]:
txt_chinese = wiki.section("Chinese") #section(section_title)
txt_chinese

'The dishes that comprise "Singaporean Chinese cuisine" today were originally brought to Singapore by the early southern Chinese immigrants (Hokkien, Teochew, Cantonese, Hakka and Hainanese). They were then adapted to suit the local availability of ingredients, while absorbing influences from Malay, Indian and other cooking traditions.\nMost of the names of Singaporean Chinese dishes were derived from dialects of southern China, Hokkien (Min Nan) being the most common. As there was no common system for transliterating these dialects into the Latin alphabet, it is common to see different variants on the same name for a single dish. For example, bah kut teh may also be spelt bak kut teh, and char kway tiao may also be spelt char kuay teow.\n\nBak kut teh (肉骨茶; ròu gǔ chá), pork rib soup made with a variety of Chinese herbs and spices.\nBeef kway teow (牛肉粿条; niú ròu guǒ tiáo), flat rice noodles stir-fried with beef, served dry or with soup.\nBak chang (肉粽; ròu zòng), glutinous rice dumpli

In [164]:
len(txt_chinese)

6763

In [165]:
def get_corpus(txt, start_chars, end_chars):
    start, end = 0, 0
    corpus = []
    new = False
    for i in range(len(txt)):
        if txt[i] in start_chars:
            start = i+1 # start of food name just after start_char
            new = True # found start of new word flag
        if txt[i] in end_chars:
            end = i # end of food name, non inclusive of txt[i]
        if new and end > start:
            while txt[start] == " ":
                start += 1 # remove space at start
            while txt[end-1] == " ":
                end -= 1 # remove space at end
            corpus.append(txt[start: end].lower()) # change to lower case
            new = False # word copied out, prevents duplicates
    return corpus

We notice that Chinese cuisine is the only cuisine where all food names are bounded by ```'\n'``` and ```'('``` characters, (eg. \nSliced fish soup (鱼片汤; yú piàn tāng)) we shall extract it step by step as an example.

In [166]:
corpus_chinese = get_corpus(txt_chinese, ['\n'], ['(', '/'])
corpus_chinese

['most of the names of singaporean chinese dishes were derived from dialects of southern china, hokkien',
 'bak kut teh',
 'beef kway teow',
 'bak chang',
 'bak chor mee',
 'ban mian',
 'chai tow kway',
 'char kway teow',
 'char siu',
 'crab bee hoon',
 'drunken prawns',
 'duck rice',
 'fish ball noodles',
 'fish soup bee hoon',
 'frog leg porridge',
 'hae mee',
 'hainanese chicken rice',
 'har cheong gai',
 'hokkien mee',
 'hum chim peng',
 'kuay chap',
 'mee pok',
 'min chiang kueh',
 "pig's brain soup",
 "pig's organ soup",
 'popiah',
 'shredded chicken noodles',
 'sliced fish soup',
 'soon kway',
 'teochew porridge',
 'turtle soup',
 'vegetarian bee hoon',
 'yong tau foo',
 'youtiao']

In [167]:
corpus_chinese = corpus_chinese[1:]
corpus_chinese_size = len(corpus_chinese)
print(corpus_chinese_size)

33


In [168]:
df_corpus = df_corpus.append(pd.DataFrame({"Food":corpus_chinese,
                                          "Cuisine":["Chinese"]*len(corpus_chinese)}))
df_corpus

Unnamed: 0,Food,Cuisine
0,bak kut teh,Chinese
1,beef kway teow,Chinese
2,bak chang,Chinese
3,bak chor mee,Chinese
4,ban mian,Chinese
5,chai tow kway,Chinese
6,char kway teow,Chinese
7,char siu,Chinese
8,crab bee hoon,Chinese
9,drunken prawns,Chinese


## Extracting all other cuisines

By viewing the wikipedia page, it is observed that all other cuisines have food names are bounded by ```'\n'```, ```','``` and ```'/'``` characters.

Similarly, we will expect the section summary to have a substring bounded by the same characters and will need to remove those.

In [169]:
for cuisine in cuisines[1:]: # exclude "Chinese"
    txt = wiki.section(cuisine) #section(section_title)
    corpus = get_corpus(txt, ['\n'], [',', '(', '/'])
    print("\n", cuisine, ", Corpus size =", len(corpus))
    print(corpus)


 Malay , Corpus size = 27
['acar', 'assam pedas', 'ayam penyet', 'bakso', 'begedil', 'curry puff', 'dendeng paru', 'goreng pisang', 'gulai daun ubi', 'keropok', 'ketupat', 'lemak siput', 'lontong', 'nagasari', 'nasi goreng', 'nasi padang', 'otak-otak', 'pecel lele', 'rawon', 'rojak bandung', 'roti john', 'sambal', 'satay', 'sayur lodeh', 'soto', 'soto ayam', 'tumpeng']

 Indian , Corpus size = 10
['appam', 'dosa', 'murtabak', 'naan', 'roti prata', 'soup kambing', 'soup tulang', 'soup tulang merah', 'tandoori chicken', 'vadai']

 Cross-cultural , Corpus size = 20
['ayam buah keluak', 'biryani', 'cereal prawns', 'chili crab pasta', 'curry laksa', 'fish head curry', 'kari debal', 'kari lemak ayam', 'katong laksa', 'kueh pie tee', 'kway teow goreng', 'mee rebus', 'mee siam', 'mee goreng', 'mee soto', 'rojak', 'sambal kangkong', 'satay bee hoon', 'tauhu goreng', '"western food" in hawker centres where "singapore-style" chicken chop']

 Seafood , Corpus size = 5
['black pepper crab', 'chill

From the results above, we realised that we will only need to remove the summary text (the first element of the corpus) for "Fruit", and the last element for "Cross-cultural".

However, to make our code reusable, we shall fulter the food names by length instead. Let's use 40 characters as a filter.

Running the above loop again, we have:

In [170]:
corpus_size = corpus_chinese_size
for cuisine in cuisines[1:]: # exclude "Chinese"
    txt = wiki.section(cuisine) #section(section_title)
    corpus = get_corpus(txt, ['\n'], [',', '(', '/'])
    corpus = list(filter(lambda x: len(x) <= 40, corpus))
    print("\n", cuisine, ", Corpus size =", len(corpus))
    print(corpus)
    
    corpus_size += len(corpus)
    df_corpus = df_corpus.append(pd.DataFrame({"Food":corpus,
                                          "Cuisine":[cuisine]*len(corpus)}))


 Malay , Corpus size = 27
['acar', 'assam pedas', 'ayam penyet', 'bakso', 'begedil', 'curry puff', 'dendeng paru', 'goreng pisang', 'gulai daun ubi', 'keropok', 'ketupat', 'lemak siput', 'lontong', 'nagasari', 'nasi goreng', 'nasi padang', 'otak-otak', 'pecel lele', 'rawon', 'rojak bandung', 'roti john', 'sambal', 'satay', 'sayur lodeh', 'soto', 'soto ayam', 'tumpeng']

 Indian , Corpus size = 10
['appam', 'dosa', 'murtabak', 'naan', 'roti prata', 'soup kambing', 'soup tulang', 'soup tulang merah', 'tandoori chicken', 'vadai']

 Cross-cultural , Corpus size = 19
['ayam buah keluak', 'biryani', 'cereal prawns', 'chili crab pasta', 'curry laksa', 'fish head curry', 'kari debal', 'kari lemak ayam', 'katong laksa', 'kueh pie tee', 'kway teow goreng', 'mee rebus', 'mee siam', 'mee goreng', 'mee soto', 'rojak', 'sambal kangkong', 'satay bee hoon', 'tauhu goreng']

 Seafood , Corpus size = 5
['black pepper crab', 'chilli crab', 'oyster omelette', 'sambal lala', 'sambal stingray']

 Fruit , C

Verifying that our corpus has been created correctly, we check that the size of the created dataframe has the same size as ```corpus_size```

In [171]:
print("Dataframe size =", df_corpus.shape)
print("Expected corpus size =", corpus_size)

Dataframe size = (112, 2)
Expected corpus size = 112


## Data Cleaning

From a visual inspection of the corpus, we found that the following food names can be modified:

**Chinese**
- 'beef kway teow', 'hainanese chicken rice', 'shredded chicken noodles' *- add alternative name*

**Malay**
- 'nasi goreng', 'nasi padang', 'roti john' *- add alternative name*

**Indian**
- 'roti prata' *- add alternative name*

**Desserts:**
- 'kuih or kueh' *- change to alternative name*
- 'kueh lapis is a rich'
- 'lapis sagu is also a popular kueh with layers of alternating colour and a sweet'

**Drinks and beverages:**
- 'chin chow drink', 'yuenyeung coffee' *- add alternative name*

In [172]:
df_corpus["alternative names"] = df_corpus["Food"]

In [173]:
wrong_names = ['kueh lapis is a rich',
               'lapis sagu is also a popular kueh with layers of alternating colour and a sweet',
              'kuih or kueh']

# Dropping wrong rows
df_corpus = df_corpus[~df_corpus["Food"].isin(wrong_names)]
df_corpus.shape

(110, 3)

In [175]:
# Adding corrected names
names = ['kueh lapis', 'lapis sagu']
df_corpus = df_corpus.append(pd.DataFrame({"Food": names,
                                           "alternative names": names,
                                           "Cuisine":['Desserts']*2}))

# Adding Alternative names
cus = 'Chinese'
names = ['beef kway teow', 'hainanese chicken rice', 'shredded chicken noodles']
alternative = ['kway teow', 'chicken rice', 'chicken noodles']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Malay'
names = ['nasi goreng', 'nasi padang', 'nasi goreng', 'nasi padang', 'roti john']
alternative = ['nasigoreng', 'nasipadang', 'nasi-goreng', 'nasi-padang', 'roti-john']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Indian'
names = ['roti prata', 'roti prata', 'roti prata']
alternative = ['rotiprata', 'roti-prata', 'prata']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Desserts'
names = ['kueh', 'kueh']
alternative = ['kuih', 'kueh']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Drinks and beverages'
names = ['chin chow drink', 'chin chow drink', 'chin chow drink', 'yuenyeung coffee']
alternative = ['chin chow', 'chinchow', 'chin-chow', 'yuenyeung']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

df_corpus.shape

(144, 3)

Finally, we generate some summary statistics of our corpus

In [145]:
print(df_corpus.shape)
print(df_corpus["Cuisine"].value_counts())

(113, 2)
Chinese                 33
Malay                   27
Cross-cultural          19
Desserts                10
Indian                  10
Drinks and beverages     9
Seafood                  5
Name: Cuisine, dtype: int64


## Exporting Corpus

In [147]:
df_corpus.to_csv("../Instagram/" + "corpus_wikipedia.csv", index=False)

# Importing Dataset

In [148]:
filepath = "../Instagram/Cleaned data/"
filename = "All_posts.csv"

In [149]:
df = pd.read_csv(filepath + filename)  
df.head()

Unnamed: 0,timestamp,Username,caption,no. of likes,no. of comments,comments
0,30/6/2020,8days_eat,Katong’s famous Hokkien mee and fried mee sua ...,326,7,"['@lauren.khoury still can’t crack a smile 😬',..."
1,30/6/2020,8days_eat,"For the first time in Singapore, Heytea will b...",258,3,"['Awesomeeeee neeed this now', '@becauseitsdon..."
2,29/6/2020,8days_eat,Korean “fat-carons” — supersized macarons stuf...,453,3,['#8dayseat #sgfoodies #instafood #yum #sgfood...
3,29/6/2020,8days_eat,Online ordering system Oddle has launched a ne...,287,3,"['THANK YOU 👀📸🔥', 'Absolutely love this tinkat..."
4,29/6/2020,8days_eat,A canelé is a bite-sized French pastry from Bo...,431,3,"['@jet8food , I see it! 😁', '@brave_nic', '🙏🏼💗💗']"


In [150]:
numposts = df.shape[0]

## Data Pre-processing

In [153]:
df["caption"] = df["caption"].apply(lambda x: x.lower())
df.head()

Unnamed: 0,timestamp,Username,caption,no. of likes,no. of comments,comments
0,30/6/2020,8days_eat,katong’s famous hokkien mee and fried mee sua ...,326,7,"['@lauren.khoury still can’t crack a smile 😬',..."
1,30/6/2020,8days_eat,"for the first time in singapore, heytea will b...",258,3,"['Awesomeeeee neeed this now', '@becauseitsdon..."
2,29/6/2020,8days_eat,korean “fat-carons” — supersized macarons stuf...,453,3,['#8dayseat #sgfoodies #instafood #yum #sgfood...
3,29/6/2020,8days_eat,online ordering system oddle has launched a ne...,287,3,"['THANK YOU 👀📸🔥', 'Absolutely love this tinkat..."
4,29/6/2020,8days_eat,a canelé is a bite-sized french pastry from bo...,431,3,"['@jet8food , I see it! 😁', '@brave_nic', '🙏🏼💗💗']"
