# Northfork case

This notebook is for the Northfork case. The input data are to be structured such that similar ingredients are connected to the same connector, which refers to in this Notebook as `keyword`. For a full description of the task, see [dashboard](https://miro.com/app/board/uXjVMIq3jj0=/).

## Data 
The data are taken from https://docs.google.com/spreadsheets/d/1YO2FcP3H049GrztXx0ATOqC2Y-f5kNk61kquD_k7dos/edit#gid=1528949334. Two sheets can be found on the link: English and French. The data are downoaded into two CSV files named `data_english.csv` and `data_french.csv`. 

## Script overview
The Notebook starts with loading the specific Spacy language model, where the language input is specified by the user, i.e. hardcoded (commented as `FIXME`). This should be the only thing that needs to be changed to run the Notebook with either the English or French input data.

The data from the CSV file are loaded to the Notebook as a dataframe. The data are then cleaned such that all numbers are removed and all letters are lowercase. Outlook: number removal shall be fixed in case the amount of values is needed in the ingredient.

Looking at the first column of the dataframe (Ingredients), the ingredient in each row is tokenised. The connector or keyword is extracted by choosing the token with dependency of either `ROOT` or `compound`. This is found to be more inclusive than choosing the token that is a noun. The keyword is extracted in form of dictionary, where its key is the connector and the value is the count of how many times the keyword appears. For example, `banane: 99` implies that `banane` appears in 99 rows of ingredient. 

Since the extracting function is inclusive, a validation is needed to check that the results are correct. Two checks are presented:     
1) Check if the sum of the dictionary values is equal to the number of rows of the inout data     
2) Check if two keywoords appear in the same row. If not, then the keyword extraction is done correctly.

## Outlook
Note that the script stops at the validation step. The final step is basically assigning the keyword to the Connector column in the CSV file. The assignment should be done such that one keyword is assigned once for the number of rows that share similar ingredients. This number is basically the value of each dictionary key. In case two keywords appear in the same row, the one that has more occurence is selected as the keyword, together with its corresponding value. Since this is purely Python, it should be done easily and skipped for now (for the sake of time).

Note that the English data has a connector of `Full fat cream` for the cream-related ingredients. This script, however, will extract the word `cream` instead. In this case, we can perhaps give a condition to pass the first ingredient of that category to be the connector. 

## Table of content
1. [Data cleaning](#Data-cleaning)
2. [Keyword extraction](#Keyword-extraction)
3. [Result](#Result)
4. [Validation](#Validation)

//
Prim Pasuwan

In [1]:
import pandas as pd
import spacy
import re
from collections import Counter

In [2]:
# Specify the desired languange
#lang = 'english' # FIXME
lang = 'french' # FIXME

# Load the Spacy language model
if lang == 'english':
    nlp = spacy.load('en_core_web_sm')

if lang == 'french':
    nlp = spacy.load('fr_core_news_sm')

In [3]:
# Read the CSV file into a dataframe
df = pd.read_csv('data_'+lang+'.csv') # FIXME

In [13]:
# A quick glance at the data
df.head()

Unnamed: 0,Ingredients
0,Une banane très délicieuse et bien mûre1
1,Une banane très délicieuse et bien mûre2
2,Une banane très délicieuse et bien mûre3
3,Une banane très délicieuse et bien mûre4
4,Une banane très délicieuse et bien mûre5


## Data cleaning

In [15]:
# Define a function to remove all numbers from a list of string
def remove_numbers(s):
    pattern = '[0-9]'
    s = [re.sub(pattern, '', i) for i in s]
    return s

In [138]:
# Make a new df out of the first column of the data
first_column = pd.DataFrame(df[df.columns.values[0]])

In [145]:
# Lower all capitalised letters 
col_low = pd.DataFrame(first_column[first_column.columns.values[0]].str.lower())

In [144]:
col_low.head()

Unnamed: 0,Ingredients
0,une banane très délicieuse et bien mûre1
1,une banane très délicieuse et bien mûre2
2,une banane très délicieuse et bien mûre3
3,une banane très délicieuse et bien mûre4
4,une banane très délicieuse et bien mûre5


In [140]:
# Convert df column to a list
list_low_ingr = col_low.values.tolist()

In [127]:
# Remove all numbers from the ingredients
ingr = remove_numbers(list_low_ingr)

## Keyword extraction

In [43]:
# Define a function to extract the keyword from a string
def extract_keyword(text):
    doc = nlp(text)
    keywords = []
    for token in doc:
        if token.dep_ == 'ROOT' or token.dep_ == 'compound':
            #if token.pos_ == 'NOUN' and token.text not in keywords:
                keywords.append(token.text)
    return ' '.join(keywords)

In [128]:
# Extract the keyword from each ingredient i
extracted_keyword = [extract_keyword(i) for i in ingr]

In [129]:
# Define a function to count the frequency of the keyword
# The function takes a list as an input and return a dictionary
def count_keyword(k):
    return dict(Counter(k))

# To see a list of key or value from the dictionary:
# my_dict.keys() or my_dict.values()

In [130]:
# Get the keyword and its count
freq = count_keyword(extracted_keyword)

## Result

In [131]:
freq

{'banane': 99,
 'avocat': 68,
 'avoca': 1,
 'advocat': 1,
 'crème': 180,
 'fouet': 55,
 'filets': 58,
 'cueillies': 19,
 'pommes': 24}

## Validation

In [133]:
# Get the sum of all the counts
vsum = 0
for v in freq.values():
    vsum = vsum + v
vsum

505

In [134]:
df.shape

(505, 1)

In [158]:
# from the following list, go through each row of Ingredients
# and count how many times two consecutive keywords appear in the same row
# if 0, the keywords are extracted correctly 
# if not 0, then the two keywords appear in the same ingredient

list_keys = list(freq.keys())
for i in range(len(list_keys)):
    if i<len(list_keys)-1:
        print(f'{list_keys[i]}{" & "}{list_keys[i+1]:{10}} {col_low[col_low[col_low.columns.values[0]].str.contains(list_keys[i]) & col_low[col_low.columns.values[0]].str.contains(list_keys[i+1])].shape[0] } ')

banane & avocat     0 
avocat & avoca      68 
avoca & advocat    0 
advocat & crème      0 
crème & fouet      180 
fouet & filets     0 
filets & cueillies  0 
cueillies & pommes     19 
