# Introduction
In Part I, we retrieved the research paper's dataset and converted the .txt files to .csv. 

In this Part, we will replicate the "Data and Methods" portion of the research to the best of our ability. More specifically, you will:
1. Plot distribution of ingredients per cuisine
2. Replicate Table 1 
3. Chart the cumulative frequency distribution of the ingredient usage
4. See if Heap's Law applies to the dataset 

<font color='red'><strong>Disclaimer: some of the reproduction of the research may not resemble that you see in the paper. The UpLevel team did their utmost best to replicate the results of the study.</strong></font>

### Step 1: Import libraries
Let's start with importing the libraries
- pandas as pd
- matplotlib.pyplot as plt
- seaborn as sns
- glob
- numpy as np

In [None]:
# Step 1: Import libraries

### Step 2: Get a list of files in the "cuisine_recipe_ingredient_CSV" folder
In the subsequent Steps, you'll be looping through each of the CSVs in the file later on for analysis. 

As such, get a list containing the xxx.txt files in the "cuisine_recipe_ingredient_CSV" folder.

glob is useful here. 

In [None]:
# Step 2: Get a list of .CSV filenames

## Replicate "Data collection"
![DataCollection](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/DataCollection.png)

For starters, we will replicate what is mentioned in the article - that there are 8,498 recipes and 2,911 ingredients.

### Step 3: Get the total number of recipes
We'll first validate the fact that there are 8,498 recipes in the dataset. 

You will need to get the total number of unique "recipe_id" values in all of the 20 CSVs in "cuisine_reciple_ingredient".

In [None]:
# Step 3: Get the total number of recipes

<details>
    <summary><strong>Click here once for pseudocode if you're stuck</strong></summary>
    <ol>
        <li>Declare a variable containing an empty list</li>
        <li>Use a for loop to loop through the filenames in the list you got from Step 2. In each loop:</li>
        <ul>
            <li>Create a DataFrame using the current loop's filename</li>
            <li>Declare a variable and get the "recipe_id" column</li>
            <li>Call .unique on the column to get the unique items in the column</li>
            <li>Get the length of the list of unique recipe_id</li>
            <li>Append the length into the list</li>
        </ul>
        <li>Sum the lengths in the list</li>
    </ol>
</details>

### Step 4: Get the total number of ingredients
We'll also get the total number of ingredients in the dataset. 

We can do this in two ways:
1. Get the number of unique items from all of the 'ingredient' columns in the CSVs
2. Get the length of the DataFrame from "component_id.txt" in the "dataset" folder 

#### Approach 1: Get the number of unique items in all CSVs
This approach is similar to Step 3, where you loop through all the CSVs and get the unique items from "ingredient" column instead of "recipe_id".

In [None]:
# Step 4 Approach 1: Get the number of unique items in "ingredient"

#### Approach 2: Get the length of the DataFrame from "component_id.txt"
This is both easy and hard to do, because you will encounter a problem with reading the file. 

The problem lies in the encoding of the text, which you will have to figure out. 

The <a href = "https://chardet.readthedocs.io/en/latest/usage.html">chardet</a> library will be useful for you.

In [None]:
# Step 4 Approach 2: Read component_id.txt into a DataFrame and get the DataFrame length

## Replicate Figure 2
In the research publication, Figure 2 contains a probability distribution of the number of ingredients per recipe.

![Figure2](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure2.png)

Amongst the twenty cuisines, the authors chose eight to plot. 

In this Section, you will replicate the plot.

### Step 5: Get a count for the ingredients for chuancai DataFrame
We'll start with chuancai.csv, and get the count of ingredient per recipe id. 

![ChuancaiDFIngredientCount](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/ChuancaiDFIngredientCount.png)

Again, there are several ways to do it:
1. Perform a .groupby operation on 'recipe_id' and get the count of the 'ingredient'
2. Perform a .value_counts operation on 'recipe_id'

In [None]:
# Step 5: Get a count of the ingredients per recipe_id

### Step 6: Perform a distplot on the list of ingredient counts
Whichever approach you used for Step 5 was, you can use seaborn's displot method to plot your list of ingredient counts.

You will get something like this:

![ChuancaiIngredientCountDistplot](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/ChuancaiIngredientCountDistplot.png)

In [None]:
# Step 6: Plot a distplot on the ingredient count data

### Step 7: Repeat and combine the plot together for multiple cuisines
Now that you have successfully plotted the distplot for chuancai.csv, let's repeat Steps 5-6 with the following cuisines:
- lucai.csv
- chuancai.csv
- yuecai.csv
- sucai.csv
- mincai.csv
- zhecai.csv
- xiangcai.csv
- huicai.csv

You won't be able to use the list from Step 2, so you'll have to declare a new list with these files in order. Use the list from Step 2 to copy and paste the full paths. 

Don't forget to put hist=False when you plot the distplot so the resultant plot resembles Figure 2.

In [None]:
# Step 7a: Declare a new list containing the filenames

In [None]:
# Step 7b: Loop through the list and plot the distplots on the same plot

<details>
    <summary><strong>Click here once to see what we got</strong></summary>
    <img src="https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure2Replicated.png">
</details>

<details>
    <summary><strong>Click here once for pseudocode</strong></summary>
    <ol>
        <li>Use a for loop to loop through the list of filenames you declared in Step 7a. In each loop:</li>
        <ul>
            <li>Create a temporary DataFrame of the currenct filename</li>
            <li>Get the 'recipe_id' column and assign it to a variable</li>
            <li>Call a .distplot with the column, and don't forget to put hist=False</li>
        </ul>
    </ol>
</details>

## Replicate Table 1
Once we have replicated Figure 2 successfully, let's take a look at the next thing to replicate - Table 1.

![Table1](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Table1.png)

Table 1 has four columns:

<font color='red'>$N_{r}$</font> = number of recipes

<font color='green'>$N_{i}$</font> = number of ingredients

<font color='blue'>$\hat{N}_{i}$</font> = number of ingredients used only in the cuisine

<font color='purple'><$K$\></font> = average number of ingredients in a recipe
    
If this Section looks intimidating, please do not worry - you actually have figured out the way to get these numbers based on your work in Steps 3-7.
    
We'll start with getting the four lists, and then building a DataFrame out of the lists to replicate Table 1. 

### Step 8: Get the list of number of recipes for cuisines <font color='red'>$N_{r}$</font>
In this Step, we'll count how many recipes there are in each cuisine, storing them as a single list. 

Loop through the full list of CSVs that you got from Step 2 and append the number of unique recipes into a list. 

Don't sweat it - you've done this for Step 3 before but this time you do not sum the numbers in the list.

In [None]:
# Step 8: Get the list of number of recipes per cuisine

### Step 9: Get the list of number of ingredients for cuisines <font color='green'>$N_{i}$</font>
In this Step, we'll count how many unique ingredients there are for cuisines, storing them as a single list as well.

Similar to Step 8, loop through the list of CSV filenames and append the number of ingredients into a list.

In [None]:
# Step 9: Get the list of number of ingredients per cuisine

### Step 10: Get the unique number of ingredients in each cuisine <font color='blue'>$\hat{N}_{i}$</font>
In this Step, we identify how many ingredients are unique to each cuisine alone. 

To illustrate this, we can take a look at the Venn diagram below:

![UniqueIngredients](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/UniqueIngredients.png)

There are a few ways to get the number of unique items per cuisine. 

That said, we recommend implementing <a href="https://stackoverflow.com/questions/17035577/unique-features-between-multiple-lists">this code</a> and modifying it for this specific use case.

In [None]:
# Step 10: Get the number of unique ingredients in each cuisine

<details>
    <summary><strong>Click here once for pseudocode</strong></summary>
    <ol>
        <li>Create a list containing lists of unique ingredients in each cuisine</li>
        <li>Declare an empty list to store the length of unique ingredients in each cuisine</li>
        <li>Copy and paste the code from the best Stackoverflow answer in the URL</li>
        <li>Replace [A_1, A_2, A_3] with your list of lists for the variable <strong>all_lists</strong></li>
        <li>Replace (A_1, A_2, A_3) with your list of lists for the for loop</li>
        <li>Replace print(sorted(uniques)) with appending the length of <strong>sorted(uniques)</strong> into the empty list above</li>
    </ol>
</details>

### Step 11: Get the mean number of ingredients per cuisine <font color='purple'><\$K$\></font>
In this Step, we will get the mean number of ingredients per cuisine by:
1. Performing a groupby operation on the recipe_id and counting the ingredients
2. Getting the mean of the 'ingredient' column
3. Append the mean to a list
    
![AverageIngredientPerCuisineExample](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/AverageIngredientPerCuisineExample.png)
    
In this case, averaging the ingredient count of chuancai cuisine will yield 10.512.

In [None]:
# Step 11: Get the list containing the average ingredient per recipe in each cuisine

### Step 12: Put the lists into a DataFrame
Now that you have obtained all four lists, it's time to create Table 1.

![image.png](attachment:image.png)

In [None]:
# Step 12: Recreate Table 1 as DataFrame

### Step 13: Reorder the rows in the DataFrame
If you used the list from Step 2, you'll notice that the order of the cuisines is not the same as the original Table 1. 

Reorder the rows using .reindex so that you can match Table 1. 

Note that Muslim in Table 1 is qingzhen, HK is gangtai.

What do you notice about the DataFrame after reordering the rows?

In [None]:
# Step 13: Rearrange the rows of the DataFrame from Step 12

<details>
    <summary><strong>Click here once to see what we noticed</strong></summary>
    <img src="https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Table1Comparison.png">
    <br>
    <div>It seems like our Table looks identical for <font color='red'>$N_{r}$</font> and <font color='green'>$N_{i}$</font>, but not the rest. <font color='blue'>$\hat{N}_{i}$</font> seems to be mixed up and <font color='purple'><$K$></font> should be similar between the two tables. We double checked on our side for any possible errors, and it seems that our sums should be correct. No big deal, no worries.</div>
</details>

## Replicate Figure 3
Figure 3 is the cumulative frequency distribution ingredient usage. 

![CumulativeFrequencyDistribution](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/CumulativeFrequencyDistribution.png)

Based on the paper, the distribution of ingredients follow a power law which results in a somewhat straight line in the plot. We'll examine whether this claim is true by replicating the figure.

<font color='red'>Note: Bear in mind that we won't be able to replicate this image exactly, because of unforeseen factors and the fact that the project was executed in MATLAB</font>

### Step 14: Install and import powerlaw
To draw a cumulative frequency plot, we'll need to use the powerlaw library. 

Go ahead and install and import powerlaw.

More details here: https://github.com/jeffalstott/powerlaw

In [None]:
# Step 14: Import powerlaw

### Step 15: Get the frequency of every single ingredient
Before we can plot the frequency distribution, we need to know how many times each ingredient id appears in ALL of the cuisines.

There are many ways to get the frequency, and here are some suggestions on what to do after getting the list of ingredients from all cuisines:
1. Create a Counter object using the list, followed by a DataFrame using the Counter object's keys and values
2. Create DataFrame using the list of ingredients, followed by a .value_counts() operation

![GettingIngredientValueCount](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/GettingIngredientValueCount.png)

In [None]:
# Step 15: Get the frequency of every single ingredient

### Step 16: Plot the ingredient frequency distribution
Now that you have the counts of the ingredients, let's plot the frequency distribution using the powerlaw library.

Implement the code that you see in https://nbviewer.jupyter.org/github/jeffalstott/powerlaw/blob/master/manuscript/Manuscript_Code.ipynb

More specifically, we can use the code in the <strong>code cell directly after "PDF Linear vs Logarithmic Bins"</strong>.

Don't worry if your plot is different from the paper's.

<details>
    <summary><strong>Click once to see what we have</strong></summary>
    <img src="https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/PowerLawPlot.png">
    <br>
    <div>Looks like it's a straight line plot, and the frequency of our ingredients follow a power law.</div>
</details>

In [None]:
# Step 16: Plot the power law plot

### Step 17: Export the frequency count as CSV
We'll need the frequency count later in Part IV where we will create a simulation for ingredient evolution.

The frequency count that we obtained from Step 15 - be it a DataFrame or Series - should be exported into CSV.

In [None]:
# Step 17: Export the ingredient frequency count as CSV

In [None]:
# Optional: Use pandas to read the exported CSV to make sure you got the export right

## Replicate Figure 4
Figure 4 shows the number of distinct ingredients discovered vs the number of recipes scanned. 

![Figure4](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure4.png)

Intuitively, as we go through more and more recipes, we should expect to encounter fewer and fewer new ingredients.

As such, consider this:
1. x-axis: cumulative number of ingredients
2. y-axis: number of unique ingredients

The article mentions the averaging of 100 implementations of "independently random sequences" of recipes. We're unable to ascertain what that means, so we will just with using the entire dataset to construct Figure 4.

### Step 18: Create lists for Figure 4
As mentioned, we'll need to create two lists:
1. x-axis: cumulative number of ingredients
2. y-axis: number of unique ingredients

You'll have to loop through each ingredient in all of the recipes in all cuisines (a list of 88,929 items), and in each loop you have to keep track of:
- how many ingredients have you looped through
- whether the ingredient in the current loop is new 

<details>
    <summary><strong>Click here once for pseudocode</strong></summary>
    <ol>
        <li>Declare an empty list to store number of ingredients(List_1)</li>
        <li>Declare an empty list to store number of unique ingredients (List_2)</li>
        <li>Declare a variable that is an empty set (Set_1)</li>
        <li>Declare a variable that contains the sum of ingredients (Var_1)</li>
        <li>Use a for loop to loop through the combined list of ingredient items from all cuisines (88,929 items). In each loop:</li>
        <ul>
            <li>Increase the value of Var_1 by 1</li>
            <li>Append Var_1 into List_1</li>
            <li>Add the current loop's ingredient into Set_1 with .add</li>
            <li>Append the length of Set_1 into List_2</li>
        </ul>
    </ol>
</details>

In [None]:
# Step 18: Create the lists for Figure 4

### Step 19: Create a line of best fit between the two lists
Before we recreate Figure 4 with the two lists from Step 17, we'll need to create a line of best fit between them. 

We'll be doing a log y vs log x fit, more specifically

$\ log(y) = m\cdot log(x) + c $

where $m $ is the gradient and $c$ is the y-intercept.

Use numpy's polyfit function fit the two log-transformed lists with a degree of 1, and get m and c.

After that, generate a list of values y_fit using $m$ and $c$, where 

$\ y\_fit = e^{m\cdot log(x) + c}$

Gentle reminder - x is List 1 from Step 17.

In [None]:
# Step 19a: Get m and c from np.polyfit

In [None]:
# Step 19b: Create y_fit values

### Step 20: Plot the log-log plot with the line of best fit
Now that you have the line of best fit, plot both the log-log along with the line of best fit. 

You'll have to figure out how to transform the scales of the x- and y-axes to be in log scale.

If all goes well, you should see a plot like this.

![LogLogWithBestFitLine](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/LogLogWithBestFitLine.png)

In [None]:
# Step 20: Plot the figure

<details>
    <summary><strong>Is the line of best fit legit? Click once to see whether it is</strong></summary>
    <div>Yes it is! Let's take a different look at our plot, without the log scale</div>
    <img src = "https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/LogLogWithBestFitLineNormalScale.png">
    <br>
    <div><p>If you plot the graph without changing the axes' scale to log scale, you'll see that the curve that you generated fits the actual values pretty while.</p><p>It only looks like the lines do not align in the log-scale plot because of the log scale, which makes differences look big at the smaller quantities, i.e 10^1 to 10^2.</p></div>
</details>

### End of Part II
Wow, what a part. In this Part, you have successfully replicated, to a certain extent, the Tables and Figures found in the "Data and Methods" part in the research paper. 

More specifically, you:
1. Plotted probability distribution of number of ingredients per recipe for selected cuisines
2. Calculated basic statistics of the various cuisines, and compiled the results into a Table
3. Plotted a cumulative frequency distribution of ingredient usage in recipes
4. Plotted a log-log plot of the number of distinct ingredients vs number of ingredients in total

Next Part, we'll continue and replicate the results in the "Results" section of the research paper.