# Introduction
In Part II, we went on a long journey to piece together the information in each cuisine and replicated the figures found in the "Data and Method" section.

In this Part, we will continue replicating the results found by the authors of the research paper in the "Results" section.

More specifically, you will:
1. Prepare a TF-IDF vector of ingredient usage across recipes
2. Perform Principal Component Analysis (PCA) for dimensionality reduction on the dataset
3. 

<font color = 'red'><strong>Disclaimer: there will be sections where we won't be able to replicate entirely, but that is ok because of variation in coding language used and ambiguity in source instructions. For most parts, the conclusions are consistent and valid.</strong></font>

### Step 1: Import libraries
For this Part, these are the libraries we need:
- pandas as pd
- glob
- matplotlib.pyplot as plt
- seaborn as sns
- numpy as np
- TfidfVectorizer from sklearn.feature_extraction.text
- PCA from sklearn.decomposition

In [None]:
# Step 1: Import libraries

### Step 2: Get a list of filenames in 'cuisine_recipe_ingredient_CSV' folder
Similar to the last Part's Step 2, we'll need to get a list of all the .csv files in the 'cuisine_recipe_ingredient_CSV' folder using glob.


In [None]:
# Step 2: Get a list of .CSV filenames

## Replicate Figure 5
We'll start with replicating Figure 5, the first result in the "Results" section.

![Figure5](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/Figure5.png)

Here, the authors created a TF-IDF based ingredient matrix and performed Principal Component Analysis (PCA) for plotting.

We'll do the following:
1. Read the CSVs and get a list containing ingredients from each cuisine
2. Join each list of cuisine ingredient into a string
3. Create a DataFrame with the string version of the ingredient
4. Perform vectorization with TfidfVectorizer
5. Perform PCA
6. Plot the PCA components

![Figure5Approach](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure5Approach.png)

### Step 3: Get a list of list of ingredients
Loop through the filenames from Step 2, read the corresponding CSVs into DataFrame, and append the 'ingredient' column into a list.

Your list should only have 20 items.

In [None]:
# Step 3: Get a list of list of ingredients

### Step 4: Get a list of cuisine names
We'll need a list of the cuisine names, so derive a list containing the cuisine names from the list in Step 2.

Alternatively, you can also manually declare a list containing the cuisine names.

In [None]:
# Step 4: Get a list of cuisine names

### Step 5: Join the items in each list
We will then join the items in each list into a string, and append the strings into a new list.

![IngredientTextExample](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/IngredientTextExample.png)

You will have a list containing 20 strings of ingredient ids.

In [None]:
# Step 5: Join the items in each list

### Step 6: Create a DataFrame for cuisine name and text
Combine the two lists you created in Steps 4-5 to create a DataFrame.

![IngredientTextDataFrame](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/IngredientTextDataFrame.png)

It should have:
- 20 rows
- 2 columns

In [None]:
# Step 6: Create an ingredient text DataFrame

### Step 7: Vectorize 'text' column with TfidfVectorizer
We'll turn the column of text into a DataFrame containing the TF-IDF scores of each ingredient. 

In the context of our project, important and unique ingredients will get a higher score compared to ingredients that are found in recipes across many cuisines.

After TF-IDF vectorization with the TfidfVectorizer object, you'll get a sparse matrix.

Turn that matrix into a DataFrame that contains:
- 20 rows
- 2,902 columns

P.S. You might have noticed something odd by now, but we'll talk about that in the Optional Step below later.

<details>
    <summary><strong>Click here once for a hint</strong></summary>
    <div>Google "Append tfidf to pandas dataframe"</div>
</details>

In [None]:
# Step 7: Get a DataFrame of the vectorized ingredients

### Step 8: Perform PCA on the TF-IDF DataFrame
Now that you've obtained the vectorized ingredient DataFrame, it's time to perform principal component analysis to reduce the dimensions of the data, from 2,000+ columns to only 2 columns/components.

![PCAArrays](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/PCAArrays.png)

You should get values close to the ones above after PCA.

In addition, your components should account for 40.2% of the variance in the data when you call .explained_variance_ratio. 

In [None]:
# Step 8a: Declare a PCA object with n_components of 2

In [None]:
# Step 8b: Fit transform the DataFrame from Step 7

In [None]:
# Step 8c: Check the sum of the .explained_variance_ratio

### Step 9: Turn the PCA array into a DataFrame
Now that you have the PCA array, we'll have to tidy it up into a proper DataFrame so we can plot and annotate the data nicely. 

More specifically, aim for a DataFrame that looks like this:

![PCADataFrame](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/PCADataFrame.png)

In [None]:
# Step 9: Create a DataFrame using the PCA array

### Step 10: Plot the two components
Time to plot the graph, and see if it matches Figure 5! 

Gentle reminder to plot these in a scatterplot:
- x-axis: PC-1
- y-axis: PC-2

It'll be hard to figure out which point is which, so don't forget to <strong>annotate</strong> the points with the names of the corresponding cuisine.

In [None]:
# Step 10: Plot a scatterplot of PC-2 against PC-1

<details>
    <summary><strong>Did you replicate Figure 5? Click here once to see what we got</strong></summary>
    <br>
    <img src="https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/PCAScatterPlot.png">
    <br>
    <div>We managed to get a plot that is similar, but not so similar to Figure 5. That's ok because the general patterns are observed.</div>
</details>

### Optional: Fix the the vectorization
If you noticed, in Step 7, you got a DataFrame with 20 rows and 2,902 columns. Do you see what's wrong? 

There are 2,911 unique ingredients, which means you should get a DataFrame with 2,911 columns. 

9 columns are missing, and you will have to investigate. 

This part is optional because it does not affect the Section's conclusion, i.e. the replication is ok.

For the investigation of the columns, there are a few ways you can do it, but here's what we recommend:
1. Take the columns from the TF-IDF DataFrame
2. Turn it into a set (Set 1)
3. Take the list of unique ingredients in the entire dataset
4. Turn it into a set (Set 2)
5. Get the difference between the two sets

In [None]:
# Find out what are the missing 9 ingredients

### Modify Step 3 to add an underscore to the single digit ingredients
That's right! The missing 9 ingredients are numbered 1-9. 

Somehow, during vectorization, numbers between 1-9 were removed. The way to avoid this is to add an underscore to the numbers, e.g., '1_', '2_', ... , '9_'

The easiest place to perform this modification is Step 3, when you were taking the 'ingredient' column from the cuisine DataFrames.

In [None]:
# Perform the modified Step 3

### Repeat Steps 5-10 with modified data
Now that you're done adding an underscore to the single digits, this is what you'll see after Step 6:

![IngredientTextDataFrameModified](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/IngredientTextDataFrameModified.png)

Once you get this, proceed with the subsequent steps and you'll get a vectorized DataFrame of 20 rows with 2,911 rows. After that, plot the scatterplot of the PCA components again. 

Is there any difference?

In [None]:
# Repeat Steps 5-10

## Replicate Figure 6
Now that we have replicated Figure 5 successfully, it's time to analyze Figure 6. 

In Figure 6, the authors wanted to see the relationship between topological distance of cuisine nodes and physical distance. 

![Figure6](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure6.png)

This culminates in a boxplot, where there is a correlation between the two. 

We will replicate Figure 6 using the data from the following files:
- ```Chinese-cuisine-master/code/data/real_result/geographic distance.txt```
- ```Chinese-cuisine-master/code/data/real_result/topological distance.txt```

You can find these files if you unzipped the entire folder from the Github repository.

### Step 11: Read "geographic distance.txt" into a DataFrame
Use .read_csv to read the geographic distance file (yes you can use that to read txt files).

![GeographicDistanceDataFrame](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/GeographicalDistanceDataFrame.png)

Figure out what's a good character to use for the <strong>sep</strong> parameter, and make sure you don't have any header in your DataFrame.

You'll have a 20 x 20 symmterical DataFrame of the distances. 

In [None]:
# Step 11: Read 'geographic distance.txt' into a DataFrame

### Step 12: Read "topological distance.txt" into a DataFrame
You know the drill - next read 'topological distance.txt' into a DataFrame.

![TopologicalDistanceDataFrame](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/TopologicalDistanceDataFrame.png)

You'll get a 20 x 20 DataFrame containing the topological distances, and this time the DataFrame is asymmetrical.

In [None]:
# Step 12: Read 'topogical distance.txt' into a DataFrame

### Step 13: Combine the two DataFrames together
The values at positions $i$, $j$ in one DataFrame corresponds to another.

For example, $geodist_{1,0}$ corresponds to $topodist_{1,0}$, with 1309.90 and 3 respectively.

![TopoGeoDataFrame](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/TopoGeoDataFrame.png)

We'll have to create a DataFrame that contains the topological distance of the cuisines and the corresponding geographical distance.

It will have:
- 190 rows
- 2 columns

In [None]:
# Step 13: Combine the two DataFrames together

<details>
    <summary><strong>Click here once for pseudocode</strong></summary>
    <ol>
        <li>Declare an empty list (List_1)</li>
        <li>Declare another empty list (List_2)</li>
        <li>Use a for loop in the range with topo/geo DataFrame's length. In each loop (Loop_1):</li>
        <ol>
            <li>Use another for loop in the range with topo/geo DataFrame's length. In each loop (Loop_2)</li>
            <ol>
                <li>If Loop_1's current value (Value_1) is more than Loop_2's current value (Value_2):</li>
                <ul>
                    <li>Use iloc to find what the value is at position (Value_1, Value_2) for the topo DataFrame</li>
                    <li>Append that value into List_1</li>
                    <li>Do the same for positon (Value_1, Value_2) for geo DataFrame</li>
                    <li>Append that value into List_2</li>
                </ul>
            </ol>
        </ol>
        <li>Use List_1 and List_2 to create a DataFrame</li>
    </ol>
</details>

### Step 14: Remove rows that contain 0
Before we plot the boxplot, let's remove any rows that contain 0. The reason why there are rows with 0 is because rows with 0 fall under "other" cuisine. 

You'll end up with a DataFrame with:
- 171 rows
- 2 columns

In [None]:
# Step 14: Remove rows containing 0

### Step 15: Plot a boxplot
The moment of truth.

Plot your boxplot with topological distance at the x-axis and physical distance at the y-axis.

In [None]:
# Step 15: Plot a boxplot

### Step 16: Modify physical distance
The boxplot looks like Figure 6, but the values are slightly off.

To match Figure 6 completely, you'll have to multiply the values in the list with physical distance by 2.

In [None]:
# Step 16: Multiply list with physical distance by 2

### Step 17: Plot the boxplot again
Now that you've multipliedd the values of the physical distance by 2, time to replot the boxplot and see if it matches Figure 6.

In [None]:
# Step 17: Plot boxplot

<details>
    <summary><strong>What'd you think? Click once to see what we think</strong></summary>
    <div>While we were able to replicate the Figure completely, it's also useful to investigate if it is "correct" to multiply the distances like that.</div>
    <br>
    <div>We won't give anything away, but you can try to investigate which of the distances is correct, i.e. pre-multiplication or post-multiplication.</div>
</details>

## Replicate Figure 7
Next up, the authors plotted the number of spices per recipe vs mean annual temperature of the locations of the cuisines.

![Figure7](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure7.png)

The data that we need are located in four places:
- ```Chinese-cuisine-master/code/data/real_result/climate.txt```
- The .txt files in ```Chinese-cuisine-master/code/data/network/meat-based recipe/```
- The list of spice IDs in ```Chinese-cuisine-master/code/cal_spices_temperature.m```
- The CSV files in ```cuisine_recipe_ingredient_CSV```

### Step 18: Read climate.txt into a DataFrame
Firstly, read climate.txt into a DataFrame. You might have to figure out what additional parameters to use if you're stuck.

If you do it successfully, you'll have a DataFrame that has:
- 20 rows
- 2 columns

In [None]:
# Step 18: Read climate.txt into a DataFrame

<details>
    <summary><strong>Click here once for a hint</strong></summary>
    <div>Google "How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?"</div>
</details>

### Step 19: Get lists from cal_spices_temperature 
We'll also need to get two lists from ```cal_spices_temperature.m```:
- spice_id
- caixi_name

spice_id contains the ingredient ids of spices, which we wil need later. This list will have 213 items.

caixi_name contains the names of the cuisines. This list will have 20 items.

In [None]:
# Step 19: Get lists from cal_spices_temperature

### Step 20: Add caixi_name into Step 18 DataFrame and sort
Add caixi_name into the DataFrame from Step 18 as a column called "cuisine".

After that, sort the DataFrame by "cuisine" column in an alphabetical order. 

![TemperatureDataFrame](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/TemperatureDataFrame.png)

In [None]:
# Step 20a: Add a column named "cuisine" using caixi_name

In [None]:
# Step 20b: Sort the DataFrame by 'cuisine'

### Step 21: Get lists of filenames
Using glob, get a list of filenames for two folders:
- ```cuisine_recipe_ingredient_CSV```
- ```Chinese-cuisine-master/code/data/network/meat-based recipe```

In [None]:
# Step 21: Get a list of .CSV filenames

![Figure7Strategy](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure7Strategy.png)

Why have we done the things we've done so far? We'll need to do the following steps because according to the publication, this is what's going to happen:
1. Go through each cuisine
2. In each cuisine, pick on recipe_id that contains meat
3. In each recipe_id that contains meat, count how many spices appear
4. Get the average number of spices per cuisine

### Step 22: Read the first meat-based recipe txt
We'll take a look at the first meat-based recipe txt. If you used glob, the list will be sorted alphabetically and the first txt file is for the chuan cuisine.

Read the .txt file into a DataFrame. You'll have a DataFrame that has:
- 890 rows
- 1 column

In [None]:
# Step 22: Read the first meat-based recipe txt

### Step 23: Filter the first DataFrame
Read the first CSV in the cuisine list - chuan.csv - into a DataFrame. 

Filter the DataFrame using the list of recipe ids found in Step 22.

The chuan DataFrame starts with 12,068 rows but ends up with 9,753 rows after filtering for recipe_id that contains meat.

In [None]:
# Step 23: Filter the chuan DataFrame 

<details>
    <summary><strong>Click here once for a hint</strong></summary>
    <div>Google "Filter dataframe rows if value in column is in a set list of values"</div>
</details>

### Step 24: Get a list of filtered meat DataFrames
Now that you've got the idea, let's loop through all of the CSVs, create DataFrames, and filter them before appending them into a list.

You'll have a list of 20 filtered DataFrames.

In [None]:
# Step 24: Get a list of filtered DataFrames

<details>
    <summary><strong>Click here once for pseudocode</strong></summary>
    <ol>
        <li>Declare an empty list (List_1)</li>
        <li>Use a for loop in the range with cuisine list length. In each loop (Loop_1):</li>
        <ul>
            <li>Read the cuisine csv into a DataFrame (DF_1)</li>
            <li>Read the meat-recipe into a DataFrame (DF_2)</li>
            <li>Get the list of recipe id with meat from DF_2</li>
            <li>Filter DF_2 with the list (DF_3)</li>
            <li>Append DF_3 into List_1</li>
        </ul>
    </ol>
</details>

### Step 25: Get average number of spices per recipe in one filtered cuisine
Let's check out the average number of spices per recipe in the first filtered cuisine - chuan cuisine with meat only.

There are a few ways to do it, and the easiest approach you can consider is:
- get a list of unique recipe_id
- loop through the unique ids
- filter the DataFrame to contain the current id only
- from the filtered DataFrame, keep count of how many spices are found in the ingredient
- get the average value of the spice count

Using the first item in the list from Step 24, i.e. chuan-meat, you will have:
- 890 unique recipe_ids of dishes with meat
- 3.275 spices used per recipe_id

In [None]:
# Step 25: Get the average number of spices per recipe in one cuisine

<details>
    <summary><strong>Click here once for pseudocode</strong></summary>
    <ol>
        <li>Declare a variable to store the first item in the list of DataFrames from Step 24</li>
        <li>Get the 'recipe_id' column, and get a list of unique recipe_id with .unique() (List_1)</li>
        <li>Declare an empty list (List_2) to store the number of spice per recipe_id</li>
        <li>Use a for loop to loop through each item in List_1. In each loop:</li>
        <ul>
            <li>Declare a variable containing an initial value of zero (Var_1)</li>
            <li>Filter the DataFrame to contain only the current recipe_id value and assign it to a variable (Var_2)</li>
            <li>Get the list of ingredients from Var_2's 'ingredient' column (List_3)</li>
            <li>Use a for loop to loop through List_3. In each loop:</li>
            <ul>
                <li>Check if the current ingredient is in the list of spices from Step 19. If it is:</li>
                <ul>
                    <li>Increment Var_1 by 1</li>
                </ul>
            </ul>
            <li>Append Var_1 into List_2</li>
        </ul>
        <li>Take the sum of List_2 and divide it by the lenght of List_2 to get your average number of spices used per recipe_id in a cuisine</li>
    </ol>
</details>

### Step 26: Get average number of spices used per recipe for all cuisines
Now that you've successfully completed the calculation for one cuisine, it's time to do it for the rest of the cuisines.

![AverageSpicePerCuisineList](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/AverageSpicePerCuisineList.png)

You'll end up with a list containing 20 averages. 

In [None]:
# Step 26: Get average number of spices used per recipe for cuisines

### Step 27: Add the spice average to temperature DataFrame
Create a new column in the DataFrame that you got from Step 20 called "average_spice_num" using the list you got from Step 26.

In [None]:
# Step 27: Update your Step 20 DataFrame

### Step 28: Remove the row with 'other' data
Before we plot the temperature and average spice number, remove the row with "other". 

That is because the temperature for 'other' is 0, which will mess your scatterplot up later.

In [None]:
# Step 28: Remove 'other' data

### Step 29: Plot average_spice_num against temperature
The last Step before we call it a day for this Part - plotting average_spice_num against temperature in a scatterplot.

In [None]:
# Step 29: Plot average_spice_num against temperature

<details>
    <summary><strong>What'd you get? Click here to see what we got</strong></summary>
    <img src = 'https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectCuisineNetwork/Figure7vsReproduction.png'>
    <br>
    <div>The reproduction is slightly different from Figure 7, but the pattern remains the same, i.e. there is no correlation between the number of spices in the meat dishes in cuisines and the temperature of the different regions.</div>
</details>

### End of Part III
What a Part! If you have successfully (sort of) replicated all of the figures in the paper, well done to you! If you haven't yet, no worries and come back to the different Sections when you can.

To recap, you have successfully:
1. Prepared a TF-IDF vector of ingredient usage across recipes
2. Performed Principal Component Analysis (PCA) for dimensionality reduction on the dataset
3. Combined data from different sources together for visualization

In the next Part, we will work on replicating the most exciting (and hardest) part of the paper - a simulation that models cuisine evolution in a geographical region.

Put on your best Python hat and head on to Part IV.