# Downloading Data

## Getting Text of Dataset

In [1]:
data_file = open("DS163 PROJECT DATASET.txt", mode="r", encoding = "ISO-8859-1")
data_text = data_file.read()
data_file.close()

In [2]:
data_text

' \n\n\n1  \n all purpose flour  \n 3.0 cups all purpose flour  \n AR_1  \n 0.920724583  \n 3  \n cup\n2  \n all purpose flour  \n 2.8000000000000003 cups all purpose flour  \n AR_10  \n 0.905162048  \n 2.8  \n cup\n3  \n all purpose flour  \n 1.1076923076923078 cups all purpose flour  \n AR_101  \n 0.6  \n 1.107692308  \n cup\n4  \n all purpose flour  \n 3.333333333333333 cups sifted all purpose flour  \n AR_102  \n 0.9375  \n 3.333333333  \n cup\n5  \n all purpose flour  \n 2.0 cups all purpose flour  \n AR_103  \n 0.88125  \n 2  \n cup\n6  \n all purpose flour  \n 9.0 cups unbleached all purpose flour  \n AR_107  \n 0.927272701  \n 9  \n cup\n7  \n all purpose flour  \n 0.8571428571428571 cups all purpose flour  \n AR_108  \n 0.867741966  \n 0.857142857  \n cup\n8  \n all purpose flour  \n 2.6666666666666665 cups all purpose flour  \n AR_109  \n 0.871794891  \n 2.666666667  \n cup\n9  \n all purpose flour  \n 3.333333333333333 cups all purpose flour  \n AR_11  \n 0.902891922  \n 3.3

## Cleaning Lines of Data

Some of the lines contain `\n` and not an actual new line. So, the researchers replaced `\\n` with `\n`

In [3]:
import re

data_text = re.sub(r"\\n", '\n', data_text)
data_text

' \n\n\n1  \n all purpose flour  \n 3.0 cups all purpose flour  \n AR_1  \n 0.920724583  \n 3  \n cup\n2  \n all purpose flour  \n 2.8000000000000003 cups all purpose flour  \n AR_10  \n 0.905162048  \n 2.8  \n cup\n3  \n all purpose flour  \n 1.1076923076923078 cups all purpose flour  \n AR_101  \n 0.6  \n 1.107692308  \n cup\n4  \n all purpose flour  \n 3.333333333333333 cups sifted all purpose flour  \n AR_102  \n 0.9375  \n 3.333333333  \n cup\n5  \n all purpose flour  \n 2.0 cups all purpose flour  \n AR_103  \n 0.88125  \n 2  \n cup\n6  \n all purpose flour  \n 9.0 cups unbleached all purpose flour  \n AR_107  \n 0.927272701  \n 9  \n cup\n7  \n all purpose flour  \n 0.8571428571428571 cups all purpose flour  \n AR_108  \n 0.867741966  \n 0.857142857  \n cup\n8  \n all purpose flour  \n 2.6666666666666665 cups all purpose flour  \n AR_109  \n 0.871794891  \n 2.666666667  \n cup\n9  \n all purpose flour  \n 3.333333333333333 cups all purpose flour  \n AR_11  \n 0.902891922  \n 3.3

Then, the researchers split the data by `\n`.

In [4]:
data_lines = data_text.split('\n')

Once they fixed the separation of lines, they cleaned the lines themselves.<br>

In [5]:
data_lines = [ re.sub(r"\s+?\n", '', line) for line in data_lines ]
data_lines = [ line for line in data_lines if not line.isspace() ]
data_lines = [ line for line in data_lines if len(line) > 0 ]
data_lines = [ re.sub(r"\s+", ' ', line) for line in data_lines ]
data_lines = [ line[:-1] if line[-1] == ' ' else line for line in data_lines ] # removes last character if it is space
data_lines

['1',
 ' all purpose flour',
 ' 3.0 cups all purpose flour',
 ' AR_1',
 ' 0.920724583',
 ' 3',
 ' cup',
 '2',
 ' all purpose flour',
 ' 2.8000000000000003 cups all purpose flour',
 ' AR_10',
 ' 0.905162048',
 ' 2.8',
 ' cup',
 '3',
 ' all purpose flour',
 ' 1.1076923076923078 cups all purpose flour',
 ' AR_101',
 ' 0.6',
 ' 1.107692308',
 ' cup',
 '4',
 ' all purpose flour',
 ' 3.333333333333333 cups sifted all purpose flour',
 ' AR_102',
 ' 0.9375',
 ' 3.333333333',
 ' cup',
 '5',
 ' all purpose flour',
 ' 2.0 cups all purpose flour',
 ' AR_103',
 ' 0.88125',
 ' 2',
 ' cup',
 '6',
 ' all purpose flour',
 ' 9.0 cups unbleached all purpose flour',
 ' AR_107',
 ' 0.927272701',
 ' 9',
 ' cup',
 '7',
 ' all purpose flour',
 ' 0.8571428571428571 cups all purpose flour',
 ' AR_108',
 ' 0.867741966',
 ' 0.857142857',
 ' cup',
 '8',
 ' all purpose flour',
 ' 2.6666666666666665 cups all purpose flour',
 ' AR_109',
 ' 0.871794891',
 ' 2.666666667',
 ' cup',
 '9',
 ' all purpose flour',
 ' 3.3333

In [6]:
len(data_lines)

14079

After cleaning the lines, they realized that not all the ingredients had 7 lines. Some text occupies multiple lines.

In [7]:
list_list_ingredients = []

for i in range(len(data_lines)):
    if data_lines[i].startswith(' '):
        list_list_ingredients[-1].append(data_lines[i])
    else:
        list_list_ingredients.append([])
        list_list_ingredients[-1].append(data_lines[i])

In [8]:
ingredient_lengths = [ len(list_ingredients) for list_ingredients in list_list_ingredients ]

for i in range(len(ingredient_lengths)):
    if ingredient_lengths[i] != 7:
        print("Not all ingredients occupy 7 lines.")
        print(list_list_ingredients[i])
        break

Not all ingredients occupy 7 lines.
['105', ' all purpose flour', ' "18.75 ounces all purpose flour', ' such as gold medal"', ' Misc_110', ' NA', ' 2.34375', ' cup']


The researchers noticed that the initial line element of the line for the ingredient starts with a quotation mark. Then, the last line element of the line for the ingredient ends with a quotation mark.<br><br>
Thus, they kept this principle when iterating through every list of lines for ingredients.<br>

In [9]:
for i in range(len(ingredient_lengths)-1, -1, -1):
    if ingredient_lengths[i] != 7:
        num_of_lines = len(list_list_ingredients[i])
        for j in range(num_of_lines-1, 2, -1):
            if list_list_ingredients[i][j][-1] == '"' and list_list_ingredients[i][j][0] != '"':
                list_list_ingredients[i][j-1] += list_list_ingredients[i][j]
                del list_list_ingredients[i][j]

In [10]:
ingredient_lengths = [ len(list_ingredients) for list_ingredients in list_list_ingredients ]

for i in range(len(ingredient_lengths)):
    if ingredient_lengths[i] != 7:
        print("Not all ingredients occupy 7 lines.")
        print(list_list_ingredients[i])
        break

In [11]:
list_list_ingredients[104]

['105',
 ' all purpose flour',
 ' "18.75 ounces all purpose flour such as gold medal"',
 ' Misc_110',
 ' NA',
 ' 2.34375',
 ' cup']

The researchers also had to remove the quotation marks between the texts.

In [12]:
for i in range(len(list_list_ingredients)):
    if list_list_ingredients[i][2].startswith(' "'):
        line = list_list_ingredients[i][2]
        list_list_ingredients[i][2] = re.findall(r'"(.+?)"', line)[0]

In [13]:
list_list_ingredients[104]

['105',
 ' all purpose flour',
 '18.75 ounces all purpose flour such as gold medal',
 ' Misc_110',
 ' NA',
 ' 2.34375',
 ' cup']

## Putting Lines into List of Dictionaries

In [14]:
ingredient_dicts = []

import numpy as np

for ingredient in list_list_ingredients:
    try:
        ingredient_dicts.append({
            "entry_no": int(ingredient[0]),
            "ingredient": ingredient[1].strip(),
            "text": ingredient[2].strip(),
            "recipe_index": ingredient[3].strip(),
            "rating": float(ingredient[4]),
            "quantity": float(ingredient[5].strip()),
            "unit": ingredient[6].strip()
        })
    except:
        ingredient_dicts.append({
            "entry_no": int(ingredient[0]),
            "ingredient": ingredient[1].strip(),
            "text": ingredient[2].strip(),
            "recipe_index": ingredient[3].strip(),
            "rating": np.nan,
            "quantity": float(ingredient[5].strip()),
            "unit": ingredient[6].strip()
        })

In [15]:
ingredient_dicts[200:]

[{'entry_no': 201,
  'ingredient': 'baking powder',
  'text': '0.7384615384615385 teaspoons baking powder',
  'recipe_index': 'AR_101',
  'rating': 0.6,
  'quantity': 0.738461538,
  'unit': 'teaspoon'},
 {'entry_no': 202,
  'ingredient': 'baking powder',
  'text': '1.0 teaspoon baking powder',
  'recipe_index': 'AR_119',
  'rating': 0.845714283,
  'quantity': 1.0,
  'unit': 'teaspoon'},
 {'entry_no': 203,
  'ingredient': 'baking powder',
  'text': '0.6666666666666666 teaspoon baking powder',
  'recipe_index': 'AR_14',
  'rating': 0.90666666,
  'quantity': 0.666666667,
  'unit': 'teaspoon'},
 {'entry_no': 204,
  'ingredient': 'baking powder',
  'text': '1.1428571428571428 teaspoons baking powder',
  'recipe_index': 'AR_16',
  'rating': 0.85882349,
  'quantity': 1.142857143,
  'unit': 'teaspoon'},
 {'entry_no': 205,
  'ingredient': 'baking powder',
  'text': '1.3333333333333333 teaspoon baking powder',
  'recipe_index': 'AR_169',
  'rating': 0.873728848,
  'quantity': 1.333333333,
  'uni

## Creating a Pandas DataFrame for Ingredients

In [16]:
import pandas as pd

In [17]:
df_ingredients = pd.DataFrame(ingredient_dicts)
df_ingredients.head()

Unnamed: 0,entry_no,ingredient,text,recipe_index,rating,quantity,unit
0,1,all purpose flour,3.0 cups all purpose flour,AR_1,0.920725,3.0,cup
1,2,all purpose flour,2.8000000000000003 cups all purpose flour,AR_10,0.905162,2.8,cup
2,3,all purpose flour,1.1076923076923078 cups all purpose flour,AR_101,0.6,1.107692,cup
3,4,all purpose flour,3.333333333333333 cups sifted all purpose flour,AR_102,0.9375,3.333333,cup
4,5,all purpose flour,2.0 cups all purpose flour,AR_103,0.88125,2.0,cup


The researchers wanted to sort the ingredients by their rating and recipe index, so they did using the `sort_values` function.

In [18]:
df_ingredients = df_ingredients.sort_values(by = "recipe_index")
df_ingredients.head()

Unnamed: 0,entry_no,ingredient,text,recipe_index,rating,quantity,unit
0,1,all purpose flour,3.0 cups all purpose flour,AR_1,0.920725,3.0,cup
1783,821,semisweet chocolate chip,2.0 cups semisweet chocolate chips,AR_1,0.920725,2.0,cup
250,251,baking soda,1.0 teaspoon baking soda,AR_1,0.920725,1.0,teaspoon
1576,447,light brown sugar,1.0 cup packed brown sugar,AR_1,0.920725,1.0,cup
1565,1980,water,2.0 teaspoons hot water,AR_1,0.920725,0.041,cup


# Writing File

In [19]:
recipe_indices = list(df_ingredients["recipe_index"].unique())

out_file_text = ""

for i_recipe in recipe_indices:
    new_recipe = "\n" + i_recipe
    recipe = df_ingredients[df_ingredients["recipe_index"] == i_recipe]

    for ingredient_name in recipe["text"].unique():
        new_recipe += "\n " + ingredient_name

    new_recipe += "\n " + str(recipe.iloc[0]["rating"])

    out_file_text += new_recipe

out_file = open("recipe.txt", mode = 'w', encoding="iso-8859-1")
out_file.write(out_file_text[1:])
out_file.close()