<a href="https://colab.research.google.com/github/micheknows/Coupons_Python/blob/main/Coupons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Retrieve data from the API 


In [45]:
# Get the data from Spoonacular

import pandas as pd
import requests


url = "https://api.spoonacular.com/recipes/random?number=100&limitLicense=true&apiKey=5d979222ba32497c9b0eb6f964db59e9"


# Check if the dataframe already exists
try:
    df = pd.read_csv('data.csv')
except FileNotFoundError:
    df = pd.DataFrame()

# Retrieve new data
response = requests.get(url)
new_data = pd.read_json(response.content)['recipes'][0]

# Convert the new data into a DataFrame
new_data_df = pd.json_normalize(new_data)

# Concatenate new data with existing data (if any)
df = pd.concat([df, new_data_df], axis=0)



In [47]:
df.shape

(201, 37)

In [50]:
# view the head of id
column_name = 'title'
print(df[[column_name]])

                    title
0                     NaN
1                     NaN
2                     NaN
3                     NaN
4                     NaN
..                    ...
196                   NaN
197                   NaN
198                   NaN
199                   NaN
0    Pumpkin French Toast

[201 rows x 1 columns]


In [40]:
# Check for the data type of each column
print(df.dtypes)



recipes                      object
vegetarian                   object
vegan                        object
glutenFree                   object
dairyFree                    object
veryHealthy                  object
cheap                        object
veryPopular                  object
sustainable                  object
lowFodmap                    object
weightWatcherSmartPoints    float64
gaps                         object
preparationMinutes          float64
cookingMinutes              float64
aggregateLikes              float64
healthScore                 float64
creditsText                  object
license                      object
sourceName                   object
pricePerServing             float64
extendedIngredients          object
id                          float64
title                        object
readyInMinutes              float64
servings                    float64
sourceUrl                    object
image                        object
imageType                   

Look at the first few rows of data to get a general idea.


In [41]:
# Get a general overview of teh data retrieved

df.head



<bound method NDFrame.head of                                                recipes vegetarian  vegan  \
0    {'vegetarian': False, 'vegan': False, 'glutenF...        NaN    NaN   
1    {'vegetarian': True, 'vegan': True, 'glutenFre...        NaN    NaN   
2    {'vegetarian': True, 'vegan': False, 'glutenFr...        NaN    NaN   
3    {'vegetarian': False, 'vegan': False, 'glutenF...        NaN    NaN   
4    {'vegetarian': False, 'vegan': False, 'glutenF...        NaN    NaN   
..                                                 ...        ...    ...   
196  {'vegetarian': True, 'vegan': True, 'glutenFre...        NaN    NaN   
197  {'vegetarian': False, 'vegan': False, 'glutenF...        NaN    NaN   
198  {'vegetarian': False, 'vegan': False, 'glutenF...        NaN    NaN   
199  {'vegetarian': False, 'vegan': False, 'glutenF...        NaN    NaN   
0                                                  NaN      False  False   

    glutenFree dairyFree veryHealthy  cheap veryPopular s

In [43]:
print("Size before:  " + str(df.shape))

# remove duplicates from the DataFrame based on the id column
df.drop_duplicates(subset=['id'], inplace=True)


print("Size after drop duplicates:  " + str(df.shape))


Size before:  (201, 37)
Size after drop duplicates:  (2, 37)


Get the column names to see what is in the data.

In [8]:
# Get the column names
df.columns

Index(['recipes', 'vegetarian', 'vegan', 'glutenFree', 'dairyFree',
       'veryHealthy', 'cheap', 'veryPopular', 'sustainable', 'lowFodmap',
       'weightWatcherSmartPoints', 'gaps', 'preparationMinutes',
       'cookingMinutes', 'aggregateLikes', 'healthScore', 'creditsText',
       'license', 'sourceName', 'pricePerServing', 'extendedIngredients', 'id',
       'title', 'readyInMinutes', 'servings', 'sourceUrl', 'image',
       'imageType', 'summary', 'cuisines', 'dishTypes', 'diets', 'occasions',
       'instructions', 'analyzedInstructions', 'originalId',
       'spoonacularSourceUrl'],
      dtype='object')

Take a closer look at what some of these columns are:


In [15]:
# view the head of gaps
column_name = 'gaps'
print(df[[column_name]].head())

  gaps
0  NaN
1  NaN
2  NaN
3  NaN
4  NaN


In [18]:
# view the head of extendedIngredients
column_name = 'extendedIngredients'
print(df[[column_name]].head())

  extendedIngredients
0                 NaN
1                 NaN
2                 NaN
3                 NaN
4                 NaN


In [19]:
# view the head of analyzedInstructions
column_name = 'analyzedInstructions'
print(df[[column_name]].head())

  analyzedInstructions
0                  NaN
1                  NaN
2                  NaN
3                  NaN
4                  NaN


We won't need the following columns for this project:  pricePerServing

In [11]:
# Check for missing values
print(df.isnull().sum())



recipes                       1
vegetarian                  200
vegan                       200
glutenFree                  200
dairyFree                   200
veryHealthy                 200
cheap                       200
veryPopular                 200
sustainable                 200
lowFodmap                   200
weightWatcherSmartPoints    200
gaps                        200
preparationMinutes          200
cookingMinutes              200
aggregateLikes              200
healthScore                 200
creditsText                 200
license                     200
sourceName                  200
pricePerServing             200
extendedIngredients         200
id                          200
title                       200
readyInMinutes              200
servings                    200
sourceUrl                   200
image                       200
imageType                   200
summary                     200
cuisines                    200
dishTypes                   200
diets   

In [12]:
# Check for outliers and data distribution in each column
print(df.describe())

       weightWatcherSmartPoints  preparationMinutes  cookingMinutes  \
count                       1.0                 1.0             1.0   
mean                        8.0                -1.0            -1.0   
std                         NaN                 NaN             NaN   
min                         8.0                -1.0            -1.0   
25%                         8.0                -1.0            -1.0   
50%                         8.0                -1.0            -1.0   
75%                         8.0                -1.0            -1.0   
max                         8.0                -1.0            -1.0   

       aggregateLikes  healthScore  pricePerServing        id  readyInMinutes  \
count             1.0          1.0             1.00       1.0             1.0   
mean             25.0         15.0           141.51  640238.0            45.0   
std               NaN          NaN              NaN       NaN             NaN   
min              25.0         15.0  

In [None]:


# remove any rows with missing values
df.dropna(inplace=True)

# remove any unnecessary columns
df.drop(['column_name_1', 'column_name_2'], axis=1, inplace=True)

# remove any outliers
df = df[(df['column_name'] > lower_bound) & (df['column_name'] < upper_bound)]

# convert the data types of any columns as needed
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')

# normalize the data as needed
df['column_name'] = (df['column_name'] - df['column_name'].mean()) / df['column_name'].std()


In [51]:
# Save updated dataframe to file
df.to_csv('data.csv', index=False)