<a href="https://colab.research.google.com/github/jeangarcia77/sales-predictions/blob/main/Food_Sales_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction: Mount and Import**

In [70]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

filename ="/content/drive/MyDrive/01 Week 1: Python/sales_predictions.csv"
df = pd.read_csv(filename)
df.head()

In [None]:
df.info()

### **Duplicate and Missing Values**


In [85]:
# Check for duplicated data

df.duplicated()
df.duplicated().sum()

0

In [67]:
#Check for missing values

# Sum of missing values in item weight and outlet size columns
num_missing_item_weight = df['Item_Weight'].isna().sum()
num_missing_outlet_size = df['Outlet_Size'].isna().sum()

total_rows = df.shape[0]

percent_missing_item_weight = (num_missing_item_weight/total_rows)*100
percent_missing_outlet_size = (num_missing_outlet_size/total_rows)*100

print(f'{percent_missing_item_weight:.2f}% of the data in the Item Weight column is missing\n')
print(f'{percent_missing_outlet_size:.2f}% of the data in the Outlet Size column is missing')

17.17% of the data in the Item Weight column is missing

28.28% of the data in the Outlet Size column is missing


In [106]:
# Handling Missing Values

# dropped columns Item Weight and Outlet Size - details in Notes below
df.drop(columns='Item_Weight', inplace=True)
df.drop(columns='Outlet_Size', inplace=True)


#df['Outlet_Size'].fillna(df['Outlet_Size'].mode()[0], inplace=True)


# Confirm columns were dropped - no more missing data
df.isna().sum()


Item_Identifier              0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

In [112]:
# Confirm columns were dropped
df.head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Tier 1,Supermarket Type1,3735.138
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Tier 3,Supermarket Type2,443.4228
2,FDN15,Low Fat,0.01676,Meat,141.618,OUT049,1999,Tier 1,Supermarket Type1,2097.27
3,FDX07,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Tier 3,Grocery Store,732.38
4,NCD19,Low Fat,0.0,Household,53.8614,OUT013,1987,Tier 3,Supermarket Type1,994.7052


## **Corrected incorrect categories **

In [39]:
#Corrected incorrect categories - Item_Fat_Content

# All changes for Low Fat as new column value standard
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('LF', 'Low Fat')
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('low fat', 'Low Fat')


# All changes for Regular as new column value standard
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('reg', 'Regular')

df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [104]:
#Corrected incorrect categories - Item_Type

# All changes for Fruits & Vegetables as new column value standard
df['Item_Type'] = df['Item_Type'].replace('Fruits and Vegetables', 'Fruits & Vegetables')
# All changes for Health & Hygiene as new column value standard
df['Item_Type'] = df['Item_Type'].replace('Health and Hygiene', 'Health & Hygiene')


df['Item_Type'].value_counts()

Medium    5203
Small     2388
High       932
Name: Outlet_Size, dtype: int64

### **Summary Statistics**

In [111]:

df.agg(
    {
        "Item_Visibility": ["min", "max", "mean"],
        "Item_MRP": ["min", "max", "mean"],
        "Outlet_Establishment_Year": ["min", "max", "mean"],
        "Item_Outlet_Sales": ["min", "max", "mean"]
        
    }
)



Unnamed: 0,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
min,0.0,31.29,1985.0,33.29
max,0.328391,266.8884,2009.0,13086.9648
mean,0.066132,140.992782,1997.831867,2181.288914


# **Conclusion**

In [107]:
df.head(20)

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Tier 1,Supermarket Type1,3735.138
1,DRC01,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Tier 3,Supermarket Type2,443.4228
2,FDN15,Low Fat,0.01676,Meat,141.618,OUT049,1999,Tier 1,Supermarket Type1,2097.27
3,FDX07,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Tier 3,Grocery Store,732.38
4,NCD19,Low Fat,0.0,Household,53.8614,OUT013,1987,Tier 3,Supermarket Type1,994.7052
5,FDP36,Regular,0.0,Baking Goods,51.4008,OUT018,2009,Tier 3,Supermarket Type2,556.6088
6,FDO10,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,Tier 3,Supermarket Type1,343.5528
7,FDP10,Low Fat,0.12747,Snack Foods,107.7622,OUT027,1985,Tier 3,Supermarket Type3,4022.7636
8,FDH17,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,Tier 2,Supermarket Type1,1076.5986
9,FDU28,Regular,0.09445,Frozen Foods,187.8214,OUT017,2007,Tier 2,Supermarket Type1,4710.535


### **Notes**


**Missing Values**
*   Columns Item weight and Outlet Size were bot missing significant amounts of data (17% and 28% respectively). Allowing these values in our dataset would introduce much room for error. Initially was inclined to replacing Outlet Size missing values with mode. After consideration, I believe that the missing data was too large and would skew the analysis. Dropping both columns should make less room for error.



--------------------------------------------

1. There are 8523 rows and 12 columns
2. Objects, integers and floats among data types

--------------------------------------------

**Notes**

*   No duplicates in data to remove
*   Data types seem to be correct to represent data properly
*   Inconsistent values in ***item fat content*** column - 3 different values for **Low Fat** (Low Fat, LF and low fat). Will keep capitalized first letters for consistency on all columns.
*   .value_counts() - No inconsistent values in any other category value
*   **Item Weight** and **Outlet Size** accounts for ~17% & ~28% (respectively) of missing data






