<a href="https://colab.research.google.com/github/hdtran103/Prediction-of-Product-Sales/blob/main/Loading_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

syncing google.colab with drive

# **Loading Data**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


import pandas to google colab for toolkit package

In [2]:
# Import necessary library
import pandas as pd

Import a csv file using the read_csv() function from pandas library. 

In [3]:
# Load the data into a DataFrame using Pandas
filename = "/content/drive/MyDrive/Data/sales_predictions.csv"
df = pd.read_csv(filename)


- print df.head() allows it be useful for quickly testing if your object has the right type of data in it
- print df.info() allows information about a DataFrame including index dtype and columns, non-null and memory usage

In [4]:
# Preview the first fw rows of the data and summary of columns
print(df.head())
print(df.info())

  Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0           FDA15         9.30          Low Fat         0.016047   
1           DRC01         5.92          Regular         0.019278   
2           FDN15        17.50          Low Fat         0.016760   
3           FDX07        19.20          Regular         0.000000   
4           NCD19         8.93          Low Fat         0.000000   

               Item_Type  Item_MRP Outlet_Identifier  \
0                  Dairy  249.8092            OUT049   
1            Soft Drinks   48.2692            OUT018   
2                   Meat  141.6180            OUT049   
3  Fruits and Vegetables  182.0950            OUT010   
4              Household   53.8614            OUT013   

   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type  \
0                       1999      Medium               Tier 1   
1                       2009      Medium               Tier 3   
2                       1999      Medium               Tier

# **Data Cleaning**

In [5]:
# Data Cleaning
# Check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")
df.drop_duplicates(inplace=True)

Number of duplicate rows: 0


- Find out the number of rows and colums in the DataFrame using the .shape 

In [6]:
# Get the dimensions of the DataFrame
print(df.shape)

(8523, 12)


- What are the datatypes of each variable
- using df.dtypes would identified variable.

In [7]:
# print df.dtypes to get the variable identification.
print(df.dtypes)

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object


df.duplicated() is a pandas method that returns a boolean array indicating which rows of a df are duplicated.
- df.drop_duplicates() is another pandas method that removes duplicate rows from a df.

In [8]:
# Finding duplicates and drop duplicates if any
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)

0


- using the isnull() method to check for missing values in the DataFrame.

In [9]:
# Identifying missing values
print(df.isnull().sum())

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64


In [10]:
# Fill missing values with median
df["Outlet_Size"].fillna("Unknown", inplace=True)

- Printing to confirm there are no missing values after addressing them

In [11]:
# print to confirm for no missing values
print(df.isnull().sum())

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                     0
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64


In [12]:
print(df["Item_Fat_Content"].value_counts())

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64


The "Item_Fat_Content" column contains inconsistent categories, such as " Low Fat","LF", "low fat", "Regular", and "reg". we can fix this by replacing "LF" and "low fat" with "Low Fat", and replacing "reg" with "Regular"
- To fix this we have to use the 'replace()' method. 
- then print the updated value counts for the "Item_Fat_Content" column using the '.value_counts()' method.


In [13]:
# There are inconsistencies in "Item_Fat_Content" column
df["Item_Fat_Content"].replace({"LF": "Low Fat", "low fat": "Low Fat", "reg": "Regular"}, inplace=True)
print(df["Item_Fat_Content"].value_counts())

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64


- To obtain the summary statistics of each numerical column (min,max,mean) in the DataFrame, we can use the '.describe()' method

In [14]:
# Print summary statistics for numerical columns
print(df.describe())

       Item_Weight  Item_Visibility     Item_MRP  Outlet_Establishment_Year  \
count  7060.000000      8523.000000  8523.000000                8523.000000   
mean     12.857645         0.066132   140.992782                1997.831867   
std       4.643456         0.051598    62.275067                   8.371760   
min       4.555000         0.000000    31.290000                1985.000000   
25%       8.773750         0.026989    93.826500                1987.000000   
50%      12.600000         0.053931   143.012800                1999.000000   
75%      16.850000         0.094585   185.643700                2004.000000   
max      21.350000         0.328391   266.888400                2009.000000   

       Item_Outlet_Sales  
count        8523.000000  
mean         2181.288914  
std          1706.499616  
min            33.290000  
25%           834.247400  
50%          1794.331000  
75%          3101.296400  
max         13086.964800  
