<a href="https://colab.research.google.com/github/mumnsa/DAC-Curriculum-2025/blob/main/Student_Copy_EDA_DAC_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**INTRODUCTION**

***What is Exploratory Data Analysis ?***
- understanding the data sets by summarizing their main characteristics
- often plotting them visually
- This step is very important especially when we arrive at modeling the data in order to apply MACHINE LEARNING.
- Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much time to explore the data.

Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.

***What data are we exploring today ?***

We will start off our EDA journey with a simple data set regarding a specific store's coffee sales. This dataset includes Date, Time, Payment method, Type of Coffee Sold, and Money Earned.

**1 . IMPORTING LIBRARIES**

***WHAT are Libraries & WHY do we need them?***

- Libraries in programming are pre-written collections of code.
- They provide useful functions, classes, and modules that can be reused in your own code.
- Libraries act as toolkits or packages to avoid rewriting common functionality.
- They help you avoid reinventing the wheel by offering ready-made solutions for common tasks.

Libraries can be compared to recipe books in cooking: Instead of baking a cake from scratch, you can follow a well-known recipe (i.e., pre-written code).

In summary, libraries are essential because they help us work faster, ensure our code is reliable, and allow us to tackle more complex problems easily by using tools that others have already built and perfected!

**How to install :**
NOTE : for macOS use pip3 , for Windows use pip

COMMAND : " !pip
 install pandas "

In [None]:
# Example
# Install packages first in order to be able to install libraries
!pip3 install pandas

In [None]:
# You will now be able to import required libraries.
import pandas as pd         # data manipulation & analysis
import numpy as np          # numerical & mathematical operations
import matplotlib.pyplot as plt         # for creating visualisations
import seaborn as  sns          # for more advanced visualisations

**Here’s a simple explanation of the purpose of each import statement:**

***1. import pandas as pd***
- **Purpose:** Loads the **pandas** library, which is used for data manipulation and analysis.
- **What it does:** It allows you to work with data in tables, similar to Excel, called DataFrames, and to perform operations like filtering, grouping, and merging data.

***2. import numpy as np***
- **Purpose:** Loads the **NumPy** library, which is essential for numerical and mathematical operations.
- **What it does:** It helps with handling arrays (lists of numbers) and performing calculations like averages, sums, and matrix operations.

***3. import matplotlib.pyplot as plt***
- **Purpose:** Loads **matplotlib**, a library for creating visualizations.
- **What it does:** It helps in making basic plots like line charts, bar charts, and scatter plots.

4. import seaborn as sns
- **Purpose:** Loads Seaborn, a data visualization library built on top of matplotlib.
- **What it does:** It makes it easier to create more advanced and visually appealing statistical plots, like heatmaps, box plots, and violin plots.

**2. LOADING THE DATA into a DATAFRAME using pandas**

**How this works :** You can use pandas to read your data file, and organise them into a Dataframe (made up of ROWS & COLUMNS).
- can read CSV, Excel, SQL etc

What is A CSV?
- CSV is a Comma-Separated Values text file where each line of the file represents a row of data, and the values within a row are separated by commas.
- other forms of text files are : Tab-Separated Values (tsv), Semicolon-Separated Values (ssv)

In [None]:
# Place Coffee_Sales.csv into a google drive folder and remember where you have kept it
# Import and mount csv into google colab notebook (you will be required to sign in into your google account)


In [None]:
# Now that it is mounted, we can access it via the file path (different for everyone, use your file path)



**3. VIEWING THE DATA'S CONTENTS**

In [None]:
# To view FIRST 5 rows

In [None]:
# To view LAST 5 rows

In [None]:
# To view the unique elements in cash_type (we can call it payment method)

In [None]:
# To find out number of payments by cash or card

**4. VIEWING THE DATA'S GENERAL INFO & STATS**


In [None]:
# It will return the (number of rows, number of columns)

In [None]:
# It will print all the column names ofour dataset

In [None]:
# df.dtypes shows the types of data in our Dataframe

In [None]:
# df.info() shows a summary of our data set

In [None]:
# show_counts=True hows number of non-null rows
# will notice that card only has 1660 non-null rows -> we need te remove!

**int64** : This data type is used to **represent integer values**. The int64 type indicates that each element in the column is a 64-bit integer.

**float64** : This data type is used to represent floating-point values, which are **numbers that can have decimal places**. The float64 type indicates that each element in the column is a 64-bit floating-point number.

**object** : This data type is a catch-all for columns that contain mixed types or are not easily classified as numerical. Columns with the data type object can **contain strings, mixed types, or even Python objects**.

In [None]:
# Displaying a STATISTICAL summary of our data in 5dp
# (only for columns with numerical values)
pd.options.display.float_format = '{:.5f}'.format
df.describe()
# .describe() is a function in pandas that shows the statistical summary of our dataset

**5. CLEANING THE DATA**

**NULL DATA**



*   for this dataset, we will not remove null rows first.



In [None]:
# View the number of null rows according to its column names
# Remove all rows with a null values in them
# axis = 0 ==> rows
# axis = 1 ==> columns

## df = df.dropna()

# then you will notice now your dataframe has only 1660 rows instead of the initial 1748 rows.

In [None]:
# using the same command, now we notice there are no more null rows!

In [None]:
## df.count()
# number or rows are all now 1660


**DUPLICATED DATA**

In [None]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

in this case, there are no duplicated rows

It is showing two numbers:

0: The number of duplicate rows found.

6: The number of columns in the DataFrame.

In [None]:
# if there are duplicated rows, we use
# df = df.drop_duplicates()

**6. EXPLORE OUR DATA !**

In [None]:
#converting ['date'] into date-time format => more uniform data => easier plotting => better analysis
df['date'] = pd.to_datetime(df['date'])
df['datetime'] = pd.to_datetime(df['datetime'])

**Commonly used functions!**

**df['date']:**

* This accesses the 'date' column from the
DataFrame df, which contains the dates of the sales transactions

**value_counts():**

* This function counts the occurrence of each unique date, effectively calculating how many sales occurred on each date. The result is a Series where the index represents the dates and the values represent the sales count.

**sort_index():**

* The output of value_counts() is sorted by date (index) to ensure the sales data is ordered chronologically, so the line chart reflects the correct time sequence.

**plot(kind='line', title='Daily Coffee Sales'):**

* This plots the sales count (values) against the dates (index) as a line plot, with the title "Daily Coffee Sales".

* The kind='line' specifies that a line chart should be used.

* you can choose to use 'bar', 'box', 'hist', 'pie'

**plt.rcParams['figure.figsize'] = (14, 6):**

* This configures the default size of the plot, making it 14 units wide and 6 units tall. This ensures that the chart is large enough for clear visualization.

In [None]:
# To check daily sales across March 2024 to November 2024

**TO ADD COLOURS TO VISUALISATIONS :**

* MatplotLib - use colors = ' '
* Seaborn - use palette = ' '

In [None]:

# you can play around with the colours!


# for this, if null rows are removed we have no more cash!!!

In [None]:
# Which day of the week has the highest sales?
# 0=Monday, 6=Sunday
df['weekday'] = df['date'].dt.weekday
plt.rcParams['figure.figsize'] = (14,6)

df['weekday'].value_counts().sort_index().plot(kind='bar', title='Sales by Weekday', color='pink')
plt.xticks(rotation=0)
plt.show()

In [None]:
# Which hour has the highest coffee sales?
df['hour'] = df['datetime'].dt.hour
plt.rcParams['figure.figsize'] = (14, 6)
df['hour'].value_counts().sort_index().plot(kind='bar', title='Sales by Hour of Day', color='red')
plt.show()

In [None]:
# Boxplot: To check for outliers in the sale amounts.
sns.boxplot(x=df['money'])
plt.rcParams['figure.figsize'] = (14, 6)
plt.title('Sale Amount Boxplot')
plt.show()

In [None]:
# Which coffee is the most popular?
df['coffee_name'].value_counts().plot(kind='bar', title='Top-Selling Coffee Types')
plt.rcParams['figure.figsize'] = (14, 6)
plt.xticks(rotation=45)  #adjust words on x axis
plt.show()

In [None]:
Coffee_Popularity = sns.countplot(x = 'coffee_name', data = df, palette = 'colorblind')

plt.rcParams['figure.figsize'] = (14, 6)
for bars in Coffee_Popularity.containers:
    Coffee_Popularity.bar_label(bars)

    plt.xticks(rotation=45)

In [None]:
Coffee_Sales = df.groupby(['coffee_name'], as_index = False)['money'].sum().sort_values(by = 'money', ascending = False)
sns.barplot(x = 'coffee_name', y = 'money', data  = Coffee_Sales, palette = 'Blues')