# Task 1 - Exploratory Data Analysis
---
## Section 1 - Setup
Mounting notebook to Google Drive folder, in order to access the CSV data file.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Loading `dataframe` in order to view, analyse and manipulate datasets via `Pandas`.

In [None]:
!pip install pandas
import pandas as pd



---
## Section 2 - Data loading
Updating the `path` variable then reading the CSV file into a pandas dataframe.

In [None]:
path = "/content/drive/MyDrive/cognizant/sample_sales_data.csv"
df = pd.read_csv(path)
df.drop(columns=["Unnamed: 0"], inplace=True, errors='ignore')
df.head()

Unnamed: 0,transaction_id,timestamp,product_id,category,customer_type,unit_price,quantity,total,payment_type
0,a1c82654-c52c-45b3-8ce8-4c2a1efe63ed,2022-03-02 09:51:38,3bc6c1ea-0198-46de-9ffd-514ae3338713,fruit,gold,3.99,2,7.98,e-wallet
1,931ad550-09e8-4da6-beaa-8c9d17be9c60,2022-03-06 10:33:59,ad81b46c-bf38-41cf-9b54-5fe7f5eba93e,fruit,standard,3.99,1,3.99,e-wallet
2,ae133534-6f61-4cd6-b6b8-d1c1d8d90aea,2022-03-04 17:20:21,7c55cbd4-f306-4c04-a030-628cbe7867c1,fruit,premium,0.19,2,0.38,e-wallet
3,157cebd9-aaf0-475d-8a11-7c8e0f5b76e4,2022-03-02 17:23:58,80da8348-1707-403f-8be7-9e6deeccc883,fruit,gold,0.19,4,0.76,e-wallet
4,a81a6cd3-5e0c-44a2-826c-aea43e46c514,2022-03-05 14:32:43,7f5e86e6-f06f-45f6-bf44-27b095c9ad1d,fruit,basic,4.49,2,8.98,debit card


#Key:
*transaction_id = this is a unique ID that is assigned to each transaction, *timestamp = this is the datetime at which the transaction was made, *product_id = this is an ID that is assigned to the product that was sold. Each product has a unique ID, *category = this is the category that the product is contained within, *customer_type = this is the type of customer that made the transaction, *unit_price = the price that 1 unit of this item sells for,*quantity = the number of units sold for this product within this transaction, *total = the total amount payable by the customer, and *payment_type = the payment method used by the customer

---
## Section 3 - Descriptive statistics
Description of the data, that is: what columns are present, how many null values exist and what data types exists within each column.

After this, you should try to compute some descriptive statistics of the numerical columns within the dataset

In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7829 entries, 0 to 7828
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   transaction_id  7829 non-null   object 
 1   timestamp       7829 non-null   object 
 2   product_id      7829 non-null   object 
 3   category        7829 non-null   object 
 4   customer_type   7829 non-null   object 
 5   unit_price      7829 non-null   float64
 6   quantity        7829 non-null   int64  
 7   total           7829 non-null   float64
 8   payment_type    7829 non-null   object 
dtypes: float64(2), int64(1), object(6)
memory usage: 550.6+ KB
None


In [None]:
print(df.describe())

        unit_price     quantity        total
count  7829.000000  7829.000000  7829.000000
mean      7.819480     2.501597    19.709905
std       5.388088     1.122722    17.446680
min       0.190000     1.000000     0.190000
25%       3.990000     1.000000     6.570000
50%       7.190000     3.000000    14.970000
75%      11.190000     4.000000    28.470000
max      23.990000     4.000000    95.960000


---
## Section 4 - Visualisation
Loading `seaborn` and `matploitlib` package for visualisations.

In [None]:
!pip install seaborn
import seaborn as sns
import matplotlib.pyplot as plt



#Analysing the dataset:
`Unique values are present within a column`

In [None]:
def get_unique_values(data, column):
    num_unique_values = len(data[column].unique())
    value_counts = data[column].value_counts()
    print(f"Column: {column} has {num_unique_values} unique values\n")
    print(value_counts)
    plt.figure(figsize=(10, 6))
    sns.countplot(x=data[column], order=value_counts.index)
    plt.title(f'Unique Values Distribution for {column}')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.show()
get_unique_values(data=df, column='unit_price')
get_unique_values(data=df, column='quantity')
get_unique_values(data=df, column='total')

`Distribution of categorical columns`

In [None]:
def plot_categorical_distribution(data: pd.DataFrame = None, column: str = None, height: int = 8, aspect: int = 2):
    sns.set(style="whitegrid")  # Optional: Set a seaborn style if needed
    plot = sns.catplot(data=data, x=column, kind='count', height=height, aspect=aspect)
    plot.set(title=f'Distribution of {column}')
    plt.show()
plot_categorical_distribution(data=df, column='unit_price')
plot_categorical_distribution(data=df, column='quantity')
plot_categorical_distribution(data=df, column='total')

`Correlations between the numeric columns within the data`

In [None]:
def correlation_plot(data: pd.DataFrame = None, columns: list = None):
    if columns is None:
        raise ValueError("Please provide a list of column names.")
    corr_subset = data[columns].corr()
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_subset, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
    plt.title(f'Correlation Matrix for {", ".join(columns)}')
    plt.show()
correlation_plot(data=df, columns=['unit_price', 'quantity', 'total'])

`Distribution of numeric columns`

In [None]:
def plot_continuous_distributions(data: pd.DataFrame = None, columns: list = None, height: int = 8):
    if columns is None:
        raise ValueError("Please provide a list of column names.")

    plt.figure(figsize=(height, height/5 * len(columns)))

    for i, column in enumerate(columns, 1):
        plt.subplot(len(columns), 1, i)
        sns.histplot(data[column], kde=True)
        plt.title(f'Distribution of {column}')
    plt.tight_layout()
    plt.show()
plot_continuous_distributions(data=df, columns=['unit_price', 'quantity', 'total'])

---

## Section 5 - Summary

From the explanatory data analysis, there's a solid understanding of the data.
Now, given that the client wants to know:

```
"How to better stock the items that they sell!"
```

From this dataset, it is impossible to answer that question. In order to make the next step on this project with the client, it is clear that:

- More rows of data are needed since the current sample is only from 1 store and 1 week worth of data
- Framing the specific problem statement to be solved is necessary, the current business problem is too broad therefor narrowing down the focus in order to deliver a valuable end product
- Based on the problem statement to be moved forward with, more columns are needed too(features) to help understand the outcome being solved for.