For this assignment, you must select a dataset that meets the following criteria:
1
It should be a labeled dataset with clearly defined target variables. A labeled dataset allows you to explore relationships between features and targets, making your analysis more focused and actionable. 

2
Consisting of at least 1000 rows to ensure adequate data for meaningful analysis.

3
Most of the features should be numerical, enabling the creation of diverse and insightful visualizations such as scatter plots, histograms, and heat maps.


In [2]:
!pip install scikit-learn
!pip install pandas
!pip install tensorflow
!pip install matplotlib

Collecting tensorflow
  Downloading tensorflow-2.20.0-cp312-cp312-win_amd64.whl.metadata (4.6 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow)
  Downloading flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Using cached gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google_pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Using cached libclang-18.1.1-py2.py3-none-win_amd64.whl.metadata (5.3 kB)
Collecting opt_einsum>=2.3.2 (from tensorflow)
  Downloading opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)
Collecting protobuf>=5.28.0 (from tensorflow)
  Downloading protobuf-6.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.37.1 requires protobuf<6,>=3.20, but you have protobuf 6.32.0 which is incompatible.




In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import os

In [9]:
# 1. GENERATE SYNTHETIC DATA
np.random.seed(0)
data_size = 2000

In [10]:
# Base numeric features
df = pd.DataFrame({
    "VisitDuration":  np.random.rand(data_size),                 # 0–1 (normalize as you like)
    "PagesVisited":   np.random.randint(1, 21, size=data_size),  # 1–20 pages
    "ItemsViewed":    np.random.randint(1, 16, size=data_size),  # 1–15 items
    "DeviceType":     np.random.choice([0, 1], size=data_size),  # 0=Desktop, 1=Mobile
    "AdClicks":       np.random.randint(0, 6, size=data_size),   # 0–5 ads
    "CartAdds":       np.random.randint(0, 4, size=data_size),   # 0–3 items added
})

In [11]:
# Demographics
df["Age"] = np.clip(np.random.normal(loc=33, scale=10, size=data_size).round(), 16, 75).astype(int)
df["Gender"] = np.random.choice([0, 1], size=data_size)  # 0=woman, 1=man

In [12]:
# Locations around Vancouver (with mild weights)
locations = ["New Westminster", "Downtown", "Surrey", "Burnaby", "Kitsilano"]
weights   = [0.18,               0.26,        0.22,     0.20,      0.14]
df["Location"] = np.random.choice(locations, size=data_size, p=weights)

# One-hot encode location
loc_dummies = pd.get_dummies(df["Location"], prefix="Loc")
df = pd.concat([df.drop(columns=["Location"]), loc_dummies], axis=1)

In [13]:
# Target (Purchase): keep it labeled & realistic
df["Purchase"] = (
    (df["VisitDuration"] + (df["PagesVisited"]/20.0) > 1.0) |
    (df["CartAdds"] > 0) |
    (df["AdClicks"] >= 3) |
    (df["ItemsViewed"] >= 8)
).astype(int)

In [14]:
# column order (features first, then target)
feature_cols = [
    "VisitDuration","PagesVisited","ItemsViewed","DeviceType","AdClicks","CartAdds",
    "Age","Gender",
    # one-hot locations
    "Loc_Burnaby","Loc_Downtown","Loc_Kitsilano","Loc_New Westminster","Loc_Surrey"
]
df = df[feature_cols + ["Purchase"]]

In [15]:
print(df.head())

   VisitDuration  PagesVisited  ItemsViewed  DeviceType  AdClicks  CartAdds  \
0       0.548814            12            2           0         5         0   
1       0.715189            16            2           1         0         0   
2       0.602763             5            4           1         5         0   
3       0.544883            18            7           0         2         0   
4       0.423655            13            6           1         4         2   

   Age  Gender  Loc_Burnaby  Loc_Downtown  Loc_Kitsilano  Loc_New Westminster  \
0   35       1         True         False          False                False   
1   43       0         True         False          False                False   
2   18       0         True         False          False                False   
3   21       0        False         False          False                False   
4   33       1        False         False          False                 True   

   Loc_Surrey  Purchase  
0       Fals

In [16]:
print("\nShape:", df.shape)


Shape: (2000, 14)


In [23]:
#Save the dataset to CSV
output = 'C:/Users/MuriloFarias/Desktop/GitHub/Python/CLASSE_DEA109/Assignment/vancouver_shopping_dataset.csv'
df.to_csv(output, index=False)
output

'C:/Users/MuriloFarias/Desktop/GitHub/Python/CLASSE_DEA109/Assignment/vancouver_shopping_dataset.csv'

This dataset simulates online shopping behavior for 2,000 customers in Vancouver-area cities. It is labeled, with a target variable Purchase that indicates whether a session resulted in a purchase (1) or not (0)


Features
VisitDuration (float) → Normalized session duration (0–1).
PagesVisited (int) → Number of pages visited during the session (1–20).
ItemsViewed (int) → Number of product items viewed (1–15).
DeviceType (int) → 0 = Desktop, 1 = Mobile.
AdClicks (int) → Number of ads clicked during the session (0–5).
CartAdds (int) → Number of items added to the shopping cart (0–3).
Age (int) → Customer’s age, ranging from 16 to 75 (normally distributed around 33).
Gender (int) → 0 = Woman, 1 = Man.
Loc_Burnaby, Loc_Downtown, Loc_Kitsilano, Loc_New Westminster, Loc_Surrey (binary) → One-hot encoded location of the customer session, covering major areas around Vancouver.

Target Variable
Purchase (binary) → Indicates if a purchase occurred in the session.