STEP 1 ‚Äî Create a Small Sample Dataset

Before generating 200,000 synthetic customers, we need a small dataset that represents the market.
To learn properly, we will create a tiny sample of 10 customers manually using code.
Because if you understand the small dataset, you understand the distribution, and then you can scale that up confidently.

In [4]:
import pandas as pd

sample_data = {
    "Gender":  [1, 2, 1, 1, 2, 2, 1, 2, 1, 2],
    "Age_Group": [1, 2, 2, 3, 1, 3, 2, 2, 1, 3],
    "Income_Level":   [1, 2, 3, 2, 1, 3, 2, 1, 3, 2],
    "Shop_Frequency": [1, 3, 2, 2, 1, 3, 3, 2, 1, 2],
    "Product_Type":   [1, 3, 2, 1, 1, 3, 2, 2, 1, 3]
}

sample_df = pd.DataFrame(sample_data)
print(sample_df)


   Gender  Age_Group  Income_Level  Shop_Frequency  Product_Type
0       1          1             1               1             1
1       2          2             2               3             3
2       1          2             3               2             2
3       1          3             2               2             1
4       2          1             1               1             1
5       2          3             3               3             3
6       1          2             2               3             2
7       2          2             1               2             2
8       1          1             3               1             1
9       2          3             2               2             3


STEP 2 ‚Äî Compute frequencies of all combinations

We want to answer: ‚ÄúHow common is each customer type in the sample?‚Äù
For example:
-(Male, young, low income, shops rarely, buys electronics)
-(Female, middle-aged, high income, shops frequently, buys clothing)

In [6]:
freqs = sample_df.groupby(
    ["Gender", "Age_Group", "Income_Level", "Shop_Frequency", "Product_Type"]
).size().to_dict()
freqs

{(1, 1, 1, 1, 1): 1,
 (1, 1, 3, 1, 1): 1,
 (1, 2, 2, 3, 2): 1,
 (1, 2, 3, 2, 2): 1,
 (1, 3, 2, 2, 1): 1,
 (2, 1, 1, 1, 1): 1,
 (2, 2, 1, 2, 2): 1,
 (2, 2, 2, 3, 3): 1,
 (2, 3, 2, 2, 3): 1,
 (2, 3, 3, 3, 3): 1}

STEP 3 ‚Äî Scaling frequencies up to a 200,000-customer population
This step teaches you how to take a small sample distribution and expand it correctly.

Given:
-sample size = 10
-target synthetic population = 200,000

How many synthetic customers should correspond to each combination?
For each combination:
ùëõ =(sample count/sample size)√ótarget population
ùëõ=(1/10)√ó200000=20000
Because each sample entry represents 10% of the data.

In [9]:
sample_size=len(sample_df)
target_pop = 200000
scaled_freqs={}
for combination,count in freqs.items():
    n = int((count/sample_size)*target_pop)
    scaled_freqs[combination]=n
scaled_freqs

{(1, 1, 1, 1, 1): 20000,
 (1, 1, 3, 1, 1): 20000,
 (1, 2, 2, 3, 2): 20000,
 (1, 2, 3, 2, 2): 20000,
 (1, 3, 2, 2, 1): 20000,
 (2, 1, 1, 1, 1): 20000,
 (2, 2, 1, 2, 2): 20000,
 (2, 2, 2, 3, 3): 20000,
 (2, 3, 2, 2, 3): 20000,
 (2, 3, 3, 3, 3): 20000}

STEP 4 ‚Äî Build the Synthetic Population DataFrame
We are going to:
-Create an empty table with 200,000 rows
-Fill in the rows according to scaled_freqs
-Do this combination-by-combination

Step 4A ‚Äî Create an empty DataFrame

In [19]:
import numpy as np

synthetic_population = pd.DataFrame(
    index=range(target_pop),
    columns = ["Gender", "Age_Group", "Income_Level", "Shop_Frequency", "Product_Type"]
)
synthetic_population[:]=np.nan
synthetic_population.head()

Unnamed: 0,Gender,Age_Group,Income_Level,Shop_Frequency,Product_Type
0,,,,,
1,,,,,
2,,,,,
3,,,,,
4,,,,,


Step 4B ‚Äî Insert each combination into the DataFrame

In [20]:
for combination, n in scaled_freqs.items():
     # Find the indices of 'n' rows in synthetic_population where "Gender" is missing (NaN)
    empty_indices = synthetic_population[synthetic_population["Gender"].isna()].sample(n=n, replace=False).index
    # Assign the current combination of attributes to the selected rows
    synthetic_population.loc[empty_indices, ["Gender", "Age_Group", "Income_Level", "Shop_Frequency", "Product_Type"]] = combination
# Check how many missing values remain in each column after filling
synthetic_population.isna().sum()

Gender            0
Age_Group         0
Income_Level      0
Shop_Frequency    0
Product_Type      0
dtype: int64

In [17]:
synthetic_population.head()

Unnamed: 0,Gender,Age_Group,Income_Level,Shop_Frequency,Product_Type
0,1,3,2,2,1
1,2,1,1,1,1
2,1,3,2,2,1
3,1,1,3,1,1
4,2,3,3,3,3


STEP 5 ‚Äî Add Realistic Features (Noise)
In this step, we'll take your population and add a bit of real-world variability to make it feel more like actual data. We'll introduce some randomness, like:
-Salaries (based on income group)
-Number of purchases (based on shopping frequency)
-Product purchase patterns (based on product type)
Adding noise makes the data more realistic and also helps with testing machine learning models later.

What We‚Äôll Do:
1. Generate random salaries based on income group (low, middle, high).
2. Add random purchase counts based on shopping frequency (low, medium, high).
3. Add variability to product types based on user groupings.

Step 5A ‚Äî Add Random Salaries
For simplicity, let‚Äôs assume:
-Low income = salary range between 10,000 and 30,000
-Middle income = salary range between 30,000 and 60,000
-High income = salary range between 60,000 and 100,000
We will use the np.random.randint function to generate these values.

In [None]:
def generate_salary(income_level):
    if income_level == 1:
        return np.random.randint(10000, 30000)
    elif income_level == 2:
        return np.random.randint(30000, 60000)
    elif income_level == 3:
        return np.random.randint(60000, 100000)

synthetic_population["Salary"]=synthetic_population["Income_Level"].apply(generate_salary)
synthetic_population.head()

Unnamed: 0,Gender,Age_Group,Income_Level,Shop_Frequency,Product_Type,Salary
0,1,1,3,1,1,73247
1,2,3,3,3,3,79965
2,2,2,2,3,3,45093
3,2,3,2,2,3,40471
4,2,2,2,3,3,33832


Step 5B ‚Äî Random Purchase Counts
We'll assign purchase counts based on Shop_Frequency:
Shop_Frequency  Meaning     Number of Purchases (per month)
    1           Rarely          0 ‚Äì 2
    2           Sometimes       3 ‚Äì 6
    3           Frequently      7 ‚Äì 15
We'll again use np.random.randint to generate some randomness.

In [26]:
def generate_purchases(freq):
    if freq==1:
        return np.random.randint(0,2)
    elif freq==2:
        return np.random.randint(3,6)
    elif freq==3:
        return np.random.randint(7,15)

synthetic_population["Monthly_Purchase"]=synthetic_population["Shop_Frequency"].apply(generate_purchases)
print(synthetic_population.head())
synthetic_population["Monthly_Purchase"].describe()

  Gender Age_Group Income_Level Shop_Frequency Product_Type  Salary  \
0      1         1            3              1            1   73247   
1      2         3            3              3            3   79965   
2      2         2            2              3            3   45093   
3      2         3            2              2            3   40471   
4      2         2            2              3            3   33832   

   Monthly_Purchase  
0                 0  
1                10  
2                11  
3                 5  
4                12  


count    200000.00000
mean          4.90125
std           4.17720
min           0.00000
25%           1.00000
50%           4.00000
75%           8.00000
max          14.00000
Name: Monthly_Purchase, dtype: float64

Step 5C ‚Äî Add Randomness to Product Preferences. This step introduces some variability in the Product_Type choices based on Age_Group and Income_Level to make the data feel more lifelike.

-Young customers (Age_Group = 1) might prefer tech gadgets, fashion items, or gaming products.
-Middle-aged customers (Age_Group = 2) may prefer home appliances, furniture, or gadgets.
-Older customers (Age_Group = 3) could be more interested in health products, books, or home decor.
-Low-income customers might prefer lower-cost items, while high-income customers might opt for premium products.

We will use random choice within these groups for the Product_Type column.

Product type:
1- Fashion
2- Gaming
3- Tech, Home Appliances
4- Luxury Gadgets, Kitchen Items, Furniture
5- Home Appliances, Gadgets
6- Luxury Home, Health Products
7- Books, Luxury Books

In [30]:
def generate_product_type(age, income):
    if age==1:
        if income == 1:
            return np.random.choice([1,2])
        elif income ==2:
            return np.random.choice([1, 2, 3])
        elif income == 3: 
            return np.random.choice([3, 4, 5])
    elif age==2:
        if income == 1:
            return np.random.choice([3, 4, 5])
        elif income ==2:
            return np.random.choice([1, 3, 4])
        elif income == 3: 
            return np.random.choice([6, 7])
    elif age==3:
        if income == 1:
            return np.random.choice([1, 3, 5])
        elif income ==2:
            return np.random.choice([2, 4, 6])
        elif income == 3: 
            return np.random.choice([5, 6, 7])
synthetic_population["Product_Type"]= synthetic_population.apply(
    lambda row: generate_product_type(row["Age_Group"], row["Income_Level"]), axis=1
)
synthetic_population.head()

Unnamed: 0,Gender,Age_Group,Income_Level,Shop_Frequency,Product_Type,Salary,Monthly_Purchase
0,1,1,3,1,4,73247,0
1,2,3,3,3,6,79965,10
2,2,2,2,3,3,45093,11
3,2,3,2,2,4,40471,5
4,2,2,2,3,4,33832,12


| Column           | Description                                   |
| ---------------- | --------------------------------------------- |
| Gender           | 1 = Male, 2 = Female                          |
| Age_Group        | 1 = Young, 2 = Middle-aged, 3 = Older         |
| Income_Level     | 1 = Low, 2 = Medium, 3 = High                 |
| Shop_Frequency   | 1 = Rarely, 2 = Sometimes, 3 = Frequently     |
| Product_Type     | Randomized based on Age_Group & Income_Level  |
| Salary           | Random salary based on Income_Level           |
| Monthly_Purchase | Random purchase count based on Shop_Frequency |

Step 6: Save Your Synthetic Dataset

In [31]:
synthetic_population.to_csv("Synthetic_customer_data.csv", index=False)
print("Synthetic Data saved successfully")

Synthetic Data saved successfully


Optional Next Steps:
-Analyze & Visualize
|-Plot distributions of Age_Group, Income_Level, Product_Type, etc.
|-Look at correlations between Salary, Purchases, and Product_Type.

-Segment Customers with K-Means
|-Cluster customers into groups based on Salary, Monthly_Purchases, and Product_Type.
|-Discover patterns like ‚ÄúHigh-income frequent buyers‚Äù or ‚ÄúYoung occasional shoppers.‚Äù

-Machine Learning Simulations
|-Predict Monthly_Purchases based on features using regression.
|-Classify likely Product_Type using classification algorithms.