---

### *Title: Introduction to Decision Trees for Marketing Segmentation*
Subtitle: Identifying Target Customers Using Demographic Data
Content:
- Decision trees enable marketers to predict customer behavior based on demographic attributes
- Nodes are split based on metrics like entropy to create pure child nodes
  - *Pure nodes*: all data points have same value for dependent variable
  - *Impure nodes*: data points have different values, require further splitting
- Goal is to create a classification rule determining likelihood of a customer action (e.g. purchasing a product)


### *Title: Constructing a Decision Tree*
Subtitle: A Step-by-Step Process Using Entropy and Impurity Metrics
Content: 
- Begin with a root node containing all data points
- For each independent variable, calculate entropy of potential child nodes 
  - *Entropy*: measures node impurity, 0 for pure node, 1 for max impurity
- Choose split that minimizes weighted average impurity across child nodes
  - *Impurity*: Σ (Entropy(child node) * % of observations in child node)
- Continue splitting impure nodes until all terminal nodes are pure


### *Lab Slide 1 - Excel Steps*
Title: Calculating Impurity Metrics in Excel
Content:
1. Use COUNTIFS to tally number of people in each category who do/don't buy yogurt 
   - Example: =COUNTIFS($C$3:$C$12,$C15,$F$3:$F$12,D$14) 
2. Use SUM to total number of people for each attribute value
   - Example: =SUM(D15:E15)
3. Calculate fraction of observations for each attribute value 
   - Example: =F15/SUM($F$15:$F$16)
4. Compute entropy components: P(i|X=a)*Log2(P(i|X=a)
   - Example: =IFERROR((D15/$F15)*LOG(D15/$F15,2),0)
5. Sum entropy components to get total entropy for each node split
   - Example: =SUM(H15:I15) 
6. Calculate split impurity as weighted avg of child node entropies
   - Example: =-SUMPRODUCT(G15:G16,J15:J16)
7. Choose split with lowest impurity and continue until all terminal nodes are pure


In [None]:
import pandas as pd 
import sqlite3

# Read Excel data into pandas DataFrame
data = pd.read_excel('Greekyogurt.xlsx', sheet_name='Sheet1', 
                     usecols='A:F', skiprows=2, nrows=10)

# Create SQLite3 database and write DataFrame to table
conn = sqlite3.connect('greekyogurt.db')
data.to_sql('customers', conn, index=False)

# Compute entropy for gender split
cur = conn.cursor()
cur.execute('''
    SELECT 
        Gender,
        SUM(CASE WHEN Buys = 'Yes' THEN 1 ELSE 0 END) AS Buys_Yes,
        SUM(CASE WHEN Buys = 'No' THEN 1 ELSE 0 END) AS Buys_No,
        COUNT(*) AS Total
    FROM customers
    GROUP BY Gender
''')

gender_counts = pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])
gender_entropy = (-gender_counts['Buys_Yes']/gender_counts['Total'] * 
                 np.log2(gender_counts['Buys_Yes']/gender_counts['Total'])).sum()

print(f"Entropy for gender split: {gender_entropy:.3f}")

# Similar process for income and marital status splits
# ...

conn.close()

---

### *Title: Understanding S Curves in Marketing Analytics*
Subtitle: Analyzing product adoption and sales over time
Content:
• What are S curves and why are they important in marketing?
   - S curves show cumulative sales or adoption over time
   - Examples: VAX minicomputers, cars in Italy, railroads in US
• Key insights from S curves:
   - **Upper limit of sales** - maximum potential 
   - **Inflection point** - when sales growth starts slowing
• Strategically, products before inflection point have more growth potential


### *Title: The Mathematics Behind S Curves*
Subtitle: Normal distributions and percentile calculations
Content: 
• Why do S curves occur? Individual adoption times follow a normal distribution
   - Mean time to adopt: 5 years
   - Standard deviation: 1.25 years
• Recreating the S curve:
   - Calculate percentiles of adoption times 
      - e.g. 10th percentile = average time 10th person adopts
   - Count cumulative adoptions at each time point
• Inflection point occurs at mean adoption time (t=5)


### *Title: Recreating the S Curve in Excel*
Content:
1. In H9:H107, compute average adoption time for each percentile
   - Use NORMINV function: `=NORMINV(G9/100,$I$4,$I$5)`
   - G9 is percentile, $I$4 is mean, $I$5 is standard deviation 
2. In K11:K70, count cumulative adoptions at each time point
   - Use COUNTIF: `=COUNTIF($H$9:$H$107,"<="&J11)`
   - Counts adoptions up to time in J11
3. Graph J10:K70 as scatter chart to produce S curve 
   - Inflection point around t=5 (mean adoption time)



In [None]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from scipy.stats import norm

# Load Excel data into pandas
data = pd.read_excel('Scurvenormal.xlsx', sheet_name='Sheet1', usecols='G:I', skiprows=3, nrows=100)

# Calculate adoption times using normal distribution
mean = 5
std = 1.25
data['Time'] = norm.ppf(data['Person']/100, loc=mean, scale=std)

# Create SQLite database and write data
conn = sqlite3.connect('adoption.db')
data.to_sql('adoptions', conn, index=False)

# Query cumulative adoptions by time
query = '''
SELECT Time, COUNT(*) as People
FROM adoptions 
WHERE Time <= ?
GROUP BY Time
'''

# Initialize plot
fig, ax = plt.subplots()

# Loop through times and plot cumulative adoptions
for t in range(11):
    result = conn.execute(query, (t,)).fetchone()
    if result:
        ax.scatter(result[0], result[1])

# Customize and display plot        
ax.set_xlim([0,10])        
ax.set_ylim([0,100])
ax.set_xlabel('Time')
ax.set_ylabel('Cumulative Sales')
ax.set_title('S Curve of Product Adoption')
plt.show()

conn.close()

---

### *Title: Modeling Product Diffusion with the Logistic Curve*
Subtitle: Understanding the path of product adoption over time
Content:
• What is the Logistic (Pearl) curve and why is it useful in marketing?
   - Models cumulative sales, market penetration, or sales per capita 
   - Defined by equation: x(t) = L / (1 + ae^-bt)
      - *L*: upper limit, *a* and *b*: determine slope
• Key insights from Logistic curve:
   - **Upper limit (L)** - maximum potential sales or adoption
   - **Inflection point** - when growth rate starts slowing 
      - Occurs when t = Ln(a) / b


### *Title: Estimating Logistic Curve Parameters in Excel*
Subtitle: Using Solver to fit real-world data
Content:
• How can we estimate the L, a, and b parameters of a Logistic curve?
   - Input historical data and set up Logistic equation in Excel
   - Use Solver to minimize squared error by changing L, a, b
      - *Squared error*: (Actual - Estimated)^2
• Example: Modeling global cell phone adoption per 100 people
   - Estimated equation: x(t) = 118.17 / (1 + 11.618e^-0.319t) 
   - Inflection point occurred in 2008 (t=7.67 years after 2001)
   - Forecast 2012-2014 adoption by extending model


### *Title: Fitting a Logistic Curve in Excel*
Content:
1. Enter historical data (year in column C, actuals in column E)
2. In F2:H2, input initial guesses for L, a, and b parameters
3. In F5:F15, estimate cells using Logistic formula
   - Formula in F5: `=L/(1+a*EXP(-b*C5))`
4. In G5:G15, calculate squared error for each estimate 
   - Formula in G5: `=(E5-F5)^2`
5. In C3, sum squared errors with `=SUM(G5:G15)`  
6. Use Solver to minimize C3 by changing cells $F$2:$H$2
   - Set constraints: $F$2:$H$2 >= 0, L <= 200
7. Extend model by copying formula in F15 to F16:F18


In [None]:
import pandas as pd
import sqlite3
from scipy.optimize import curve_fit
import numpy as np
import matplotlib.pyplot as plt

# Load data from Excel into pandas DataFrame
data = pd.read_excel('worldcellpearl.xlsx', sheet_name='Sheet1', 
                     usecols='C,E', skiprows=4, nrows=11)

# Rename columns and reset index to year
data.columns = ['Year', 'Actual']
data.set_index('Year', inplace=True)
 
# Create SQLite database and write data
conn = sqlite3.connect('cellphones.db')
data.to_sql('adoption', conn)

# Define the logistic function
def logistic(t, L, a, b):
    return L / (1 + a * np.exp(-b*t))

# Extract year and actuals from database
query = 'SELECT * FROM adoption'
t, y = zip(*conn.execute(query))

# Fit logistic curve
popt, pcov = curve_fit(logistic, t, y, p0=(100, 10, 0.1))
L, a, b = popt

print(f"Estimated equation: {L:.2f} / (1 + {a:.2f} * exp(-{b:.2f} * t))")

# Calculate inflection point
inflection_t = np.log(a) / b
inflection_year = int(t[0] + inflection_t)
print(f"Inflection point occurred in {inflection_year}")

# Plot data and fitted curve
t_ext = np.append(t, [t[-1]+1, t[-1]+2, t[-1]+3])  # Extend for forecast
y_ext = logistic(t_ext, L, a, b)

plt.figure()
plt.plot(t, y, 'o', label='Actual')  
plt.plot(t_ext, y_ext, label='Logistic Model')
plt.xlabel('Year')
plt.ylabel('Cell Phones per 100 People')
plt.legend()
plt.show()

conn.close()

---

### *Title: Incorporating Seasonality into S-Curve Fitting*
Subtitle: Adapting logistic curves for quarterly or monthly data
Content:
- When fitting a logistic curve to quarterly/monthly data, seasonality must be incorporated
  - Multiply S-curve forecast by appropriate seasonal index
  - Add seasonal indices as changing cells
  - Choose forecast parameters to minimize sum of squared errors
- Example: Fitting seasonal logistic curve to 2002-2006 quarterly iPod sales
  - Data in "iPodsseasonal.xls" file


### *Title: Fitting a Seasonal S-Curve in Excel*
Subtitle: Step-by-step process using iPod sales data
Content: 
1. Compute sales per 100 people for each quarter
2. Calculate forecast by multiplying S-curve value by seasonal index
3. Compute squared error for each prediction  
4. Calculate sum of squared errors
5. Add constraint for seasonal indices to average 1
6. Use Solver to find optimal parameters (e.g., a=1000)


### *Title: Fitting a Seasonal S-Curve in Excel*
Content:
1. In cells F5:F19, compute sales per 100 people for each quarter using the formula: 
   =100*D5/E5
2. In cells G5:G19, calculate the forecast for each quarter using the formula:
   =(L/(1+a*EXP(-b*A5)))*HLOOKUP(C5,seaslook,2)
3. In cells H5:H19, compute the squared error for each prediction with the formula:
   =(F5-G5)^2  
4. In cell H3, calculate the sum of squared errors using:
   =SUM(H5:H19)
5. Add the constraint $N$2=1 to ensure seasonal indices average to 1
6. Use Solver with the GRG MultiStart Engine:
   - Set objective to H3 (sum of squared errors)
   - By changing variable cells $G$2:$M$2  
   - Subject to constraint $N$2=1
   - Raise upper bound of 'a' to 10,000 if needed



In [None]:
import pandas as pd
import sqlite3
import numpy as np
from sklearn.metrics import mean_squared_error

# Load data from Excel into a pandas DataFrame
data = pd.read_excel('iPodsseasonal.xlsx', sheet_name='Sheet1', usecols='A:F', skiprows=3, nrows=16)

# Connect to SQLite3 database
conn = sqlite3.connect('iPodsseasonal.db')

# Write DataFrame to SQLite3 table
data.to_sql('sales_data', conn, if_exists='replace', index=False)

# Query data from SQLite3
query = '''
    SELECT * 
    FROM sales_data
'''
df = pd.read_sql_query(query, conn)

# Compute sales per 100 people
df['Sales_per_100'] = 100 * df['Sales'] / df['Pop']

# Define the logistic function
def logistic(x, L, a, b):
    return L / (1 + a * np.exp(-b * x))

# Define seasonal indices
seasonal_indices = [0.9, 0.83, 0.77, 1.49]

# Optimize parameters using least squares
from scipy.optimize import curve_fit

quarters = df.index.tolist()
sales_per_100 = df['Sales_per_100'].tolist()

def logistic_seasonal(x, L, a, b):
    return logistic(x, L, a, b) * seasonal_indices[int(x % 4)]

params, _ = curve_fit(logistic_seasonal, quarters, sales_per_100, p0=[3, 1000, 0.5])

L, a, b = params

# Generate predictions
predictions = [logistic_seasonal(x, L, a, b) for x in quarters]

# Calculate sum of squared errors
sse = mean_squared_error(sales_per_100, predictions) * len(sales_per_100)
print(f"Sum of Squared Errors: {sse:.2f}")

# Plot actual vs predicted values
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(quarters, sales_per_100, 'bo-', label='Actual')
plt.plot(quarters, predictions, 'r*-', label='Predicted') 
plt.xlabel('Quarter')
plt.ylabel('Sales per 100 People')
plt.title('iPod Sales: Actual vs Predicted')
plt.xticks(quarters, [f"Q{(q%4)+1} {2002+(q//4)}" for q in quarters], rotation=45)
plt.legend()
plt.tight_layout()
plt.show()