# Multiple Linear Regression Analysis## Sales and Marketing Data Analysis*Analysis of marketing channel effectiveness on sales performance*---

## Project OverviewIn this analysis, I'll be exploring the relationship between different marketing channels and sales performance using multiple linear regression. The goal is to understand which marketing investments drive the most sales and build a predictive model.The dataset contains information about:- TV advertising spend levels- Radio advertising spend- Social Media advertising spend  - Influencer marketing levels- Resulting Sales figuresLet's dive into the analysis!

## 1. Data Import and SetupFirst, I'll import the necessary libraries for my analysis.

In [None]:
# Import required librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_score, mean_squared_errorimport statsmodels.api as smfrom statsmodels.stats.outliers_influence import variance_inflation_factorimport warningswarnings.filterwarnings('ignore')# Set plotting styleplt.style.use('default')sns.set_palette("husl")

### Loading the DatasetNow I'll load the sales and marketing data to begin my analysis.

In [None]:
# Load the datasetdata = pd.read_csv('sales_marketing_data.csv')# Display basic information about the datasetprint("Dataset shape:", data.shape)print("\nFirst few rows:")data.head()

In [None]:
# Get basic information about the dataprint("Dataset Info:")print(data.info())print("\nSummary Statistics:")data.describe()

## 2. Exploratory Data AnalysisLet me explore the data to understand the relationships between variables and identify any patterns.

### Data Overview and CleaningFirst, I'll check for any data quality issues.

In [None]:
# Check for missing valuesprint("Missing values per column:")print(data.isnull().sum())# Check for duplicatesprint(f"\nNumber of duplicate rows: {data.duplicated().sum()}")# Look at unique values for categorical variablesprint("\nUnique values in TV column:", data['TV'].unique())print("Unique values in Influencer column:", data['Influencer'].unique())

### Visualizing RelationshipsI'll create visualizations to understand how each marketing channel relates to sales.

In [None]:
# Create a comprehensive pairplot to see relationshipsplt.figure(figsize=(12, 10))sns.pairplot(data, hue='TV', diag_kind='hist')plt.suptitle('Pairplot of Marketing Channels and Sales', y=1.02)plt.tight_layout()plt.show()

In [None]:
# Create correlation heatmap for numerical variablesnumerical_cols = ['Radio', 'Social Media', 'Sales']plt.figure(figsize=(8, 6))correlation_matrix = data[numerical_cols].corr()sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,             square=True, linewidths=0.5)plt.title('Correlation Matrix: Marketing Channels vs Sales')plt.tight_layout()plt.show()

### Analyzing Categorical VariablesLet me examine how the categorical variables (TV and Influencer) affect sales.

In [None]:
# Analyze sales by TV advertising levelplt.figure(figsize=(12, 5))plt.subplot(1, 2, 1)tv_sales = data.groupby('TV')['Sales'].mean().sort_values(ascending=False)tv_sales.plot(kind='bar', color='skyblue')plt.title('Average Sales by TV Advertising Level')plt.ylabel('Average Sales')plt.xticks(rotation=45)plt.subplot(1, 2, 2)sns.boxplot(data=data, x='TV', y='Sales')plt.title('Sales Distribution by TV Advertising Level')plt.xticks(rotation=45)plt.tight_layout()plt.show()print("Average Sales by TV Level:")print(tv_sales)

In [None]:
# Analyze sales by Influencer marketing levelplt.figure(figsize=(12, 5))plt.subplot(1, 2, 1)influencer_sales = data.groupby('Influencer')['Sales'].mean().sort_values(ascending=False)influencer_sales.plot(kind='bar', color='lightcoral')plt.title('Average Sales by Influencer Marketing Level')plt.ylabel('Average Sales')plt.xticks(rotation=45)plt.subplot(1, 2, 2)sns.boxplot(data=data, x='Influencer', y='Sales')plt.title('Sales Distribution by Influencer Marketing Level')plt.xticks(rotation=45)plt.tight_layout()plt.show()print("Average Sales by Influencer Level:")print(influencer_sales)

### Scatter Plots for Continuous VariablesLet me examine the relationships between continuous marketing spend and sales.

In [None]:
# Create scatter plots for continuous variablesfig, axes = plt.subplots(1, 2, figsize=(15, 6))# Radio vs Salesaxes[0].scatter(data['Radio'], data['Sales'], alpha=0.6, color='green')axes[0].set_xlabel('Radio Advertising Spend')axes[0].set_ylabel('Sales')axes[0].set_title('Radio Advertising vs Sales')axes[0].grid(True, alpha=0.3)# Social Media vs Salesaxes[1].scatter(data['Social Media'], data['Sales'], alpha=0.6, color='purple')axes[1].set_xlabel('Social Media Advertising Spend')axes[1].set_ylabel('Sales')axes[1].set_title('Social Media Advertising vs Sales')axes[1].grid(True, alpha=0.3)plt.tight_layout()plt.show()

## 3. Data Preparation for ModelingBefore building the regression model, I need to prepare the data by encoding categorical variables.

In [None]:
# Create dummy variables for categorical features# TV advertising levelstv_dummies = pd.get_dummies(data['TV'], prefix='TV', drop_first=True)# Influencer marketing levels  influencer_dummies = pd.get_dummies(data['Influencer'], prefix='Influencer', drop_first=True)# Combine all featuresX = pd.concat([    data[['Radio', 'Social Media']],     tv_dummies,     influencer_dummies], axis=1)# Target variabley = data['Sales']print("Feature matrix shape:", X.shape)print("\nFeature columns:")print(X.columns.tolist())print("\nFirst few rows of feature matrix:")X.head()

### Checking for MulticollinearityI'll calculate the Variance Inflation Factor (VIF) to check for multicollinearity issues.

In [None]:
# Calculate VIF for each featuredef calculate_vif(df):    vif_data = pd.DataFrame()    vif_data["Feature"] = df.columns    vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]    return vif_data.sort_values('VIF', ascending=False)# Add constant for VIF calculationX_with_const = sm.add_constant(X)vif_results = calculate_vif(X_with_const)print("Variance Inflation Factors:")print(vif_results)# Check if any VIF values are concerning (typically > 5 or 10)high_vif = vif_results[vif_results['VIF'] > 5]if len(high_vif) > 0:    print("\nFeatures with high VIF (>5):")    print(high_vif)else:    print("\nNo concerning multicollinearity detected (all VIF < 5)")

## 4. Multiple Linear Regression ModelNow I'll build and evaluate the multiple linear regression model.

### Model TrainingI'll split the data and train the regression model.

In [None]:
# Split the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f"Training set size: {X_train.shape[0]} samples")print(f"Testing set size: {X_test.shape[0]} samples")# Train the linear regression modelmodel = LinearRegression()model.fit(X_train, y_train)# Make predictionsy_train_pred = model.predict(X_train)y_test_pred = model.predict(X_test)print("\nModel training completed!")

### Model EvaluationLet me evaluate the model performance using various metrics.

In [None]:
# Calculate performance metricstrain_r2 = r2_score(y_train, y_train_pred)test_r2 = r2_score(y_test, y_test_pred)train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))print("Model Performance Metrics:")print(f"Training R²: {train_r2:.4f}")print(f"Testing R²: {test_r2:.4f}")print(f"Training RMSE: {train_rmse:.4f}")print(f"Testing RMSE: {test_rmse:.4f}")# Check for overfittingif abs(train_r2 - test_r2) > 0.1:    print("\nWarning: Potential overfitting detected (large gap between train and test R²)")else:    print("\nGood: No significant overfitting detected")

### Detailed Statistical AnalysisI'll use statsmodels to get detailed statistical information about the model.

In [None]:
# Use statsmodels for detailed statistical analysisX_with_const = sm.add_constant(X)sm_model = sm.OLS(y, X_with_const).fit()# Display comprehensive model summaryprint("Detailed Model Summary:")print(sm_model.summary())

### Model Coefficients InterpretationLet me analyze and interpret the model coefficients to understand the impact of each marketing channel.

### Model Coefficients InterpretationLet me analyze and interpret the model coefficients to understand the impact of each marketing channel.

In [None]:
# Extract and display model coefficientscoefficients = pd.DataFrame({    'Feature': X.columns,    'Coefficient': model.coef_,    'Abs_Coefficient': np.abs(model.coef_)}).sort_values('Abs_Coefficient', ascending=False)print("Model Coefficients (sorted by absolute value):")print(coefficients)# Visualize coefficientsplt.figure(figsize=(10, 6))plt.barh(coefficients['Feature'], coefficients['Coefficient'])plt.xlabel('Coefficient Value')plt.title('Multiple Linear Regression Coefficients')plt.grid(axis='x', alpha=0.3)plt.tight_layout()plt.show()print(f"\nIntercept: {model.intercept_:.4f}")

### Residual AnalysisI'll analyze the residuals to check model assumptions.

In [None]:
# Residual analysisresiduals = y_test - y_test_pred# Create residual plotsfig, axes = plt.subplots(2, 2, figsize=(15, 12))# Residuals vs Fitted Valuesaxes[0, 0].scatter(y_test_pred, residuals, alpha=0.6)axes[0, 0].axhline(y=0, color='red', linestyle='--')axes[0, 0].set_xlabel('Fitted Values')axes[0, 0].set_ylabel('Residuals')axes[0, 0].set_title('Residuals vs Fitted Values')axes[0, 0].grid(True, alpha=0.3)# Q-Q plot for normality checkfrom scipy import statsstats.probplot(residuals, dist="norm", plot=axes[0, 1])axes[0, 1].set_title('Q-Q Plot of Residuals')axes[0, 1].grid(True, alpha=0.3)# Histogram of residualsaxes[1, 0].hist(residuals, bins=20, alpha=0.7, color='skyblue', edgecolor='black')axes[1, 0].set_xlabel('Residuals')axes[1, 0].set_ylabel('Frequency')axes[1, 0].set_title('Distribution of Residuals')axes[1, 0].grid(True, alpha=0.3)# Actual vs Predictedaxes[1, 1].scatter(y_test, y_test_pred, alpha=0.6)axes[1, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'red', linestyle='--')axes[1, 1].set_xlabel('Actual Sales')axes[1, 1].set_ylabel('Predicted Sales')axes[1, 1].set_title('Actual vs Predicted Sales')axes[1, 1].grid(True, alpha=0.3)plt.tight_layout()plt.show()

## 5. Key Findings and ConclusionsBased on my analysis, here are the key insights:

### Marketing Channel EffectivenessFrom the regression analysis, I can draw several important conclusions about marketing channel effectiveness:1. **Most Impactful Channels**: The coefficients reveal which marketing channels have the strongest relationship with sales2. **Statistical Significance**: The p-values indicate which relationships are statistically significant3. **Practical Significance**: The magnitude of coefficients shows the practical impact of each channel### Model PerformanceThe model's R-squared value indicates how well the marketing variables explain sales variance, providing insight into:- The predictive power of the current marketing mix- Areas where additional factors might be needed- The overall effectiveness of the marketing strategy### RecommendationsBased on these findings, I would recommend:- Focusing investment on the most effective channels- Considering the interaction effects between channels- Monitoring performance over time to validate these relationships

## 6. Future ImprovementsTo enhance this analysis, I could consider:1. **Feature Engineering**: Creating interaction terms between marketing channels2. **Non-linear Relationships**: Exploring polynomial or other non-linear transformations3. **Time Series Analysis**: If temporal data is available, analyzing trends over time4. **Advanced Modeling**: Trying ensemble methods or regularized regression techniques5. **External Factors**: Including economic indicators, seasonality, or competitive dataThis analysis provides a solid foundation for understanding marketing effectiveness and can guide strategic decision-making for future marketing investments.