This cell imports necessary libraries and loads the dataset "Advertising Budget and Sales.csv" into a pandas DataFrame. It then displays the first few rows and provides information about the DataFrame, such as column names and data types.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('Advertising Budget and Sales.csv')

# Display the first few rows
print("First 5 rows of the dataset:")
display(df.head())

# Get information about the dataframe
print("\nDataFrame Info:")
df.info()

This cell checks for missing values in each column of the DataFrame and displays the count of missing values for each column.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing values in each column:")
display(missing_values)

This cell checks for duplicate rows in the DataFrame and prints the total number of duplicate rows found.

In [None]:
# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_rows}")

This cell renames the column 'Unnamed: 0' to 'ID' and removes any trailing spaces from the column names.

In [None]:
df = df.rename(columns={'Unnamed: 0': 'ID'})
df.columns = df.columns.str.replace(' ($)', '', regex=False)

This cell generates visualizations to explore the distribution of numerical columns ('TV Ad Budget', 'Radio Ad Budget', 'Newspaper Ad Budget', and 'Sales') using histograms. It also visualizes the relationships between advertising budgets and sales using scatter plots and displays the correlation matrix using a heatmap.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of numerical columns
numerical_cols = ['TV Ad Budget', 'Radio Ad Budget', 'Newspaper Ad Budget', 'Sales']

plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_cols):
    plt.subplot(2, 2, i + 1)
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

# Visualize the relationships between advertising budgets and sales using scatter plots
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
sns.scatterplot(x='TV Ad Budget', y='Sales', data=df)
plt.title('TV Ad Budget vs Sales')

plt.subplot(1, 3, 2)
sns.scatterplot(x='Radio Ad Budget', y='Sales', data=df)
plt.title('Radio Ad Budget vs Sales')

plt.subplot(1, 3, 3)
sns.scatterplot(x='Newspaper Ad Budget', y='Sales', data=df)
plt.title('Newspaper Ad Budget vs Sales')
plt.tight_layout()
plt.show()

# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(8, 6))
correlation_matrix = df[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

This cell visualizes the distribution of numerical columns using box plots to identify potential outliers.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of numerical columns using box plots
numerical_cols = ['TV Ad Budget', 'Radio Ad Budget', 'Newspaper Ad Budget', 'Sales']

plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_cols):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(y=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

This cell removes outliers from the 'Newspaper Ad Budget' column using the Interquartile Range (IQR) method and prints the shape of the DataFrame after removing the outliers.

In [None]:
# Remove outliers using IQR
Q1 = df['Newspaper Ad Budget'].quantile(0.25)
Q3 = df['Newspaper Ad Budget'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df= df[(df['Newspaper Ad Budget'] >= lower_bound) & (df['Newspaper Ad Budget'] <= upper_bound)].copy()

print(f"Shape after removing outliers: {df.shape}")

This cell visualizes the distribution of numerical columns using box plots again after outlier removal to show the updated distributions.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of numerical columns using box plots
numerical_cols = ['TV Ad Budget', 'Radio Ad Budget', 'Newspaper Ad Budget', 'Sales']

plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_cols):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(y=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

This cell scales the numerical features ('TV Ad Budget', 'Radio Ad Budget', and 'Newspaper Ad Budget') using StandardScaler from scikit-learn.

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numerical columns to scale (excluding 'ID' and 'Sales')
numerical_cols_to_scale = ['TV Ad Budget', 'Radio Ad Budget', 'Newspaper Ad Budget']

# Initialize the StandardScaler
scaler = StandardScaler()

df[numerical_cols_to_scale] = scaler.fit_transform(df[numerical_cols_to_scale])

This cell selects the features (independent variables) and the target (dependent variable) for the linear regression model. 'Radio Ad Budget', 'TV Ad Budget', and 'Newspaper Ad Budget' are selected as features (X) and 'Sales' is selected as the target (y).

In [None]:
X = df[['Radio Ad Budget', 'TV Ad Budget', 'Newspaper Ad Budget']]
y = df['Sales']

This cell splits the dataset into training and testing sets using `train_test_split` from scikit-learn. 80% of the data is used for training and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)

This cell builds a linear regression model using `LinearRegression` from scikit-learn and trains it on the training data (`X_train` and `y_train`).

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

This cell displays the intercept and coefficient of the trained linear regression model. The intercept represents the expected value of Sales when all advertising budgets are zero, and the coefficient represents the change in Sales for a one-unit increase in the corresponding advertising budget (since we are now using multiple features).

In [None]:
# Display the intercept and coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

This cell uses the trained linear regression model to make predictions on the testing data (`X_test`). The predictions are stored in the variable `y_pred`.

In [None]:
y_pred = model.predict(X_test)

This cell evaluates the performance of the trained linear regression model using common regression metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared. These metrics quantify how well the model's predictions match the actual sales values.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print the metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")

This cell generates a scatter plot comparing the actual sales values (`y_test`) against the predicted sales values (`y_pred`). A diagonal line is also plotted to represent the ideal scenario where predicted values exactly match actual values. This plot helps visualize the model's performance and identify any patterns in the errors.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k', lw=2)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title('Actual Sales vs. Predicted Sales')
plt.show()

## Summary:

### Data Analysis Key Findings

*   The trained linear regression model has an intercept of approximately 7.12 and a coefficient of approximately 0.0465. This suggests that for every additional dollar spent on TV advertising, sales are predicted to increase by approximately \$0.0465.
*   The model performance metrics on the test set are:
    *   Mean Squared Error (MSE): approximately 10.20
    *   Root Mean Squared Error (RMSE): approximately 3.19
    *   R-squared (R2): approximately 0.677
*   The R-squared value of 0.677 indicates that approximately 67.7% of the variance in Sales can be explained by the TV Ad Budget according to this simple linear regression model.
*   The plot shows the actual sales data points and the fitted regression line, illustrating the relationship between TV Ad Budget and Sales and how the model captures this trend.

### Insights or Next Steps

*   The relatively high R-squared value suggests that TV ad budget is a significant predictor of sales, but there are other factors influencing sales that are not captured by this simple model.
*   Consider exploring multiple linear regression to include other advertising budgets (Radio and Newspaper) to see if the model performance improves.