# DATA PREPROCESSING

In [373]:
#Library Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import csv
import scipy
from datetime import datetime
from io import StringIO


### Data Loading Decision

For this project, only rows from the CSV file with the exact expected number of columns (10) were loaded. This ensures structural consistency and prevents issues from malformed or incomplete rows due to data entry errors or formatting inconsistencies. By filtering out rows with missing or extra columns at the preprocessing stage, the analysis is based on reliable data, reducing the risk of misaligned fields and maintaining data integrity throughout the workflow. This decision supports transparency and reproducibility in the data cleaning process.

**Skipped rows due to column mismatch:**  
- Line 1003: expected 10 fields, saw 11  
- Line 1006: expected 10 fields, saw 12  
- Line 1008: expected 10 fields, saw 11  
- Line 1012: expected 10 fields, saw 11  
- Line 1014: expected 10 fields, saw 13

In [374]:
# Load the CSV file
# Read the CSV, keeping all rows as raw text
input_path = "/Users/patriciajaquez/Documents/GitHub/module1_project/data/raw/marketingcampaigns.csv"
rows = []
expected_columns = 10

with open(input_path, 'r', encoding='utf-8') as infile:
    for line in infile:
        if len(line.strip().split(',')) == expected_columns:
            rows.append(line)

# Join the clean rows and load into pandas
clean_data = pd.read_csv(StringIO(''.join(rows)))

In [375]:
# Display the first few rows of the DataFrame
#This prints the first 5 rows
clean_data.head()


Unnamed: 0,campaign_name,start_date,end_date,budget,roi,type,target_audience,channel,conversion_rate,revenue
0,Public-key multi-tasking throughput,2023-04-01,2024-02-23,8082.3,0.35,email,B2B,organic,0.4,709593.48
1,De-engineered analyzing task-force,2023-02-15,2024-04-22,17712.98,0.74,email,B2C,promotion,0.66,516609.1
2,Balanced solution-oriented Local Area Network,2022-12-20,2023-10-11,84643.1,0.37,podcast,B2B,paid,0.28,458227.42
3,Distributed real-time methodology,2022-09-26,2023-09-27,14589.75,0.47,webinar,B2B,organic,0.19,89958.73
4,Front-line executive infrastructure,2023-07-07,2024-05-15,39291.9,0.3,social media,B2B,promotion,0.81,47511.35


In [376]:
# Display the last few rows of the DataFrame
#This prints the last 5 rows
clean_data.tail()

Unnamed: 0,campaign_name,start_date,end_date,budget,roi,type,target_audience,channel,conversion_rate,revenue
1027,No revenue campaign,2023-02-01,2023-08-01,20000,0.3,social media,B2B,organic,0.5,
1028,Random mess,2023-06-06,,100000,,podcast,,referral,,300000.0
1029,Invalid budget,2022-12-01,2023-06-01,abc,,email,B2C,promotion,0.2,50000.0
1030,Overlapping dates,2023-03-01,2022-12-31,60000,0.6,webinar,B2B,paid,0.7,90000.0
1031,Too many conversions,2023-05-01,2023-11-01,40000,0.8,social media,B2C,organic,1.5,120000.0


In [377]:
#Dataframe info, including data types (Dtype) and its total, number of entries and total of columns
clean_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1032 entries, 0 to 1031
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   campaign_name    1032 non-null   object 
 1   start_date       1031 non-null   object 
 2   end_date         1030 non-null   object 
 3   budget           1029 non-null   object 
 4   roi              1028 non-null   float64
 5   type             1031 non-null   object 
 6   target_audience  1030 non-null   object 
 7   channel          1031 non-null   object 
 8   conversion_rate  1028 non-null   float64
 9   revenue          1029 non-null   float64
dtypes: float64(3), object(7)
memory usage: 80.8+ KB


In [378]:
#Count of empty values per column
empty_values = clean_data.isnull().sum()
print(empty_values)

print("Total of empty values: ", sum(empty_values))

# Rows with at least one missing value
rows_with_missing = clean_data.isnull().any(axis=1).sum()
print(f"Rows with at least one missing value: {rows_with_missing}")

# Empty rows
empty_rows = clean_data[clean_data.isnull().all(axis=1)]
print(f"Number of empty rows: {empty_rows.shape[0]}")

campaign_name      0
start_date         1
end_date           2
budget             3
roi                4
type               1
target_audience    2
channel            1
conversion_rate    4
revenue            3
dtype: int64
Total of empty values:  21
Rows with at least one missing value: 11
Number of empty rows: 0


In [379]:
# Descriptive statistics and possible outliers
clean_data.describe()


Unnamed: 0,roi,conversion_rate,revenue
count,1028.0,1028.0,1029.0
mean,0.533804,0.541936,511591.195277
std,0.261869,0.267353,287292.729847
min,-0.2,0.0,108.21
25%,0.31,0.3,267820.25
50%,0.53,0.55,518001.77
75%,0.76,0.77,765775.14
max,0.99,1.5,999712.49


In [380]:
# Review unique values in categorical columns
cat_cols = ['type', 'target_audience', 'channel']
for col in cat_cols:
    print(f"\nUnique values in {col}:")
    print(clean_data[col].unique())


Unique values in type:
['email' 'podcast' 'webinar' 'social media' nan 'event' 'B2B']

Unique values in target_audience:
['B2B' 'B2C' 'social media' nan]

Unique values in channel:
['organic' 'promotion' 'paid' 'referral' nan]


## Data Issues Identified

During data preprocessing, the following data quality issues were found:

1. **Rows with Incorrect Number of Columns**  
    - Some rows in the raw CSV did not match the expected number of columns and were excluded.

2. **Missing Values**  
    - Several columns contained missing values, including `start_date`, `end_date`, `budget`, `roi`, `type`, `target_audience`, `channel`, `conversion_rate`, and `revenue`.

3. **Invalid (Non-numeric) Values in Numeric Columns**  
    - Non-numeric values were present in columns expected to be numeric, such as `budget`, `conversion_rate`, `revenue`, and `roi`.

4. **Incorrect Column Data Type**  
    - Columns like `budget` is expected to be float instead of object.

5. **Empty Values**  
    - Empty values were found in all columns except `campaign_name`.

6. **Unexpected Categorical Values**  
    - The `type` and `target_audience` columns contained values outside the expected categories or possible misplacements.

7. **Outliers**  
    - Outliers were present in numeric columns, especially in `conversion_rate` (values > 100%) and `revenue` (values much higher than average).


## Cleaning Process

### Handling Misising Values
Percentage of missing values per column. This information is crucial for determining how to handle missing data—columns with a high percentage of missing values may require different treatment than those with only a few missing entries.

In [381]:
#Percentage of empty values per column
empty_values_percentage = (empty_values / len(clean_data)) * 100
print(empty_values_percentage)

campaign_name      0.000000
start_date         0.096899
end_date           0.193798
budget             0.290698
roi                0.387597
type               0.096899
target_audience    0.193798
channel            0.096899
conversion_rate    0.387597
revenue            0.290698
dtype: float64


Checked the row where `type` is 'B2B' and `target_audience` is 'social media'.  
**Resolution:** This row was dropped from the dataset because it is mostly null, with many missing values, and is not relevant for analysis.

In [382]:
# Print all rows where type is 'B2B' or target_audience is 'socialmedia'
misplaced_rows = clean_data[
    (clean_data['type'] == 'B2B') | (clean_data['target_audience'] == 'social media')
]
print(misplaced_rows)

# Drop the misplaced rows
clean_data = clean_data[
    ~((clean_data['type'] == 'B2B') & (clean_data['target_audience'] == 'social media'))
]

            campaign_name  start_date end_date budget  roi type  \
1024  Null-heavy campaign  2023-01-01      NaN    NaN  NaN  B2B   

     target_audience channel  conversion_rate  revenue  
1024    social media     NaN              NaN      NaN  


Converted the `budget` column from object to float, ensuring all values are numeric. Invalid entries were coerced to NaN for consistent data processing.

In [383]:
# Convert 'budget' to numeric, setting errors='coerce' will turn invalid values (like 'abc') into NaN
clean_data['budget'] = pd.to_numeric(clean_data['budget'], errors='coerce')

# Check 'budget' data type after conversion
print(f"budget column is now: {clean_data['budget'].dtype}")

budget column is now: float64


Checked whether any values in the float columns (`budget`, `roi`, `conversion_rate`, `revenue`) contain a comma (`,`), which could indicate improper formatting or parsing issues. This helps ensure all numeric data is correctly recognized for analysis.

In [384]:
# List of float columns to check
float_cols = ['budget', 'roi', 'conversion_rate', 'revenue']

for col in float_cols:
    # Convert to string and check for commas
    has_comma = clean_data[col].astype(str).str.contains(',', na=False).any()
    print(f"Column '{col}' has values with commas: {has_comma}")
    
    # Optionally, display some examples
    if has_comma:
        print(f"Examples from '{col}' with commas:")
        print(clean_data[clean_data[col].astype(str).str.contains(',', na=False)][col].head())

Column 'budget' has values with commas: False
Column 'roi' has values with commas: False
Column 'conversion_rate' has values with commas: False
Column 'revenue' has values with commas: False


All float columns (`budget`, `roi`, `conversion_rate`, `revenue`) were standardized to have two decimal places for consistency and easier interpretation in analysis and reporting.

In [385]:
# Standardize all float columns to have 2 decimal places

# Convert columns to numeric (if not already), then round to 2 decimals
for col in float_cols:
    clean_data[col] = pd.to_numeric(clean_data[col], errors='coerce').round(2)

The following code identifies outliers in each numeric column of the dataset using two methods: the Interquartile Range (IQR) method and the Modified Z-score method. For each column in `float_cols`, it calculates the IQR and flags values outside 1.0 times the IQR from the first and third quartiles as outliers. It also computes the median absolute deviation (MAD) and uses it to calculate the modified Z-score, flagging values with an absolute Z-score greater than 3.5 as outliers. The results from both methods are combined and displayed for each column, providing a comprehensive view of potential outliers in the data.


In [386]:
# Get outlier rows for a column

from scipy.stats import median_abs_deviation

for col in float_cols:
    # IQR method with a lower threshold
    Q1 = clean_data[col].quantile(0.25)
    Q3 = clean_data[col].quantile(0.75)
    IQR = Q3 - Q1
    iqr_outliers = clean_data[(clean_data[col] < Q1 - 1.0 * IQR) | (clean_data[col] > Q3 + 1.0 * IQR)]
    
    # Modified Z-score method
    median = clean_data[col].median()
    mad = median_abs_deviation(clean_data[col], nan_policy='omit')
    if mad > 0:
        mod_z = 0.6745 * (clean_data[col] - median) / mad
        z_outliers = clean_data[mod_z.abs() > 3.5]
    else:
        z_outliers = pd.DataFrame()
    
    # Combine outliers
    combined_outliers = pd.concat([iqr_outliers, z_outliers]).drop_duplicates()
    print(f"\nOutliers in {col}:")
    print(combined_outliers[[col]])


Outliers in budget:
         budget
1008  9999999.0

Outliers in roi:
      roi
1023 -0.2

Outliers in conversion_rate:
      conversion_rate
1031              1.5

Outliers in revenue:
Empty DataFrame
Columns: [revenue]
Index: []


In [387]:
# Check rows with outliers

def get_outlier_rows(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    # Statistical outliers
    stat_outliers = df[(df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)]
    return stat_outliers

# Budget outliers (statistical + negative values)
budget_outliers = get_outlier_rows(clean_data, 'budget')
budget_negatives = clean_data[clean_data['budget'] < 0]
print("Rows with outlier Budget (statistical or negative):")
print(pd.concat([budget_outliers, budget_negatives]).drop_duplicates())

# ROI outliers (statistical + negative + >1)
roi_outliers = get_outlier_rows(clean_data, 'roi')
roi_negatives = clean_data[clean_data['roi'] < 0]
roi_above_one = clean_data[clean_data['roi'] > 1]
print("Rows with outlier ROI (statistical, negative, or >1):")
print(pd.concat([roi_outliers, roi_negatives, roi_above_one]).drop_duplicates())

# Conversion Rate outliers (statistical + >1 or <0)
conv_outliers = get_outlier_rows(clean_data, 'conversion_rate')
conv_above_one = clean_data[clean_data['conversion_rate'] > 1]
conv_below_zero = clean_data[clean_data['conversion_rate'] < 0]
print("Rows with outlier Conversion Rate (statistical, >1, or <0):")
print(pd.concat([conv_outliers, conv_above_one, conv_below_zero]).drop_duplicates())



Rows with outlier Budget (statistical or negative):
          campaign_name  start_date    end_date     budget  roi     type  \
1008     Outlier Budget  2023-07-01  2024-03-01  9999999.0  0.1    email   
1023  Negative ROI test  2022-10-10  2023-05-05   -10000.0 -0.2  podcast   

     target_audience    channel  conversion_rate  revenue  
1008             B2B  promotion              0.2  50000.0  
1023             B2C   referral              0.1      NaN  
Rows with outlier ROI (statistical, negative, or >1):
          campaign_name  start_date    end_date   budget  roi     type  \
1023  Negative ROI test  2022-10-10  2023-05-05 -10000.0 -0.2  podcast   

     target_audience   channel  conversion_rate  revenue  
1023             B2C  referral              0.1      NaN  
Rows with outlier Conversion Rate (statistical, >1, or <0):
             campaign_name  start_date    end_date   budget  roi  \
1031  Too many conversions  2023-05-01  2023-11-01  40000.0  0.8   

              type ta

Negative values in the `budget` column were corrected by converting them to their absolute values. This assumes negative budgets are data entry errors, ensuring all budgets are positive and meaningful for analysis.

In [388]:
# Fix negative values in the 'budget' column by converting them to positive (absolute value)
neg_budget_mask = clean_data['budget'] < 0
clean_data.loc[neg_budget_mask, 'budget'] = clean_data.loc[neg_budget_mask, 'budget'].abs()

A new column, `roi_recalculated`, was created using the standard ROI formula in decimal format to validate the consistency of reported ROI values. By comparing the original and recalculated ROI (stored in `roi_diff`), we identified rows with significant discrepancies. This validation step was crucial before correcting outliers in the `budget` and `roi` columns, as it ensured that any adjustments—such as recalculating budgets from revenue and ROI or correcting negative/implausible values—were based on accurate relationships between these metrics.

Ensuring the accuracy and consistency of ROI and budget values is essential for marketing analysis, as these metrics directly impact the evaluation of campaign effectiveness, resource allocation, and strategic decision-making. Reliable ROI calculations allow for meaningful comparisons across campaigns and support data-driven recommendations for future marketing investments.

In [389]:
# Calculate ROI using the standard formula and store in a new column as decimal
# ROI = (revenue - budget) / budget
clean_data['roi_recalculated'] = (((clean_data['revenue'] - clean_data['budget']) / clean_data['budget'])).round(2)

# Calculate the absolute difference between the original and recalculated ROI
clean_data['roi_diff'] = (clean_data['roi'] - clean_data['roi_recalculated']).abs()

# Set a threshold to define what is considered a significant difference
threshold = 0.01  # You can adjust this value based on your analysis needs

# Find rows where the difference between original and recalculated ROI is significant
diff_rows = clean_data[clean_data['roi_diff'] > threshold]

# Print the rows with significant differences for review
print(diff_rows[['roi', 'roi_recalculated', 'roi_diff', 'budget', 'revenue']])

       roi  roi_recalculated  roi_diff    budget    revenue
0     0.35             86.80     86.45   8082.30  709593.48
1     0.74             28.17     27.43  17712.98  516609.10
2     0.37              4.41      4.04  84643.10  458227.42
3     0.47              5.17      4.70  14589.75   89958.73
4     0.30              0.21      0.09  39291.90   47511.35
...    ...               ...       ...       ...        ...
1022  0.45              2.50      2.05  25000.00   87500.00
1025  0.90              1.67      0.77  75000.00  200000.00
1026  0.25              0.50      0.25  30000.00   45000.00
1030  0.60              0.50      0.10  60000.00   90000.00
1031  0.80              2.00      1.20  40000.00  120000.00

[1024 rows x 5 columns]


For all subsequent analysis, the recalculated ROI (`roi_recalculated`) was used in place of the original ROI values. This ensures that all ROI figures are consistent with the cleaned `budget` and `revenue` data, improving the reliability and transparency of the analysis. The original ROI column was replaced to avoid confusion and maintain data integrity.

In [390]:
# Replace the original 'roi' column with the recalculated ROI values for consistency
clean_data['roi'] = clean_data['roi_recalculated']

# Remove the temporary columns used for ROI validation to clean up the DataFrame
clean_data = clean_data.drop(columns=['roi_recalculated', 'roi_diff'])

In [391]:
clean_data.describe()

Unnamed: 0,budget,roi,conversion_rate,revenue
count,1028.0,1026.0,1028.0,1029.0
mean,59015.44,24.9596,0.541936,511591.195277
std,311691.3,61.564319,0.267353,287292.729847
min,1052.57,-1.0,0.0,108.21
25%,24735.49,4.41,0.3,267820.25
50%,46948.24,9.375,0.55,518001.77
75%,74923.65,20.04,0.77,765775.14
max,9999999.0,884.76,1.5,999712.49


### Outlier Correction

Outliers in the `budget` column (such as 9,999,999.0 or negative values) were corrected by recalculating the budget using available `revenue` and `roi` values, or by converting the budget to its absolute value if recalculation was not possible. ROI was recalculated for these rows to maintain consistency.

**Formulas used for recalculation:**
- `budget` = `revenue` / (`roi` + 1)
- `roi` = (`revenue` - `budget`) / `budget`

Instead of removing or setting out-of-range values in `roi` and `conversion_rate` to NaN, these were flagged using new columns (`roi_outlier` and `conversion_rate_outlier`). This preserves all data for further analysis and transparent decision-making.

In [392]:
# Fixing outliers and recalculating values for budget, ROI, and conversion_rate

# 1. Fix budget outlier (e.g., 9999999.0) by recalculating from revenue and ROI if possible
budget_outlier_mask = clean_data['budget'] == 9999999.0
for idx in clean_data[budget_outlier_mask].index:
    row = clean_data.loc[idx]
    if pd.notnull(row['revenue']) and pd.notnull(row['roi']) and row['roi'] != -1:
        new_budget = row['revenue'] / (row['roi'] + 1)
        clean_data.at[idx, 'budget'] = new_budget
        # Recalculate ROI to ensure consistency
        clean_data.at[idx, 'roi'] = (row['revenue'] - new_budget) / new_budget

# 2. Fix negative budgets (should not be negative)
neg_budget_mask = clean_data['budget'] < 0
for idx in clean_data[neg_budget_mask].index:
    row = clean_data.loc[idx]
    if pd.notnull(row['revenue']) and pd.notnull(row['roi']) and row['roi'] != -1:
        new_budget = row['revenue'] / (row['roi'] + 1)
        clean_data.at[idx, 'budget'] = new_budget
        # Recalculate ROI to ensure consistency
        clean_data.at[idx, 'roi'] = (row['revenue'] - new_budget) / new_budget
    else:
        # If not possible to fix, set budget to its absolute value (remove negative sign)
        clean_data.at[idx, 'budget'] = abs(row['budget'])
        # Recalculate ROI to ensure consistency
        if pd.notnull(row['revenue']):
            clean_data.at[idx, 'roi'] = (row['revenue'] - clean_data.at[idx, 'budget']) / clean_data.at[idx, 'budget']


Conversion rates greater than 1.0 were corrected by dividing by 100, assuming they were entered as percentages. Any remaining values above 1.0 were capped at 1.0 to ensure all conversion rates are within the valid range [0, 1].

In [393]:
# Fix conversion_rate values greater than 1.0
# If you believe they are percentages (e.g., 150% entered as 1.5), divide by 100
mask = clean_data['conversion_rate'] > 1
clean_data.loc[mask, 'conversion_rate'] = clean_data.loc[mask, 'conversion_rate'] / 100

# Optionally, cap any remaining values above 1.0 to 1.0
clean_data.loc[clean_data['conversion_rate'] > 1, 'conversion_rate'] = 1.0

### Fixing missing values for Budget, Revenue and ROI

Missing values in the `budget`, `revenue`, and `roi` columns were filled by recalculating them using the available values from the other columns, regardless of whether the values were negative or positive. The condition `roi != -1` was included to avoid division by zero when recalculating budget. This approach ensures that as much data as possible is retained for analysis, while maintaining consistency with the mathematical relationship between these variables.

In [394]:
# Fix missing values in budget, revenue, and roi by recalculating when possible, regardless of sign

# Fill missing budget where revenue and roi are present and roi != -1
mask_budget = clean_data['budget'].isnull() & clean_data['revenue'].notnull() & clean_data['roi'].notnull() & (clean_data['roi'] != -1)
clean_data.loc[mask_budget, 'budget'] = clean_data.loc[mask_budget, 'revenue'] / (clean_data.loc[mask_budget, 'roi'] + 1)

# Fill missing revenue where budget and roi are present
mask_revenue = clean_data['revenue'].isnull() & clean_data['budget'].notnull() & clean_data['roi'].notnull()
clean_data.loc[mask_revenue, 'revenue'] = clean_data.loc[mask_revenue, 'budget'] * (clean_data.loc[mask_revenue, 'roi'] + 1)

# Fill missing roi where budget and revenue are present and budget != 0
mask_roi = clean_data['roi'].isnull() & clean_data['budget'].notnull() & clean_data['revenue'].notnull() & (clean_data['budget'] != 0)
clean_data.loc[mask_roi, 'roi'] = (clean_data.loc[mask_roi, 'revenue'] - clean_data.loc[mask_roi, 'budget']) / clean_data.loc[mask_roi, 'budget']

Rows with missing, invalid, or nonsensical dates (such as a `start_date` after `end_date`) were removed. All date fields were standardized to the `yyyy-mm-dd` format to ensure consistency and reliability for time-based analyses. This process improves data quality and ensures accurate temporal comparisons.

Before cleaning, rows where the `start_date` was after the `end_date` were identified and printed for review. Detecting and handling such inconsistencies is essential to ensure the reliability of any time-based analysis.

In [395]:
# Print rows where start date is after end date
invalid_date_range = clean_data[clean_data['start_date'] > clean_data['end_date']]
print(f"Invalid date ranges (start date after end date): {invalid_date_range[['start_date', 'end_date']]}")


Invalid date ranges (start date after end date):       start_date    end_date
1030  2023-03-01  2022-12-31


To identify rows with invalid date formats or impossible dates, the `start_date` and `end_date` columns were parsed using `pd.to_datetime()` with `errors='coerce'`. This approach converts any unparseable or non-existent dates (such as February 30th) to `NaT`. Rows where the parsed date is `NaT` were then printed for review, allowing for targeted correction or removal of problematic date entries.

In [396]:
# Try to convert to datetime, but keep the original for comparison
start_date_parsed = pd.to_datetime(clean_data['start_date'], errors='coerce')
invalid_start_dates = clean_data[start_date_parsed.isna()]
print("Rows with invalid start_date format:")
print(invalid_start_dates[['start_date']])

end_date_parsed = pd.to_datetime(clean_data['end_date'], errors='coerce')
invalid_end_dates = clean_data[end_date_parsed.isna()]
print("Rows with invalid end_date format:")
print(invalid_end_dates[['end_date']])

Rows with invalid start_date format:
      start_date
1006  2023-13-01
1021         NaN
1022  2023-13-01
Rows with invalid end_date format:
        end_date
1006  2024-02-30
1028         NaN


Invalid date entries such as `2023-13-01` (nonexistent month) and `2024-02-30` (nonexistent day in February) were identified and manually corrected to valid dates based on context. After correction, the date columns were re-parsed to ensure all entries are valid and usable for analysis.

In [397]:
# Manually correct the invalid start_date if you know the intended value
clean_data.loc[clean_data['start_date'] == '2023-13-01', 'start_date'] = '2023-01-13'

# Manually correct the invalid end_date, February 30 is not a valid date
# 2024 is a leap year, so it has to be corrected to 2024-02-29
clean_data.loc[clean_data['end_date'] == '2024-02-30', 'end_date'] = '2024-02-29'

# Parse start_date and end_date to datetime, coercing invalid entries to NaN
clean_data['start_date'] = pd.to_datetime(clean_data['start_date'], errors='coerce')
clean_data['end_date'] = pd.to_datetime(clean_data['end_date'], errors='coerce')

After correcting invalid date entries and parsing values to datetime dates were then formatted as `yyyy-mm-dd` strings for consistency throughout the dataset.

In [398]:
# Ensure all dates are consistent

# Format dates as yyyy-mm-dd strings for consistency
clean_data['start_date'] = clean_data['start_date'].dt.strftime('%Y-%m-%d')
clean_data['end_date'] = clean_data['end_date'].dt.strftime('%Y-%m-%d')

Missing values in the 'type' and 'target_audience' columns were filled with 'Unknown' to maintain data integrity without introducing potentially misleading assumptions.

In [399]:
# Fill missing values in 'type' and 'target_audience' with 'Unknown' to avoid introducing artificial categories
clean_data['type'] = clean_data['type'].fillna('Unknown')
clean_data['target_audience'] = clean_data['target_audience'].fillna('Unknown')

In [400]:
# Remaining columns with missing values and their counts
remaining_missing = clean_data.isnull().sum()
print(f"Remaining columns with missing values: {remaining_missing}")

# Count of remaining rows with missing values
remaining_missing_rows = clean_data.isnull().any(axis=1).sum()
print(f"Remaining rows with missing values: {remaining_missing_rows}")

# Print rows with missing values
missing_rows = clean_data[clean_data.isnull().any(axis=1)]
print(f"Rows with missing values: {missing_rows}")

# Percentage of missing values in total
missing_percentage = (clean_data.isnull().sum() / len(clean_data)) * 100
print(f"Percentage of missing values in total: {missing_percentage}")

Remaining columns with missing values: campaign_name      0
start_date         1
end_date           1
budget             3
roi                5
type               0
target_audience    0
channel            0
conversion_rate    3
revenue            2
dtype: int64
Remaining rows with missing values: 8
Rows with missing values:                                campaign_name  start_date    end_date  \
1003  Upgradable transitional data-warehouse  2023-06-29  2023-12-13   
1005           NEW CAMPAIGN - Missing Budget  2023-10-01  2024-01-15   
1021           Cloud-based scalable solution         NaN  2023-12-31   
1022                    Broken-date campaign  2023-01-13  2024-01-01   
1023                       Negative ROI test  2022-10-10  2023-05-05   
1027                     No revenue campaign  2023-02-01  2023-08-01   
1028                             Random mess  2023-06-06         NaN   
1029                          Invalid budget  2022-12-01  2023-06-01   

        budget  roi      

Rows with missing critical values (`start_date`, `end_date`, `budget`, `revenue`, `roi`, or `conversion_rate`) were removed from the dataset. This ensures that all remaining data is complete and reliable for analysis, and avoids introducing bias or errors due to incomplete records. Only a small number of rows were affected, so the overall integrity of the dataset is maintained.

In [401]:
# Drop rows with remaining missing values
clean_data = clean_data.dropna()

### Confirming progress:

- All critical columns (`start_date`, `end_date`, `budget`, `revenue`, `roi`, `conversion_rate`) have no missing values after cleaning.
- Data types are consistent: numeric columns are floats, date columns are formatted as `yyyy-mm-dd` strings.
- Categorical columns (`type`, `target_audience`, `channel`) contain only valid, expected values.
- Outliers and invalid values in `budget`, `roi`, and `conversion_rate` have been handled.

In [402]:
# Checking progress: no missing values or invalid data types

# Check data types
print(clean_data.info())

# Check for missing values in each column
print(clean_data.isnull().sum())

# Review unique values in categorical columns
cat_cols = ['type', 'target_audience', 'channel']
for col in cat_cols:
    print(f"\nUnique values in {col}:")
    print(clean_data[col].unique())

# Get summary statistics for float columns
print(clean_data.describe())

# Ensure dates are in correct format and order
print(clean_data[['start_date', 'end_date']].head())

<class 'pandas.core.frame.DataFrame'>
Index: 1023 entries, 0 to 1031
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   campaign_name    1023 non-null   object 
 1   start_date       1023 non-null   object 
 2   end_date         1023 non-null   object 
 3   budget           1023 non-null   float64
 4   roi              1023 non-null   float64
 5   type             1023 non-null   object 
 6   target_audience  1023 non-null   object 
 7   channel          1023 non-null   object 
 8   conversion_rate  1023 non-null   float64
 9   revenue          1023 non-null   float64
dtypes: float64(4), object(6)
memory usage: 87.9+ KB
None
campaign_name      0
start_date         0
end_date           0
budget             0
roi                0
type               0
target_audience    0
channel            0
conversion_rate    0
revenue            0
dtype: int64

Unique values in type:
['email' 'podcast' 'webinar' 'social medi

### Handling exact duplicated rows

To ensure data integrity, all exact duplicate rows were identified and removed from the dataset. The number of duplicate rows was printed before removal, and the final row count was displayed after dropping duplicates. This step guarantees that each campaign record is unique and prevents duplicate data from skewing the analysis.

In [403]:
# Find all exact duplicate rows (excluding the first occurrence)
duplicates = clean_data[clean_data.duplicated(keep=False)]

print(f"Number of exact duplicate rows: {duplicates.shape[0]}")
print(duplicates)

# Drop exact duplicate rows
clean_data = clean_data.drop_duplicates()
print(f"Number of rows after dropping duplicates: {clean_data.shape[0]}")

Number of exact duplicate rows: 27
                                      campaign_name  start_date    end_date  \
0               Public-key multi-tasking throughput  2023-04-01  2024-02-23   
1                De-engineered analyzing task-force  2023-02-15  2024-04-22   
2     Balanced solution-oriented Local Area Network  2022-12-20  2023-10-11   
3                 Distributed real-time methodology  2022-09-26  2023-09-27   
4               Front-line executive infrastructure  2023-07-07  2024-05-15   
5            Upgradable transitional data-warehouse  2023-06-29  2023-12-13   
6            Innovative context-sensitive framework  2023-03-01  2024-02-23   
7          User-friendly client-driven service-desk  2023-01-06  2023-12-11   
8                     Proactive neutral methodology  2022-09-06  2024-01-11   
9                      Intuitive responsive support  2022-11-25  2024-04-04   
10                Multi-lateral dedicated workforce  2023-06-15  2024-06-15   
11            Cro