# Robust Data Profiler - Session 3 Mini Project

## Project Goals
Build a data profiler that:
- Accepts a file path as input
- Handles file-not-found and parse errors
- Reads CSV data
- Writes summary to a text file

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
# Load tips dataset from seaborn
tips = sns.load_dataset('tips')

# Save to CSV
tips.to_csv('tips_data.csv', index=False)
print("Tips dataset saved!")
print("Shape:", tips.shape)
tips.head()

Tips dataset saved!
Shape: (244, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
def data_profiler(df):
    """
    Analyze a dataframe and return a formatted report string.
    """
    data_output = ''
    
    # Prints shape (rows, columns)
    data_output += 'Shape\n'
    data_output += str(df.shape)

    # Lists column names and dtypes
    data_output += '\n\nColumns\n'
    data_output += str(df.dtypes)

    # Counts missing values per column
    data_output += '\n\nMissing Values\n'
    data_output += str(df.isnull().sum())

    # Shows basic statistics for numeric columns (includes count, mean, std)
    data_output += '\n\nNumerical Statistics\n'
    data_output += str(df.describe())

    # Shows statistics for categorical columns
    data_output += '\n\nCategorical Statistics\n'
    data_output += str(df.describe(include='object'))

    # Shows first 5 rows
    data_output += '\n\nFirst 5 rows\n'
    data_output += str(df.head())

    # return text
    return data_output


def robust_data_profiler(file_name, output_path='data_profile.txt'):
    """
    Read a CSV file and create a profile report.
    Handles errors and writes results to a text file.
    """
    # pandas read inside of try block
    try:
        df = pd.read_csv(file_name)
        print("File loaded successfully!")
        
    except FileNotFoundError:
        print("ERROR: File not found!")
        print("Please check the file path.")
        return None
    
    except pd.errors.ParserError:
        print("ERROR: Could not parse the file!")
        print("The file may be corrupted or not a valid CSV.")
        return None
    
    except Exception as e:
        print("ERROR: Something went wrong!")
        print("Error message:", str(e))
        return None

    # Build output header
    data_output = 'Data Source: '
    data_output += file_name
    data_output += '\n\n'
    
    # Call data profiler
    data_output += data_profiler(df)

    # Write results to file
    with open(output_path, "w") as f:
        f.write(data_output)
    
    print(f"Report written to: {output_path}")
    
    # Return text
    return data_output

In [4]:
# Run the profiler with default output path
result = robust_data_profiler('tips_data.csv')
print(result)

File loaded successfully!
Report written to: data_profile.txt
Data Source: tips_data.csv

Shape
(244, 7)

Columns
total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

Missing Values
total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

Numerical Statistics
       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000

Categorical Statistics
         sex smoker  day    time
count    244    244  244     244
unique     2      2    4       2
top     Male     No  Sat  Dinner
freq     157    151  

In [5]:
# Read and display the report file
with open('data_profile.txt', 'r') as f:
    report = f.read()
    print(report)

Data Source: tips_data.csv

Shape
(244, 7)

Columns
total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object

Missing Values
total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

Numerical Statistics
       total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000

Categorical Statistics
         sex smoker  day    time
count    244    244  244     244
unique     2      2    4       2
top     Male     No  Sat  Dinner
freq     157    151   87     176

First 5 rows
   total_bill   tip     sex smoker  

## 6. Test Error Handling

Let's test what happens with errors.

In [6]:
# Test with a file that doesn't exist
print("Test 1: File not found")
print("=" * 50)
result = robust_data_profiler('nonexistent_file.csv')
print("Result:", result)

Test 1: File not found
ERROR: File not found!
Please check the file path.
Result: None


In [7]:
# Test with an empty file
print("Test 2: Empty file")
print("=" * 50)

# Create an empty file
with open('empty.csv', 'w') as f:
    pass

result = robust_data_profiler('empty.csv')
print("Result:", result)

Test 2: Empty file
ERROR: Something went wrong!
Error message: No columns to parse from file
Result: None


In [8]:
# Test with custom output path
print("\nTest 3: Custom output path")
print("=" * 50)
result = robust_data_profiler('tips_data.csv', output_path='custom_report.txt')
print("\nReport saved to custom location!")


Test 3: Custom output path
File loaded successfully!
Report written to: custom_report.txt

Report saved to custom location!
