# Data Cleaning and Exploratory Data Analysis (EDA)

This notebook performs data cleaning and exploratory data analysis (EDA) on the cruise ship dataset. It includes a function to clean the data for both Vessel 1 and Vessel 2, which can then be used in other notebooks for specific analyses.




**Note:** If you are interested in the EDA, please run the cells below. For data cleaning, please go to the data cleaning sections (this cell should be excuted for the further process).


### Exploratory Data Analysis (EDA)




#### Instructions:
1. Please set the correct path for the dataset file in the cell below.
2. Ensure that the file path is correct for your local system.

The following cells will perform EDA on the data for both vessels. The analysis includes:
1. Plotting the number of missing values per column.
2. Plotting histograms for numerical columns.
3. Plotting boxplots for numerical columns.
4. Creating a correlation heatmap.

In [2]:
file_path = r"C:\Users\jeeva\Downloads\Cruises_performance_analysis\Cruise_ship_analysis\data\data.csv"

In [None]:

# Import necessary utilities.py for helper functions

from utilities import *



# Load the dataset
try:
    data = load_data(file_path)
    print("Data loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")

# Clean the data
vessel1_data = data[data['Vessel Name'] == 'Vessel 1']
vessel2_data= data[data['Vessel Name'] == 'Vessel 2']

# Perform EDA for vessel 1
plot_missing_values(vessel1_data, 'Number of Missing Values per Column Vessel1 data')
# Perform EDA for vessel 2
plot_missing_values(vessel2_data, 'Number of Missing Values per Column Vessel2 data')

# Plotting for Vessel 1
#plot_histograms(vessel1_data, 'Histograms for Vessel 1 Data')
plot_boxplots(vessel1_data, 'Boxplots for Vessel 1 Data')
plot_correlation_heatmap(vessel1_data, 'Correlation Heatmap for Vessel 1 Data')

# Plotting for Vessel 2
#plot_histograms(vessel2_data, 'Histograms for Vessel 2 Data')
plot_boxplots(vessel2_data, 'Boxplots for Vessel 2 Data')
plot_correlation_heatmap(vessel2_data, 'Correlation Heatmap for Vessel 2 Data')

### EDA Summary
*Correlations*

*Assumptions*:
Diesel Generator Power and Main Engine Fuel Flow Rate: Reflects essential energy conversion efficiency.

Diesel Power Generators and Propulsion, Speed, Power Service: Critical support for ship-wide power demands.

Sea Temperature and HVAC Chillers Power: Illustrates increased cooling demand with rising temperatures.

Scrubber Power with Propulsion and Speed: Highlights enhanced scrubber system usage during high propulsion scenarios.

Power Service with Propulsion and Speed: Shows heightened power service requirements during increased propulsion operations.

Power Service and Boiler Flow Rates: Indicates an integrated energy management system optimizing efficiency.

Scrubber Power and Boiler Flow Rates: Boilers reduce scrubber power demand by generating steam.

Diesel Generator Power and Boiler Flow Rates: Intensive boiler use reduces reliance on diesel generators.

Propulsion Power and Boiler Flow Rates: High propulsion demands prioritize main engines over boilers.

Speed Through Water and Boiler Flow Rates: Increased speeds reduce boiler use for auxiliary functions.

Main Engine Fuel Flow Rates and Boiler Flow Rates: Higher fuel consumption for propulsion decreases boiler operations.


*Selected Columns and Missing Value Handling*

Chosen Columns: Essential for categorizing data by vessel and time intervals, monitoring power consumption, and tracking operational efficiency.

Missing Values Handling Strategy: Filling missing values with the mean ensures comprehensive analysis and preserves data integrity.
Negative Correlations



## Data Cleaning Function

We will take the help a function clean_data from utilities that takes in the raw data and returns cleaned data for Vessel 1 and Vessel 2.


#### Instructions:
1. Please set the correct path for the dataset file in the cell below.
2. Ensure that the file path is correct for your local system.

In [3]:
file_path = r"C:\Users\jeeva\Downloads\Cruises_performance_analysis\Cruise_ship_analysis\data\data.csv"

In [4]:
# Import necessary utilities.py for helper functions

from utilities import *

# Define the necessary columns and date columns
necessary_columns = [
    'Start Time', 'End Time', 'Vessel Name',
    'Power Galley 1 (MW)', 'Power Galley 2 (MW)', 'Power Service (MW)',
    'HVAC Chiller 1 Power (MW)', 'HVAC Chiller 2 Power (MW)', 'HVAC Chiller 3 Power (MW)',
    'Scrubber Power (MW)', 'Sea Temperature (Celsius)',
    'Boiler 1 Fuel Flow Rate (L/h)', 'Boiler 2 Fuel Flow Rate (L/h)',
    'Incinerator 1 Fuel Flow Rate (L/h)', 'Diesel Generator 1 Power (MW)',
    'Diesel Generator 2 Power (MW)', 'Diesel Generator 3 Power (MW)',
    'Diesel Generator 4 Power (MW)',
    'Speed Over Ground (knots)', 'Speed Through Water (knots)',
    'Propulsion Power (MW)', 'Port Side Propulsion Power (MW)',
    'Starboard Side Propulsion Power (MW)', 'Bow Thruster 1 Power (MW)',
    'Bow Thruster 2 Power (MW)', 'Bow Thruster 3 Power (MW)',
    'Stern Thruster 1 Power (MW)', 'Stern Thruster 2 Power (MW)',
    'Main Engine 1 Fuel Flow Rate (kg/h)', 'Main Engine 2 Fuel Flow Rate (kg/h)',
    'Main Engine 3 Fuel Flow Rate (kg/h)', 'Main Engine 4 Fuel Flow Rate (kg/h)'
]

date_columns = ['Start Time', 'End Time']

# Load the dataset
data = load_data(file_path)

data =filter_columns(data, necessary_columns)


# Clean the data
vessel1_data_cleaned = clean_data(data[data['Vessel Name'] == 'Vessel 1'], necessary_columns, date_columns)
vessel2_data_cleaned = clean_data(data[data['Vessel Name'] == 'Vessel 2'], necessary_columns, date_columns)



# Save the cleaned data to CSV files for use in other notebooks
vessel1_data_cleaned.to_csv('../data/vessel1_cleaned.csv', index=False)
vessel2_data_cleaned.to_csv('../data/vessel2_cleaned.csv', index=False)

print("Data cleaning complete. Cleaned data saved for Vessel 1 and Vessel 2.")


Data cleaning complete. Cleaned data saved for Vessel 1 and Vessel 2.


_____