<a href="https://colab.research.google.com/github/jsp289/CS5901_Assignment2/blob/main/CS5901__assignment2_stage1_1_data_cleaning_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CS5901 - Assignment 2 - Stage 1**
*This .py file provides functions for data cleaning the CSV file*

### **Stage 1.1** - Import Data
*Import the CSV file from Git Repo and load into a Pandas data frame.*

In [31]:
# Initialize Google Drive when saved to G-Drive
#from google.colab import drive
#drive.mount('/content/drive')

import pandas as pd

def import_data(file_path):
  """
  This function imports data from a CSV file and load it into a Pandas DataFrame

  Args:
    file_path: location path of the CSV file
  Returns:
    df: a Pandas dataframe of the data in the CSV file
    else: print error message
  """

  #load CSV file in a Pandas dataframe using tab delimiter
  try:
    df_loaded = pd.read_csv(file_path, delimiter='\t')
    print(f"Data loaded successfully. Dimensions:{df.shape}")
    return df_loaded
  except Exception as e:
    print(f"Error loading data: {e}")
    return None

# File path of CSV file in Google-Drive
#file_path = '/content/drive/My Drive/P2data6332.csv'

#GitHub filepath
file_path = 'https://raw.githubusercontent.com/jsp289/CS5901_Assignment2/refs/heads/main/P2data6332.csv'

df =  import_data(file_path)
df.head()

Data loaded successfully. Dimensions:(482, 5)


Unnamed: 0,Level,T4,T3,T3adjusted,T4adjusted
0,5,8.1,2.1,2.008299,1.280579
1,5,8.7,,2.05671,
2,20,7.9,4.6,1.991632,1.663103
3,30,2.3,0.4,1.320006,0.736806
4,20,5.4,2.6,1.754411,1.375069


---
### Stage 1.2 - Remove Nonsensical Rows
*Here we drop negative values and outliers.*

In [43]:
def remove_nonsensical_rows(df):
  """
  This function removes negative values and outliers using the interquartile range method
  from the data frame generated in Stage 1.1

  Args:
    df: the Pandas dataframe generated in Stage 1.1
  Returns:
    df_cleaned: a Pandas dataframe with negative values and outliers removed
    nonsensical_data: a Pandas dataframe with negative values and outliers
  """

  # Identify negative values in columns
  negative_rows = df[(df[['T3','T4','T3adjusted','T4adjusted']]<0).any(axis=1)]

  #return negative_rows

#nonsensical_data = remove_nonsensical_rows(df)
#nonsensical_data.shape
#nonsensical_data.head()

  # Identify outliers with the IQR method
  Q1 = df[['T3','T4','T3adjusted','T4adjusted']].quantile(0.25)
  Q3 = df[['T3','T4','T3adjusted','T4adjusted']].quantile(0.75)
  IQR = Q3 - Q1

  #Calculate outlier boundaries
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

  #Isolate outlier rows
  outlier_rows = df.loc[((df[['T3','T4','T3adjusted','T4adjusted']] < lower_bound)|
                     (df[['T3','T4','T3adjusted','T4adjusted']] > upper_bound)).any(axis=1)]

  #return outlier_rows

#nonsensical_data = remove_nonsensical_rows(df)
#nonsensical_data.shape
#nonsensical_data.head()

  #Combine negative and outlier rows
  invalid_rows = pd.concat([negative_rows, outlier_rows])

  #Drop invalid rows and duplicates
  df_cleaned = df.drop(invalid_rows.index)

  return df_cleaned, invalid_rows

#df_cleaned, invalid_rows = remove_nonsensical_rows(df)
#df_cleaned.shape
#invalid_rows.shape
#df_cleaned.head()
#invalid_rows.head()


Unnamed: 0,Level,T4,T3,T3adjusted,T4adjusted
11,10,-1.6,0.1,-1.169607,0.464159
33,5,-6.4,-1.5,-1.856636,-1.144714
35,40,-20.0,-38.4,-2.714418,-3.373731
36,20,-2.8,-1.2,-1.40946,-1.062659
50,30,-1.6,0.5,-1.169607,0.793701
