# üõ†Ô∏è Mini Project: Clean and Reshape Cybersecurity Data

**Instructions:**

1. Load the provided dataset: `mini_project_threat_logs.csv`
2. Explore the dataset and identify any missing or inconsistent values.
3. Clean the data:
   - Handle missing values
   - Rename columns if needed
   - Change data types where appropriate
4. Create a new column `duration` = `end_time` - `start_time`
5. Share your final cleaned DataFrame and a brief Markdown summary below.


In [16]:
import pandas as pd

# Load dataset
df = pd.read_csv("mini_project_threat_logs.csv")

# Preview data
df.head()


Unnamed: 0,log_id,start_time,end_time,source_ip,event_type,severity
0,101,2024-01-01 08:00:00,2024-01-01 08:30:00,192.168.1.2,login,2.0
1,102,2024-01-01 09:15:00,2024-01-01 09:45:00,192.168.1.3,login,3.0
2,103,2024-01-01 10:30:00,,192.168.1.4,scan,
3,104,,2024-01-01 11:30:00,192.168.1.2,malware,4.0
4,105,2024-01-01 12:00:00,2024-01-01 12:45:00,,scan,2.0


In [17]:
print("First 5 rows:")
print(df.head())

print("\nData types:")
print(df.dtypes)

print("\nSummary statistics:")
print(df.describe(include="all"))

print("\nMissing values per column:")
print(df.isnull().sum())

print("\nDuplicate rows:", df.duplicated().sum())

print("\nDataset shape:", df.shape)


First 5 rows:
   log_id           start_time             end_time    source_ip event_type  \
0     101  2024-01-01 08:00:00  2024-01-01 08:30:00  192.168.1.2      login   
1     102  2024-01-01 09:15:00  2024-01-01 09:45:00  192.168.1.3      login   
2     103  2024-01-01 10:30:00                  NaN  192.168.1.4       scan   
3     104                  NaN  2024-01-01 11:30:00  192.168.1.2    malware   
4     105  2024-01-01 12:00:00  2024-01-01 12:45:00          NaN       scan   

   severity  
0       2.0  
1       3.0  
2       NaN  
3       4.0  
4       2.0  

Data types:
log_id          int64
start_time     object
end_time       object
source_ip      object
event_type     object
severity      float64
dtype: object

Summary statistics:
            log_id           start_time             end_time    source_ip  \
count     5.000000                    4                    4            4   
unique         NaN                    4                    4            3   
top            N

In [18]:
# Create cleaned copy
df_cleaned = df.copy()

# Convert to datetime format
df_cleaned["start_time"] = pd.to_datetime(df_cleaned["start_time"], errors="coerce")
df_cleaned["end_time"] = pd.to_datetime(df_cleaned["end_time"], errors="coerce")

# Remove rows with missing critical values
df_cleaned = df_cleaned.dropna(subset=[
    "start_time",
    "end_time",
    "severity",
    "source_ip"
])

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Reset index
df_cleaned = df_cleaned.reset_index(drop=True)

df_cleaned.head()


Unnamed: 0,log_id,start_time,end_time,source_ip,event_type,severity
0,101,2024-01-01 08:00:00,2024-01-01 08:30:00,192.168.1.2,login,2.0
1,102,2024-01-01 09:15:00,2024-01-01 09:45:00,192.168.1.3,login,3.0


## ‚úçÔ∏è Markdown Summary:
_Summarize your data cleaning steps and what the final dataset looks like. Mention any decisions made._

In [20]:
"""
Data Cleaning Summary
1. Loaded the dataset from the CSV file using pandas so the data could be analyzed in Python.
2. Created a copy of the original dataset to preserve the raw data and safely perform cleaning steps.
3. Converted the start_time and end_time columns into datetime format so time-based calculations could be performed correctly.
4. Removed rows with missing critical values such as start_time, end_time, severity, and source_ip to improve data accuracy.
5. Removed duplicate rows to prevent repeated log entries and ensure data integrity.
6. Reset the index to keep the dataset clean and well-organized.
7. Created new features called duration and duration_minutes to calculate how long each event lasted.
These steps improved the overall quality of the dataset and prepared it for cybersecurity threat analysis.
"""


'\nData Cleaning Summary\n1. Loaded the dataset from the CSV file using pandas so the data could be analyzed in Python.\n2. Created a copy of the original dataset to preserve the raw data and safely perform cleaning steps.\n3. Converted the start_time and end_time columns into datetime format so time-based calculations could be performed correctly.\n4. Removed rows with missing critical values such as start_time, end_time, severity, and source_ip to improve data accuracy.\n5. Removed duplicate rows to prevent repeated log entries and ensure data integrity.\n6. Reset the index to keep the dataset clean and well-organized.\n7. Created new features called duration and duration_minutes to calculate how long each event lasted.\nThese steps improved the overall quality of the dataset and prepared it for cybersecurity threat analysis.\n'