# Lab-3 project - Vehicle Data Analysis and Visualization
### Team member(s):
- Orchlon Chinbat (50291063)
### Submission date:
- 10/19/2026

## Introduction
We are working with a large CSV file containing traffic violation records from Washington, D.C., covering the years 2013 to 2016. The dataset consists of 803,060 entries in total. Our goal is to analyze this data and present the results through various charts and graphs.

During the analysis phase, we will identify the vehicle with the 6th newest model year (along with its maximum year and price) and the one with the 4th oldest model year (along with its minimum year and price). We will also detect and count all missing values, display them, and save the cleaned data into a new CSV file.

After cleaning, we will create at least five distinct visualizations to highlight statistical insights and key findings. These graphs may include comparisons such as maximum year or price against vehicle make or model, with each visualization accompanied by a markdown explanation describing what it represents.


# Part 1: Data analysis section

### Find the 6th newest production year among all vehicles, then within that year, determine the vehicle(s) that have the highest price.

In [30]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('vehicles_data.csv', encoding = "ISO-8859-1")

# Convert Year to numeric, coercing errors to NaN
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')

# Filter out unrealistic years (keep only years between 1900 and 2025)
df_year_clean = df[(df['Year'] >= 1900) & (df['Year'] <= 2025)].copy()

# We already cleaned up the years, but it is not unique and sorted. So make it unique and sorted.
df_sorted_unique_years = sorted(df_year_clean['Year'].unique(), reverse=True)

# Get the 6th newest year
sixth_newest_year = df_sorted_unique_years[5]
sixth_newest_year


  df = pd.read_csv('vehicles_data.csv', encoding = "ISO-8859-1")


np.float64(2020.0)

We found the 6th newest year, now we need to find the vehicle(s) with the highest price in that year.

In [69]:
# Find the vehicle(s) with the highest price in the 6th newest year
sixth_newest_year_cars_df = df_year_clean[(df_year_clean['Year'] == sixth_newest_year)]

# Find the maximum price for vehicles of this year
sixth_newest_most_expensive = sixth_newest_year_cars_df['Price'].max()

# Find the vehicle(s) with the maximum price
sixth_newest_expensive_df = sixth_newest_year_cars_df[(sixth_newest_year_cars_df['Price'] == sixth_newest_most_expensive)]

sixth_newest_expensive_df


Unnamed: 0,Date Of Stop,Time Of Stop,Description,Location,VehicleType,Year,Make,Model,Color,Violation Type,Race,Gender,Price
103322,8/23/15,21:53:00,FAILURE TO DISPLAY TWO LIGHTED FRONT LAMPS WHE...,RIVER RD / CARDEROCK SPRINGS DR,02 - Automobile,2020.0,HONDA,CIVI,SILVER,Warning,BLACK,M,9322.0


### Identify the vehicle(s) with the 4th oldest production year and determining their minimum price.

In [70]:
# Find 4th oldest year
fourth_oldest_year = sorted(df_year_clean['Year'].unique(), reverse=False)[3]

# Find the vehicle(s) with the 4th oldest production year
fourth_oldest_cars_df = df_year_clean[(df_year_clean['Year'] == fourth_oldest_year)]

# Find the minimum price for vehicles of this year
fourth_oldest_min_price = fourth_oldest_cars_df['Price'].min()

# Find the vehicle(s) with the minimum price
fourth_min_price_car_df = fourth_oldest_cars_df[(fourth_oldest_cars_df['Price'] == fourth_oldest_min_price)]

fourth_min_price_car_df

Unnamed: 0,Date Of Stop,Time Of Stop,Description,Location,VehicleType,Year,Make,Model,Color,Violation Type,Race,Gender,Price
746093,3/24/14,22:11:00,DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC...,GEORGIA AND UNIVERSITY BLVD,02 - Automobile,1925.0,HONDA,CIVIC,GRAY,Warning,WHITE,F,500.0


### Identify and count the number of missing values.

In [83]:
# Count missing values for each column
missing_values = df.isnull().sum()
# Calculate the percentage of missing values for each column
missing_percentage = ((df.isnull().sum() / len(df)) * 100).round(2)

# Calculate the total number of missing values
total_missing_value = missing_values.sum()

# Find the rows with missing values
rows_with_missing = df[df.isnull().any(axis=1)]


### Converting all my findings into a data frame table and printing it.

In [107]:
# Create separate CSV files for different analysis results

# 1. Vehicle Analysis Results (6th newest and 4th oldest)
vehicle_results = []

# Add 6th newest vehicle information
for idx, row in sixth_newest_expensive_df.iterrows():
    vehicle_results.append({
        'Analysis_Type': '6th Newest Year - Most Expensive',
        'Vehicle_Type': row['VehicleType'],
        'Make': row['Make'],
        'Year': row['Year'],
        'Model': row['Model'],
        'Color': row['Color'],
        'Price': row['Price'],
        'Year_Rank': '6th Newest',
    })

# Add 4th oldest vehicle information
for idx, row in fourth_min_price_car_df.iterrows():
    vehicle_results.append({
        'Analysis_Type': '4th Oldest Year - Least Expensive',
        'Vehicle_Type': row['VehicleType'],
        'Make': row['Make'],
        'Year': row['Year'],
        'Model': row['Model'],
        'Color': row['Color'],
        'Price': row['Price'],
        'Year_Rank': '4th Oldest',
    })

# Create vehicle analysis dataframe
vehicle_df = pd.DataFrame(vehicle_results)

# 2. Missing Values Analysis Results
missing_results = []

# Add missing values summary for each column
for col, missing_count, missing_pct in zip(missing_values.index, missing_values.values, missing_percentage.values):
    missing_results.append({
        'Column_Name': col,
        'Missing_Count': missing_count,
        'Missing_Percentage': missing_pct,
        'Total_Records': len(df),
        'Complete_Records': len(df) - missing_count,
    })

# Add overall summary
missing_results.append({
    'Column_Name': 'TOTAL_DATASET',
    'Missing_Count': total_missing_value,
    'Missing_Percentage': round((total_missing_value / (len(df) * len(df.columns))) * 100, 2),
    'Total_Records': len(df),
    'Complete_Records': len(df) - total_missing_value,
})

# Create missing values dataframe
missing_df = pd.DataFrame(missing_results)


# Display preview of both files
print("\nVEHICLE ANALYSIS PREVIEW")
print(vehicle_df.to_string(index=False))

print("\n\nMISSING VALUES ANALYSIS PREVIEW")
print(missing_df.to_string(index=False))



VEHICLE ANALYSIS PREVIEW
                    Analysis_Type    Vehicle_Type  Make   Year Model  Color  Price  Year_Rank
 6th Newest Year - Most Expensive 02 - Automobile HONDA 2020.0  CIVI SILVER 9322.0 6th Newest
4th Oldest Year - Least Expensive 02 - Automobile HONDA 1925.0 CIVIC   GRAY  500.0 4th Oldest


MISSING VALUES ANALYSIS PREVIEW
   Column_Name  Missing_Count  Missing_Percentage  Total_Records  Complete_Records
  Date Of Stop              0                0.00         803889            803889
  Time Of Stop              0                0.00         803889            803889
   Description              5                0.00         803889            803884
      Location            831                0.10         803889            803058
   VehicleType            829                0.10         803889            803060
          Year           6844                0.85         803889            797045
          Make            872                0.11         803889            8

### Save the analysis results to CSV file

In [109]:
# Create separate CSV files for different analysis results

# 1. Vehicle Analysis Results (6th newest and 4th oldest)
vehicle_results = []

# Add 6th newest vehicle information
for idx, row in sixth_newest_expensive_df.iterrows():
    vehicle_results.append({
        'Analysis_Type': '6th Newest Year - Most Expensive',
        'Vehicle_Type': row['VehicleType'],
        'Make': row['Make'],
        'Year': row['Year'],
        'Model': row['Model'],
        'Color': row['Color'],
        'Price': row['Price'],
    })

# Add 4th oldest vehicle information
for idx, row in fourth_min_price_car_df.iterrows():
    vehicle_results.append({
        'Analysis_Type': '4th Oldest Year - Least Expensive',
        'Vehicle_Type': row['VehicleType'],
        'Make': row['Make'],
        'Year': row['Year'],
        'Model': row['Model'],
        'Color': row['Color'],
        'Price': row['Price'],
    })

# Create vehicle analysis dataframe
vehicle_df = pd.DataFrame(vehicle_results)

# 2. Missing Values Analysis Results
missing_results = []

# Add missing values summary for each column
for col, missing_count, missing_pct in zip(missing_values.index, missing_values.values, missing_percentage.values):
    missing_results.append({
        'Column_Name': col,
        'Missing_Count': missing_count,
        'Missing_Percentage': missing_pct,
        'Total_Records': len(df),
        'Complete_Records': len(df) - missing_count,
    })

# Add overall summary
missing_results.append({
    'Column_Name': 'TOTAL_DATASET',
    'Missing_Count': total_missing_value,
    'Missing_Percentage': round((total_missing_value / (len(df) * len(df.columns))) * 100, 2),
    'Total_Records': len(df),
    'Complete_Records': len(df) - total_missing_value,
})

# Create missing values dataframe
missing_df = pd.DataFrame(missing_results)

# Save both files
vehicle_df.to_csv('lab3_vehicle_analysis.csv', index=False)
missing_df.to_csv('lab3_missing_values_analysis.csv', index=False)

print("Vehicle analysis results saved to 'lab3_vehicle_analysis.csv'")
print("Missing values analysis saved to 'lab3_missing_values_analysis.csv'")


Vehicle analysis results saved to 'lab3_vehicle_analysis.csv'
Missing values analysis saved to 'lab3_missing_values_analysis.csv'
