In [18]:
import pandas as pd
import numpy as np
import re #gpt suggested

df = pd.read_csv("/content/train.csv")

#drop dupes
if 'Unnamed: 0' in df.columns:
    df = df.drop(columns=['Unnamed: 0'])

#null strings to nan b4 parsing
for col in ['Mileage','Engine','Power','New_Price']:
    if col in df.columns:
        df[col] = df[col].replace(['null','Null','NULL','',' '], np.nan)

#A
#missing counts
missing_before = df.isna().sum()
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Don’t impute New_Price due to heavy missingness (document this choice)
if 'New_Price' in numeric_cols:
    numeric_cols.remove('New_Price')

for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

missing_after = df.isna().sum()

(a) handling missing values


in this step i looked for missing values in the dataset using the isna() function. several columns like mileage, engine, power, and seats had some missing values. the new_price column had many missing values. to fix this, i filled the numeric columns with their median values because the median is not affected much by extreme numbers. for text columns, i filled them with the most common value (the mode). i left the new_price column as it was because most of its values were missing, and filling it could make the data unrealistic. this step made sure that there were no empty cells left that could cause problems later when doing analysis.

In [19]:
#B
def extract_number(x):
    """Return first float-like number from a string or NaN."""
    if pd.isna(x):
        return np.nan
    m = re.search(r'(\d+(\.\d+)?)', str(x))
    return float(m.group(1)) if m else np.nan

df['Mileage'] = df['Mileage'].apply(extract_number)
df['Engine'] = df['Engine'].apply(extract_number)
df['Power'] = df['Power'].apply(extract_number)
def parse_new_price(x): #GPT assisted
    if pd.isna(x):
        return np.nan
    s = str(x).strip()
    num = extract_number(s)
    if num is None or np.isnan(num):
        return np.nan
    if 'Cr' in s or 'crore' in s.lower():
        return num * 100.0   # 1 Cr = 100 Lakh
    # default assume Lakh
    return num

df['New_Price'] = df['New_Price'].apply(parse_new_price)

(b) removing units


some columns in the dataset contained both numbers and words, like “19.67 kmpl” or “1582 cc.” python treats these as text, which makes it hard to do calculations. i removed the extra text and kept only the numeric part using regular expressions. i also converted the new_price column from “lakh” and “cr” (crore) units into plain numbers by changing 1 crore into 100 lakh. after this cleaning, all the columns like mileage, engine, power, and new_price were stored as numbers, which makes them easier to analyze later.

In [20]:
#C GPT assisted
cols_to_encode = [c for c in ['Fuel_Type','Transmission'] if c in df.columns]
df = pd.get_dummies(df, columns=cols_to_encode, drop_first=True)

(c) encoding categorical variables

the dataset had some columns with text values, like fuel_type and transmission. since most analysis methods and machine learning models work better with numbers, i changed these text columns into numeric columns using one-hot encoding. this created new columns like fuel_type_petrol and transmission_manual, which show 1 or 0 depending on the category. now the dataset is fully numeric and easier to use for statistical work or prediction models.

In [21]:
#D
df['car_age'] = 2025 - df['Year']

(d) creating a new feature


to add more useful information to the dataset, i created a new column called car_age. i calculated it by subtracting the year of the car from the current year (2025). this new feature helps us understand how old each car is, which is an important factor that affects the price of a used car.

In [24]:
#E
keep_cols = [c for c in ['Name','Location','Year','Kilometers_Driven','Mileage',
                         'Engine','Power','Seats','New_Price','Price','car_age',
                         'Fuel_Type_Diesel','Fuel_Type_Petrol','Transmission_Manual'] if c in df.columns]
cars_selected = df[keep_cols].copy()

cars_filtered = cars_selected.loc[(cars_selected['car_age'] < 10) & (~cars_selected['Price'].isna())].copy()
cars_filtered = cars_filtered.rename(columns={'Price': 'Used_Price_Lakh'})
cars_filtered['Price_per_1000cc'] = cars_filtered['Used_Price_Lakh'] / cars_filtered['Engine'].replace(0, np.nan) * 1000
cars_sorted = cars_filtered.sort_values(by='Used_Price_Lakh', ascending=False)

summary_by_owner = (
    df.groupby('Owner_Type', dropna=False)['Price']
      .agg(mean_price='mean', median_price='median', n='count')
      .reset_index()
)

#GPT-assisted summart

#cars_sorted.to_csv('cars_sorted.csv', index=False)
#summary_by_owner.to_csv('summary_by_owner.csv', index=False)

print("Missing before:\n", missing_before)
print("\nMissing after (note New_Price intentionally left as-is):\n", missing_after)
print("\nGroup summary by Owner_Type:\n", summary_by_owner.head())
print("\nPreview cleaned rows:\n", cars_sorted.head(3))


Missing before:
 Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 36
Power                  36
Seats                  38
New_Price            5032
Price                   0
dtype: int64

Missing after (note New_Price intentionally left as-is):
 Name                 0
Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
New_Price            0
Price                0
dtype: int64

Group summary by Owner_Type:
        Owner_Type  mean_price  median_price     n
0           First   10.105076         5.990  4811
1  Fourth & Above    3.415000         3.125     8
2          Second    7.839719         4.500   925
3           Third    5.348058 

(e) data wrangling operations


in this part i used basic data manipulation steps. first, I selected only the columns that were most useful for analysis. then, I filtered the data to include only cars that were less than ten years old. i renamed the price column to used_price_lakh to make it clearer. i created a new column called price_per_1000cc to show how much each car costs for every 1000 cc of engine size. next, i arranged the rows so that the most expensive cars appear first. finally, i grouped the data by owner_type and calculated the average, median, and number of cars in each group. these steps helped me organize and summarize the dataset so it became cleaner and easier to understand