In [None]:
from google.colab import files

~This line imports a specific module named files from the google.colab library.

->Let's break it down:
1.from: This keyword is used in Python to import modules or functions from other libraries, making their functionality available in your current code.
2.google.colab: This refers to a library specifically designed for working with Google Colaboratory, a cloud-based environment for running Python code, often used for machine learning and data analysis.
3.import: This keyword is used to bring the desired module or function into your current code.
4.files: This is the specific module within the google.colab library that you're importing.

In [None]:
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"harshagnihotri11","key":"80c38aa0f0fd8ba614e583a45f7f799e"}'}

~Triggers a file upload dialog: When you execute this line of code within a Google Colab notebook, it prompts you to select files from your local computer to upload to the Colab environment.
~Uploads to Colab's file system: The chosen files are then transferred from your local machine to the Colab environment, where they become accessible for use in your code.

In [None]:
!mkdir ~/.kaggle

~This line of code, !mkdir ~/.kaggle, uses the shell command mkdir within a Colab notebook to create a directory named **.kaggle** within your home directory.

In [None]:
!mv /content/kaggle.json ~/.kaggle

~This line of code, executed as a shell command in Colab, moves a file named kaggle.json from one location to another.

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

~This line of code, executed as a shell command in Colab, modifies the permissions associated with the kaggle.json file to make it more secure.

~chmod command: This is the shell command used to change file permissions.
600: This is the numerical representation of the specific permissions being set:
Owner read/write: The owner of the file (you) can read and write to the file.
No group or other access: Other users or groups on the system cannot read or write to the file.

In [None]:
!pip install kaggle



~In this line we are manually installing the kaggle library.

In [None]:
!kaggle datasets download -d austinreese/craigslist-carstrucks-data

Downloading craigslist-carstrucks-data.zip to /content
 98% 256M/262M [00:02<00:00, 165MB/s]
100% 262M/262M [00:02<00:00, 125MB/s]


~This line of code, executed as a shell command in Colab, initiates the download of a specific dataset from Kaggle directly into the Colab environment.

In [None]:
!unzip /content/craigslist-carstrucks-data.zip

Archive:  /content/craigslist-carstrucks-data.zip
  inflating: vehicles.csv            


~This line of code, executed as a shell command in Colab, extracts the contents of a ZIP archive file into the Colab environment

In [None]:
import pandas as pd

~Imports the Pandas library, designed for data analysis and manipulation, particularly with tabular data.

In [None]:
import numpy as np

~Imports the NumPy library, which is foundational for numerical computations and array operations in Python.

In [None]:
data = pd.read_csv("/content/vehicles.csv")

~Reads the CSV file named "vehicles.csv" located in the "/content" directory.
Stores the loaded data as a Pandas DataFrame object, named data.

In [None]:
data.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,
4,7210384030,https://greensboro.craigslist.org/cto/d/trinit...,greensboro,https://greensboro.craigslist.org,4900,,,,,,...,,,,,,,nc,,,


~Displays the first few rows (5 by default) of the DataFrame data.
Offers a quick glimpse into the data's structure, content, and format.

In [None]:
data.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',
       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',
       'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color',
       'image_url', 'description', 'county', 'state', 'lat', 'long',
       'posting_date'],
      dtype='object')

~ It provides us the clear overview of the available features in the dataset.
~Helps identify potential areas for feature selection or engineering.

In [None]:
data.shape

(426880, 26)

~ Retrieves the dimensions of the DataFrame data in the form of a tuple containing two elements:
1.The number of rows (observations) in the dataset.
2.The number of columns (features) in the dataset.

In [None]:
data.isna().sum()

id                   0
url                  0
region               0
region_url           0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
VIN             161042
drive           130567
size            306361
type             92858
paint_color     130203
image_url           68
description         70
county          426880
state                0
lat               6549
long              6549
posting_date        68
dtype: int64

~This line shows us all the data and the missing values in those data.

In [None]:
data.drop(labels=data.columns[0:4],axis=1,inplace=True)

~This line helps us to remove specific columns which we will be not using for creating this model.

In [None]:
data.shape

(426880, 22)

In [None]:
data.head()

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,6000,,,,,,,,,,...,,,,,,,az,,,
1,11900,,,,,,,,,,...,,,,,,,ar,,,
2,21000,,,,,,,,,,...,,,,,,,fl,,,
3,1500,,,,,,,,,,...,,,,,,,ma,,,
4,4900,,,,,,,,,,...,,,,,,,nc,,,


In [None]:
data.drop(labels="title_status",axis=1,inplace=True)

~This line removes the column **"Title_status"** for the data.

In [None]:
data.head()

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,transmission,VIN,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,6000,,,,,,,,,,...,,,,,,,az,,,
1,11900,,,,,,,,,,...,,,,,,,ar,,,
2,21000,,,,,,,,,,...,,,,,,,fl,,,
3,1500,,,,,,,,,,...,,,,,,,ma,,,
4,4900,,,,,,,,,,...,,,,,,,nc,,,


In [None]:
data.columns

Index(['price', 'year', 'manufacturer', 'model', 'condition', 'cylinders',
       'fuel', 'odometer', 'transmission', 'VIN', 'drive', 'size', 'type',
       'paint_color', 'image_url', 'description', 'county', 'state', 'lat',
       'long', 'posting_date'],
      dtype='object')

In [None]:
data.drop(labels=data.columns[14:],axis=1,inplace=True)

In [None]:
data.head()

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,transmission,VIN,drive,size,type,paint_color
0,6000,,,,,,,,,,,,,
1,11900,,,,,,,,,,,,,
2,21000,,,,,,,,,,,,,
3,1500,,,,,,,,,,,,,
4,4900,,,,,,,,,,,,,


In [None]:
data.drop(labels="VIN",axis=1,inplace=True)

In [None]:
data.head()

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,transmission,drive,size,type,paint_color
0,6000,,,,,,,,,,,,
1,11900,,,,,,,,,,,,
2,21000,,,,,,,,,,,,
3,1500,,,,,,,,,,,,
4,4900,,,,,,,,,,,,


In [None]:
data.shape

(426880, 13)

In [None]:
data.columns

Index(['price', 'year', 'manufacturer', 'model', 'condition', 'cylinders',
       'fuel', 'odometer', 'transmission', 'drive', 'size', 'type',
       'paint_color'],
      dtype='object')

In [None]:
data.dropna(axis=0,thresh=12,inplace=True)

~Removes rows with missing values (NaNs) from the DataFrame data based on a specific threshold.

In [None]:
data.shape

(144017, 13)

In [None]:
data.head()

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,transmission,drive,size,type,paint_color
31,15000,2013.0,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,automatic,rwd,full-size,truck,black
32,27990,2012.0,gmc,sierra 2500 hd extended cab,good,8 cylinders,gas,68696.0,other,4wd,,pickup,black
33,34590,2016.0,chevrolet,silverado 1500 double,good,6 cylinders,gas,29499.0,other,4wd,,pickup,silver
34,35000,2019.0,toyota,tacoma,excellent,6 cylinders,gas,43000.0,automatic,4wd,,truck,grey
35,29990,2016.0,chevrolet,colorado extended cab,good,6 cylinders,gas,17302.0,other,4wd,,pickup,red


In [None]:
data.isna().sum()

price               0
year                0
manufacturer     4059
model             975
condition       13570
cylinders        1766
fuel                0
odometer          408
transmission       11
drive            2491
size            36793
type             1893
paint_color      2856
dtype: int64

In [None]:
data["manufacturer"].value_counts()

ford               24663
chevrolet          21040
toyota             12227
honda               8115
nissan              7261
jeep                6783
gmc                 5635
ram                 5199
dodge               4886
bmw                 4699
mercedes-benz       4443
hyundai             3202
subaru              3136
lexus               2874
volkswagen          2750
kia                 2429
chrysler            2293
cadillac            2065
buick               1810
infiniti            1673
audi                1633
mazda               1597
acura               1494
lincoln             1410
pontiac             1014
mitsubishi           982
volvo                947
mini                 715
rover                593
mercury              583
saturn               459
jaguar               450
porsche              432
fiat                 203
alfa-romeo            88
tesla                 61
harley-davidson       49
ferrari               28
datsun                24
land rover             8


~Counts the occurences of each unique value within the "manufacturer" column of the Dataframe data. Returns index and values.

In [None]:
data["odometer"].fillna(value=data["odometer"].mean(),inplace=True)

~Fills missing values (NaNs) in the "odometer" column of the DataFrame data with the mean value of that column.

In [None]:
data["condition"].unique()

array(['excellent', 'good', 'like new', 'new', 'fair', nan, 'salvage'],
      dtype=object)

~Extracts unique values from the "condition" column of the DataFrame data and returns them as a NumPy array.

In [None]:
data["condition"].value_counts()

excellent    59097
good         51580
like new     14128
fair          4547
new            732
salvage        363
Name: condition, dtype: int64

In [None]:
data["condition"].fillna(value=data["condition"].value_counts().index[data["condition"].value_counts().argmax()],
                         inplace=True)

~Fills missing values (NaNs) in the "condition" column of the DataFrame data with the most frequent value in that column.

~data["condition"]: Extracts the "condition" column as a Series object.
1. .fillna(): A Series method used to fill missing values.
2.value=: Specifies the fill value, which is dynamically calculated.
3.data["condition"].value_counts(): Counts occurrences of each unique value in the column.
4. .index data["condition"].value_counts().argmax(): Extracts the index (the most frequent value) from the value counts.
5.inplace=True: Modifies the DataFrame directly, rather than creating a new one.

In [None]:
data["paint_color"].fillna(value=data["paint_color"].value_counts().index[data["paint_color"].value_counts().argmax()],
                           inplace=True)

In [None]:
data["transmission"].fillna(value=data["transmission"].value_counts().index[data["transmission"].value_counts().argmax()],
                           inplace=True)

In [None]:
crosstab_df_mdl_yr_mfc = pd.crosstab(data["model"],
 [data["year"],data["manufacturer"]],rownames=["model"],
                                     colnames=["year","manufacturer"])

~Creates a multi-level cross-tabulation (pivot table) that summarizes the counts of model occurrences for each combination of year and manufacturer in the DataFrame data.

In [None]:
crosstab_df_mdl_yr_mfc

year,1900.0,1905.0,1913.0,1913.0,1918.0,1923.0,1924.0,1924.0,1925.0,1926.0,...,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2022.0,2022.0,2022.0,2022.0
manufacturer,acura,chevrolet,cadillac,ford,ford,ford,dodge,ford,ford,ford,...,ram,rover,subaru,toyota,volkswagen,volvo,chevrolet,ford,mitsubishi,toyota
model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
"""t""",0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
& altima,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
(210),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
(300),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
(cng) 2500 express van,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zl1 camaro,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zr2 sonoma,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zx2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
♿,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


~Imports the drive module from the google.colab library.
~This module provides functions for mounting your Google Drive within a Colab notebook, allowing you to access and work with files stored in your Drive.

In [None]:
data["cylinders"].fillna(value=data["cylinders"].value_counts().index[data["cylinders"].value_counts().argmax()],
                           inplace=True)

In [None]:
mapping_dict = dict()

for single_col in crosstab_df_mdl_yr_mfc.columns:
  mapping_dict[single_col] = crosstab_df_mdl_yr_mfc[single_col].index[crosstab_df_mdl_yr_mfc[single_col].argmax()]

~Builds a dictionary called mapping_dict to store mappings between specific year-manufacturer combinations and the most common model associated with them.
~Provides a quick way to look up the most prevalent model for a given year and manufacturer.

In [None]:
mapping_dict

{(1900.0, 'acura'): 'rdx',
 (1905.0, 'chevrolet'): 'astro cargo',
 (1913.0, 'cadillac'): 'touring car',
 (1913.0, 'ford'): '"t"',
 (1918.0, 'ford'): 'model t',
 (1923.0, 'ford'): 't bucket',
 (1924.0, 'dodge'): 'phaeton',
 (1924.0, 'ford'): 'model t',
 (1925.0, 'ford'): 'model t',
 (1926.0, 'ford'): 'model t',
 (1927.0, 'chevrolet'): 'coupe',
 (1927.0, 'chrysler'): '60 series',
 (1927.0, 'ford'): 'model t',
 (1928.0, 'cadillac'): 'coupe',
 (1928.0, 'chevrolet'): '4x4 pickup',
 (1928.0, 'dodge'): 'hotrod',
 (1928.0, 'ford'): 'model a',
 (1929.0, 'dodge'): 'da coupe',
 (1929.0, 'ford'): 'model a',
 (1930.0, 'chevrolet'): 'coupe',
 (1930.0, 'dodge'): 'dc8',
 (1930.0, 'ford'): 'model a',
 (1931.0, 'chevrolet'): '5 window coupe',
 (1931.0, 'ford'): 'model a',
 (1932.0, 'chevrolet'): 'coupe',
 (1932.0, 'chrysler'): 'rat rod',
 (1932.0, 'ford'): 'roadster',
 (1932.0, 'pontiac'): 'coupe',
 (1933.0, 'chevrolet'): 'coupe',
 (1933.0, 'ford'): '3 window pro street coupe',
 (1934.0, 'chevrolet'): '

In [None]:
for k in mapping_dict.keys():
  boolean_mask = (data["year"] == k[0]) & (data["manufacturer"] == k[1])
  data.loc[boolean_mask,"model"] = data.loc[boolean_mask,"model"].fillna(value=mapping_dict[k],inplace=False)

~Fills missing values (NaNs) in the "model" column of the DataFrame data using the most common models identified in the mapping_dict.
~Imputation is performed based on matching year and manufacturer combinations.

In [None]:
data.isna().sum()

price               0
year                0
manufacturer     4059
model              27
condition           0
cylinders           0
fuel                0
odometer            0
transmission        0
drive            2491
size            36793
type             1893
paint_color         0
dtype: int64

In [None]:
crosstab_df_typ_mdl_mfc = pd.crosstab(data["type"],
 [data["model"],data["manufacturer"]],rownames=["type"],
                                     colnames=["model","manufacturer"])

~Creates another multi-level cross-tabulation (pivot table) that summarizes the counts of vehicle types for each combination of model and manufacturer in the DataFrame data.

In [None]:
crosstab_df_typ_mdl_mfc

model,"""t""",& altima,(210),(300),(cng) 2500 express van,(s)port (s)edan,* vmi * ♿,- 240d,- 328i - convertible,- benz sprinter,...,z4 sdrive35i,z4 sdrive35is,z71,zdx,zephyr,zl1 camaro,zr2 sonoma,zx2,♿,♿ vmi
manufacturer,ford,chevrolet,chevrolet,chrysler,chevrolet,chevrolet,chrysler,mercedes-benz,bmw,mercedes-benz,...,bmw,bmw,chevrolet,acura,lincoln,chevrolet,gmc,ford,chrysler,chrysler
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
SUV,0,0,0,0,0,0,0,0,0,0,...,0,0,0,4,0,0,0,0,0,0
bus,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
convertible,1,0,0,0,0,0,0,0,1,0,...,3,1,0,0,0,0,0,0,0,0
coupe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,1,0,0
hatchback,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
mini-van,0,1,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,1,2
offroad,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
other,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
pickup,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
sedan,0,0,1,0,0,1,0,1,0,0,...,0,0,0,0,5,0,0,0,0,0


In [None]:
mapping_dict = dict()

for single_col in crosstab_df_typ_mdl_mfc.columns:
  mapping_dict[single_col] = crosstab_df_typ_mdl_mfc[single_col].index[crosstab_df_typ_mdl_mfc[single_col].argmax()]

~Creates a dictionary mapping_dict to store mappings between model-manufacturer pairs and their most common vehicle types, similar to the previous mapping for models.

In [None]:
mapping_dict

{('"t"', 'ford'): 'convertible',
 ('& altima', 'chevrolet'): 'mini-van',
 ('(210)', 'chevrolet'): 'sedan',
 ('(300)', 'chrysler'): 'other',
 ('(cng) 2500 express van', 'chevrolet'): 'truck',
 ('(s)port (s)edan', 'chevrolet'): 'sedan',
 ('* vmi * ♿', 'chrysler'): 'mini-van',
 ('- 240d', 'mercedes-benz'): 'sedan',
 ('- 328i - convertible', 'bmw'): 'convertible',
 ('- benz sprinter', 'mercedes-benz'): 'van',
 ('- santa fe', 'hyundai'): 'SUV',
 ('-150 xlt', 'ford'): 'truck',
 ('-benz e350', 'mercedes-benz'): 'sedan',
 ('-benz s430', 'mercedes-benz'): 'sedan',
 ('/ accord', 'honda'): 'sedan',
 ('/ bertone x1/9', 'fiat'): 'convertible',
 ('/ braun', 'dodge'): 'mini-van',
 ('/ durango sport', 'dodge'): 'SUV',
 ('/ vmi / ♿', 'chrysler'): 'mini-van',
 ('// vmi // ♿', 'chrysler'): 'mini-van',
 ('// vmi ♿', 'chrysler'): 'mini-van',
 ('/mercury comet', 'lincoln'): 'coupe',
 ('/vmi conversion van', 'ford'): 'van',
 ('037', 'chevrolet'): 'coupe',
 ("08' mkz 79,ooo mi.", 'lincoln'): 'sedan',
 ('1 ser

In [None]:
for k in mapping_dict.keys():
  boolean_mask = (data["model"] == k[0]) & (data["manufacturer"] == k[1])
  data.loc[boolean_mask,"type"] = data.loc[boolean_mask,"type"].fillna(value=mapping_dict[k],inplace=False)

~Fills missing values (NaNs) in the "type" column of the DataFrame data using the most common types identified in the mapping_dict.
~Imputation is based on matching model and manufacturer combinations.

In [None]:
data.isna().sum()

price               0
year                0
manufacturer     4059
model              27
condition           0
cylinders           0
fuel                0
odometer            0
transmission        0
drive            2491
size            36793
type              129
paint_color         0
dtype: int64

In [None]:
data["drive"].fillna(value=data["drive"].value_counts().index[data["drive"].value_counts().argmax()],
                           inplace=True)

In [None]:
data["size"].fillna(value=data["size"].value_counts().index[data["size"].value_counts().argmax()],
                           inplace=True)

In [None]:
data.isna().sum()

price              0
year               0
manufacturer    4059
model             27
condition          0
cylinders          0
fuel               0
odometer           0
transmission       0
drive              0
size               0
type             129
paint_color        0
dtype: int64

In [None]:
data["type"].fillna(value=data["type"].value_counts().index[data["type"].value_counts().argmax()],
                           inplace=True)

In [None]:
data["manufacturer"].fillna(value=data["manufacturer"].value_counts().index[data["manufacturer"].value_counts().argmax()],
                           inplace=True)

In [None]:
data["model"].fillna(value=data["model"].value_counts().index[data["model"].value_counts().argmax()],
                           inplace=True)

In [None]:
data.isna().sum()

price           0
year            0
manufacturer    0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
drive           0
size            0
type            0
paint_color     0
dtype: int64

~ And finally we have filled all the missing value of each data in the dataset.