[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mfurio93/Used-Cars-Dataset/blob/main/Notebook.ipynb)

# File importation and pre-cleaning

We now import the .csv file obtained from https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data.

It contains all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, title status, and 18 other categories.

Due to size constraints, several columns were deleted using Excel. Thanks to this the .csv jumps from 1.34 GB to 41MB in size.

Pandas informs us that we're working with a pretty large dataset of 426880 entries.

In [None]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.0f' % x)
df = pd.read_csv("train.csv")
print (df.shape)

(426880, 16)


# Analysis of missing data

Through the use of Pandas, we can see that there are columns with way too much missing data, so we'll delete those columns entirely.

Additionally, the columns where there is a more despreciable amount of missing data will be kept, however, the rows with the missing data of these columns will also be deleted.

Last but not least, we delete all duplicate entries within the dataframe.

In [None]:
df.isna().sum()

region               0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
drive           130567
size            306361
type             92858
paint_color     130203
state                0
dtype: int64

In [None]:
del df['condition']
del df['cylinders']
del df['drive']
del df['size']
del df['paint_color']
del df['type']

In [None]:
df.fillna(0, inplace=True)
df.drop(df[(df['price'] == 0)].index, inplace=True)
df.drop(df[(df['year'] == 0)].index, inplace=True)
df.drop(df[(df['manufacturer'] == 0)].index, inplace=True)
df.drop(df[(df['model'] == 0)].index, inplace=True)
df.drop(df[(df['fuel'] == 0)].index, inplace=True)
df.drop(df[(df['odometer'] == 0)].index, inplace=True)
df.drop(df[(df['title_status'] == 0)].index, inplace=True)
df.drop(df[(df['transmission'] == 0)].index, inplace=True)

In [None]:
df.drop_duplicates(inplace=True)

# Analysis of target variable

Now we take a look at how our target variable behaves.

Here we can appreciate several values that seem either off or excessive. To curb this effect, we'll delete all rows whose target variable value is outside of the \$1000 (one thousand) to \$100000 (one hundred thousand).

After this cleaning process, we lost roughly 120000 entries, or about 30% of our dataset, however, we now have a completely filled and unique dataset.

In [None]:
df.price.describe()

count       315296
mean         76518
std       12726399
min              1
25%           7500
50%          15990
75%          27990
max     3736928711
Name: price, dtype: float64

In [None]:
df.drop(df[df['price'] > 100000].index, inplace=True)
df.drop(df[df['price'] < 1000].index, inplace=True)
df.price.describe()

count   307166
mean     19492
std      14165
min       1000
25%       7995
50%      16495
75%      28223
max     100000
Name: price, dtype: float64

In [None]:
#from google.colab import files

#df.to_csv('output.csv', encoding = 'utf-8-sig') 
#files.download('output.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>