# Exploratory Data Analysis 

### by George Levis

>This dataset comprises of 380,000 entries of used cars for sale from various online platforms. The data has been downloaded from a comprehensive used car dataset, providing in-depth information about each vehicle's type, age, condition, and more. The full dataset can be found at https://www.kaggle.com/datasets/thedevastator/uncovering-factors-that-affect-used-car-prices


#### Preliminary Data Wrangling

In [27]:
# import necessary libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

In [28]:
cars_df = pd.read_csv('autos.csv', index_col = 0)
cars_df.head()

Unnamed: 0_level_0,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
dateCrawled,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2016-03-24 11:52:17,Golf_3_1.6,private,Angebot,480,test,,1993,manual,0,golf,150000,0,gasoline,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,private,Angebot,18300,test,coupe,2011,manual,190,,125000,5,diesel,audi,yes,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",private,Angebot,9800,test,suv,2004,automatic,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,private,Angebot,1500,test,small car,2001,manual,75,golf,150000,6,gasoline,volkswagen,no,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,private,Angebot,3600,test,small car,2008,manual,69,fabia,90000,7,diesel,skoda,no,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


##### Understanding the Data


In [29]:
#get the shape about the dataframe
print("rows: ", cars_df.shape[0]) 
print("columns: ", cars_df.shape[0])

rows:  371528
columns:  371528


In [30]:
#get the information about the dataframe
print("Information about the data: ", cars_df.info()) 

<class 'pandas.core.frame.DataFrame'>
Index: 371528 entries, 2016-03-24 11:52:17 to 2016-03-07 19:39:19
Data columns (total 19 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   name                 371528 non-null  object
 1   seller               371528 non-null  object
 2   offerType            371528 non-null  object
 3   price                371528 non-null  int64 
 4   abtest               371528 non-null  object
 5   vehicleType          333659 non-null  object
 6   yearOfRegistration   371528 non-null  int64 
 7   gearbox              351319 non-null  object
 8   powerPS              371528 non-null  int64 
 9   model                351044 non-null  object
 10  kilometer            371528 non-null  int64 
 11  monthOfRegistration  371528 non-null  int64 
 12  fuelType             338142 non-null  object
 13  brand                371528 non-null  object
 14  notRepairedDamage    299468 non-null  object
 15  dateCrea

In [31]:
#get the sum of null values in each column
print("Missing values in the data: ", cars_df.isnull().sum()) 

Missing values in the data:  name                       0
seller                     0
offerType                  0
price                      0
abtest                     0
vehicleType            37869
yearOfRegistration         0
gearbox                20209
powerPS                    0
model                  20484
kilometer                  0
monthOfRegistration        0
fuelType               33386
brand                      0
notRepairedDamage      72060
dateCreated                0
nrOfPictures               0
postalCode                 0
lastSeen                   0
dtype: int64


In [32]:
#percentage of missing values in each column
blank_percent = cars_df.isnull().sum() * 100 / len(cars_df)
print("Percentage of missing values in each column: ", blank_percent)

Percentage of missing values in each column:  name                    0.000000
seller                  0.000000
offerType               0.000000
price                   0.000000
abtest                  0.000000
vehicleType            10.192771
yearOfRegistration      0.000000
gearbox                 5.439429
powerPS                 0.000000
model                   5.513447
kilometer               0.000000
monthOfRegistration     0.000000
fuelType                8.986133
brand                   0.000000
notRepairedDamage      19.395577
dateCreated             0.000000
nrOfPictures            0.000000
postalCode              0.000000
lastSeen                0.000000
dtype: float64


In [33]:
#check for duplicate rows
print("Number of duplicate rows: ", cars_df.duplicated().sum())

Number of duplicate rows:  29


In [42]:
#check the rows with missing values
print("Rows with missing values: ", cars_df.isna().all(axis=1).sum())

Rows with missing values:  0


In [35]:
#discriptive statistics of the data
cars_df.describe()

Unnamed: 0,price,yearOfRegistration,powerPS,kilometer,monthOfRegistration,nrOfPictures,postalCode
count,371528.0,371528.0,371528.0,371528.0,371528.0,371528.0,371528.0
mean,17295.14,2004.577997,115.549477,125618.688228,5.734445,0.0,50820.66764
std,3587954.0,92.866598,192.139578,40112.337051,3.712412,0.0,25799.08247
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1150.0,1999.0,70.0,125000.0,3.0,0.0,30459.0
50%,2950.0,2003.0,105.0,150000.0,6.0,0.0,49610.0
75%,7200.0,2008.0,150.0,150000.0,9.0,0.0,71546.0
max,2147484000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


#### Investigate some of the variables

In [36]:
# Check unique values in 'gearbox' column
print("Unique values in 'gearbox' column: ", cars_df['gearbox'].unique())

# Check the gearbox values
print("Number of values in 'gearbox' column: ", cars_df['gearbox'].value_counts())

Unique values in 'gearbox' column:  ['manual' 'automatic' nan]
Number of values in 'gearbox' column:  manual       274214
automatic     77105
Name: gearbox, dtype: int64


In [37]:
#check the vehicle type values
print("Number of values in 'vehicleType' column: ", cars_df['vehicleType'].value_counts())

Number of values in 'vehicleType' column:  limousine        95894
small car        80023
station wagon    67564
bus              30201
convertible      22898
coupe            19015
suv              14707
other             3357
Name: vehicleType, dtype: int64


In [38]:
#check the manufacturer values
print("Number of values in 'manufacturer' column: ", cars_df['brand'].value_counts().nlargest(15))

Number of values in 'manufacturer' column:  volkswagen       79640
bmw              40274
opel             40136
mercedes_benz    35309
audi             32873
ford             25573
renault          17969
peugeot          11027
fiat              9676
seat              7022
mazda             5695
skoda             5641
smart             5249
citroen           5182
nissan            5037
Name: brand, dtype: int64


In [39]:
# Check unique values in 'seller' column
print("Unique values in 'seller' column: ", cars_df['seller'].unique())

print(cars_df['seller'].value_counts())

Unique values in 'seller' column:  ['private' 'dealer']
private    371525
dealer          3
Name: seller, dtype: int64


In [43]:
# Check for unusual values in yearOfRegistration
print("Earliest registration year: ", cars_df["yearOfRegistration"].min())
print("Latest registration year: ", cars_df["yearOfRegistration"].max())

# Check for unusual values in monthOfRegistration
print("Earliest registration month: ", cars_df["monthOfRegistration"].min())
print("Latest registration month: ", cars_df["monthOfRegistration"].max())


Earliest registration year:  1000
Latest registration year:  9999
Earliest registration month:  0
Latest registration month:  12


### Notes on data


#### Data issues