# E-bay car sales analysis
### Hemanth Soni, July 2020

---

The goal for this project is to clean and analyze a subset of eBay car sales data. The [original database](https://www.kaggle.com/orgesleka/used-cars-database/data) was uploaded to Kaggle, but [Dataquest](https://dataquest.io) has created a version that is smaller (50K rows) and dirtier to help practice data cleaning.

The aim of this project is to clean the data and analyze the included used car listings.

## Importing Data

I'll start by setting up the working environment first

In [39]:
import pandas as pd
import numpy as np

autos = pd.read_csv('car_data/autos.csv',encoding='Latin-1')

To start, I'll quickly examine the dataset and see if anything stands out

In [40]:
autos.info()
print('')
autos.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-27 22:55:05,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


From these summaries, a few data quality issues become apparent:
* the maximum year of registration is 9999, an impossible year, and the minimum is 1000, which is also impossible
* the maximum powerPS (basically horsepower) is impossibly high, and the minimum is too low (unless the car isn't operational)
* There are some 4-digit postal codes (unclear if that's valid; need to do some research)
* some values that could be int or float are stored as objects (eg. price, odometer aka. distance travelled)
* some columns have null data (notRepairedDmaage, fuelType, gearbox, vehicleType); to explore further if those can be filled/fixed
* some columns are useless (eg. offer type, seller) with exactly the same data in almost every row (49999 out of 50k)

## Cleaning data

### Renaming columns

I'll start by renaming the columsn to follow typical Python naming conventions.

In [41]:
# Renaming from CamelCase to snake_case

print(autos.columns) # to bet an output that can be copy+pasted and modified

# Writing new column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'ps_power', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pics', 'postal',
       'last_seen']

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


### Removing anomalies

In [42]:
# Checking in on the rows that have unique values in the seller or offer_type columns (only one of each)
print(autos[autos['seller'] != 'privat'])
print(autos[autos['offer_type'] != 'Angebot'])

             date_crawled                                         name  \
7738  2016-03-15 18:06:22  Verkaufe_mehrere_Fahrzeuge_zum_Verschrotten   

          seller offer_type price   abtest vehicle_type  registration_year  \
7738  gewerblich    Angebot  $100  control        kombi               2000   

      gearbox  ps_power   model   odometer  registration_month fuel_type  \
7738  manuell         0  megane  150,000km                   8    benzin   

        brand unrepaired_damage           ad_created  num_pics  postal  \
7738  renault               NaN  2016-03-15 00:00:00         0   65232   

                last_seen  
7738  2016-04-06 17:15:37  
              date_crawled                  name  seller offer_type price  \
17541  2016-04-03 15:48:33  Suche_VW_T5_Multivan  privat     Gesuch    $0   

      abtest vehicle_type  registration_year gearbox  ps_power        model  \
17541   test          bus               2005     NaN         0  transporter   

        odometer  regi

In [None]:
# It's unclear why these are different (don't appear to be obious errors in any way); I'll drop these columns.
autos = autos.drop(columns=['seller','offer_type'])

### Changing numerical columns to int/float values

Price and odometer(distance) are stored as objects; they can be better represented as floats, so I'll do that now.

In [48]:
print(autos['price'].value_counts())

# From what I can see, it looks like the price is formatted with a $XXX,XXX format (dollar at the beginning, and commas separating every three digits).

autos['price'] = autos['price'].str.replace(',','').str.replace('$','')

0        1421
500       781
1500      734
2500      643
1200      639
         ... 
66964       1
21690       1
4335        1
7420        1
23790       1
Name: price, Length: 2357, dtype: int64


In [None]:
print(autos['odometer'].value_counts())

# From what I can see, it looks like commas separate every three digits, with a KM at the end. I'll use string replace to clean this up.

autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','')
autos.rename(columns={'odometer':'odometer_km'}, inplace=True)

In [57]:
print('Overview of price column')
print('')
print(autos['price'].unique().shape)
print('')
print(autos['price'].describe())
print('')
print(autos['price'].value_counts())
print('')
print('Overview of odometer column')
print('')
print(autos['odometer_km'].unique().shape)
print('')
print(autos['odometer_km'].describe())
print('')
print(autos['odometer_km'].value_counts())

Overview of price column

(2357,)

count     50000
unique     2357
top           0
freq       1421
Name: price, dtype: object

0        1421
500       781
1500      734
2500      643
1200      639
         ... 
66964       1
21690       1
4335        1
7420        1
23790       1
Name: price, Length: 2357, dtype: int64

Overview of odometer column

(13,)

count      50000
unique        13
top       150000
freq       32424
Name: odometer_km, dtype: object

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64
