# E-bay car sales analysis
### Hemanth Soni, July 2020

---

The goal for this project is to clean and analyze a subset of eBay car sales data. The [original database](https://www.kaggle.com/orgesleka/used-cars-database/data) was uploaded to Kaggle, but [Dataquest](https://dataquest.io) has created a version that is smaller (50K rows) and dirtier to help practice data cleaning.

The aim of this project is to clean the data and analyze the included used car listings.

## Importing data

I'll start by setting up the working environment first

In [5]:
#Importing libraries

import numpy as np
import pandas as pd

# Loading CSV to file

autos = pd.read_csv('car_data/autos.csv', encoding='Latin-1')

In [None]:
# Quick exploration

autos.info()
print()
print(autos.head())
print()
autos.describe(include='all')

From this quick scan, a few things become apparent:
* Most columns contain string types, but some (eg. odometer) could probably be integers or floats, and some (eg. damage) could probably be boolean
* The data looks mostly complete, with a few columns that have some null data but nothing that stands out too much
* The 'seller' and 'offerType' column have every row except one containing the same values, could be candidates to get dropped
* There seem to be some errors in the yearOfRegistration column; the minimum is too early and the max is too high, but most values seem right
* The power also seems too high for the max level
* A "0" month of registration doesn't make sense given there's a 12'th month of registraton as well; will need to dig into this error
* There are some 4-digit postal codes; need to dig into if that makes sense
* The number of pictures is 0 across the board? Should dig into that column as well

## Cleaning data

Starting with some basic header hygiene:

In [15]:
# Replacing columns in place w/ snakecase names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'ps_power', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pics', 'postal',
       'last_seen']

Next replacing data that can be integers as integers

In [None]:
# Taking a look at what the price format typically looks like
print(autos['price'].value_counts())

# Removing dollar signs and commas to leave only numbers
autos['price'] = autos['price'].str.replace('$','').str.replace(',','')

In [None]:
# Doing the same for the odometer column
print(autos['odometer'].value_counts())

# Removing commas and "km" from each item
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','')

In [None]:
# Exploring some of the things mentioned above

print(autos['seller'].value_counts())
print()
print(autos['offer_type'].value_counts())
print()
print(autos[autos['seller'] != 'privat'])
print()
print(autos[autos['offer_type'] != 'Angebot'])

In [29]:
# It's unclear why these are different (don't appear to be obious errors in any way); I'll drop these columns.
autos = autos.drop(columns=['seller','offer_type'])