# Initialization

In [1]:
# importing libraries
import numpy as np
import pandas as pd
import plotly.express as px
import math as mt
import requests 
from io import BytesIO

# Load Data

In [2]:
# trying to read in dataset from google sheet i made public and then loading it from the project directory if it fails

try:
    
    sheet_url = "https://docs.google.com/spreadsheets/d/1jzihXIadik_fkLF3u5dtNW2ELG0Ts3CNjOMFAVnrPL0/edit?usp=sharing"
    
    r = requests.get(sheet_url)
    
    df = pd.read_csv(BytesIO(r.content)) 
except:
    
    df = pd.read_csv("/Users/juansiliezar/sprint-4-software-development-tools/datasets/vehicles_us.csv")
    

# Data Preparation 

In [6]:
df.info()
df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
44652,15900,2018.0,dodge grand caravan,good,6.0,gas,51051.0,automatic,van,grey,,2018-09-14,11
42998,3500,2007.0,chevrolet impala,good,6.0,hybrid,153000.0,automatic,sedan,black,,2019-01-22,30
42817,3200,1992.0,gmc sierra,good,8.0,gas,145700.0,automatic,truck,,1.0,2018-06-18,39
41892,2500,1997.0,ram 1500,fair,8.0,gas,187000.0,automatic,pickup,black,1.0,2018-07-24,13
488,10495,2013.0,toyota camry,excellent,4.0,gas,111337.0,automatic,other,grey,,2019-01-15,17


In [9]:
# Checking for duplicate rows
df.duplicated().sum()

0

In [20]:
# Investigating missing values in model year column
missing_model_year = df[df['model_year'].isna()]
missing_model_year.shape

(3619, 13)

In [21]:
# Investingating is_4wd column
df['is_4wd'].value_counts(dropna=False)

is_4wd
NaN    25953
1.0    25572
Name: count, dtype: int64

In [23]:
# Investigating cylinders column
df['cylinders'].value_counts(dropna=False)

cylinders
8.0     15844
6.0     15700
4.0     13864
NaN      5260
10.0      549
5.0       272
3.0        34
12.0        2
Name: count, dtype: int64

In [25]:
# Investigating paint_color column
df['paint_color'].value_counts(dropna=False)

paint_color
white     10029
NaN        9267
black      7692
silver     6244
grey       5037
blue       4475
red        4421
green      1396
brown      1223
custom     1153
yellow      255
orange      231
purple      102
Name: count, dtype: int64

In [26]:
# Investigating odometer column
df['odometer'].value_counts(dropna=False)

odometer
NaN         7892
0.0          185
140000.0     183
120000.0     179
130000.0     178
            ... 
87836.0        1
172625.0       1
103597.0       1
167239.0       1
139573.0       1
Name: count, Length: 17763, dtype: int64

<div class='alert alert-info'> <b>Initial observations</b>

There are 51,525 rows of data and 13 columns. The index is a standard zero-based index which applies well to this specific dataset since there is no other unique identifier to use for each car. There are zero duplicate rows in our dataset. There are missing values in the `paint_color` (9,276), `odometer` (7,892), `cylinders` (5,260), `model_year` (3,619), & `is_4wd` (25,953) columns. However the missing columns in the `is_4wd` column aren't true missing values. It appears that the rows with missing values in this column are cars that are not 4 wheel drive since the only other value in that column is "1" which is commonly used to represent "yes". In regards to the former columns that contain missing values, I think it would be detrimental to our analysis to drop these rows as (1) the missing data is actually representative of what the used car market is in reality; people are forced to make decisions with the info available to them, and (2) the main variables of interest in my opinion are the `model` and `price` columns which do not contain any missing values, so we will keep these values. However, some of these columns do not have the appropriate data types, so I will look to correct those. I will also fill in the missing values for the `cylinders` and `paint_color` with the value "unkown" as these are categorical variables. 

1. Fill in missing values in `is_4wd` column with the value `0` to represent 'no'
2. Fill in missing values for `paint_color` and `cylinders` with placeholder value 'Unknown'
2. Change model year dtype to `int64`
3. Change cylinders dtype to `int64` 
4. Change data posted dtype to `datetime`

Questions about the data

        - What is the range of years that cars in this dataset were sold?
        - What are the most common car models in the used car market?
        - 
    
</div>

In [None]:
# Correcting innapropriate data types
df['model_year'] = df['model_year'].astype('int64')
df['cylinders'] = df['cylinders'].astype('int64')
df['date_posted'] = pd.to_datetime(df['date_posted'], format='%Y-%m-%d')