# Searching Data Lab

### Introduction

In this lesson, we'll work with the [craigslist cars truck dataset](https://www.kaggle.com/austinreese/craigslist-carstrucks-data).  The dataset tracks the price of cars and different details about the cars.  We'll use our skills to explore the dataset, and look for potential columns to coerce.

### Loading our Data

In [13]:
url = "https://raw.githubusercontent.com/jigsawlabs-student/engineering-large-datasets/master/vehicles_top_thousand.csv"
df = pd.read_csv(url, index_col = 0)

In [21]:
df[:2]

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,vin,drive,size,type,paint_color,image_url,description,state,lat,long
0,7088746062,https://greensboro.craigslist.org/ctd/d/cary-2...,greensboro,https://greensboro.craigslist.org,10299,2012.0,acura,tl,,,...,19UUA8F22CA003926,,,other,blue,https://images.craigslist.org/01414_3LIXs9EO33...,2012 Acura TL Base 4dr Sedan Offered by: B...,nc,35.7636,-78.7443
1,7088745301,https://greensboro.craigslist.org/ctd/d/bmw-3-...,greensboro,https://greensboro.craigslist.org,0,2011.0,bmw,335,,6 cylinders,...,,rwd,,convertible,blue,https://images.craigslist.org/00S0S_1kTatLGLxB...,BMW 3 Series 335i Convertible Navigation Dakot...,nc,,


Let's start by seeing if there are any columns that are identical.  Here is the `find_all_same` function.  Use it to identify columns with identical values.

In [15]:
def find_all_same(df):
    return [col for col in df.columns if len(df[col].unique()) == 1]

In [23]:
# []

So none of the columns have only one value.

### Coercing Columns

Next let's see if there are any numeric columns.

In [12]:
df.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',
       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',
       'transmission', 'vin', 'drive', 'size', 'type', 'paint_color',
       'image_url', 'description', 'state', 'lat', 'long'],
      dtype='object')

Now first select the columns of type object from our dataframe.

In [27]:
obj_df = None

In [29]:
obj_df.columns

# Index(['url', 'region', 'region_url', 'manufacturer', 'model', 'condition',
#        'cylinders', 'fuel', 'title_status', 'transmission', 'vin', 'drive',
#        'size', 'type', 'paint_color', 'image_url', 'description', 'state'],
#       dtype='object')

And then loop through the columns to see if any contains numbers.

In [22]:
def contains_numbers(column):
    # matches price or percentage     
    regex_string = (r'^(?!.*www|.*-|.*\/|.*[A-Za-z]|.* ).*\d.*')
#     regex_string = (r'\$\d+.*|\d+.*\%$|^\d+.*$')
    return column.str.contains(regex_string).any()

> Notice that we switched the function to end with `any` to make it more inclusive.

In [25]:
number_cols = None
number_cols

# url             False
# region          False
# region_url      False
# manufacturer    False
# model            True
# condition       False
# cylinders       False
# fuel            False
# title_status    False
# transmission    False
# vin              True
# drive           False
# size            False
# type            False
# paint_color     False
# image_url       False
# description     False
# state           False
# dtype: bool


url             False
region          False
region_url      False
manufacturer    False
model            True
condition       False
cylinders       False
fuel            False
title_status    False
transmission    False
vin              True
drive           False
size            False
type            False
paint_color     False
image_url       False
description     False
state           False
dtype: bool

Let's select the just the values that are true, to identify our almost numeric columns and see if we should coerce them.

In [30]:
has_num_cols = None

has_num_cols

# model    True
# vin      True
# dtype: bool

Select those `has_num_cols` from our `obj_df`.

In [31]:
potential_num_df = None
potential_num_df[:3]


# model	vin
# 0	tl	19UUA8F22CA003926
# 1	335	NaN
# 2	xf	NaN

Unnamed: 0,model,vin
0,tl,19UUA8F22CA003926
1,335,
2,xf,


Well things are not looking good.  Still let's loop through the data to select the top five values from the columns (loop through using value_counts).

In [31]:
# [f-150             0.020724
#  silverado 1500    0.011727
#  1500              0.010311
#  silverado         0.010008
#  wrangler          0.008290
#  Name: model, dtype: float64,
#  RUNS GREAT           0.000894
#  1G4PW5SKXG4132450    0.000745
#  1G6AA1RX2G0153368    0.000745
#  O                    0.000745
#  FINANCING            0.000745
#  Name: vin, dtype: float64]

Well these are values are looking more categorical than numeric.  It looks like we don't have numeric columns to work with in this lesson.

### Summary

In this lesson, we went through the procedure of looking for some columns to coerce.  Along the way, we saw how we can loop through our columns to check if they have values that are all the same, or if they have columns that we should make numeric.  We explored our potential columns by looping through the columns seeing the top values.