# Acquire data and deal with large dataset (~ 1.45 GB)

#### Depending on where you're reading this notebook, the large 'vehicles.csv' file may not be available alongside it. Should that be the case, the original dataset can be found in https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data

A great solution for dealing with large datasets without needing to know more complex distributing frameworks, like PySpark, is reading our file using Dask. Dask is a flexible library for parallel computing in Python.

In [1]:
from dask import dataframe as dd
import time

start = time.time()
dask_df = dd.read_csv('vehicles.csv', dtype='object')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")

Read csv with dask:  0.023141145706176758 sec


As you can see, it is capable of reading large datasets in less than a second. However, Dask isn't as advanced as Pandas for data manipulation. Since there would be no point in simply converting our Dask DF to a Pandas DF because of our dataset's large size, we'll take a random sample of it (~5%) and then convert it to Pandas DF.

In [2]:
dask_sample = dask_df.sample(frac=21334/len(dask_df), replace=None, random_state=10)
len(dask_sample)

21335

In [3]:
import pandas as pd

We need to use the Dask DF method _compute()_ in order to convert it into a Pandas DF, simply assigning our Dask sample won't work.

In [4]:
pandas_df = dask_sample.compute()

And here we go! A random sample of our large dataset in a Pandas DataFrame.

In [5]:
pandas_df.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
3724,7302715773,https://huntsville.craigslist.org/ctd/d/nashvi...,huntsville / decatur,https://huntsville.craigslist.org,23989,,,a RAV4,,4 cylinders,...,,SUV,black,https://images.craigslist.org/00C0C_bFW2ZNfMHB...,"2018 *Toyota* *RAV4* 2018 TOYOTA RAV4, XLE, AW...",,al,36.138037,-86.731163,2021-04-06T13:12:50-0500
4084,7310465657,https://mobile.craigslist.org/ctd/d/mobile-zer...,mobile,https://mobile.craigslist.org,199,2017.0,dodge,charger,,,...,,,,https://images.craigslist.org/00i0i_ChcqziILO2...,2017 Dodge Charger 20120 miles. 1 owner and cl...,,al,30.7309,-88.0789,2021-04-21T13:39:20-0500
2007,7308463415,https://dothan.craigslist.org/ctd/d/alachua-20...,dothan,https://dothan.craigslist.org,20820,2020.0,ford,ecosport titanium,,,...,,,,https://images.craigslist.org/00j0j_7F1cS2oKnD...,2020 FORD ECOSPORT TITANIUM ~ Hundreds of NEW ...,,al,29.802071,-82.530799,2021-04-17T15:48:18-0500
7877,7311916610,https://fairbanks.craigslist.org/cto/d/north-p...,fairbanks,https://fairbanks.craigslist.org,3995,1984.0,chevrolet,k5 blazer,good,8 cylinders,...,full-size,pickup,black,https://images.craigslist.org/00a0a_87FrYF5Sfv...,"1984 K5 Bkazer, 32.000 miles on rebuilt motor....",,ak,64.7805,-147.3694,2021-04-24T08:42:31-0800
3858,7314950350,https://mobile.craigslist.org/ctd/d/foley-2012...,mobile,https://mobile.craigslist.org,14890,2012.0,chevrolet,silverado 1500 ls,excellent,8 cylinders,...,full-size,truck,black,https://images.craigslist.org/00m0m_4uWlVspa23...,SILVERADO 1500.....VERY CLEAN TRUCK CLEAN HIST...,,al,30.429711,-87.683198,2021-04-30T13:57:40-0500


In [6]:
len(pandas_df)

21335

Great. Now, just so that we don't have to run all this again and also be able to export this project with ease, we'll save this Pandas DF as a CSV with a reduced size.

In [7]:
pandas_df.to_csv('vehicles_reduced.csv', index=False) 

Let's import our CSV file and make sure it was properly saved.

In [8]:
df = pd.read_csv('vehicles_reduced.csv')
df.sample(10)

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
16987,7314052565,https://providence.craigslist.org/cto/d/provid...,rhode island,https://providence.craigslist.org,2800,2003.0,mini,cooper,good,,...,,coupe,yellow,https://images.craigslist.org/00K0K_8QnLofzwRn...,• hit me up with offers •looking to get rid of...,,ri,41.8226,-71.4139,2021-04-28T17:28:00-0400
20749,7310151399,https://yakima.craigslist.org/ctd/d/lynnwood-2...,yakima,https://yakima.craigslist.org,31999,2015.0,chevrolet,colorado,,4 cylinders,...,,truck,,https://images.craigslist.org/00Y0Y_7hPv1NkcBy...,FREE SHIPPING TO YAKIMAASK US HOW IT WORKSCALL...,,wa,47.81247,-122.32164,2021-04-20T18:32:39-0700
15523,7313789917,https://oklahomacity.craigslist.org/ctd/d/okla...,oklahoma city,https://oklahomacity.craigslist.org,18990,2018.0,mitsubishi,eclipse cross es,good,,...,,hatchback,red,https://images.craigslist.org/00q0q_7T5On2tjjA...,Carvana is the safer way to buy a car During t...,,ok,35.46,-97.51,2021-04-28T09:01:08-0500
18081,7311364208,https://beaumont.craigslist.org/ctd/d/beaumont...,beaumont / port arthur,https://beaumont.craigslist.org,6500,2011.0,subaru,legacy,,,...,,,,https://images.craigslist.org/00E0E_idq5DRNRT3...,"Clean inside/out and runs smooth, recently ser...",,tx,30.0211,-94.1157,2021-04-23T10:31:45-0500
7675,7311590728,https://desmoines.craigslist.org/ctd/d/carroll...,des moines,https://desmoines.craigslist.org,33800,2000.0,,plymouth prowler,excellent,6 cylinders,...,,coupe,black,https://images.craigslist.org/00g0g_aAgEJMZQ1o...,Check out this 2000 Plymouth Prowler Roadster ...,,ia,42.064092,-94.861561,2021-04-23T16:36:46-0500
3407,7309539504,https://stockton.craigslist.org/ctd/d/sacramen...,stockton,https://stockton.craigslist.org,16288,2010.0,lexus,rx 350 4x2 with navigation,good,6 cylinders,...,,other,,https://images.craigslist.org/00w0w_6ixwqwEISt...,2010 * Lexus * * RX 350 4x2 With Navigation ...,,ca,38.610767,-121.422557,2021-04-19T15:41:00-0700
688,7311555472,https://fayar.craigslist.org/ctd/d/tulsa-2013-...,fayetteville,https://fayar.craigslist.org,8500,2013.0,ford,edge,excellent,6 cylinders,...,mid-size,SUV,white,https://images.craigslist.org/00h0h_4G8OUa7Jru...,2013 Ford Edge Limited FREE WARRANTY!! Ocean ...,,ar,36.192874,-95.75898,2021-04-23T15:32:59-0500
5139,7314541668,https://orlando.craigslist.org/ctd/d/kissimmee...,orlando,https://orlando.craigslist.org,11900,2016.0,volkswagen,jetta,,,...,compact,sedan,,https://images.craigslist.org/00N0N_8nZ44fojLl...,2016 Volkswagen Jetta SEL 6A Offered by: W...,,fl,28.329026,-81.404237,2021-04-29T17:39:37-0400
19548,7307224846,https://vermont.craigslist.org/ctd/d/south-bur...,vermont,https://vermont.craigslist.org,25995,2016.0,ford,explorer,,,...,,,silver,https://images.craigslist.org/00O0O_7uyhqmwzEi...,Second Street Auto - 1000 Second Street - Manc...,,vt,44.461201,-73.193703,2021-04-15T12:09:28-0400
14968,7313048517,https://toledo.craigslist.org/ctd/d/wapakoneta...,toledo,https://toledo.craigslist.org,20580,2017.0,buick,enclave,excellent,,...,,SUV,,https://images.craigslist.org/00c0c_aBBsSQdTii...,2017 *** Buick Enclave Convenience Group SUV *...,,oh,40.573231,-84.184755,2021-04-26T17:34:38-0400


In [9]:
len(df)

21335