#  **Price Analysis**

Get the data from the csv file

In [None]:
#Let's import useful libraries

import pandas as pd
import numpy as np
import re

In [None]:
df = pd.read_csv('preprocessed_data_craigslist.csv')
df.drop(columns = 'Unnamed: 0', axis = 1, inplace = True)
df.head()

Unnamed: 0,price,location,url,date,title,numimage,text,condition,makemanufacturer,modelnamenumber,...,electricassist,framesize,handlebartype,suspension,wheelsize,sizedimensions,serialnumber,paintcolor,yearmanufactured,days
0,1000.0,auburn,https://auburn.craigslist.org/mcd/d/fayettevil...,2022-04-05 15:32,"BAD CREDIT, NO CREDIT, OK! WE WORK WITH EVERYONE!",10.0,"WE SHIP NATIONWIDE, FINANCE NATIONWIDE! YOU SE...",unknown,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,16
1,1.0,auburn,https://auburn.craigslist.org/for/d/defuniak-s...,2022-04-17 13:32,Atv’s,23.0,"Used and New atv’s , go karts and dirt bikes f...",unknown,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,4
2,8795.0,auburn,https://auburn.craigslist.org/fod/d/fairburn-2...,2022-04-16 14:51,2017 Club Car Precedent 4 Seater Gas Alabama C...,20.0,Very Nice 2017 Club Car Precedent EFI Gas Alab...,4,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,5
3,30.0,auburn,https://auburn.craigslist.org/bop/d/opelika-th...,2022-04-16 09:07,Thule 961XT Speedway Bike Strap Rear Rack Carrier,6.0,Thule 961XT Speedway - Bike Strap Rear Rack Ca...,unknown,1.0,1.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,5
4,112.0,auburn,https://auburn.craigslist.org/sgd/d/full-garag...,2022-04-15 12:52,"Full Garage Gym-Squat Rack, Dumbbells- Financi...",4.0,"Full Garage Gym Setup- Squat Rack, Adjustable ...",4,1.0,1.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,6


In [None]:
df.price.describe()

count      3437.000000
mean       4513.562118
std       24026.705163
min           1.000000
25%          75.000000
50%         275.000000
75%        2295.000000
max      999992.000000
Name: price, dtype: float64

We observe that there is a huge standard deviation for the price distribution. The minimum of $1 corresponds to annouces where the owner had to make an input in the price category to publish the annouce but wants to discuss the price. We can imagine the same for the prices very high. We could get some information about the price in the description. That is what we will look at in the next section.

Let's filter the data to get the annouces with incoherent prices. We have taken arbitrary threshold to determine if an announce is incoherent or not. 

In [None]:
df_filter = df[(df['price']>3000) | (df['price']<10)]
print(f'The number of incoherent announces regarding their price is: ',len(df_filter))
df_filter

The number of incoherent announces regarding their price is:  922


Unnamed: 0,price,location,url,date,title,numimage,text,condition,makemanufacturer,modelnamenumber,...,electricassist,framesize,handlebartype,suspension,wheelsize,sizedimensions,serialnumber,paintcolor,yearmanufactured,days
1,1.0,auburn,https://auburn.craigslist.org/for/d/defuniak-s...,2022-04-17 13:32,Atv’s,23.0,"Used and New atv’s , go karts and dirt bikes f...",unknown,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,4
2,8795.0,auburn,https://auburn.craigslist.org/fod/d/fairburn-2...,2022-04-16 14:51,2017 Club Car Precedent 4 Seater Gas Alabama C...,20.0,Very Nice 2017 Club Car Precedent EFI Gas Alab...,4,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,5
9,1.0,auburn,https://bham.craigslist.org/bik/d/springville-...,2022-04-18 09:36,Mountain bikes / children bikes,18.0,I have a Vertical PK7 21 speed mountain bike w...,unknown,0.0,0.0,...,0.0,3,unknown,0.0,unknown,0,0,0,0,3
27,4800.0,auburn,https://atlanta.craigslist.org/sat/mcy/d/lagra...,2022-04-15 19:58,2009/2014 Kawasaki Bikes,12.0,Blue-2009. Asking 3800.00 Damage on light. No ...,unknown,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,5
40,1.0,auburn,https://atlanta.craigslist.org/atl/bik/d/atlan...,2022-04-05 08:38,Multiple bikes for sale,3.0,multiple bikes for sale - cannondale R600 (60...,unknown,0.0,0.0,...,0.0,3,unknown,0.0,29,0,0,0,0,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3355,1.0,seattle,https://seattle.craigslist.org/sno/bik/d/lynnw...,2022-04-18 12:15,6 Ladies bikes-$50 to $75,0.0,"Roadmaster Mountain Sport, 18 speed, 24"" tires...",3,1.0,0.0,...,0.0,unknown,unknown,0.0,26,0,0,0,0,3
3356,3750.0,seattle,https://seattle.craigslist.org/skc/mpo/d/black...,2022-04-18 11:34,KTM Dirt Bikes,9.0,2017 KTM 85 SX less than 5 hrs $5000 2012 KT...,unknown,1.0,1.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,3
3382,1.0,seattle,https://seattle.craigslist.org/see/wan/d/seatt...,2022-04-15 10:52,CASH for wrecked or damaged sport bikes,0.0,I am in the market to purchase wrecked or dama...,unknown,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,6
3418,5.0,seattle,https://seattle.craigslist.org/oly/tag/d/olymp...,2022-04-04 11:06,Motormax Super Bikes Kawasaki 1/24 scale,0.0,Motormax Super Bikes Kawasaki motorcycle I c...,4,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,17


In [None]:
print(f'The price for this announce on Craigslist was: $', df_filter['price'][3423] )
print(f'Here is the description associated with this annouce: ',df_filter['text'][3423])

The price for this announce on Craigslist was: $ 1.0
Here is the description associated with this annouce:  SCHWINN CLASSIC BIKES, SELLING AS A PAIR. ONLY $100 FOR BOTH!


Let's use a regular expression pattern to extract the price from the annouce description. 

In [None]:
from statistics import mean

index_list = df_filter.index
index_list_final = list()
new_price_list = list()

for i in df.index:
  if i in df_filter.index:
    new_price_int = list()
    try:
      new_price = re.findall('\$(\x20?\d+(?:[,]\d{0,3})?)', df_filter['text'][i])
      for j in new_price:
        new_price_int.append(int(j.replace(',','')))
      if new_price_int == []:
        new_price = float('Nan')
        new_price_list.append(new_price)
        index_list_final.append(i)
      else:
        new_price_list.append(int(sum(new_price_int)/len(new_price_int)))
        index_list_final.append(i)
    except:
      #If there is no match, we replace the price value by NaN to eliminate them easily from our database
      new_price_list.append(float('Nan'))
      index_list_final.append(i)
  else:
    new_price_list.append(df['price'][i])
    index_list_final.append(i)

df_new_price = pd.DataFrame({'price' : new_price_list, 'Index':index_list_final})
df_new_price.set_index('Index', inplace = True)
df_new_price

Unnamed: 0_level_0,price
Index,Unnamed: 1_level_1
0,1000.0
1,
2,2995.0
3,30.0
4,112.0
...,...
3432,75.0
3433,100.0
3434,110.0
3435,50.0


In [None]:
#We update our dataframe with the new prices

df['price'] = df_new_price['price']
df.head()

Unnamed: 0,price,location,url,date,title,numimage,text,condition,makemanufacturer,modelnamenumber,...,electricassist,framesize,handlebartype,suspension,wheelsize,sizedimensions,serialnumber,paintcolor,yearmanufactured,days
0,1000.0,auburn,https://auburn.craigslist.org/mcd/d/fayettevil...,2022-04-05 15:32,"BAD CREDIT, NO CREDIT, OK! WE WORK WITH EVERYONE!",10.0,"WE SHIP NATIONWIDE, FINANCE NATIONWIDE! YOU SE...",unknown,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,16
1,,auburn,https://auburn.craigslist.org/for/d/defuniak-s...,2022-04-17 13:32,Atv’s,23.0,"Used and New atv’s , go karts and dirt bikes f...",unknown,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,4
2,2995.0,auburn,https://auburn.craigslist.org/fod/d/fairburn-2...,2022-04-16 14:51,2017 Club Car Precedent 4 Seater Gas Alabama C...,20.0,Very Nice 2017 Club Car Precedent EFI Gas Alab...,4,0.0,0.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,5
3,30.0,auburn,https://auburn.craigslist.org/bop/d/opelika-th...,2022-04-16 09:07,Thule 961XT Speedway Bike Strap Rear Rack Carrier,6.0,Thule 961XT Speedway - Bike Strap Rear Rack Ca...,unknown,1.0,1.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,5
4,112.0,auburn,https://auburn.craigslist.org/sgd/d/full-garag...,2022-04-15 12:52,"Full Garage Gym-Squat Rack, Dumbbells- Financi...",4.0,"Full Garage Gym Setup- Squat Rack, Adjustable ...",4,1.0,1.0,...,0.0,unknown,unknown,0.0,unknown,0,0,0,0,6


In [None]:
df.price.describe()

count      2944.000000
mean       1632.134511
std        9448.964930
min           0.000000
25%          80.000000
50%         225.000000
75%        1000.000000
max      317422.000000
Name: price, dtype: float64

We observe that the mean is now 1632 against 4513 before the cleaning of our data. 
This is much more coherent with the price of bikes.

In [None]:
#Let's check how many useful data we still have in our dataframe

df.price.count() 

2944