# Lemon or Peach?

For those who have taken ECON 101, you should already be quite familar with the term **lemon car** that refers to a car found to be defective only after it has been bought. In 1970, economist George Akerlof published a paper "The Market for Lemons: Quality Uncertainty and the Market Mechanism" that explored how the quality of goods traded in a market can degrade in the presence of *information asymmetry* between buyers and sellers. This leads to the **adverse selection** problem that  sellers will sell only when they hold "lemons" and they will leave the market when they hold "peaches". As such, Adverse selection is a market mechanism that can lead to a market collapse. In 2001, Akerlof, along with Michael Spence, and Joseph Stiglitz, jointly received the Nobel Memorial Prize in Economic Sciences, for their research on issues related to asymmetric information.

![lemoncar](https://blog.drivetime.com/wp-content/uploads/2014/06/lemon-car.png)

Online information system creates platform for buyers and sellers to trade their goods with information about the goods. However, lemon problem becomes even worse as buyers have no way to verify and examine the goods (such as used cars) and have to rely on information posted by the sellers. 

In order to address this problem, large-scale historical purchase data would allow online retailers to alleviate the issue. For example, OLX group is a global online marketplace operating in 45 countries. The OLX marketplace is a platform for buying and selling services and goods such as electronics, fashion items, furniture, household goods, Properties, cars and bikes. Their business model is to charge sellers listing fees when they post the advertisement. Their business objective is to facilitate transactions on the platform such that buyers would be willing to repeately purchase quality goods with reasonable price and sellers would be willing to list more goods for profits. 

As the analyst, you are given the task **to examine the information about listings of used cars on OLX Portugal that a car is considered to be sold or not if the days of listing is within the 30 days.** Otherwise, the seller has to pay extra to relist the advertisement until it is sold or withdrawn. 



**OLX Car Dataset (olx_car_dataset.csv)**
All car listings are contained in the file olx_car_dataset.csv. Each line of this file after the header row represents one listing of car on the OLX platform, and has the following format:
**`'ID', 'region_id', 'private_business', 'price', 'make', 'model', 'fuel_type', 'mileage', 'reg_year', 'eng_capacity', 'color', 'capacity', 'dayslive'`**

The columns are quite self-explained. `dayslive` is the days of the listing on the OLX until the card is sold or withdrawn. 

In [1]:
import pandas as pd
import numpy as np

The dataset can store up to 10,000 car listings and therefore in total contains 3 parts. 

Meanwhile, each part of the data contains a few errors that need to be resolved:

- the ```price``` column of part1 is mistakenly encoded as ```prices``` 
- the ```region_id``` column of part 1 is all in the wrong sign, e.g. the region_id of 11 is wrongly encoded as -11
- the ```private_business``` column of part2 is mistakenly encoded as ```private```
- the ```mileage``` column of part2 is miscalculated by taking a logarithm of its original value
- the ```capacity``` column of part3 is wrongly set as missing value when the capacity is 5
- the ```model``` column of part3 is wrongly encoded with also the ```make``` columns, e.g. BMW 320 should have BMW in the column ```make``` and 320 in the column ```model```.

Read the datasets and concatenate them all into one dataset and show how many car listings are in the entire dataset?

In [91]:
cars_1 = pd.read_csv('cars_dataset_part1.csv', index_col = 0)
cars_1.drop(columns = 'Unnamed: 0.1', inplace = True)

cars_1.rename(columns = {'prices':'price'}, inplace=True) # 1.1
cars_1['region_id'] = cars_1['region_id'].abs() # 1.2


cars_2 = pd.read_csv('cars_dataset_part2.csv', index_col = 0)
cars_2.drop(columns = 'Unnamed: 0.1', inplace = True)

cars_2.rename(columns = {'private':'private_business'}, inplace=True) # 1.3
cars_2.mileage = round(cars_2.mileage * 10000, 0) # 1.4


cars_3 = pd.read_csv('cars_dataset_part3.csv', index_col = 0)
cars_3.drop(columns = 'Unnamed: 0.1', inplace = True)

cars_3[['capacity']] = cars_3[['capacity']].fillna(value = 5) # 1.5


dataframes = [cars_1, cars_2, cars_3]
df = pd.concat(dataframes)

Show the average price of all diesel cars

In [95]:
print(round(df.loc[df.fuel_type == 'diesel']['price'].mean(), 2))

17820.63


Show the median mileage of all eletric vehicles from region 13:

In [98]:
print(df.loc[(df.fuel_type == 'electric') & (df.region_id == 13)]['mileage'].median())

54000.0


Show the top 5 most popular car make listed by private owners and registered since 2017: 

In [166]:
pd.DataFrame(df.loc[(df.private_business == 'business') & (df.reg_year == 2017)]['make'].value_counts()).head(5)

Unnamed: 0,make
renault,553
nissan,328
mercedes-benz,310
peugeot,296
bmw,283


Show the difference of average price of diesel cars with mileage larger than 100,000 kms and smaller than 5,000 kms. What do you find? How about gaz cars? 

In [200]:
# diesel
p_diesel_over_100k = df.loc[(df.fuel_type == 'diesel') & (df.mileage > 100000)]['price'].mean()
p_diesel_under_5k = df.loc[(df.fuel_type == 'diesel') & (df.mileage < 5000)]['price'].mean()

print(f'''The difference of average price of diesel cars with mileage > 100,000 kms and < 5,000 kms is:
\tEUR {round(p_diesel_under_5k - p_diesel_over_100k, 1)}''')

# answer is below

The difference of average price of diesel cars with mileage > 100,000 kms and < 5,000 kms is:
	EUR 11859.4


In [202]:
# gaz
p_gaz_over_100k = df.loc[(df.fuel_type == 'gaz') & (df.mileage > 100000)]['price'].mean()
p_gaz_under_5k = df.loc[(df.fuel_type == 'gaz') & (df.mileage < 5000)]['price'].mean()

print(f'''The difference of average price of diesel cars with mileage > 100,000 kms and < 5,000 kms is:
\tEUR {round(p_gaz_under_5k - p_gaz_over_100k, 1)}''')

# answer is below

The difference of average price of diesel cars with mileage > 100,000 kms and < 5,000 kms is:
	EUR 13442.0


**ANSWER:**
Mileage has a greater impact on price on gaz cars than on diesel cars, since the difference in average price is greater. In other words, gaz cars devaluate faster.

For cars with the most popular color, how many of them are sold by business sellers in region 11 and with capacity of 5 passengers?

In [205]:
most_popular_color_number = df['color'].value_counts()

top_color = list(most_popular_color_number.index)[0]

# most popular color is white - 11 180 cars - stored in variable top_color

result = df.loc[(df.color == top_color) & (df.private_business == 'business') & (df.region_id == 11) & (df.capacity == 5)]['color'].value_counts()

print(f'''Out of {most_popular_color_number[0]} cars of the most popular color:
{result[0]} are sold by business sellers in region 11 and with capacity of 5 passengers''')

Out of 11180 cars of the most popular color:
2101 are sold by business sellers in region 11 and with capacity of 5 passengers


Which model is listed the most expensive? On average, how many days does this model stay on the listing?

In [195]:
most_expensive_model = df.sort_values(by='price', ascending=False).head(1)
make = list(most_expensive_model['make'])[0]
model = list(most_expensive_model['model'])[0]

avg_days_listed = df[(df.make == make) & (df.model == model)]['dayslive'].mean()

print(f'Most expensive make and model, respectively: {make} {model}.')
print(f'This model stays listed an average of {round(avg_days_listed)} days.')

Most expensive make and model, respectively: audi a6-avant.
This model stays listed an average of 62 days.


For cars listed in region of 11 and 13, can you identify the top 5 model and make of cars that have the highest sales (i.e. the listing days smaller or less than 30 days)?

In [181]:
a = df.loc[((df.region_id == 11) | (df.region_id == 13)) & (df.dayslive < 30)].sort_values('model')

df_8 = pd.DataFrame(a.groupby(a['model']).size().sort_values(ascending = False).head(5))
df_8.columns = ['count of cars']

df_8

Unnamed: 0_level_0,count of cars
model,Unnamed: 1_level_1
clio,203
ibiza,159
megane,146
fortwo,127
qashqai,119


Show the correlation between mileage, reg_year, eng_capacity and price. What do you observe? 

In [164]:
df[['mileage', 'reg_year', 'eng_capacity', 'price']].corr()

Unnamed: 0,mileage,reg_year,eng_capacity,price
mileage,1.0,-0.537396,0.145914,-0.122086
reg_year,-0.537396,1.0,-0.138543,0.148113
eng_capacity,0.145914,-0.138543,1.0,0.083119
price,-0.122086,0.148113,0.083119,1.0


**ANSWER:** There is a significant negative correlation between mileage and reg_year, meaning cars that have high mileages tend to have been listed earlier, and in the past years it is cars with lower mileages that are being listed.

Other than that, there are no relevant positive or negative correlations.

Show the pivot table of mean and standard deviation for car prices  across the type of sellers and region (as the row), and fuel_types (as the column). If there are no available cars in each category, set the value to be 0.

In [184]:
pivot_t = df.pivot_table(index = ['region_id','ID'], columns = 'fuel_type', values = 'price', aggfunc = 'mean')
pivot_t.fillna(0, inplace = True)

pivot_t

Unnamed: 0_level_0,fuel_type,diesel,electric,gaz,gpl,hibride-diesel,hibride-gaz
region_id,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,000dc445ec70114ac054e0d5bb3eb5212d497974,0.0,0.0,20500.0,0.0,0.0,0.0
0,0024e85227b5903203d5fb61e53065a7a99b6496,14600.0,0.0,0.0,0.0,0.0,0.0
0,005b23c5838a36ebafbad8142ca2a28cb7f48b84,21800.0,0.0,0.0,0.0,0.0,0.0
0,0073d137d9cc705aefce9d4992c19cf2d754fd09,25900.0,0.0,0.0,0.0,0.0,0.0
0,019603341aa89cd03b778f520bac00a4d15bd434,20700.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
43,344d3de2948c6428d42ccc1e24c2db12b9cde1cd,0.0,0.0,43862.0,0.0,0.0,0.0
43,475686a27d99abe184e6e5f26c7c6a24fe6369d4,0.0,0.0,0.0,0.0,0.0,5526125.0
43,f12c83cb987d79b6090f1896fdc6f57490746858,0.0,0.0,4800.0,0.0,0.0,0.0
45,f29d8127dd4061b7c653c8a8f18891ce4dedda77,22000.0,0.0,0.0,0.0,0.0,0.0
