# **Simple Data Exploration**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
train = pd.read_csv('../input/train.tsv',sep='\t')
test = pd.read_csv('../input/test.tsv',sep='\t')
sub = pd.read_csv('../input/sample_submission.csv')

## Let's look at training data

In [None]:
print (train.columns)
print(train.shape)
train.head()

## 1. **Price -** As we are predicting price of the item, let's take a look at price distribution i.e. from the data what is most common price range 0-100, 100-200 etc.

In [None]:
plt.figure(figsize=(15, 10))
plt.hist(x = train['price'], bins=30,range = [1,350])
plt.show()

### Looking at above it seems, majority of the products sold are in range of 0-50.

## 2. **Item Condition:** - Description provided for the field is *the condition of the items provided by the seller*. So how should we interpret this, 1 Being the best while 5 being the worst, or vice versa?

In [None]:
sns.countplot(train['item_condition_id'])

### Looking at the distribution above, it seems, Condition deteriorates from 1 --> 5. To confirm this lets take a look at conditon and price, expectation is more items sold where condition is 1.

In [None]:
plt.scatter(x = train['price'],y = train['item_condition_id'],s=25, edgecolors='k')

### Chart above, confirms our assusmption. But it's not to say that items with Condition 1 are the most sold items, as seen above condition 1-3 items all have decent volumes with some outliers.

## 2. **Brand Name:** - Let's explore brand names in the data. Questions I am asking myself;
* How many brands are invoved in the data?
* Which brands has most sales?
* Which brands have most expensive items?
* Brands vs Categories

### How many brands are invoved in the data?

In [None]:
print(len(train['brand_name'].unique()))

### Which brand has the most sales?

In [None]:
train.groupby(['brand_name'])[['price']].sum().sort_values('price', ascending=False).head().plot(kind='barh',figsize=(15, 10))

### Nike seems to the brand with highest amount of sales with PINK coming in next best.

### Which brands have most expensive items?

In [None]:
train.sort_values('price',ascending=False).head(100).groupby('brand_name').count()[['price']].sort_values('price',ascending = False).plot(kind='barh',figsize = (15,10))

### Of the top 100 most expensive items, LV-Chanel-Apple account for more than half of the items.

## 3. **Category Names**: As seen in brand name analysis, while there are tech companies in the mix, the company with highest sales was Nike and company with most highly priced items was LV. Hence category deep dive should give us interesting insight in to what sort of things people buy the most on mercari.

In [None]:
cat_list = [str(x) for x in train['category_name'].unique()]
cat_list.sort()
print(len(cat_list))

### As seen above there are ~1.2k categories; each label is then divided further in to multiple detailed category levels. Let's create two list of these categories, 1. Which will be based on High level i.e. first word only 2. Which will be based on last word.

In [None]:
cat_list_1 = list([str(x).split("/")[0] for x in train['category_name']])
print(set(cat_list_1))
print(len(set(cat_list_1)))

In [None]:
cat_list_2 = list([str(x).split("/")[-1] for x in train['category_name']])
print(len(cat_list_2))

In [None]:
train['category_detail_name'] = cat_list_1

In [None]:
train.groupby('category_detail_name').sum()[['price']].head(10).sort_values('price',ascending = False).plot(kind = 'pie',y='price',figsize = (10,10))

### This doesn't seem to stack up to earlier chart i.e. Nike ws the highest selling brand hence expectation was to see sport apparel related category to be on top, but the highest selling cateory above seems to be women.

In [None]:
train[train['brand_name']=='Nike'].groupby(['category_detail_name'])[['price']].sum().sort_values('price',ascending = False)

In [None]:
train[(train['brand_name']=='Nike') & (train['category_detail_name'] == 'Women')].head(10)

### This makses sense, seems most items sold by Nike were Women's sports apparel.

### To be continued