# Content
1. [Imports](#1.-Imports)
1. [Functions](#2.-Functions)
1. [Loading Data](#3.-Loading-Data)
1. [Data Understanding](#4.-Data-Understanding)
  1. [Data Description](#4.1.-Data-Description)
  1. [Univariate analysis](#4.2.-Univariate-analysis)
    1. [Item condition](#4.2.1.-Item-condition)
    1. [Brand name](#4.2.2.-Brand-name)
    1. [Shipping](#4.2.3.-Shipping)
    1. [Price](#4.2.4.-Price)
    1. [Category name](#4.2.5.-Category-name)
       1. Extracting categories
       1. Category 1
       1. Category 2
       1. Category 3
       1. Category 4
       1. Category 5
    1. [Listing name and Item description](#4.2.6.-Listing-name-and-Item-description)
1. [EDA](#4.3.-EDA)
  1. [What are the relationship between the items' conditions and their price ?](#4.3.1.-What-are-the-relationship-between-the-items'-conditions-and-their-price-?) 
  1. [What are the most expensive brands ?](#4.3.2.-What-are-the-most-expensive-brands-?)
  1. [Does the shipping fee influence the price ?](#4.3.3.-Does-the-shipping-fee-influence-the-price-?)
  1. [What are the most expensive listing categories ?](#4.3.4.-What-are-the-most-expensive-listing-categories-?)
      1. Category 1
      1. Category 2
      1. Category 3
      1. Category 4
      1. Category 5
  1. [Which listings are concerned by each main category ?](#4.3.5.-Which-listings-are-concerned-by-each-main-category-?)

# 1. Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk

% matplotlib inline

In [None]:
sns.set(style='darkgrid')

# 2. Functions

In [None]:
def data_description(df):
    """
    Returns a dataframe with some informations about the variables of the input dataframe.
    """
    data = pd.DataFrame(index=df.columns)
    
    # the numeric data types
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

    for i in data.index:
        data.loc[i, 'count'] = df[i].count()
        data.loc[i, 'missing values'] = df[i].shape[0] - df[i].count()
        data.loc[i, 'unique values'] = len(df[i].unique())
        data.loc[i, 'type'] = df[i].dtypes
        
        # if the type is numeric compute statistical properties
        if df[i].dtypes in numerics: 
            data.loc[i, 'mean'] = df[i].mean()     
            data.loc[i, 'std'] = df[i].std()
            data.loc[i, 'min'] = df[i].min()
            data.loc[i, '25%'] = df[i].quantile(0.25)
            data.loc[i, 'median'] = df[i].quantile(0.5)
            data.loc[i, '75%'] = df[i].quantile(0.75)
            data.loc[i, 'max'] = df[i].max()
        else:
            count = df[i].str.count('[a-zA-Z]+')
            # mean, std, quartiles,  min and max of the number of words
            data.loc[i, 'mean_w'] = count.mean()
            data.loc[i, 'std_w'] = count.std()
            data.loc[i, 'min_w'] = count.min()
            data.loc[i, '25%_w'] = count.quantile(0.25)
            data.loc[i, 'median_w'] = count.quantile(0.5)
            data.loc[i, '75%_w'] = count.quantile(0.75)
            data.loc[i, 'max_w'] = count.max()
            
    return data.transpose()

def countplot(x, data, figsize=(10,5)):
    """
    Wraps the countplot function of seaborn and allow to specify the size of the figure.
    """ 
    fig, ax = plt.subplots(1, 1, figsize=figsize)
    sns.countplot(x=x, data=data, ax=ax, order=data[x].value_counts().index)
    for tick in ax.get_xticklabels():
        tick.set_rotation(90)
          
def subplots(x, y, z, data, hue=None, showfliers=False, figsize=(16,5)):
    """
    Boxplots and barplot. Wraps seabon's boxplot and barplot methods.
    """ 
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
    sns.barplot(x=x, y=y, data=data, order=data[x].value_counts().index, hue=hue, ax=ax1)
    sns.boxplot(x=x, y=y, data=data, order=data[x].value_counts().index, hue=hue, ax=ax2, showfliers=showfliers)
    for tick1, tick2 in zip(ax1.get_xticklabels(), ax2.get_xticklabels()):
        tick1.set_rotation(90)
        tick2.set_rotation(90)
        
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)    
    sns.barplot(x=x, y=z, data=data, order=data[x].value_counts().index, hue=hue, ax=ax1)
    sns.boxplot(x=x, y=z, data=data, order=data[x].value_counts().index, hue=hue, ax=ax2, showfliers=showfliers)
    for tick1, tick2 in zip(ax1.get_xticklabels(), ax2.get_xticklabels()):
        tick1.set_rotation(90)
        tick2.set_rotation(90)

# 3. Loading Data

In [None]:
# training set
train = pd.read_csv('../input/train.tsv', sep='\t')
print(train.shape)

# test set
test = pd.read_csv('../input/test.tsv', sep='\t')
print(test.shape)

# 4. Data Understanding

## 4.1. Data Description

In [None]:
train.sample(10)

The dataset is relatively small in term of number of columns. 

On the one hand, __item_condition_id__, __brand_name__ and __shipping look__ like categorical variables. __category_name__ groups all categories and the sub-categories of the listings. More categorical variables can be extracted from it.

On the other hand __name__ and __item_description__ are free text variables and __item_description__ has more content than name. A lot of informations can be extracted from these 2 variables.

The main difficulty will consist in extracting enough information from these texts to predict the price which is a continuous variable.

In [None]:
data_description(train)

In [None]:
data_description(test)

### Missing values
Only __category_name__, __brand_name__ and __item_description__ have missing values.

### Unique values
There are many  (4810)  brands. Some of them must be rare.

### Distribution
The average price is aroud 26.7\$, the maximum price is 2009\$ but the 3rd quartile is 29\$. We can conclude that most of the listing have relatively low prices.

### Text 
The textual information are quite rich. There are few duplicates. __name__ and __item_description__ have respectively 4 and 25 words in average.  


## 4.2. Univariate Analysis

Here we visualize the distributions of the main variables.

### 4.2.1. Item condition

In [None]:
countplot('item_condition_id', train, figsize=(8,4))

For some reasons the count decreases when the id increases. Maybe, the item condition is not a categorical variable but an ordinal variable. Higher condition id might mean better condition.

### 4.2.2. Brand name

In [None]:
values = train['brand_name'].value_counts()
print(values)
countplot('brand_name', train[train['brand_name'].isin(values.index[0:50])] , figsize=(20,5))

By looking at the first brands, we can deduce that most listings are clothes, electronic products and video games.

### 4.2.3. Shipping

In [None]:
plt.figure(figsize=(6,5))
sns.countplot(x='shipping', data=train)

### 4.2.4. Price

In [None]:
# we plot the distribution distribution of price
plt.figure(figsize=(10,5))
sns.distplot(train['price'], kde=False)

Here we verify that most price are under 100$.

In [None]:
# distribution of g = log(1+price)   (price=exp(g)-1)
# price = 0 <=> log(1+price) = 0
# this transformation might be useful later
plt.figure(figsize=(10,5))
train['log_price'] = np.log(train['price'] + 1)
sns.distplot(train['log_price'], kde=False)

### 4.2.5. Category name

First, we need to extract the sub-category names.

#### Extracting categories

In [None]:
train['category_name'].str.contains('/').fillna(False).value_counts()

In [None]:
test['category_name'].str.contains('/').fillna(False).value_counts()

The number of listing without slash in their category is the same as the number of missing value. We can deduce that there is always a slash when the category is present i.e. "/" is the delimiter used to separate the sub-categories. 

In [None]:
# How many sub-categories in the training set
(train['category_name'].str.count('/')+1).value_counts()

In [None]:
# How many sub-categories in the test set
(test['category_name'].str.count('/')+1).value_counts()

Most of the time there are 3 categories.

In [None]:
# Extract the categories
train.loc[:, 'category_1'] = train['category_name'].map(lambda x: x.split('/')[0] if type(x) == type('a') and len(x.split('/')) > 0 else None)
train.loc[:, 'category_2'] = train['category_name'].map(lambda x: x.split('/')[1] if type(x) == type('a') and len(x.split('/')) > 1 else None)
train.loc[:, 'category_3'] = train['category_name'].map(lambda x: x.split('/')[2] if type(x) == type('a') and len(x.split('/')) > 2 else None)
train.loc[:, 'category_4'] = train['category_name'].map(lambda x: x.split('/')[3] if type(x) == type('a') and len(x.split('/')) > 3 else None)
train.loc[:, 'category_5'] = train['category_name'].map(lambda x: x.split('/')[4] if type(x) == type('a') and len(x.split('/')) > 4 else None)
    
print(train[['category_1','category_2','category_3','category_4','category_5']].count(axis=0))

We can now compute and visualize the distributions.

#### Category 1

In [None]:
countplot('category_1', train, figsize=(10,5))

Most listings concern  women.

#### Category 2

In [None]:
countplot('category_2', train, figsize=(30,5))

#### Category 3

In [None]:
values = train['category_3'].value_counts()
print(values)
countplot('category_3', train[train['category_3'].isin(values.index[0:50])] , figsize=(20,5))

#### Category 4

In [None]:
countplot('category_4', train, figsize=(5,5))

#### Category 5

In [None]:
countplot('category_5', train, figsize=(6,5))

### Listing name and Item description


In [None]:
# randomly print 10 names
for i in range(10):
    print(train['name'].sample(1).iloc[0])
    print()

In [None]:
# randomly print 10 descriptions
for i in range(10):
    print(train['item_description'].sample(1).iloc[0])
    print()

These two fields are free texts. Before constructing any feature from them, we look at which kind of words are present. 

These are some of the Part-of-Speech tags used by [ntlk](http://www.nltk.org/book/ch05.html) for the English language:
- DT: optional determiner
- JJ: adjectives
- NN: singular common noun
- NNS: plural common noun
- IN: preposition
- VBD: past tense verb
- VBZ: present tense verb, 3rd person singular
- VBP: present tense verb, not 3rd person singular
- PRP: personal pronoun
- RB: adverbs

Let's see which tags are mostly used for each textual variable.

In [None]:
# Compute the number of occurences of each tag in a random sample of names
names_tags = nltk.pos_tag((train['name'].sample(10000) + ' ').sum())
names_tags_freq =  pd.Series(nltk.FreqDist(tag for (word, tag) in names_tags))

In [None]:
# Compute the number of occurence of each tag in a randm sample of item descriptions
descrip_tags = nltk.pos_tag((train['item_description'].sample(10000) + ' ').sum())
descrip_tags_freq =  pd.Series(nltk.FreqDist(tag for (word, tag) in descrip_tags))

In [None]:
# reoder the series
names_tags_freq = names_tags_freq.sort_values(ascending=False)
descrip_tags_freq = descrip_tags_freq.sort_values(ascending=False)

In [None]:
plt.figure(figsize=(15,5))
plt.bar(range(names_tags_freq.shape[0]), names_tags_freq, tick_label=names_tags_freq.index)

In [None]:
plt.figure(figsize=(15,5))
plt.bar(range(descrip_tags_freq.shape[0]), descrip_tags_freq, tick_label=descrip_tags_freq.index)

These distributions show that we should focus on retrieving informations from proper and common nouns, adjectives and verbs. We can now look at the most frequent words.

In [None]:
# Generate a word cloud image for name
wordcloud = WordCloud().generate((train['name'].sample(100000) + ' ').sum())
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

In [None]:
# Generate a word cloud image for item description
wordcloud = WordCloud().generate((train['item_description'].sample(20000) + ' ').sum())
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

We also check that the most frequent words are nouns and adjectives.

## 4.3. EDA

### 4.3.1. What are the relationship between the items' conditions and their price ? 


In [None]:
subplots('item_condition_id', 'price', 'log_price', train, hue='shipping', showfliers=True)

The average price increases with the the condition id. Higher ids probably means better conditions.

### 4.3.2. What are the most expensive brands ?

In [None]:
results = train.groupby('brand_name').price.agg(['count','mean'])
results = results[results['count']>1000].sort_values(by='mean', ascending=False)
results.head(30)

We find luxury brands among the monst expensive ones, but we also find shoes, electronic and sport brands.

In [None]:
values = train['brand_name'].value_counts().index[0:30]
subplots('brand_name', 'price', 'log_price', train[train['brand_name'].isin(values)], hue='shipping', 
         showfliers=True, figsize=(20,5))

We can see that the prices are higher when the fees are paid by the buyers.

### 4.3.3. Does the shipping fee influence the price ?

In [None]:
subplots('shipping', 'price', 'log_price', train, figsize=(10,3))

### 4.3.4. What are the most expensive listing categories ?

#### Category 1

In [None]:
subplots('category_1', 'price', 'log_price', train, hue='shipping', figsize=(16,5))

In [None]:
subplots('category_1', 'price', 'log_price', train, hue='item_condition_id', figsize=(20,5))

#### Category 2

In [None]:
results = train.groupby('category_2').price.agg(['count','mean'])
results = results[results['count']>1000].sort_values(by='mean', ascending=False)
results.head(30)

The most expensive listing electronics products and mode accessories.

In [None]:
values = train['category_2'].value_counts().index[0:30]
subplots('category_2', 'price', 'log_price', train[train['category_2'].isin(values)], figsize=(20,5))

#### Category 3

In [None]:
values = train['category_3'].value_counts().index[0:30]
subplots('category_3', 'price', 'log_price', train[train['category_3'].isin(values)], figsize=(20,5))

#### Category 4

In [None]:
subplots('category_4', 'price', 'log_price', train, figsize=(10,3))

#### Category 5

In [None]:
subplots('category_5', 'price', 'log_price', train, figsize=(10,3))

### 4.3.5. Which listings are concerned by each main category

In [None]:
fig, axes = plt.subplots(10, 2, figsize=(16,40))
cats = train['category_1'].value_counts().index
for i in range(10):
    ax1, ax2 = axes[i]    
    cat = cats[i]
    wordcloud1 = WordCloud().generate((train[train['category_1']==cat]['name'].sample(10000) + ' ').sum())
    wordcloud2 = WordCloud().generate((train[train['category_1']==cat]['item_description'].sample(10000) + ' ').sum())
    ax1.imshow(wordcloud1, interpolation='bilinear')
    ax2.imshow(wordcloud2, interpolation='bilinear')
    ax1.axis("off")
    ax2.axis("off")
    ax1.set_title('Most frequent words of name for category ' + cat)
    ax2.set_title('Most frequent words of the description for category ' + cat)

Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.