# Flipkart E-commerce Data of various products

# This project is made by Kaushik Ghosh, @kgkaushik on github

### Context
This is a pre-crawled dataset, taken as subset of a bigger dataset (more than 5.8 million products) that was created by extracting data from Flipkart.com, a leading Indian eCommerce store.

### Content
This dataset has following fields:

product_url
product_name
product_category_tree
pid
retail_price
discounted_price
image
is_FK_Advantage_product
description
product_rating
overall_rating
brand
product_specifications

### Acknowledgements
This dataset was created by PromptCloud's in-house web-crawling service.

### Inspiration
Analyses of the pricing, product specification and brand can be performed.

In [2]:
import numpy as np
import pandas as pd


In [3]:
df = pd.read_csv('flipkart.csv')

In [3]:
df.head()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."


In [4]:
df.describe()

Unnamed: 0,retail_price,discounted_price
count,19922.0,19922.0
mean,2979.206104,1973.401767
std,9009.639341,7333.58604
min,35.0,35.0
25%,666.0,350.0
50%,1040.0,550.0
75%,1999.0,999.0
max,571230.0,571230.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
uniq_id                    20000 non-null object
crawl_timestamp            20000 non-null object
product_url                20000 non-null object
product_name               20000 non-null object
product_category_tree      20000 non-null object
pid                        20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
image                      19997 non-null object
is_FK_Advantage_product    20000 non-null bool
description                19998 non-null object
product_rating             20000 non-null object
overall_rating             20000 non-null object
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(2), object(12)
memory usage: 1.2+ MB


## Around 6000 missing brand names

Checking the relation of existing brand names with pid since patterns showed that the first five letters are common with the same brand names

In [5]:
dfnew = df.copy()

In [6]:
dfnew.drop(['overall_rating'], axis = 1, inplace = True)

In [7]:
#found the pattern the, first 7 letters correspond to brandname
def firstfive(frame):
    return frame[0:7]
    

In [8]:
dfnew['newid'] = dfnew['pid'].apply(firstfive)


In [9]:
#dfnew.drop(['pid'],axis=1,inplace=True)
dfnew


Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,brand,product_specifications,newid
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati...",SBEEH3Q
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""...",SHOEH4G
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",...",PSOEH3Z
5,c2a17313954882c1dba461863e98adf2,2016-03-25 22:59:23 +0000,http://www.flipkart.com/eternal-gandhi-super-s...,Eternal Gandhi Super Series Crystal Paper Weig...,"[""Eternal Gandhi Super Series Crystal Paper We...",PWTEB7H2E4KCYUE3,430.0,430.0,"[""http://img5a.flixcart.com/image/paper-weight...",False,Key Features of Eternal Gandhi Super Series Cr...,No rating available,Eternal Gandhi,"{""product_specification""=>[{""key""=>""Model Name...",PWTEB7H
6,ce5a6818f7707e2cb61fdcdbba61f5ad,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVVKRBAXHB,1199.0,479.0,"[""http://img6a.flixcart.com/image/short/p/j/z/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F
7,8542703ca9e6ebdf6d742638dfb1f2ca,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGYGHFUEXN,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/e/x...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati...",SBEEH3Q
8,29c8d290caa451f97b1c32df64477a2c,2016-03-25 22:59:23 +0000,http://www.flipkart.com/dilli-bazaaar-bellies-...,"dilli bazaaar Bellies, Corporate Casuals, Casuals","[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH3DZBFR88SCK,699.0,349.0,"[""http://img6a.flixcart.com/image/shoe/b/p/n/p...",False,"Key Features of dilli bazaaar Bellies, Corpora...",No rating available,dilli bazaaar,"{""product_specification""=>[{""key""=>""Occasion"",...",SHOEH3D
9,4044c0ac52c1ee4b28777417651faf42,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVUHAAVH9X,1199.0,479.0,"[""http://img5a.flixcart.com/image/short/5/z/c/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F


In [10]:
dfnew['newid'].value_counts()

CRTECN2    464
ACCE9Y6     97
TSHE993     87
RNGE7M9     81
RNGEDAK     76
MUGEAGG     72
STIE7KF     65
ACCE9DJ     62
RNGE7GV     61
CRNEE84     58
STIE9F5     55
MUGEACY     54
STIEYZ5     52
RNGEDAP     52
ACCE6SJ     43
NKCDU4R     43
ACCE6GF     42
ACCDVHJ     40
BRAE3TS     40
CRNEEAB     37
BLAEAWA     36
NKCE9PF     36
MUGEBFG     35
PCSEC86     34
SNDEAN4     33
ACCEAZC     33
BRAEBBM     32
ACCE9DK     32
ACCDUV7     31
BRAEGJF     31
          ... 
CRNEYW6      1
RUGEJ3N      1
TROEJYT      1
PBXEBWY      1
JCKE2AF      1
SPMDHEZ      1
SHTDVZK      1
KLSEJ3V      1
TSHE62Q      1
SHOEG52      1
TPMEE6Z      1
TSHE7FU      1
NKCDXH8      1
BRAE8NW      1
KRTEE8S      1
ACCEJR2      1
FSEEGBR      1
KLSEGUA      1
BBOEGGV      1
USGEDST      1
SHIEB59      1
JWSEFBQ      1
PTGECWD      1
BRAEA6Z      1
CAGE9B7      1
LBXEHH5      1
TSHE4CD      1
SHIEB65      1
SWTEA38      1
NKCEGZZ      1
Name: newid, Length: 9253, dtype: int64

In [11]:
dfnewnodup = dfnew.drop_duplicates()

In [12]:
dfnewnodup

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,brand,product_specifications,newid
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati...",SBEEH3Q
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""...",SHOEH4G
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",...",PSOEH3Z
5,c2a17313954882c1dba461863e98adf2,2016-03-25 22:59:23 +0000,http://www.flipkart.com/eternal-gandhi-super-s...,Eternal Gandhi Super Series Crystal Paper Weig...,"[""Eternal Gandhi Super Series Crystal Paper We...",PWTEB7H2E4KCYUE3,430.0,430.0,"[""http://img5a.flixcart.com/image/paper-weight...",False,Key Features of Eternal Gandhi Super Series Cr...,No rating available,Eternal Gandhi,"{""product_specification""=>[{""key""=>""Model Name...",PWTEB7H
6,ce5a6818f7707e2cb61fdcdbba61f5ad,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVVKRBAXHB,1199.0,479.0,"[""http://img6a.flixcart.com/image/short/p/j/z/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F
7,8542703ca9e6ebdf6d742638dfb1f2ca,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGYGHFUEXN,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/e/x...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati...",SBEEH3Q
8,29c8d290caa451f97b1c32df64477a2c,2016-03-25 22:59:23 +0000,http://www.flipkart.com/dilli-bazaaar-bellies-...,"dilli bazaaar Bellies, Corporate Casuals, Casuals","[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH3DZBFR88SCK,699.0,349.0,"[""http://img6a.flixcart.com/image/shoe/b/p/n/p...",False,"Key Features of dilli bazaaar Bellies, Corpora...",No rating available,dilli bazaaar,"{""product_specification""=>[{""key""=>""Occasion"",...",SHOEH3D
9,4044c0ac52c1ee4b28777417651faf42,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVUHAAVH9X,1199.0,479.0,"[""http://img5a.flixcart.com/image/short/5/z/c/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ...",SRTEH2F


In [13]:
df.head()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."


## There is no such correlation between the pid and brand so we have to discard this.


In [15]:
df.tail()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
19995,7179d2f6c4ad50a17d014ca1d2815156,2015-12-01 10:15:43 +0000,http://www.flipkart.com/walldesign-small-vinyl...,WallDesign Small Vinyl Sticker,"[""Baby Care >> Baby & Kids Gifts >> Stickers >...",STIE7KFJAKSTDY9G,1500.0,730.0,"[""http://img6a.flixcart.com/image/wall-decorat...",False,Buy WallDesign Small Vinyl Sticker for Rs.730 ...,No rating available,No rating available,WallDesign,"{""product_specification""=>[{""key""=>""Number of ..."
19996,71ac419198359d37b8fe5e3fffdfee09,2015-12-01 10:15:43 +0000,http://www.flipkart.com/wallmantra-large-vinyl...,Wallmantra Large Vinyl Stickers Sticker,"[""Baby Care >> Baby & Kids Gifts >> Stickers >...",STIE9F5URNQGJCGH,1429.0,1143.0,"[""http://img6a.flixcart.com/image/sticker/z/g/...",False,Buy Wallmantra Large Vinyl Stickers Sticker fo...,No rating available,No rating available,Wallmantra,"{""product_specification""=>[{""key""=>""Number of ..."
19997,93e9d343837400ce0d7980874ece471c,2015-12-01 10:15:43 +0000,http://www.flipkart.com/elite-collection-mediu...,Elite Collection Medium Acrylic Sticker,"[""Baby Care >> Baby & Kids Gifts >> Stickers >...",STIE7VAYDKQZEBSD,1299.0,999.0,"[""http://img5a.flixcart.com/image/sticker/b/s/...",False,Buy Elite Collection Medium Acrylic Sticker fo...,No rating available,No rating available,Elite Collection,"{""product_specification""=>[{""key""=>""Number of ..."
19998,669e79b8fa5d9ae020841c0c97d5e935,2015-12-01 10:15:43 +0000,http://www.flipkart.com/elite-collection-mediu...,Elite Collection Medium Acrylic Sticker,"[""Baby Care >> Baby & Kids Gifts >> Stickers >...",STIE8YSVEPPCZ42Y,1499.0,1199.0,"[""http://img5a.flixcart.com/image/sticker/4/2/...",False,Buy Elite Collection Medium Acrylic Sticker fo...,No rating available,No rating available,Elite Collection,"{""product_specification""=>[{""key""=>""Number of ..."
19999,cb4fa87a874f715fff567f7b7b3be79c,2015-12-01 10:15:43 +0000,http://www.flipkart.com/elite-collection-mediu...,Elite Collection Medium Acrylic Sticker,"[""Baby Care >> Baby & Kids Gifts >> Stickers >...",STIE88KN9ZDSGZKY,1499.0,999.0,"[""http://img6a.flixcart.com/image/sticker/z/k/...",False,Buy Elite Collection Medium Acrylic Sticker fo...,No rating available,No rating available,Elite Collection,"{""product_specification""=>[{""key""=>""Number of ..."


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
uniq_id                    20000 non-null object
crawl_timestamp            20000 non-null object
product_url                20000 non-null object
product_name               20000 non-null object
product_category_tree      20000 non-null object
pid                        20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
image                      19997 non-null object
is_FK_Advantage_product    20000 non-null bool
description                19998 non-null object
product_rating             20000 non-null object
overall_rating             20000 non-null object
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(2), object(12)
memory usage: 1.2+ MB


In [17]:
dfnew = df.copy()

In [18]:
dfnew.drop_duplicates()

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,7f7036a6d550aaa89d34c77bd39a5e48,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGU7MFYJFY,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/j/f...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,f449ec65dcbc041b6ae5e6a32717d01b,2016-03-25 22:59:23 +0000,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH4GRSUBJGZXE,999.0,499.0,"[""http://img5a.flixcart.com/image/shoe/7/z/z/r...",False,Key Features of AW Bellies Sandals Wedges Heel...,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,0973b37acd0c664e3de26e97e5571454,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2F6HUZMQ6SJ,699.0,267.0,"[""http://img5a.flixcart.com/image/short/6/2/h/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,bc940ea42ee6bef5ac7cea3fb5cfbee7,2016-03-25 22:59:23 +0000,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",PSOEH3ZYDMSYARJ5,220.0,210.0,"[""http://img5a.flixcart.com/image/pet-shampoo/...",False,Specifications of Sicons All Purpose Arnica Do...,No rating available,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."
5,c2a17313954882c1dba461863e98adf2,2016-03-25 22:59:23 +0000,http://www.flipkart.com/eternal-gandhi-super-s...,Eternal Gandhi Super Series Crystal Paper Weig...,"[""Eternal Gandhi Super Series Crystal Paper We...",PWTEB7H2E4KCYUE3,430.0,430.0,"[""http://img5a.flixcart.com/image/paper-weight...",False,Key Features of Eternal Gandhi Super Series Cr...,No rating available,No rating available,Eternal Gandhi,"{""product_specification""=>[{""key""=>""Model Name..."
6,ce5a6818f7707e2cb61fdcdbba61f5ad,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVVKRBAXHB,1199.0,479.0,"[""http://img6a.flixcart.com/image/short/p/j/z/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
7,8542703ca9e6ebdf6d742638dfb1f2ca,2016-03-25 22:59:23 +0000,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",SBEEH3QGYGHFUEXN,32157.0,22646.0,"[""http://img6a.flixcart.com/image/sofa-bed/e/x...",False,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
8,29c8d290caa451f97b1c32df64477a2c,2016-03-25 22:59:23 +0000,http://www.flipkart.com/dilli-bazaaar-bellies-...,"dilli bazaaar Bellies, Corporate Casuals, Casuals","[""Footwear >> Women's Footwear >> Ballerinas >...",SHOEH3DZBFR88SCK,699.0,349.0,"[""http://img6a.flixcart.com/image/shoe/b/p/n/p...",False,"Key Features of dilli bazaaar Bellies, Corpora...",No rating available,No rating available,dilli bazaaar,"{""product_specification""=>[{""key""=>""Occasion"",..."
9,4044c0ac52c1ee4b28777417651faf42,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FVUHAAVH9X,1199.0,479.0,"[""http://img5a.flixcart.com/image/short/5/z/c/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
uniq_id                    20000 non-null object
crawl_timestamp            20000 non-null object
product_url                20000 non-null object
product_name               20000 non-null object
product_category_tree      20000 non-null object
pid                        20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
image                      19997 non-null object
is_FK_Advantage_product    20000 non-null bool
description                19998 non-null object
product_rating             20000 non-null object
overall_rating             20000 non-null object
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(2), object(12)
memory usage: 1.2+ MB


In [20]:
dfnew = dfnew.drop_duplicates()

In [21]:
dfnew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 0 to 19999
Data columns (total 15 columns):
uniq_id                    20000 non-null object
crawl_timestamp            20000 non-null object
product_url                20000 non-null object
product_name               20000 non-null object
product_category_tree      20000 non-null object
pid                        20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
image                      19997 non-null object
is_FK_Advantage_product    20000 non-null bool
description                19998 non-null object
product_rating             20000 non-null object
overall_rating             20000 non-null object
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(2), object(12)
memory usage: 1.4+ MB


In [22]:
dfnew.drop(['crawl_timestamp','uniq_id','image'],axis=1,inplace=True)

In [23]:
dfnew.head(2)
dfnew.drop(['pid'],axis=1,inplace=True)

#### 'pid','crawl_timestamp','uniq_id','image' dropped since they are irrelevant to our current context

## Analysing its relationship with description, product_url, product name

In [24]:
dfnew['product_url'][0]
dfnew['description'][20]

'Specifications of Sicons Conditioning Conditoner Dog Shampoo (200 ml) General Pet Type Dog Brand Sicons Quantity 200 ml Model Number SH.DF-02 Type Conditioning Fragrance Conditoner Form Factor Gel In the Box Sales Package Shampoo Sicons Dog Fashion Conditioner Aloe Rinse'

In [25]:
dfnew.drop(['description'],axis=1,inplace=True)
dfnew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 0 to 19999
Data columns (total 10 columns):
product_url                20000 non-null object
product_name               20000 non-null object
product_category_tree      20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
is_FK_Advantage_product    20000 non-null bool
product_rating             20000 non-null object
overall_rating             20000 non-null object
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(2), object(7)
memory usage: 1.5+ MB


In [26]:
#dfnew.info()
dfnew['product_url'][18000]
#dfnew['product_name'][0]

'http://www.flipkart.com/wildcraft-bonk-25-l-backpack/p/itme9dkby8j6cryh?pid=BKPE9DKAA8TSKFKW'

In [27]:
dfnew['product_name'][18000]

'Wildcraft Bonk 25 L Backpack'

In [28]:
dfnew1 = dfnew[dfnew['brand']==pd.np.nan]

In [29]:
dfnew1

Unnamed: 0,product_url,product_name,product_category_tree,retail_price,discounted_price,is_FK_Advantage_product,product_rating,overall_rating,brand,product_specifications


In [30]:
dfnew.head()

Unnamed: 0,product_url,product_name,product_category_tree,retail_price,discounted_price,is_FK_Advantage_product,product_rating,overall_rating,brand,product_specifications
0,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",999.0,379.0,False,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
1,http://www.flipkart.com/fabhomedecor-fabric-do...,FabHomeDecor Fabric Double Sofa Bed,"[""Furniture >> Living Room Furniture >> Sofa B...",32157.0,22646.0,False,No rating available,No rating available,FabHomeDecor,"{""product_specification""=>[{""key""=>""Installati..."
2,http://www.flipkart.com/aw-bellies/p/itmeh4grg...,AW Bellies,"[""Footwear >> Women's Footwear >> Ballerinas >...",999.0,499.0,False,No rating available,No rating available,AW,"{""product_specification""=>[{""key""=>""Ideal For""..."
3,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",699.0,267.0,False,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."
4,http://www.flipkart.com/sicons-all-purpose-arn...,Sicons All Purpose Arnica Dog Shampoo,"[""Pet Supplies >> Grooming >> Skin & Coat Care...",220.0,210.0,False,No rating available,No rating available,Sicons,"{""product_specification""=>[{""key""=>""Pet Type"",..."


### product_url and pid are unique to any product, so dropping these would help in reducing the duplicates.

### Approach one, for the brand names whose word length is 2

In [31]:
def stripbrand(x):
    l = x.split()
    return l[0] + l[1]
dfnew.drop(['product_url'],axis=1,inplace=True)

In [32]:
#dfnew['brand_refined'] = dfnew[dfnew['product_name']].apply(stripbrand,axis=1)
#type(dfnew['product_name'])
#dfnew['product_name']
dfnew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 0 to 19999
Data columns (total 9 columns):
product_name               20000 non-null object
product_category_tree      20000 non-null object
retail_price               19922 non-null float64
discounted_price           19922 non-null float64
is_FK_Advantage_product    20000 non-null bool
product_rating             20000 non-null object
overall_rating             20000 non-null object
brand                      14136 non-null object
product_specifications     19986 non-null object
dtypes: bool(1), float64(2), object(6)
memory usage: 1.4+ MB


In [33]:
dfnew.drop_duplicates(inplace=True)

In [34]:
dfnew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19449 entries, 0 to 19999
Data columns (total 9 columns):
product_name               19449 non-null object
product_category_tree      19449 non-null object
retail_price               19371 non-null float64
discounted_price           19371 non-null float64
is_FK_Advantage_product    19449 non-null bool
product_rating             19449 non-null object
overall_rating             19449 non-null object
brand                      13791 non-null object
product_specifications     19435 non-null object
dtypes: bool(1), float64(2), object(6)
memory usage: 930.7+ KB


### 551 duplicate rows dropped

## Approach 2. Analising its relationship with product_category_tree

In [35]:
x = dfnew['product_category_tree'][100]

In [36]:
x

'["Watches >> Wrist Watches >> Rorlig Wrist Watches"]'

In [37]:
x = x.strip('[')

In [38]:
x = x.strip(']')

In [39]:
x = x.replace('"','')
m = x.split(">>")


In [40]:
m = x.split(">>")
def subtract(x,y):
    if y in x:
        x = x.replace(y,"")
    return x
subtract('kaushik','ik')


'kaush'

In [41]:
m

['Watches ', ' Wrist Watches ', ' Rorlig Wrist Watches']

In [42]:
for i in range(len(m)):
    m[i] = m[i].replace(' ','')


In [43]:
m

['Watches', 'WristWatches', 'RorligWristWatches']

In [44]:
for i in range(len(m)-2,0,-1):
    if m[i] in m[-1]:
        m[-1] = m[-1].replace(m[i],'')

In [45]:
m

['Watches', 'WristWatches', 'Rorlig']

In [46]:
m[-1]

'Rorlig'

In [47]:
def brandprediction(x):
    x = x.strip('[')
    x = x.strip(']')
    x = x.replace('"','')
    m = x.split(">>")
    for i in range(len(m)):
        m[i] = m[i].replace(' ','')
    for i in range(len(m)-2,0,-1):
        if m[i] in m[-1]:
            m[-1] = m[-1].replace(m[i],'')
    return m[-1]

In [48]:
brandprediction(dfnew['product_category_tree'][876])

'LifebyShoppersStop'

In [49]:
dfnew['brand_predicted'] = dfnew['product_category_tree'].apply(brandprediction)

In [50]:
dfnew['brand_predicted'].value_counts()
dfnew['brand'].value_counts()

Allure Auto                        469
Regular                            311
Voylla                             299
Slim                               279
TheLostPuppy                       229
Karatcraft                         211
Black                              164
DailyObjects                       144
White                              143
Speedwav                           141
Radiant Bay                        132
Red                                102
Enthopia                           101
BlueStone                           99
Pink                                98
HomeeHub                            95
Wallmantra                          80
Purple                              79
AdroitZ                             74
Blue                                71
Jewelizer                           67
Regular Fit                         66
Easy Gardening                      59
Beige                               54
Hotpiper                            53
Raymond                  

## Prediction almost fitting nicely

In [51]:
dfnew.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19449 entries, 0 to 19999
Data columns (total 10 columns):
product_name               19449 non-null object
product_category_tree      19449 non-null object
retail_price               19371 non-null float64
discounted_price           19371 non-null float64
is_FK_Advantage_product    19449 non-null bool
product_rating             19449 non-null object
overall_rating             19449 non-null object
brand                      13791 non-null object
product_specifications     19435 non-null object
brand_predicted            19449 non-null object
dtypes: bool(1), float64(2), object(7)
memory usage: 1.5+ MB


In [52]:
dfnew['brand'].nunique()

3499

In [53]:
dfnew['brand_predicted'].nunique()

5985

In [54]:
brandprediction(dfnew['product_category_tree'][154])

'Boots'

### some cases where the product category tree had some anomalies like not having the brand name at last

In [55]:
dfnew['product_category_tree'][154]

'["Footwear >> Women\'s Footwear >> Casual Shoes >> Boots"]'

In [56]:
dfnew['brand'][154]

nan

In [57]:
df['product_url'][154]

'http://www.flipkart.com/shuz-touch-boots/p/itmecxpm57rcspy4?pid=SHOECXPMAQXZ3QPH'

In [58]:
df['product_name'][154]

'Shuz Touch Boots'

In [59]:
df['product_name'][155]

'Kielz Ladies Boots'

In [60]:
dfnew['brand'][155]

nan

In [61]:
type(dfnew['brand'][155])

float

### The 'nan' value was a float thus not visible by putting np.nan as the criteria for seperating dataframes

In [62]:
type(dfnew['brand'][154])

float

In [63]:
str(dfnew['brand'][155])

'nan'

In [64]:
def convertstring(x):
    return str(x)

In [65]:
dfnew['brand_refined_into_string'] = df['brand'].apply(convertstring)

## Approach 3.0 separating dataframes and working on just the one having nan

In [90]:
dfbrandnan = dfnew[dfnew['brand_refined_into_string'] == 'nan']
dff = dfnew[dfnew['brand_refined_into_string'] != 'nan']

In [91]:
dfbrandnan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5658 entries, 25 to 19962
Data columns (total 11 columns):
product_name                 5658 non-null object
product_category_tree        5658 non-null object
retail_price                 5645 non-null float64
discounted_price             5645 non-null float64
is_FK_Advantage_product      5658 non-null bool
product_rating               5658 non-null object
overall_rating               5658 non-null object
brand                        0 non-null object
product_specifications       5658 non-null object
brand_predicted              5658 non-null object
brand_refined_into_string    5658 non-null object
dtypes: bool(1), float64(2), object(8)
memory usage: 314.9+ KB


In [92]:
dfbrandnan.head()

Unnamed: 0,product_name,product_category_tree,retail_price,discounted_price,is_FK_Advantage_product,product_rating,overall_rating,brand,product_specifications,brand_predicted,brand_refined_into_string
25,Glus Wedding Lingerie Set,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",1299.0,699.0,False,No rating available,No rating available,,"{""product_specification""=>[{""key""=>""Number of ...",Glus,
26,Veelys Shiny White Quad Roller Skates - Size 4...,"[""Sports & Fitness >> Other Sports >> Skating ...",3199.0,2499.0,False,No rating available,No rating available,,"{""product_specification""=>[{""key""=>""Playing Le...",Veelys,
27,Bulaky vanity case Jewellery Vanity Case,"[""Beauty and Personal Care >> Makeup >> Vanity...",499.0,390.0,False,3,3,,"{""product_specification""=>{""key""=>""Body Materi...",Bulaky,
28,FDT Women's Leggings,"[""Clothing >> Women's Clothing >> Fusion Wear ...",699.0,309.0,False,No rating available,No rating available,,"{""product_specification""=>[{""key""=>""Number of ...",FDT,
29,Madcaps C38GR30 Men's Cargos,"[""Clothing >> Men's Clothing >> Cargos, Shorts...",2199.0,1699.0,False,No rating available,No rating available,,"{""product_specification""=>[{""key""=>""Number of ...",Madcaps,


## Approach 3.1 reapplying all the logics one by one to get the best possible brand prediction

In [93]:
dfbrandnan['brand'] = dfbrandnan['product_name'].apply(lambda x: x.split()[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [94]:
def stringcut(x):
    m = x.split()
    y = m[0]+m[1]
    return y
stringcut

<function __main__.stringcut>

In [95]:
dfbrandnan1 = dfbrandnan.copy()

In [5]:
#dfbrandnan1['brand_name2'] = dfbrandnan1['product_name'].apply(stringcut)

In [97]:
dfbrandnan1.head()

Unnamed: 0,product_name,product_category_tree,retail_price,discounted_price,is_FK_Advantage_product,product_rating,overall_rating,brand,product_specifications,brand_predicted,brand_refined_into_string
25,Glus Wedding Lingerie Set,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",1299.0,699.0,False,No rating available,No rating available,Glus,"{""product_specification""=>[{""key""=>""Number of ...",Glus,
26,Veelys Shiny White Quad Roller Skates - Size 4...,"[""Sports & Fitness >> Other Sports >> Skating ...",3199.0,2499.0,False,No rating available,No rating available,Veelys,"{""product_specification""=>[{""key""=>""Playing Le...",Veelys,
27,Bulaky vanity case Jewellery Vanity Case,"[""Beauty and Personal Care >> Makeup >> Vanity...",499.0,390.0,False,3,3,Bulaky,"{""product_specification""=>{""key""=>""Body Materi...",Bulaky,
28,FDT Women's Leggings,"[""Clothing >> Women's Clothing >> Fusion Wear ...",699.0,309.0,False,No rating available,No rating available,FDT,"{""product_specification""=>[{""key""=>""Number of ...",FDT,
29,Madcaps C38GR30 Men's Cargos,"[""Clothing >> Men's Clothing >> Cargos, Shorts...",2199.0,1699.0,False,No rating available,No rating available,Madcaps,"{""product_specification""=>[{""key""=>""Number of ...",Madcaps,


In [98]:
dfbrandnan.head()

Unnamed: 0,product_name,product_category_tree,retail_price,discounted_price,is_FK_Advantage_product,product_rating,overall_rating,brand,product_specifications,brand_predicted,brand_refined_into_string
25,Glus Wedding Lingerie Set,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",1299.0,699.0,False,No rating available,No rating available,Glus,"{""product_specification""=>[{""key""=>""Number of ...",Glus,
26,Veelys Shiny White Quad Roller Skates - Size 4...,"[""Sports & Fitness >> Other Sports >> Skating ...",3199.0,2499.0,False,No rating available,No rating available,Veelys,"{""product_specification""=>[{""key""=>""Playing Le...",Veelys,
27,Bulaky vanity case Jewellery Vanity Case,"[""Beauty and Personal Care >> Makeup >> Vanity...",499.0,390.0,False,3,3,Bulaky,"{""product_specification""=>{""key""=>""Body Materi...",Bulaky,
28,FDT Women's Leggings,"[""Clothing >> Women's Clothing >> Fusion Wear ...",699.0,309.0,False,No rating available,No rating available,FDT,"{""product_specification""=>[{""key""=>""Number of ...",FDT,
29,Madcaps C38GR30 Men's Cargos,"[""Clothing >> Men's Clothing >> Cargos, Shorts...",2199.0,1699.0,False,No rating available,No rating available,Madcaps,"{""product_specification""=>[{""key""=>""Number of ...",Madcaps,


## Approach 3.2 checking for common nouns from product_category_tree and subtracting it from product_name  

In [99]:
dfbrandnan['product_name'][25]

'Glus Wedding Lingerie Set'

In [100]:
dfbrandnan['product_category_tree'][25]

'["Clothing >> Women\'s Clothing >> Lingerie, Sleep & Swimwear >> Lingerie Sets >> Glus Lingerie Sets"]'

In [101]:
def commons(x):
    x = x.strip('[')
    x = x.strip(']')
    x = x.replace('"','')
    #x = x.replace('\'',' ')
    m = x.split(">>")
    
    return m

In [102]:
def gettingbrand(x):
    n = commons(x)
    l = n[0:-1]
    xl = []
    for i in l:
        m = i.split()
        for j in m:
            xl.append(j)
    y = dfbrandnan['product_name'][dfbrandnan]
    ylist = y.split()
    brand = ''
    for i in range(len(ylist)):
        if y[i] not in xl:
            brand = brand + y[i]
    return brand    
    

In [103]:
brand_list = []
for row_index,row in dfbrandnan.iterrows():
    x = row['product_category_tree']
    n = commons(x)
    y = dfbrandnan['product_name'][row_index]
    ylist = y.split()
    for j in range(len(n)):
        if n[j] in ylist:
            l = n[0:-1]
        else:
            l = n
    xl = []
    for i in l:
        m = i.split()
        for j in m:
            xl.append(j)
    #y = dfbrandnan['product_name'][row_index]
    #ylist = y.split()
    #print(ylist)
    brand = ''
    #brand_list = []
    for i in range(len(ylist)):
        #print(ylist[i])
        if ylist[i] not in xl:
            brand = brand + ' ' + ylist[i]
    brand = brand.strip()
    brand_list.append(brand)
   

In [104]:
brand_list

['Wedding Set',
 'Shiny White Quad Roller - Size 4.5 UK',
 'vanity case Jewellery Case',
 '',
 'C38GR30',
 'CO6394A1 Analog Watch - For Men, Boys',
 'TEN TEN Black Knee Length',
 'G 729 S-BK Analog Watch - For Men, Boys',
 'Carlton',
 'R8851116001 Analog Watch - For Boys',
 '8503B-1RED Cold Light Digital Watch - For Boys, Girls',
 'WM64 Elegance Analog Watch - For Men, Boys',
 'Quechua Arpenaz Novadry',
 'COLAT_MW20 Sheen Analog Watch - For Men, Women, Boys, Girls',
 'Steppings Trendy',
 'RW38 Analog Watch - For Boys',
 'RR-028 Expedition Analog Watch - For Men, Boys',
 'Catwalk',
 'Magnum Lifestyle',
 'UFT-TSW-005-BK-BR Analog Watch - For Boys',
 'Rialto',
 'Kielz Ladies',
 'WY16B Youth Digital Watch - For Men, Boys',
 'La Briza Ashley',
 'CAU1116.BA0858 Formula 1 Analog Watch - For Boys, Men',
 'Salt N Pepper 13-019 Femme Black',
 'Shuz Touch',
 'Steppings Trendy',
 'Salt N Pepper 14-664 Denny Black',
 'CS-2001 Analog Watch - For Boys, Men',
 'Crocs',
 'AB011010/BB08 131S Chronomat 4

In [105]:
dfbrandnan2 = dfbrandnan.copy()

In [106]:
dfbrandnan2['final_brand'] = brand_list

In [107]:
dfbrandnan2['final_brand'].value_counts()

Printed Round Neck T-Shirt                            242
                                                      129
A-line Dress                                          120
Solid Polo Neck T-Shirt                               112
Solid Round Neck T-Shirt                              110
Graphic Print Round Neck T-Shirt                      100
Striped Polo Neck T-Shirt                              73
Casual Printed Kurti                                   70
Casual Sleeveless Solid Top                            69
Shift Dress                                            68
Casual Short Sleeve Solid Top                          56
Solid V-neck T-Shirt                                   51
Combo Set                                              46
Full Sleeve Solid Sweatshirt                           46
Solid Casual Shirt                                     39
Gathered Dress                                         39
Maxi Dress                                             39
Printed V-neck

### Approach 3.2 resulted in anomalies

In [108]:
dfbrandnan2.head(5000)

Unnamed: 0,product_name,product_category_tree,retail_price,discounted_price,is_FK_Advantage_product,product_rating,overall_rating,brand,product_specifications,brand_predicted,brand_refined_into_string,final_brand
25,Glus Wedding Lingerie Set,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",1299.0,699.0,False,No rating available,No rating available,Glus,"{""product_specification""=>[{""key""=>""Number of ...",Glus,,Wedding Set
26,Veelys Shiny White Quad Roller Skates - Size 4...,"[""Sports & Fitness >> Other Sports >> Skating ...",3199.0,2499.0,False,No rating available,No rating available,Veelys,"{""product_specification""=>[{""key""=>""Playing Le...",Veelys,,Shiny White Quad Roller - Size 4.5 UK
27,Bulaky vanity case Jewellery Vanity Case,"[""Beauty and Personal Care >> Makeup >> Vanity...",499.0,390.0,False,3,3,Bulaky,"{""product_specification""=>{""key""=>""Body Materi...",Bulaky,,vanity case Jewellery Case
28,FDT Women's Leggings,"[""Clothing >> Women's Clothing >> Fusion Wear ...",699.0,309.0,False,No rating available,No rating available,FDT,"{""product_specification""=>[{""key""=>""Number of ...",FDT,,
29,Madcaps C38GR30 Men's Cargos,"[""Clothing >> Men's Clothing >> Cargos, Shorts...",2199.0,1699.0,False,No rating available,No rating available,Madcaps,"{""product_specification""=>[{""key""=>""Number of ...",Madcaps,,C38GR30
88,"Cobra Paris CO6394A1 Analog Watch - For Men, ...","[""Watches >> Wrist Watches >> Cobra Paris Wris...",18995.0,15195.0,False,No rating available,No rating available,Cobra,"{""product_specification""=>[{""key""=>""Mechanism""...",CobraParis,,"CO6394A1 Analog Watch - For Men, Boys"
89,TEN TEN Women's Black Knee Length Boots Boots,"[""Footwear >> Women's Footwear >> Casual Shoes...",4995.0,1995.0,False,No rating available,No rating available,TEN,"{""product_specification""=>[{""key""=>""Ideal For""...",Boots,,TEN TEN Black Knee Length
90,"Aries Gold G 729 S-BK Analog Watch - For Men,...","[""Watches >> Wrist Watches >> Aries Gold Wrist...",13699.0,13099.0,False,No rating available,No rating available,Aries,"{""product_specification""=>[{""key""=>""Chronograp...",AriesGold,,"G 729 S-BK Analog Watch - For Men, Boys"
91,Carlton Boots,"[""Footwear >> Women's Footwear >> Casual Shoes...",3495.0,1223.0,False,No rating available,No rating available,Carlton,"{""product_specification""=>[{""key""=>""Occasion"",...",Boots,,Carlton
92,Maserati Time R8851116001 Analog Watch - For ...,"[""Watches >> Wrist Watches >> Maserati Time Wr...",24400.0,24400.0,False,No rating available,No rating available,Maserati,"{""product_specification""=>[{""key""=>""Chronograp...",MaseratiTime,,R8851116001 Analog Watch - For Boys


In [109]:
dfbrandnan2['brand'].nunique()

1546

In [110]:
dfbrandnan2['brand_predicted'].nunique()

1466

### This shows that 'brand' has more unique brands, but 'brand_predicted' has better results 

### Approach 3.3 searching for common nouns in brand_predicted and replacing it by brand[row_index]

In [116]:
dfbrandnan3 = dfbrandnan2.copy()
brand_name_final = []

In [118]:
dfbrandnan2.reset_index()
brande_final = []

In [119]:
for row_index,row in dfbrandnan2.iterrows():
    if row['brand_predicted'] == 'Boots' or row['brand_predicted'] == 'Heels' or row['brand_predicted'] == 'Loafers' or row['brand_predicted'] == 'La' or  row['brand_predicted'] == 'Flats' or row['brand_predicted'] == 'Foot' :
        brande_final.append(row['brand'])
    else:
        brande_final.append(row['brand_predicted'])

In [120]:
brande_final

['Glus',
 'Veelys',
 'Bulaky',
 'FDT',
 'Madcaps',
 'CobraParis',
 'TEN',
 'AriesGold',
 'Carlton',
 'MaseratiTime',
 'Vizion',
 'Camerii',
 'Quechua',
 'Colat',
 'Steppings',
 'Rochees',
 'Rorlig',
 'Catwalk',
 'Magnum',
 'TSTAR',
 'Rialto',
 'Kielz',
 'Alfajr',
 'La',
 'TAGHeuer',
 'Salt',
 'Shuz',
 'Steppings',
 'Salt',
 'CostaSwiss',
 'Crocs',
 'Breitling',
 'Lyc',
 'Myra',
 'Calibro',
 'Get',
 'Kielz',
 'Kielz',
 'Rochees',
 'Kielz',
 'Salt',
 'Fluid',
 'Steppings',
 'Rialto',
 'Rorlig',
 'Steppings',
 'Disney',
 'Salt',
 'Cartier',
 'Bruno',
 'Stylistry',
 'LoisCaron',
 'Felix',
 'Kielz',
 'Kielz',
 'Sakay',
 'La',
 'KoolKidz',
 'FranckBella',
 'Steppings',
 'KoolKidz',
 'Vizion',
 'VencerStella',
 'La',
 'Casela',
 'Sneha',
 'Timer',
 'Shuz',
 'Kielz',
 'Colat',
 'Kielz',
 'Belle',
 'Titan',
 'Selfie',
 'Clincher',
 'SrushtiArtJewelry',
 'Kielz',
 'Q&Q',
 'Belle',
 'Escort',
 'Roxy',
 'Lee',
 'Bruno',
 'Estilo',
 'Jackklein',
 'NorthMoon',
 'Shuz',
 'Foot',
 'Credos',
 'RichClub

### These are the analysis that helped framing the brand_prediction function

In [121]:
#dfbrandnan3['final_brand_hel']

In [123]:
#dfbrandnan2['brand_name_final'] = dfbrandnan2[dfbrandnan2['']

In [124]:
dfbrandnan2['product_category_tree'][19962]

'["Footwear >> Women\'s Footwear >> Heels"]'

In [125]:
dfbrandnan2['final_brand'][19962]

'Stylistry Women'

In [126]:
v = dfbrandnan2['product_name'][19962]

In [127]:
n = commons('["Footwear >> Women\'s Footwear >> Heels"]')

In [128]:
l = n
xl = []
for i in l:
    m = i.split()
    for j in m:
        xl.append(j)

In [129]:
xl

['Footwear', "Women's", 'Footwear', 'Heels']

In [130]:
y = v
ylist = y.split()
print(ylist)
brand = ''
brand_list = []
for i in range(len(ylist)):
    print(ylist[i])
    if ylist[i] not in xl:
        brand = brand + ' ' + ylist[i]
brand = brand.strip()
brand_list.append(brand)

['Stylistry', 'Women', 'Heels']
Stylistry
Women
Heels


In [131]:
brand_list

['Stylistry Women']

In [132]:
n = commons(dfbrandnan['product_category_tree'][25])

In [134]:
#b = gettingbrand(dfbrandnan['product_category_tree'][25])

In [135]:
l = n[0:-1]

In [136]:
l

['Clothing ',
 " Women's Clothing ",
 ' Lingerie, Sleep & Swimwear ',
 ' Lingerie Sets ']

In [137]:
xl = []
for i in l:
    m = i.split()
    for j in m:
        xl.append(j)
        

In [138]:
i.split()

['Lingerie', 'Sets']

In [139]:
xl

['Clothing',
 "Women's",
 'Clothing',
 'Lingerie,',
 'Sleep',
 '&',
 'Swimwear',
 'Lingerie',
 'Sets']

In [140]:
y = dfbrandnan['product_name'][25]
ylist = y.split()

In [141]:
brand = ''
for i in range(len(ylist)):
    if y[i] not in xl:
        brand = brand + y[i]
        
        

In [142]:
ylist

['Glus', 'Wedding', 'Lingerie', 'Set']

In [143]:
brand

'Glus'

### After analysis , the final brand getting algorithm

In [144]:
dfbrandnan2['Brand_names'] = brande_final

In [145]:
dfcleaned = dfbrandnan2.copy()

In [146]:
dfcleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5658 entries, 25 to 19962
Data columns (total 13 columns):
product_name                 5658 non-null object
product_category_tree        5658 non-null object
retail_price                 5645 non-null float64
discounted_price             5645 non-null float64
is_FK_Advantage_product      5658 non-null bool
product_rating               5658 non-null object
overall_rating               5658 non-null object
brand                        5658 non-null object
product_specifications       5658 non-null object
brand_predicted              5658 non-null object
brand_refined_into_string    5658 non-null object
final_brand                  5658 non-null object
Brand_names                  5658 non-null object
dtypes: bool(1), float64(2), object(10)
memory usage: 359.2+ KB


In [158]:
dfcleaned.drop(['brand','brand_predicted','brand_refined_into_string','final_brand'],axis=1,inplace=True)

In [159]:
dfcleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5658 entries, 25 to 19962
Data columns (total 9 columns):
product_name               5658 non-null object
product_category_tree      5658 non-null object
retail_price               5645 non-null float64
discounted_price           5645 non-null float64
is_FK_Advantage_product    5658 non-null bool
product_rating             5658 non-null object
overall_rating             5658 non-null object
product_specifications     5658 non-null object
Brand_names                5658 non-null object
dtypes: bool(1), float64(2), object(6)
memory usage: 270.7+ KB


In [160]:
dff.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13791 entries, 0 to 19999
Data columns (total 11 columns):
product_name                 13791 non-null object
product_category_tree        13791 non-null object
retail_price                 13726 non-null float64
discounted_price             13726 non-null float64
is_FK_Advantage_product      13791 non-null bool
product_rating               13791 non-null object
overall_rating               13791 non-null object
brand                        13791 non-null object
product_specifications       13777 non-null object
brand_predicted              13791 non-null object
brand_refined_into_string    13791 non-null object
dtypes: bool(1), float64(2), object(8)
memory usage: 767.7+ KB


In [161]:
b_already_cleaned = dff['brand']
dff1 = dff.copy()

In [162]:
dff1['Brand_names'] = b_already_cleaned

In [165]:
dff1.info()
#dff1.drop(['brand','brand_predicted','brand_refined_into_string'],axis=1,inplace=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13791 entries, 0 to 19999
Data columns (total 9 columns):
product_name               13791 non-null object
product_category_tree      13791 non-null object
retail_price               13726 non-null float64
discounted_price           13726 non-null float64
is_FK_Advantage_product    13791 non-null bool
product_rating             13791 non-null object
overall_rating             13791 non-null object
product_specifications     13777 non-null object
Brand_names                13791 non-null object
dtypes: bool(1), float64(2), object(6)
memory usage: 659.9+ KB


In [166]:
dfcleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5658 entries, 25 to 19962
Data columns (total 9 columns):
product_name               5658 non-null object
product_category_tree      5658 non-null object
retail_price               5645 non-null float64
discounted_price           5645 non-null float64
is_FK_Advantage_product    5658 non-null bool
product_rating             5658 non-null object
overall_rating             5658 non-null object
product_specifications     5658 non-null object
Brand_names                5658 non-null object
dtypes: bool(1), float64(2), object(6)
memory usage: 270.7+ KB


In [279]:
dff1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13791 entries, 0 to 19999
Data columns (total 9 columns):
product_name               13791 non-null object
product_category_tree      13791 non-null object
retail_price               13726 non-null float64
discounted_price           13726 non-null float64
is_FK_Advantage_product    13791 non-null bool
product_rating             13791 non-null object
overall_rating             13791 non-null object
product_specifications     13777 non-null object
Brand_names                13791 non-null object
dtypes: bool(1), float64(2), object(6)
memory usage: 983.1+ KB


In [286]:
dl = dff1.merge(dfcleaned)

In [167]:
frames = [dff1,dfcleaned]

In [168]:
cleanedflipkart = pd.concat(frames)

In [169]:
cleanedflipkart.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19449 entries, 0 to 19962
Data columns (total 9 columns):
product_name               19449 non-null object
product_category_tree      19449 non-null object
retail_price               19371 non-null float64
discounted_price           19371 non-null float64
is_FK_Advantage_product    19449 non-null bool
product_rating             19449 non-null object
overall_rating             19449 non-null object
product_specifications     19435 non-null object
Brand_names                19449 non-null object
dtypes: bool(1), float64(2), object(6)
memory usage: 930.7+ KB


In [171]:
cleanedflipkart.to_csv('cleanedflipkart.csv',encoding='utf-8')

The csv file now contains brands for which we had 'nan', thus reducing the level of anomaly in the dataset to some extent

Next See the product_specifications cleaning