# Python DS-ao-Dev

## Bussiness Problem

**Star Jeans Company**

- Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendimento. Depois de vários negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um modelo de negócio do tipo E-commerce.

- A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação baixo e escalar a medida que forem conseguindo clientes.

- Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e o material para a fabricação de cada peça.

- Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes perguntas: ]
    1. Qual o melhor preço de venda para as calças? 
    2. Quantos tipos de calças e suas cores para o produto inicial? 
    3. Quais as matérias-prima necessárias para confeccionar as calças?
    
- As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

## Solution Planning (Input-Output-Tasks)

**Bussiness Question**

- Which is the best price for jeans?

1. Input:
    1. Fonte de dados
        - Site da H&M: https://www2.hm.com/en_us/men/products/jeans.html
        - Site da Macys: https://www.macys.com/shop/mens-clothing/mens-jeans
    2. Ferramentas
        - Python 3.8.0
        - Bibliotecas de Webscrapping ( BS4, Selenium )
        - PyCharm
        - Jupyter Notebook ( Analise e prototipagens )
        - Crontjob, Airflow
        - Streamlit
    
2. Output:
    1. A resposta para a pergunta.
        - Mediana dos preços dos concorrents.
    2. Formato da entrega
        - Tabela ou gráfico
    3. Local da entrega
        - App Streamlit
    
3. Tasks:
    1. Passo a passso para construir o cálculo da mediana ou média
        - Realizar o calculo da mediana sobre o produto, tipo e cor
    2. Definir o formato da entrega ( Visualização, Tabela, Frase )
        - Gráfico de barras com a mediana dos preço dos produtos, por tipo e cor dos últimos 30 dia
        - Tabela com as seguintes colunas: id | product_name | product_type | product_color | produ
        - Definição do schema: Colunas e seu tipo
        - Definição a infraestrutura de armazenamento ( SQLITE3 )
        - Design do ETL ( Scripts de Extração, Transformação e Carga )
        - Planejamento de Agendamento dos scripts ( dependencias entre os scripts )
        - Fazer as visualizações
        - Entrega do produto final
    3. Decidir o local de entrega ( PowerBi, Telegram, Email, Streamlit, Intranet ),
        - App com Streamlit

## Bussiness Models

“Como você planeja ganhar dinheiro”, Michael Lewis

“Um modelo de negócio descreve a lógica de criação, entrega e captura de valor por
parte de uma organização”, Alexander Osterwalder

- E-commerce:
    1. Faturamento: Vendas de um produto.
    2. Exemplo: Lojas Riachuelo, Submarino, Magazine Luiza, etc
        
- Software AS a Service ( SaaS ):
    1. Faturamento: Assinatura mensal/anual de utilização ou por usuário.
    2. Exemplo: Looker, Asana, Gmail, Salesforce.
    
- Serviço:
    1. Faturamento: Prestação de serviço por tempo ou projeto.
    2. Exemplo: Sul América, Porto Seguro, Mapfre.
    
- Mobile App:
    1. Faturamento: Venda de upgrades.
    2. Exemplo: Wildlife, Ubisoft, Games Mobile.
    
- Media Site:
    1. Faturamento: Cobrança por clicks ou visualizações de um determinado anúncio.
    2. Exemplo: Facebook, Google, UOL, G1, etc.
    
- Marketplace:
    1. Faturamento: Taxa sobre a transação entre o passageiro e o motorista.
    2. Exemplo: Uber, Ifood, 99, Elo7, Submarino.

## E-commerce Metrics

- **Growth Metrics**:
    1. Porcentagem do Marketshare
    2. Número de Clientes Novos
- **Revenue Metrics**:
    1. Número de Vendas
    2. Ticket Médio
    3. LTV ( Long Time Value )
    4. Recência Média
    5. Basket Size Médio
    6. Markup médio
- **Cost Metrics**:
    1. CAC ( Custo de aquisição de Clientes )
    2. Desconto médio
    3. Custo de Produção
    4. Taxa de devolução
    5. Custos Fixos ( Folha de pagamento, escritório, softwares )
    6. Impostos

In [None]:
from IPython.display import Image
Image(filename='/home/marxcerqueira/repos/Data-Science-Projects/pa005_insiders_clustering/pa005_marx_cerqueira/reports/figures/mapa_metricas_e_commerce.png')

# Imports

In [1]:
import re
import numpy as np
import pandas as pd

In [6]:
pwd

'/home/marxcerqueira/repos/python-ds-ao-dev'

# Loading Data (Web Scrapping)

In [25]:
data = pd.read_csv( '/home/marxcerqueira/repos/python-ds-ao-dev/products_hm.csv' )

In [10]:
data.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,Fit,Composition,Size,Product safety
0,0,690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,690449,22,Light denim blue/trashed,Skinny fit,Lining: Polyester 100%,"The model is 184cm/6'0"" and wears a size 31/32",
1,1,690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,690449,22,Light denim blue/trashed,Skinny fit,"Cotton 98%, Elastane 2%","The model is 184cm/6'0"" and wears a size 31/32",
2,2,690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,690449,22,Denim blue,Skinny fit,Lining: Polyester 100%,"The model is 184cm/6'0"" and wears a size 31/32",
3,3,690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,690449,22,Denim blue,Skinny fit,"Cotton 98%, Elastane 2%","The model is 184cm/6'0"" and wears a size 31/32",
4,4,690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,690449,22,Black/washed,Skinny fit,Lining: Polyester 100%,"The model is 184cm/6'0"" and wears a size 31/32",


In [45]:
cols = ['product_id','product_category','product_name','product_price','scrapy_datetime','style_id','color_id','color_name','fit','composition','size', 'product_safety']

In [11]:
#checkin NA
data.isna().sum()

Unnamed: 0             0
product_id             0
product_category      96
product_name           0
product_price          0
scrapy_datetime        0
style_id               0
color_id               0
color_name             1
Fit                    1
Composition            1
Size                1032
Product safety      1832
dtype: int64

In [None]:
data[data['product_category'].iloc[471:]]

In [20]:
data[data['product_category'].isna()]

Unnamed: 0.1,Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,Fit,Composition,Size,Product safety
481,481,985159001,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,1,Denim blue,Skinny fit,"Shell: Cotton 99%, Elastane 1%","The model is 187cm/6'2"" and wears a size 31/32",
482,482,985159001,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,1,Denim blue,Skinny fit,Pocket lining: Cotton 100%,"The model is 187cm/6'2"" and wears a size 31/32",
483,483,985159001,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,1,Dark gray,Skinny fit,"Shell: Cotton 99%, Elastane 1%","The model is 187cm/6'2"" and wears a size 31/32",
484,484,985159001,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,1,Dark gray,Skinny fit,Pocket lining: Cotton 100%,"The model is 187cm/6'2"" and wears a size 31/32",
485,485,985159001,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,1,Light denim blue,Skinny fit,"Shell: Cotton 99%, Elastane 1%","The model is 187cm/6'2"" and wears a size 31/32",
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1343,1343,985159002,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,2,Black,Skinny fit,"Shell: Cotton 99%, Elastane 1%","The model is 187cm/6'2"" and wears a size 31/32",
1344,1344,985159002,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,2,Dark gray,Skinny fit,Pocket lining: Cotton 100%,"The model is 187cm/6'2"" and wears a size 31/32",
1345,1345,985159002,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,2,Dark gray,Skinny fit,"Shell: Cotton 99%, Elastane 1%","The model is 187cm/6'2"" and wears a size 31/32",
1346,1346,985159002,,Skinny Jeans,$ 19.99,2021-06-01 08:57:37,985159,2,Light denim blue,Skinny fit,Pocket lining: Cotton 100%,"The model is 187cm/6'2"" and wears a size 31/32",


In [18]:
data['product_category'].unique()

array(['men_jeans_skinny', 'men_jeans_slim', nan, 'men_jeans_tapered',
       'men_jeans_regular'], dtype=object)

In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1765 entries, 0 to 1860
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   product_id        1765 non-null   int64         
 1   product_category  1765 non-null   object        
 2   product_name      1765 non-null   object        
 3   product_price     1765 non-null   float64       
 4   scrapy_datetime   1765 non-null   datetime64[ns]
 5   style_id          1765 non-null   int64         
 6   color_id          1765 non-null   int64         
 7   color_name        1764 non-null   object        
 8   Fit               1764 non-null   object        
 9   Composition       1764 non-null   object        
 10  Size              781 non-null    object        
 11  Product safety    29 non-null     object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(7)
memory usage: 179.3+ KB


In [63]:
data['composition'].unique()

array(['Cotton 98%, Elastane 2%',
       'Cotton 88%, Polyester 10%, Elastane 2%',
       'Cotton 89%, Polyester 10%, Elastane 1%',
       'Cotton 93%, Polyester 6%, Elastane 1%', 'Cotton 99%, Elastane 1%',
       'Cotton 80%, Polyester 19%, Elastane 1%',
       'Cotton 73%, Polyester 26%, Elastane 1%',
       'Cotton 90%, Polyester 8%, Elastane 2%', 'Cotton 100%',
       'Cotton 90%, Elasterell-P 8%, Elastane 2%', nan], dtype=object)

In [137]:
data = pd.read_csv( '/home/marxcerqueira/repos/python-ds-ao-dev/products_hm.csv' )
data = data.drop(['Unnamed: 0'], axis = 1)
data.columns = cols

# product id
data['product_id'] = data['product_id'].astype(int)

# product category
data = data.dropna(subset = ['product_category'])
# product name
data['product_name'] = data['product_name'].apply(lambda x: x.replace(' ', '_').lower())
 
# product price
data['product_price'] = data['product_price'].apply(lambda x: x.replace('$ ', '')).astype(float)
 
# scrapy datetime
data['scrapy_datetime'] = pd.to_datetime(data['scrapy_datetime'], format = '%Y-%m-%d %H:%M:%S')

# style id

# color id
 
# color name
data['color_name'] = data['color_name'].apply(lambda x: x.replace(' ','_').lower() if pd.notnull(x) else x)

# fit
data['fit'] = data['fit'].apply(lambda x: x.replace(' ', '_').lower() if pd.notnull(x) else x)

# size
# data['size_number'] = data['size'].apply(lambda x: re.search(̣'\d{3}cm', x).group(0) if pd.notnull(x) else x)
data['size_number'] = data['size'].apply( lambda x: re.search( '\d{3}cm', x ).group(0) if pd.notnull( x ) else x )
data['size_number'] = data['size_number'].apply(lambda x: re.search('\d+', x).group(0) if pd.notnull(x) else x)        

# size model
data['size_model'] = data['size'].str.extract('(\d+/\\d+)')


# # product safety
#     #dropped


# # composition
# data = data[~data['composition'].str.contains('Pocket lining:', na = False)]
# data = data[~data['composition'].str.contains('Lining:', na = False)]
# data = data[~data['composition'].str.contains('Shell:', na = False)]

# #break composition by comma
# df1 = data['composition'].str.split(',', expand = True)

# # cottom / polyester / elastano / elasterell
# df_ref = pd.DataFrame( index = np.arange( len(data) ), columns = ['cotton', 'polyester', 'elastane', 'elasterell'])

# # cotton
# df_cotton = df1[0]
# df_cotton.name = 'cotton'

# df_ref = pd.concat([df_ref, df_cotton], axis = 1)
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

# # polyester
# df_polyester = df1.loc[df1[1].str.contains('Polyester', na=True), 1]
# df_polyester.name = 'polyester'

# df_ref =pd.concat([df_ref, df_polyester], axis = 1)
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')]

# # elastane
# df_elastane = df1.loc[df1[1].str.contains('Elastane', na=True), 1]
# df_elastane.name = 'elastane'

# df_ref = pd.concat([df_ref, df_elastane], axis = 1)
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')]

# # elasterell
# df_elasterell = df1.loc[df1[1].str.contains( 'Elasterell', na=True ), 1]
# df_elasterell.name = 'elasterell'

# df_ref = pd.concat( [df_ref, df_elasterell], axis=1 )
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]

# # final join
# data = pd.concat([data, df_ref], axis = 1)

# # format composition data
# data['cotton'] = data['cotton'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)
# data['polyester'] = data['polyester'].apply( lambda x: int( re.search( '\d+', x).group(0) ) / 100 if pd.notnull( x ) else x )
# data['elastane'] = data['elastane'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) / 100 if pd.notnull( x ) else x )
# data['elasterell'] = data['elasterell'].apply( lambda x: int( re.search('\d+',x ).group(0) ) / 100 if pd.notnull( x ) else x )

## Drop columns
# data = data.drop(['size', 'product_safety'], axis = 1)

# Drop duplicates

# data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,polyester,elastane,elasterell
0,,,,,NaT,,,,,,,,,,,
1,690449022.0,men_jeans_skinny,skinny_jeans,39.99,2021-06-01 08:57:37,690449.0,22.0,light_denim_blue/trashed,skinny_fit,"Cotton 98%, Elastane 2%",184.0,31/32,0.98,,0.02,
2,,,,,NaT,,,,,,,,,,,
3,690449022.0,men_jeans_skinny,skinny_jeans,39.99,2021-06-01 08:57:37,690449.0,22.0,denim_blue,skinny_fit,"Cotton 98%, Elastane 2%",184.0,31/32,0.98,,0.02,
4,,,,,NaT,,,,,,,,,,,


In [70]:
data.shape

(972, 12)

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,polyester,elastane,elasterell
0,,,,,NaT,,,,,,,,,,,
1,690449022.0,men_jeans_skinny,skinny_jeans,39.99,2021-06-01 08:57:37,690449.0,22.0,light_denim_blue/trashed,skinny_fit,"Cotton 98%, Elastane 2%",184.0,31/32,0.98,,0.02,
2,,,,,NaT,,,,,,,,,,,
3,690449022.0,men_jeans_skinny,skinny_jeans,39.99,2021-06-01 08:57:37,690449.0,22.0,denim_blue,skinny_fit,"Cotton 98%, Elastane 2%",184.0,31/32,0.98,,0.02,
4,,,,,NaT,,,,,,,,,,,


In [84]:
df1.head()

Unnamed: 0,0,1,2
1,Cotton 98%,Elastane 2%,
3,Cotton 98%,Elastane 2%,
5,Cotton 98%,Elastane 2%,
7,Cotton 98%,Elastane 2%,
9,Cotton 98%,Elastane 2%,


In [136]:
data.tail(30)

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model,cotton,polyester,elastane,elasterell
1831,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,denim_blue,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1832,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,black/washed_out,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1833,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,dark_blue_denim,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1834,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,dark_gray,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1835,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,light_gray,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1836,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,dark_blue,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1837,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,dark_denim_blue,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1838,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,denim_blue,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1839,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,light_blue,slim_fit,"Cotton 98%, Elastane 2%",187.0,31/32,0.98,,0.02,
1840,751994009.0,men_jeans_slim,slim_jeans,29.99,2021-06-01 08:57:37,751994.0,9.0,dark_gray,slim_fit,"Cotton 98%, Elastane 2%",188.0,31/32,0.98,,0.02,


In [None]:
data = pd.read_csv( '/home/marxcerqueira/repos/ds-ao-dev/products_hm.csv' )
# product id
data = data.dropna( subset=['product_id'] )
data['product_id'] = data['product_id'].astype( int )
# product name
data['product_name'] = data['product_name'].apply( lambda x: x.replace( ' ',␣
,→'_' ).lower() )
                   
# product price
data['product_price'] = data['product_price'].apply( lambda x: x.replace( '$ ',␣
,→'' ) ).astype( float )
                   
# scrapy datetime
data['scrapy_datetime'] = pd.to_datetime( data['scrapy_datetime'],␣
,→format='%Y-%m-%d %H:%M:%S' )
                   
# style id
data['style_id'] = data['style_id'].astype( int )
                   
# color id
data['color_id'] = data['color_id'].astype( int )
                   
# color name
data['color_name'] = data['color_name'].apply( lambda x: x.replace( ' ', '_' ).
,→replace( '/', '_' ).lower() if pd.notnull( x ) else x )
                   
# fit
data['fit'] = data['fit'].apply( lambda x: x.replace( ' ', '_' ).lower() if pd.
,→notnull( x ) else x )
                   
# size number
data['size_number'] = data['size'].apply( lambda x: re.search( '\d{3}cm', x ).group(0) if pd.notnull( x ) else x )
                   
data['size_number'] = data['size_number'].apply( lambda x: re.search( '\d+', x).group(0) if pd.notnull( x ) else x )
                   
# size model
        
data['size_model'] = data['size'].str.extract( '(\d+/\\d+)' )
                   
# composition
data = data[~data['composition'].str.contains( 'Pocket lining:', na=False )]
data = data[~data['composition'].str.contains( 'Lining:', na=False )]
data = data[~data['composition'].str.contains( 'Shell:', na=False )]
                   
# drop duplicates
data = data.drop_duplicates( subset=['product_id', 'product_category','product_name', 'product_price',
'scrapy_datetime', 'style_id', 'color_id','color_name', 'fit'], keep='last' )
                   
# reset index
data = data.reset_index( drop=True )
                   
# break composition by comma
df1 = data['composition'].str.split( ',', expand=True )
                   
# cotton | polyester | elastano | elasterell
df_ref = pd.DataFrame( index=np.arange( len( data ) ),␣
,→columns=['cotton','polyester', 'elastane', 'elasterell'] )
                   
# cotton
df_cotton = df1[0]
                   
df_cotton.name = 'cotton'
df_ref = pd.concat( [df_ref, df_cotton ], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]
df_ref['cotton'] = df_ref['cotton'].fillna( 'Cotton 0%' )
                   
# polyester
                
df_polyester = df1.loc[df1[1].str.contains( 'Polyester', na=True ), 1]
df_polyester.name = 'polyester'
df_ref = pd.concat( [df_ref, df_polyester], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['polyester'] = df_ref['polyester'].fillna( 'Polyester 0%' )
                   
# elastano
df_elastane = df1.loc[df1[1].str.contains( 'Elastane', na=True ), 1]
df_elastane.name = 'elastane'
                   
# combine elastane from both columns 1 and 2
df_elastane = df_elastane.combine_first( df1[2] )
df_ref = pd.concat( [df_ref, df_elastane], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['elastane'] = df_ref['elastane'].fillna( 'Elastane 0%' )
                   
# elasterell
df_elasterell = df1.loc[df1[1].str.contains( 'Elasterell', na=True ), 1]
df_elasterell.name = 'elasterell'
df_ref = pd.concat( [df_ref, df_elasterell], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['elasterell'] = df_ref['elasterell'].fillna( 'Elasterell-P 0%' )
                   
# final join
                
data = pd.concat( [data, df_ref], axis=1 )
                   
# format composition data
        
data['cotton'] = data['cotton'].apply( lambda x: int( re.search( '\d+', x ).
,→group(0) ) / 100 if pd.notnull( x ) else x )
data['polyester'] = data['polyester'].apply( lambda x: int( re.search( '\d+', x␣
,→).group(0) ) / 100 if pd.notnull( x ) else x )
data['elastane'] = data['elastane'].apply( lambda x: int( re.search( '\d+', x ).
,→group(0) ) / 100 if pd.notnull( x ) else x )
data['elasterell'] = data['elasterell'].apply( lambda x: int( re.search(␣
,→'\d+',x ).group(0) ) / 100 if pd.notnull( x ) else x )
                   
# Drop columns

data = data.drop( columns=['size', 'product safety', 'composition'], axis=1 )
# Drop duplicates
            
data = data.drop_duplicates()
data.shape        
        