# Data Understanding

## EN
The purpose of this notebook is to explore and understand the dataset structure, variables, and potential data quality issues.

This step is essential to define meaningful hypotheses and guide further analysis.

## PT
O objetivo deste notebook é explorar e compreender a estrutura do dataset, suas variáveis e possíveis problemas de qualidade dos dados.

Essa etapa é fundamental para definir hipóteses relevantes e orientar as próximas análises.



## Initial Hypotheses

### EN
- Users with higher engagement (more views and cart additions) are more likely to convert.
- Longer session duration is associated with higher purchase probability.
- Certain product categories may have higher abandonment rates.

### PT
- Usuários com maior engajamento (mais visualizações e adições ao carrinho) tendem a converter mais.
- Sessões com maior duração estão associadas a maior probabilidade de compra.
- Algumas categorias de produtos podem apresentar maiores taxas de abandono.


In [3]:
 # Importando os dados

import pandas as pd
from pathlib import Path

# Path expected in the repository
repo_data_path = Path("../data/ecommerce_shopper_behavior.csv")

if repo_data_path.exists():
    df = pd.read_csv(repo_data_path)
else:
    # Local path (not tracked in GitHub)
    local_data_path = Path("c:\\Users\\Jenifer\\Downloads\\e_commerce_shopper_behaviour_and_lifestyle.csv")
    df = pd.read_csv(local_data_path)

df.head(), df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 60 columns):
 #   Column                           Non-Null Count    Dtype 
---  ------                           --------------    ----- 
 0   user_id                          1000000 non-null  int64 
 1   age                              1000000 non-null  int64 
 2   gender                           1000000 non-null  object
 3   country                          1000000 non-null  object
 4   urban_rural                      1000000 non-null  object
 5   income_level                     1000000 non-null  int64 
 6   employment_status                1000000 non-null  object
 7   education_level                  1000000 non-null  object
 8   relationship_status              1000000 non-null  object
 9   has_children                     1000000 non-null  int64 
 10  household_size                   1000000 non-null  int64 
 11  occupation                       1000000 non-null  object
 12  e

(   user_id  age  gender  country urban_rural  income_level employment_status  \
 0        1   56  Female  Germany    Suburban         90860     Self-employed   
 1        2   69    Male    Japan    Suburban         35423        Unemployed   
 2        3   46  Female    India       Urban         21467     Self-employed   
 3        4   32    Male   Canada       Urban         41770     Self-employed   
 4        5   60  Female    Japan       Urban        183882          Employed   
 
     education_level relationship_status  has_children  ...  \
 0  Associate Degree              Single             0  ...   
 1          Bachelor              Single             1  ...   
 2  Associate Degree             Married             1  ...   
 3          Bachelor             Widowed             0  ...   
 4  Associate Degree             Widowed             1  ...   
 
    cart_items_average checkout_abandonments_per_month  \
 0                  10                               2   
 1              

In [5]:
df.columns

Index(['user_id', 'age', 'gender', 'country', 'urban_rural', 'income_level',
       'employment_status', 'education_level', 'relationship_status',
       'has_children', 'household_size', 'occupation', 'ethnicity',
       'language_preference', 'device_type', 'weekly_purchases',
       'monthly_spend', 'cart_abandonment_rate', 'review_writing_frequency',
       'average_order_value', 'preferred_payment_method',
       'coupon_usage_frequency', 'loyalty_program_member', 'referral_count',
       'product_category_preference', 'shopping_time_of_day',
       'weekend_shopper', 'impulse_purchases_per_month', 'browse_to_buy_ratio',
       'return_frequency', 'budgeting_style', 'brand_loyalty_score',
       'impulse_buying_score', 'environmental_consciousness',
       'health_conscious_shopping', 'travel_frequency', 'hobby_count',
       'social_media_influence_score', 'reading_habits', 'exercise_frequency',
       'stress_from_financial_decisions', 'overall_stress_level',
       'sleep_quali

In [6]:
# Check data types and missing values
df.dtypes
df.isnull().sum()

user_id                            0
age                                0
gender                             0
country                            0
urban_rural                        0
income_level                       0
employment_status                  0
education_level                    0
relationship_status                0
has_children                       0
household_size                     0
occupation                         0
ethnicity                          0
language_preference                0
device_type                        0
weekly_purchases                   0
monthly_spend                      0
cart_abandonment_rate              0
review_writing_frequency           0
average_order_value                0
preferred_payment_method           0
coupon_usage_frequency             0
loyalty_program_member             0
referral_count                     0
product_category_preference        0
shopping_time_of_day               0
weekend_shopper                    0
i

## Variables Overview

### EN
Below is an initial understanding of the dataset variables based on their names and observed values.

This interpretation may be refined during the analysis.

### PT
Abaixo está um entendimento inicial das variáveis do dataset com base em seus nomes e valores observados.

Essa interpretação pode ser refinada ao longo da análise.


In [8]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 60 columns):
 #   Column                           Non-Null Count    Dtype 
---  ------                           --------------    ----- 
 0   user_id                          1000000 non-null  int64 
 1   age                              1000000 non-null  int64 
 2   gender                           1000000 non-null  object
 3   country                          1000000 non-null  object
 4   urban_rural                      1000000 non-null  object
 5   income_level                     1000000 non-null  int64 
 6   employment_status                1000000 non-null  object
 7   education_level                  1000000 non-null  object
 8   relationship_status              1000000 non-null  object
 9   has_children                     1000000 non-null  int64 
 10  household_size                   1000000 non-null  int64 
 11  occupation                       1000000 non-null  object
 12  e

## Variables Overview

The dataset contains demographic, behavioral, transactional, and engagement-related variables describing e-commerce users.

Below is a high-level overview of the main variable groups.

### Demographic Information
Variables related to user profile and socio-demographic characteristics, such as age, gender, country, education level, income level, and household composition.

Examples:
- user_id
- age
- gender
- country
- income_level
- education_level
- household_size

### Behavioral and Usage Data
Variables describing how users interact with the e-commerce platform, including purchasing frequency, browsing behavior, and cart abandonment patterns.

Examples:
- weekly_purchases
- monthly_spend
- cart_abandonment_rate
- average_order_value
- device_type

### Engagement and Loyalty Indicators
Variables related to user engagement, loyalty, and long-term relationship with the platform.

Examples:
- review_writing_frequency
- premium_subscription
- return_rate

### Socioeconomic and Lifestyle Attributes
Variables representing employment, occupation, family status, language preference, and other lifestyle indicators that may influence purchasing behavior.

Examples:
- employment_status
- occupation
- relationship_status
- has_children
- language_preference

## Visão Geral das Variáveis

O dataset contém variáveis demográficas, comportamentais, transacionais e de engajamento que descrevem usuários de e-commerce.

Abaixo está uma visão geral dos principais grupos de variáveis.

### Informações Demográficas
Variáveis relacionadas ao perfil do usuário e características sociodemográficas, como idade, gênero, país, nível educacional, renda e composição familiar.

Exemplos:
- user_id
- age
- gender
- country
- income_level
- education_level
- household_size

### Dados Comportamentais e de Uso
Variáveis que descrevem como os usuários interagem com a plataforma de e-commerce, incluindo frequência de compras, comportamento de navegação e abandono de carrinho.

Exemplos:
- weekly_purchases
- monthly_spend
- cart_abandonment_rate
- average_order_value
- device_type

### Indicadores de Engajamento e Fidelidade
Variáveis relacionadas ao engajamento do usuário, fidelidade e relacionamento de longo prazo com a plataforma.

Exemplos:
- review_writing_frequency
- premium_subscription
- return_rate

### Atributos Socioeconômicos e de Estilo de Vida
Variáveis que representam situação profissional, ocupação, estado civil, preferências de idioma e outros fatores de estilo de vida que podem influenciar o comportamento de compra.

Exemplos:
- employment_status
- occupation
- relationship_status
- has_children
- language_preference
