- __Stages Overview__ <br>
[&gt; 1. IDA](01-initial-data-analysis.ipynb) [&gt; 2. EDA](02-exploratory-data-analysis.ipynb) [&gt; 3. Feature Eng.](03-feature-engineering.ipynb) [&gt; 4. Modeling & Evaluation](04-modeling-and-evaluation.ipynb)

# 1. INITIAL DATA ANALYSIS: Preparation

## Short Description

### Briefing

- #### Identify and resolve potential quality issues in the research raw data, to ensuring a ready dataset for progression through the research stages.
    - Summary: identify > resolve > quality issues > raw data > research progression

### Debriefing

- #### Data issues were discussed, identified and sucessfully resolved in the following subjects:
    - Duplicates Data;
    - Missing Values/Data Loss;
    - Format/Type of Data;
    - Outliers.
- #### Pre-processed datasets were generated with quality, and allowed the research to progress to the next stages.
    - Summary: pre-processed > raw data > with quality > allowed > research progression

## Materials and Methods

### Data

- __Data overview__
    - The Online Retail II UCI is the dataset of this data science research in its raw (original) state, without any treatment, cleaning and inspection.
- __Online Retail II UCI description__ 
    - "A real online retail transaction data set of two years. Dataset contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers."
- __Online Retail II UCI data dictionary__
    - __Invoice__: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation;
    - __StockCode__: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product;
    - __Description__: Product (item) name. Nominal;
    - __Quantity__: The quantities of each product (item) per transaction. Numeric;
    - __InvoiceDate__: Invice date and time. Numeric. The day and time when a transaction was generated;
    - __UnitPrice__: Unit price. Numeric. Product price per unit in sterling (Â£);
    - __CustomerID__: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer;
    - __Country__: Country name. Nominal. The name of the country where a customer resides.
- __Dataset source | Dataset info and dictionary source__
    - https://www.kaggle.com/datasets/mashlyn/online-retail-ii-uci

In [None]:
raw_data_path = './data/raw_data.csv'

### Technologies

- __Pandas__
    - DataFrames and Series for data manipulation, analysis and cleaning.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

from IPython.display import display

from src.utils.analytics import DataIssues 
import os current_directory = os.getcwd() 
if os.path.basename(current_directory) == 'notebooks': 
    os.chdir('..') 
    from utils.analytics import DataIssues




# raw_data = pd.read_csv(raw_data_path)

### Procedures

1. O dicionário oficial dos dados brutos forneceu informações essenciais sobre a estrutura dos dados, facilitando a compreensão eficiente do seu formato e conteúdo. Esse recurso foi crucial para orientar a preparação e análise subsequentes dos dados. E comparar se os dados brutos estão em conformidade com o dicionário oficial é uma etapa crítica para garantir a qualidade e a integridade dos dados.

2. O Pandas foi utilizado para verificar a conformidade da estrutura dos dados, realizar avaliações e implementar o tratamento necessário. Os processos realizados incluíram a detecção de dados duplicados, valores faltantes, formatos inadequados e outliers. Diversas fontes e citações ressaltaram a importância de preparar adequadamente os dados, uma vez que dados mal preparados podem não ser aceitos por algoritmos de aprendizado de máquina e podem complicar a análise de dados.

3. The raw data for the study consists of over one million observations. Due to the large volume and quality issues identified in the data structure, the primary approach adopted was to remove the compromised data entirely, without applying significance tests. This approach aims to prioritize obtaining data of sufficient quality for analysis and modeling, while balancing compliance with the study’s timelines. While removal without significance analysis may have limitations, steps were taken to keep all data and processes thoroughly documented in the study repository. This will allow the documentation to provide a solid foundation for future investigations and contribute to the accumulated knowledge about the business problem at hand.

## Results and Discussion

In [None]:
display(raw_data)

In [None]:
rdi = DataIssues(raw_data)

In [None]:
rdi.count_missing_data_issues()

In [None]:
rdi.get_data_types()

In [None]:
rdi.count_outliers()

## OUTPUTS (identification __issues overview__, actions to __resolving__ and full __preprocessing__ implementation)

- ### The raw data has __six variables__ with __quality issues__ that need to be __resolved__.

- ### Identifi**ed** Data Issues In:
    - #### Missing Values/Data Loss:
        - `Description` (4,382 missing values)
        - `Customer ID` (243,007 missing values)
    - #### Format/Type of Data:
        - `InvoiceDate` (needs to be converted to datetime)
        - `Customer ID` (needs to be converted to object)
    - #### Outliers:
        - `Quantity` (has 116,489 outliers)
        - `Price` (has 68,105 outliers)

In [None]:
raw_data_to_preprocess__missings_types_outliers = raw_data.copy()

In [None]:
def resolve_missing_data_issues(data: pd.DataFrame) -> pd.DataFrame:
    df = data.copy()
    df.dropna(inplace=True)
    return df
raw_data_to_preprocess__types_outliers= resolve_missing_data_issues(raw_data_to_preprocess__missings_types_outliers)

In [None]:
def resolve_data_types(data: pd.DataFrame) -> pd.DataFrame:
    df = data.copy()
    if not pd.api.types.is_datetime64_any_dtype(df['InvoiceDate']):
        df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
    if not pd.api.types.is_string_dtype(df['Customer ID']):
        df['Customer ID'] = df['Customer ID'].astype(int)
        df['Customer ID'] = df['Customer ID'].astype(str) 
    return df
raw_data_to_preprocess__outliers = resolve_data_types(raw_data_to_preprocess__types_outliers)

In [None]:
# Reorganize the data
preprocessed_data__with_outliers = raw_data_to_preprocess__outliers

del raw_data_to_preprocess__missings_types_outliers
del raw_data_to_preprocess__types_outliers
del raw_data_to_preprocess__outliers

In [None]:
def resolve_outliers_iteratively(data: pd.DataFrame, max_iterations=100) -> (pd.DataFrame, pd.DataFrame):
    
    all_outliers = pd.DataFrame()    
    for i in range(max_iterations):
        df_with_outliers = data.copy()
        
        for column in df_with_outliers:
            if pd.api.types.is_numeric_dtype(df_with_outliers[column]):
                Q1 = df_with_outliers[column].quantile(0.25)
                Q3 = df_with_outliers[column].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                
                print(f'i{i + 1} | variable: {column} - lower bound: {lower_bound} <-> upper bound: {upper_bound}')

                df_with_outliers[f'outlier_{column}'] = (df_with_outliers[column] < lower_bound) | (df_with_outliers[column] > upper_bound)
        
        outlier_columns = [col for col in df_with_outliers.columns if col.startswith('outlier_')]
        df_with_outliers['is_outlier'] = df_with_outliers[outlier_columns].any(axis=1)
        
        df_without_outliers = df_with_outliers[~df_with_outliers['is_outlier']].copy()
        new_isolated_outliers = df_with_outliers[df_with_outliers['is_outlier']].copy()
        
        all_outliers = pd.concat([all_outliers, new_isolated_outliers], ignore_index=True)
        
        df_without_outliers.drop(outlier_columns + ['is_outlier'], axis=1, inplace=True)
        
        if df.shape[0] == df_without_outliers.shape[0]:
            print(f'No more outliers detected in iteration {i + 1}.')
            print('Dataset is now outlier-free.')
            break
        
        df = df_without_outliers
    
    return df, all_outliers
preprocessed_data, preprocessed_data_isolated_outliers = resolve_outliers_iteratively(preprocessed_data__with_outliers)

## FINAL RESULTS - INITIAL (RAW) DATA ANALYSIS: Wrangling

- ### The raw data were __successfully preprocessed__ and reached a __sufficient level of quality__ to be __ready__ for progression to the next stages of the study.

### __Resolved__ Data Issues In:
- #### Missing Values/Data Loss:
    - `Description` (4,382 missing values) -> __dropped__
    - `Customer ID` (243,007 missing values) -> __dropped__
- #### Format/Type of Data:
    - `InvoiceDate` (needs to be converted to datetime) -> __converted__
    - `Customer ID` (needs to be converted to object) -> __converted__
- #### Outliers:
    - `Quantity` (has 116,489 outliers) -> __resolved__
    - `Price` (has 68,105 outliers) -> __resolved__

- #### New genereted preprocessed datasets:
    - `preprocessed_data` -> __Final pre-processed data__
    - `preprocessed_data_isolated_outliers` -> Isolated outliers from the final pre-processed data
    - `preprocessed_data__with_outliers` -> Final pre-processed data (__WITH__ outliers)
- #### Limitations:
    - Although the transformation performed in this initial data preparation is essential to advance the analysis and modeling, it can lead to the loss of valuable information, introduce bias, and reduce the representativeness of the data, compromising the precision of the results. In a future study on this business problem and the data involved, it is essential to assess the impact of data loss using reliable and well-established statistical techniques. However, despite the removals performed, the study preserved 61.69% of the original observations, keeping all variables and excluding 408,871 observations that contained missing data or outliers in any of their variables.
- #### Implications:
    - During the initial analysis of the data, over 160,000 outliers were identified. None of these outliers were discarded. Although they were not used in the current modeling, they were isolated and stored in the study repository and may serve as a basis for analysis in future research.
- #### Next steps:
    - __Exploratory Data Analysis (EDA)__
    - __Modeling__

In [None]:
# Short comparison between raw and preprocessed data
print('RAW DATA:')
display(raw_data)
print()
print('PREPROCESSED DATA:')
display(preprocessed_data)

In [None]:
prdi = DataIssues(preprocessed_data)

In [None]:
prdi.count_missing_data_issues()

In [None]:
prdi.get_data_types()

In [None]:
prdi.count_outliers()

In [None]:
display(preprocessed_data)

In [None]:
def export_data(data: pd.DataFrame, path: str) -> None:
    data.to_csv(path, index=False)
    print('A dataset was exported successfully to: ', path)
    return None

export_data(preprocessed_data, './data/preprocessed_data.csv')
export_data(preprocessed_data_isolated_outliers, './data/preprocessed_data_isolated_outliers.csv')
export_data(preprocessed_data__with_outliers, './data/preprocessed_data__with_outliers.csv')