## \[Data Science\] Full stack training

- campo chave para junção das tabelas: `grower_document`
- tabela base: `md_grower_report`
  - corresponde ao cadastro de agricultores da Bayer

In [1]:
import pandas as pd
from pathlib import Path

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
INPUT_FOLDER = Path.cwd().parent / "inputs"

In [4]:
csv_files = Path(INPUT_FOLDER).glob('**/*.csv')

In [5]:
df = {f"{csv_file.name[10:-4]}": pd.read_csv(csv_file) for csv_file in csv_files}

  df = {f"{csv_file.name[10:-4]}": pd.read_csv(csv_file) for csv_file in csv_files}
  df = {f"{csv_file.name[10:-4]}": pd.read_csv(csv_file) for csv_file in csv_files}
  df = {f"{csv_file.name[10:-4]}": pd.read_csv(csv_file) for csv_file in csv_files}


In [6]:
df.keys()

dict_keys(['sales', 'disque_intacta', 'md_grower_report', 'saved_seeds', 'entered_area'])

## Tables:

- **sales** => (871.264, 79) - (a_seller_document, a_buyer_document)
- **disque_intacta** => (385.716, 51)
- **md_grower_report** => (514.182, 38) - (a_grower_document)
- **saved_seeds** => (19.648, 72)
- **entered_area** => (51.262, 40)

## Dealing with duplicated data

An initial `left outer join` between `sales` and `md_grower_report` was resulting an amount greater than the number of rows of `sales`

This raised a red flag because probably there were duplicated keys in `md_grower_report`

Then, I started investigating

In [7]:
df_dup_detected = pd.merge(
    df["sales"],
    df["md_grower_report"],
    how="left",
    # on=None,
    left_on=["a_seller_document"],
    right_on=["a_grower_document"],
    # left_index=False,
    # right_index=False,
    # sort=True,
    suffixes=("_seller", "_grower"),
)
df_dup_detected.shape

(896150, 117)

`896.150` **>** `871.264` :: `24.886` extra amount (**!!!**)

In [8]:
grower_docs = df["md_grower_report"]["a_grower_document"]
grower_docs_unq = df["md_grower_report"]["a_grower_document"].unique()
print(f"Comparing grower_docs.size with grower_docs_unq.size => {len(grower_docs)} vs {len(grower_docs_unq)}")
assert len(df["md_grower_report"]["a_grower_document"]) == len(df["md_grower_report"]["a_grower_document"].unique())

Comparing grower_docs.size with grower_docs_unq.size => 514182 vs 461935


AssertionError: 

In [9]:
df["md_grower_report"].dropna(subset=["a_grower_document"]).shape

(475945, 38)

In [10]:
# droping empty keys
df["md_grower_report"] = df["md_grower_report"].dropna(subset=["a_grower_document"])

In [11]:
ids = df["md_grower_report"]["a_grower_document"]
df_grower_report_dup = df["md_grower_report"][ids.isin(ids[ids.duplicated()])].sort_values("a_grower_document")
df_grower_report_dup.shape

(28011, 38)

In [12]:
# droping duplicated keys
df["md_grower_report"] = df["md_grower_report"].drop_duplicates(subset=["a_grower_document"], keep='first')
df["md_grower_report"].shape

(461934, 38)

In [14]:
df_cleaned = pd.merge(
    df["sales"],
    df["md_grower_report"],
    how="left",
    # on=None,
    left_on=["a_seller_document"],
    right_on=["a_grower_document"],
    # left_index=False,
    # right_index=False,
    # sort=True,
    suffixes=("_seller", "_grower"),
)
df_cleaned.shape

(871264, 117)

`871.264` **=** `871.264` :: **ZERO** extra amount (*first mission accomplished*)

In [15]:
df_final = pd.merge(
    df_cleaned,
    df["md_grower_report"],
    how="left",
    # on=None,
    left_on=["a_buyer_document"],
    right_on=["a_grower_document"],
    # left_index=False,
    # right_index=False,
    # sort=True,
    suffixes=("_buyer", "_grower_2"),
)
df_final.shape

(871264, 155)

`871.264` **=** `871.264` :: **ZERO** extra amount (*double check done*)

notice that the number of column are equal to `79` **+** `38` **+** `38` (**=** `155`)

number of columns in `sales` = 79 / number of columns in `md_grower_report` = 38