# Data Quality Check

This section contains code used to ensure that all available data has been successfully downloaded. While missing values may still occur, this part includes checks for missing chunks of data. Several cells are dedicated to identifying these gaps, ultimately producing a CSV file with the corresponding match IDs that need to be re-downloaded.

Once this file is created, return to the fill_data.py script, comment out any unnecessary parts, and run the script to fetch the missing data.

After a successful download, proceed to the second part of this notebook to merge the newly retrieved data.

**🔧 Integration of this functionality into fill_data.py is planned to reduce manual effort.**

## Part 1

In [3]:
import pandas as pd
import os

In [4]:
names = ["20_21_final", "21_22_final", "22_23_final", "23_24_final"]
dfs = [
    pd.read_csv(f"..\data\extracted_data\{names[0]}.csv"),
    pd.read_csv(f"..\data\extracted_data\{names[1]}.csv"),
    pd.read_csv(f"..\data\extracted_data\{names[2]}.csv"),
    pd.read_csv(f"..\data\extracted_data\{names[3]}.csv")
]

os.makedirs("../data/match_ids", exist_ok=True)

# Helper to save problematic match_ids to file
def save_ids_to_file(ids, chunk_name, df_name):
    file_path = f"../data/match_ids/e{chunk_name}_{df_name}.csv"
    with open(file_path, 'w') as f:
        for match_id in ids:
            f.write(f"{match_id}\n")

In [5]:
# CHUNK 1 — Check goals (Exception 1)
for df, name in zip(dfs, names):
    ids = []
    for _, row in df.iterrows():
        if pd.isna(row['goals_1']) or pd.isna(row['goals_2']):
            ids.append(row['match_id'])
    save_ids_to_file(ids, 1, name)

In [6]:
# CHUNK 2 — Check possession (Exception 2)
for df, name in zip(dfs, names):
    ids = []
    for _, row in df.iterrows():
        if pd.isna(row.get('possession_1')) or pd.isna(row.get('possession_2')):
            ids.append(row['match_id'])
    save_ids_to_file(ids, 2, name)

In [8]:
# CHUNK 3 — Check formations (Exception 3)
for df, name in zip(dfs, names):
    ids = []
    for _, row in df.iterrows():
        if pd.isna(row.get('team_1_formation')) or pd.isna(row.get('team_2_formation')):
            ids.append(row['match_id'])
    save_ids_to_file(ids, 3, name)

In [10]:
# CHUNK 4 — Check ratings (Exception 4)
for df, name in zip(dfs, names):
    ids = []
    for _, row in df.iterrows():
        if pd.isna(row.get('team_1_line_1')) or pd.isna(row.get('team_2_line_1')):
            ids.append(row['match_id'])
    save_ids_to_file(ids, 4, name)

In [11]:
# CHUNK 5 — Check standard bets (Exception 5)
for df, name in zip(dfs, names):
    ids = []
    for _, row in df.iterrows():
        cols = ['bet_1', 'bet_x', 'bet_2']
        if sum(1 for col in cols if not pd.isna(row.get(col)) and isinstance(row[col], (int, float))) < 3:
            ids.append(row['match_id'])
    save_ids_to_file(ids, 5, name)

In [12]:
# CHUNK 6 — Check double chance bets (Exception 6)
for df, name in zip(dfs, names):
    ids = []
    for _, row in df.iterrows():
        cols = ['bet_1x', 'bet_12', 'bet_x2']
        if sum(1 for col in cols if not pd.isna(row.get(col)) and isinstance(row[col], (int, float))) < 3:
            ids.append(row['match_id'])
    save_ids_to_file(ids, 6, name)

In [13]:
# CHUNK 7 — Check over/under bets (Exception 7)
for df, name in zip(dfs, names):
    ids = []
    over_cols = [col for col in df.columns if col.startswith('bet_above_')]
    under_cols = [col for col in df.columns if col.startswith('bet_below_')]
    for _, row in df.iterrows():
        above_valid = sum(1 for col in over_cols if not pd.isna(row.get(col)) and isinstance(row[col], (int, float)))
        below_valid = sum(1 for col in under_cols if not pd.isna(row.get(col)) and isinstance(row[col], (int, float)))
        if above_valid < 2 or below_valid < 2:
            ids.append(row['match_id'])
    save_ids_to_file(ids, 7, name)

In [15]:
# CHUNK 8 — Check handicap bets (Exception 8)
for df, name in zip(dfs, names):
    ids = []
    handicap_cols = [col for col in df.columns if col.startswith('bet_handicap')]
    for _, row in df.iterrows():
        valid_handicaps = sum(1 for col in handicap_cols if not pd.isna(row.get(col)) and isinstance(row[col], (int, float)))
        if valid_handicaps < 2:
            ids.append(row['match_id'])
    save_ids_to_file(ids, 8, name)

## Part 2

This script is designed to enrich the main DataFrame `file1` with additional data from a secondary DataFrame `file2` by merging them on the match_id column.

To use it:

-Set the `file1` and `file2` variables to the desired input CSV file paths.

-Adjust the output path in to_csv() to specify where the merged result should be saved.

This is helpful when you want to add new columns or features to your primary dataset from an auxiliary data source.

In [None]:
import pandas as pd

In [None]:
directory = r'..\\data\\extracted_data\\'
file1 = f"{directory}20_21_final.csv"
file2 = f"{directory}20_21_exceptions.csv"

df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

merged_df = pd.merge(df1, df2, on='match_id', how='inner')

merged_df.to_csv(f'{directory}merged.csv', index=False)