### Data usually requires cleaning and preparation to ensure that insights obtained from it can be relied upon. 
### here are several reasons why data may contain errors:

- Typographical Errors:
    Transcription Errors,  
      Incomplete Data

- Missing Values:
    Null or NaN Values,
    Inconsistent Data

- Inconsistent Units:
    Inconsistent Formatting,
    Outliers

- Data Entry Errors:
    Genuine Outliers,
    Duplications



- Incorrect Data Types:
    Mixed Data Types,
    Inaccurate Sources

- Errors in Source Data:
    Measurement Errors



In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('../Data/fifa.csv') # load your data

1. Check the Data in Columns:

Examine the contents of columns 14 and 26 to understand why they have mixed data types. Use df.iloc[:, 14] and df.iloc[:, 26] to inspect the data.

In [None]:
df.iloc[:, 14]

The data contained in the Free kick accuracy column should be represented as a numeric type (int or float), pandas represents columns as objects if there are mixed data types (strings, numbers) within a column thus safely infering that these columns are objects.

more on object data types [https://pandas.pydata.org/docs/reference/arrays.html#objects]

In [None]:
df.iloc[0, 14]

looking closely, the data in that column is represented as strings

In [None]:
df.iloc[:, 26]

2. Specify Data Types Explicitly:

When reading the CSV file, explicitly specify the data types for these columns using the dtype parameter in pd.read_csv. For example, if the columns contain numbers, you can set them to a numeric type, reloading the data to correct for the data type reveals something interesting within the data, check within the error message. find the last line of the Traceback.

In [None]:
df2 = pd.read_csv('../Data/fifa.csv', dtype={'Free kick accuracy': int, 'Penalties': int})

In [None]:
# find the row with '81+1' as it appears within the columns
df[df.iloc[:, 14].isin(['81+1'])]

- The values like "79+2" and "81+1" in a FIFA dataset likely represent a player's overall rating and additional attributes or boosts. In FIFA video games, a player's overall rating is often composed of their base attributes plus potential boosts or modifiers. The format "79+2" suggests a base rating of 79 with an additional boost of 2.

- The additional values (e.g., "+2" or "+1") are usually modifiers that reflect temporary improvements or boosts to certain attributes. These boosts may be applied based on the player's performance in recent matches, achievements, or other in-game factors.

- As for the data type, when you see values like "79+2" in a dataset, the column containing these values is likely of type object or string. Pandas might interpret such columns as containing mixed data types if some values are purely numeric, and others have additional characters like '+'.

3. Several other columns represent certain observations this way, using the .info() method on the DataFrame will show the data types for all columns in the dataset.

In [None]:
df.info()

In [None]:
# check for the sum of missing values
df.isna().sum()

In [None]:
# create a copy of dataframe for manipulation
df_copy = df.copy()

In [None]:
# take a part of the data without data type issues
proper_cols = ['Name', 'Age', 'Nationality', 'Preferred Positions', 'Overall']
proper_df = df_copy[proper_cols]

In [None]:
proper_df

In [None]:
# selesct the columns to focus on
numeric_df = df_copy.drop(proper_cols, axis=1)

In [None]:
numeric_df.info()

In [None]:
# Method 1: Convert numeric columns to numeric type after removing boost values
numeric_df = numeric_df.apply(lambda x: pd.to_numeric(x.str.split('[+-]').str[0]))

Note: The boosts can be additions as well as subractions.

In [None]:
numeric_df.dtypes

In [None]:
final_df = pd.concat([proper_df, numeric_df], axis=1)

In [None]:
# creating a checkpoint
final_df.to_csv('../Data/final.csv', index=False)

In [None]:
# Method 2: Convert to numeric while 'eval'uating the boost values
numeric_df2 = df_copy.drop(proper_cols, axis=1)

In [None]:
for col in numeric_df2.columns:
    print(col)
    numeric_df2[col] = [eval(value) if isinstance(value, str) else value for value in numeric_df2[col]]

In [None]:
### The dataset contains some spurious values, using DataFrame.describe() we discover that some maximum values are way above 100

In [None]:
df = pd.read_csv('../Data/final.csv')

In [None]:
df.info()

In [None]:
# spurious values
for col in df.columns:
    if df[col].dtypes != 'object':
        df[col] = df[col].apply(lambda x: x if x <= 100 else np.nan)

In [None]:
df.isna().sum()

In [None]:
df.dropna(inplace=True)

In [None]:
df.describe()

In [None]:
# creating a checkpoint
df.to_csv('../Data/final2.csv', index=False)