# Data Cleaning Exercise

Cleaning your data is crucial when starting a new data engineering project because it ensures the accuracy, consistency, and reliability of the dataset. Dirty data, which may include duplicates, missing values, and errors, can lead to incorrect analysis and insights, ultimately affecting the decision-making process. Data cleaning helps in identifying and rectifying these issues, providing a solid foundation for building effective data models and analytics. Additionally, clean data improves the performance of algorithms and enhances the overall efficiency of the project, leading to more trustworthy and actionable results.

Use Python, ```numpy```, ```pandas``` and/or ```matplotlib``` to analyse and clean your batch data:

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load Data

Link to data source: ```<TODO>```

In [None]:
df = pd.read_csv('AAPL.csv')

## Understand the Data

View the first few rows, get summary statistics and check data types

In [None]:
df.head(10)

In [None]:
print(df.dtypes)

In [None]:
print("Allgemeine Informationen:")
df.info()

print("\nStatistische Zusammenfassung:")
print(df.describe(include='all'))

## Handle Missing Data

Identify missing values and fill or drop missing values

In [None]:
print("Fehlende Werte pro Spalte:")
print(df.isnull().sum())

## Handle Duplicates

Identify duplicates and remove them

In [None]:
duplicates = df.duplicated()
print(f"Anzahl doppelter Zeilen: {duplicates.sum()}")

In [None]:
# If Duplicates: remove Duplicates
df = df.drop_duplicates()

## Handle Outliers

Identify outliers and remove or corret them

In [None]:
numeric_cols = ['open', 'high', 'low', 'close', 'volume', 'adjclose', 'dividends']

# Boxplots
for col in numeric_cols:
    plt.figure()
    df.boxplot(column=col)
    plt.title(f'Boxplot für {col}')
    plt.show()

## Handle Incorrect Data Types

In [None]:
# Umwandlung der Datumsspalte in datetime-Format
df['date'] = pd.to_datetime(df['date'])
print(df.dtypes)

## Visualize Data

Use graphes, plots and/or diagrams to visualize the data

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(df['date'], df['close'])
plt.title('AAPL Schlusskurs über die Zeit')
plt.xlabel('Datum')
plt.ylabel('Schlusskurs ($)')
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(12, 4))
plt.bar(df['date'], df['volume'], width=1.0)
plt.title('Handelsvolumen über die Zeit')
plt.xlabel('Datum')
plt.ylabel('Volumen')
plt.show()

## Save Cleaned Data

In [None]:
df.to_json('AAPL_cleaned.json', orient='records', lines=True)

print("Saved as AAPL_cleaned.json")