## Data Cleaning

Let's load the 2021 data for AAPL this time. Unfortunately, this data is not quite as *clean* as the 2020 data, so we'll need to do some data wrangling. The file we're looking to load is `AAPL_2021_raw.csv`.

In [None]:
# New imports for a new notebook!
import pandas as pd

In [None]:
# Loading the data, setting the index, and having a look
url = "https://raw.githubusercontent.com/ImperialCollegeLondon/efds-ta-python/main/data/AAPL_2021_raw.csv"
df = pd.read_csv(url)
df["Date"] = pd.to_datetime(df["Date"])
df.set_index("Date", inplace=True)
df

Can you see what we mean by messy? How many issues can you spot?

- Dates out of order
- Duplicate rows
- Missing values

First let's start with sorting the index.

In [None]:
# Checking for ascending order in the index
df.index.is_monotonic_increasing

# Sorting the index in place
df.sort_index(inplace=True)

Now let's focus on duplicates:

In [None]:
# Counting duplicates
df.duplicated().sum()

# Getting rid of duplicates
df.drop_duplicates()

# But make sure we update the variable to save our work!
df
df = df.drop_duplicates()

# Or we could have written:
# df.drop_duplicates(inplace=True)

Let's look at the missing values next. Previously, we saw that `info()` gave us some insight into how many missing values we had, but we can also use `isnull()` combined with `sum()`, which contains a bit less noise.

In [None]:
df.isnull().sum()

Now that we've identified some missing values, the big question is how to handle them. There are many approaches to this that will vary depending on the data and the further analysis you plan to carry out.

In [None]:
# We can just drop any row that contains NaN in any column
df.dropna()

# Drop a row which contains NaN in a specific column
df.dropna(subset="Close")

# We can fill in NaNs with the average of a column
df['Volume'].fillna(df['Volume'].mean())

# We can interpolate (point on a line connecting the value of the days before and after)
df["Adj Close"].interpolate(method="linear")

# We can forward fill, and use the value from the day before
df["Adj Close"].ffill()

### Exercise 3

Notice how above we didn't update the variable or use `inplace`, so our DataFrame `df` is still full of missing values. Fix all missing values applying the following rules:
- Interpolate missing values in the Close and Adj Close columns
- Drop any rows with NaN in the Volume column
- Forward fill missing values in the Open column

Your DataFrame `df` should have no missing values when done. Use `info()` to confirm.

In [None]:
## YOUR CODE GOES HERE

## Saving Data

Now that we've cleaned our data, let's save it, by writing it to a new .CSV file. We can use pandas' `to_csv()`.

In [None]:
df.to_csv("AAPL_2021_clean.csv")