# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

## Import Data and Drop Missing Values

In this notebook, you will learn how to import the data and remove missing values from your data. The key steps are:
1. [Import the Libraries](#import)
2. [Read Data from a CSV File](#read)
3. [Check and Drop NaN Values ](#drop)


<a id='import'></a>
## Import the Libraries
First, we will import `pandas` and `numpy` libraries for data manipulation and analysis. 

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

<a id='read'></a>
## Read Data from a CSV File

A CSV (Comma Separated Value) file stores tabular content in the form of plain text. The values in this file are separated by a comma. You can use `pandas.read_csv()` to read the CSV file. We have saved the Gold ETF (GLD) data in OHLC format in a CSV file named `gold_prices.csv`. 

To import the data into our notebook, we will use the following lines of code.

Syntax:
```python
import pandas as pd
pd.read_csv(filename, parse_dates, index_col)
```
1. **filename**: Name of the file in string format, if the file is in the same folder as the notebook. If the CSV file is located in a different folder we will have to specify the full path or location of the file in string format. 
2. **parse_dates**: To parse the column specified as a datetime object. By default, date columns are treated as a string when loading data from a CSV file. To read the date column correctly, we can use the argument `parse_dates` and specify a column name which we want in the date format. If the column is not specified, then the first or specified index column can be parsed by `parse_dates = True`.
3. **index_col**: The column name or number that you want to set as index (index number begins from 0)

In [2]:
# Read the data
gold_prices = pd.read_csv('../data_modules/gold_prices.csv',
                          parse_dates=['Date'], index_col='Date')
# Print dataframe
gold_prices.head()

Unnamed: 0_level_0,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-04-15,136.0,136.75,130.509995,131.309998
2013-04-16,134.899994,135.110001,131.759995,132.800003
2013-04-17,133.809998,134.949997,132.320007,132.869995
2013-04-18,134.119995,135.309998,133.619995,134.300003
2013-04-19,136.0,136.020004,134.600006,135.470001


<a id='drop'></a>
## Check and Drop NaN Values

To detect missing values in a dataframe, you can use the `isna()` function.

Syntax:
```python
df.isna()
```
**df**: the dataframe to be checked for missing values.

This function returns a boolean value indicating if the values are NA. It returns True or 1 for every value that is NA and False or 0 otherwise. We will also use the `sum()` function to get the total number of missing values in our dataset i.e. the sum of all True or 1 values returned by `isna()` function.

In [3]:
# Here we check for NaN values
gold_prices.isna().sum()

Open     0
High     0
Low      0
Close    0
dtype: int64

We can see that the output is 0 for all columns. Hence, `gold_prices` dataset does not have any `NaN` values.

If the dataset included `NaN` values, you can remove them by using the `dropna()` method.

Syntax:
```python
df.dropna(axis=0, how='any', inplace=True)
```
1. **axis**: Determines if rows or columns which contain missing values are removed (0, or ‘index’: Drop rows which contain missing values. 1, or ‘columns’: Drop columns which contain missing values).
2. **how**: This parameter enables you to specify "how" the method will decide to drop a row or column from the dataframe. `how='any'` means dropna will drop the row if any of the values in that row are missing. `how='all'` means dropna will drop the row only if all of the values in that row are missing.
3. **inplace**: bool, default False. If we want to make changes to the dataframe we are working on and overwrite what was there before we will use `inplace=True`. This will update the original dataframe "in place". When `inplace=False`, the changes we make will not reflect in the original dataframe. In this case, we will have to assign a new variable to store the modified dataframe.  

## Conclusion
In this notebook, we learned how to import the libraries and read data from CSV files. We also learned how to check and remove `NaN` values from our dataset.
<br><br>