# EDA Exercise

To see a completed version of this exercise, refer to [`examples/eda-exercise`](https://datasci.rice.edu/deep/curriculum/examples/eda-exercise/).

## Prerequisites

You need to install Python! The most convenient distribution for us to use is the Anaconda Distribution, which you can install from here: https://docs.anaconda.com/anaconda/install/.

After installation, ensure that you can follow these instructions to open Jupyter:
https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/execute.html

Once you've launched Jupyter, you should be able to double-click on a `.ipynb` file to open a new kernel.

!!! warning "Working Directory"

    You'll want to be careful about where you launch Jupyter and download data to.
    It's best practice to create a folder for your deep project to organize data and 
    launch Jupyter from there.  If jupyter is started from a different directory it might
    be hard to find your data / notebooks!

In [1]:
# it's best practice to have your imports up top so others can immediately know what to install
# if you import more modules, add them here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
plt.rcParams["figure.figsize"] = 10, 6
plt.rcParams["figure.dpi"] = 150

In [3]:
plt.rcParams["figure.figsize"] = 10, 6
plt.rcParams["figure.dpi"] = 150

## Acquiring Data

By now you've selected a dataset for this semester.  See below for examples of reading tabular data into Pandas:

In [4]:
# reading from a CSV: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
# df = pd.read_csv("data/foo.csv")

# reading from a CSV without column names
# df = pd.read_csv("data/foo.csv", columns=["date", "company", "valuation"])

# WARNING

Be sure that you've installed all the dependencies for this project! From your terminal run:

pip:
```shell
$ # make a virtual environment, e.g. mkvirtualenv deep
$ # activate your virtual environment, e.g. workon deep
$ pip install -r requirements.txt
```

conda:
```shell
$ conda env create --name deep --file requirements.txt
$ conda activate deep
```

In [20]:
# https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm
import requests, zipfile, io

url = "https://www.cftc.gov/files/dea/history/fut_disagg_txt_hist_2006_2016.zip"

r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("../data")

In [21]:
!ls ../data/

F_Disagg06_16.txt


In [23]:
!head -n1 ../data/F_Disagg06_16.txt

"Market_and_Exchange_Names","As_of_Date_In_Form_YYMMDD","Report_Date_as_YYYY-MM-DD","CFTC_Contract_Market_Code","CFTC_Market_Code","CFTC_Region_Code","CFTC_Commodity_Code","Open_Interest_All","Prod_Merc_Positions_Long_All","Prod_Merc_Positions_Short_All","Swap_Positions_Long_All","Swap__Positions_Short_All","Swap__Positions_Spread_All","M_Money_Positions_Long_All","M_Money_Positions_Short_All","M_Money_Positions_Spread_All","Other_Rept_Positions_Long_All","Other_Rept_Positions_Short_All","Other_Rept_Positions_Spread_All","Tot_Rept_Positions_Long_All","Tot_Rept_Positions_Short_All","NonRept_Positions_Long_All","NonRept_Positions_Short_All","Open_Interest_Old","Prod_Merc_Positions_Long_Old","Prod_Merc_Positions_Short_Old","Swap_Positions_Long_Old","Swap__Positions_Short_Old","Swap__Positions_Spread_Old","M_Money_Positions_Long_Old","M_Money_Positions_Short_Old","M_Money_Positions_Spread_Old","Other_Rept_Positions_Long_Old","Other_Rept_Positions_Short_Old","Other_Rept_Positions_Spread_Old

In [25]:
# low_memory=False suppresses a mixed-type warning that can arise w/ messy data.  we're not worried about performance right now, so not a big deal.
df = pd.read_csv("../data/F_Disagg06_16.txt", low_memory=False)

## Structured EDA

### What features are in your dataset?

List of Columns:
https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalViewable/CFTC_023168.html

In [26]:
df.columns

Index(['Market_and_Exchange_Names', 'As_of_Date_In_Form_YYMMDD',
       'Report_Date_as_YYYY-MM-DD', 'CFTC_Contract_Market_Code',
       'CFTC_Market_Code', 'CFTC_Region_Code', 'CFTC_Commodity_Code',
       'Open_Interest_All', 'Prod_Merc_Positions_Long_All',
       'Prod_Merc_Positions_Short_All',
       ...
       'Conc_Net_LE_4_TDR_Long_Other', 'Conc_Net_LE_4_TDR_Short_Other',
       'Conc_Net_LE_8_TDR_Long_Other', 'Conc_Net_LE_8_TDR_Short_Other',
       'Contract_Units', 'CFTC_Contract_Market_Code_Quotes',
       'CFTC_Market_Code_Quotes', 'CFTC_Commodity_Code_Quotes',
       'CFTC_SubGroup_Code', 'FutOnly_or_Combined'],
      dtype='object', length=191)

Note that the dataset is in a "wide" format, as there are related fields that have values in the column name, e.g. "Conc...4" vs "Conc...8".  More info on data organization:  

* https://en.wikipedia.org/wiki/Wide_and_narrow_data
* https://vita.had.co.nz/papers/tidy-data.pdf

### What type is each feature?

### Distribution of each feature?

### What do the numeric features represent? Counts? Measurements?

### What are the pairwise relationships between numeric features?

## Brainstorming

Let's take a step back and connect your dataset to its real-world context.  

Consider what these features and values actually represent.  Is there anything unexpected about the features?  What biases or thoughts did you have about this topic before exploring the data?  List off some "facts" that you think are true about your topic.  Our goal in EDA is to reconcile your perspective of the data / topic with the *truth* of the dataset.  What motivated you to choose this dataset?  What insights or questions are you investigating with this dataset?  Now that you've explored each of the features, which might be useful to you in your investigation?

## Open-Ended EDA

EDA is an iterative process.  It begins with answering initial questions which lead to more questions.  Using some of the brainstorming above, come up with at least one concrete investigation into your dataset.  This might be inspecting a specific irregularity, questioning a personal bias, or identifying a specific relationship between two features.  

To do this, you'll likely need to select a subset of your dataset, transform it into a simpler format, and finally visualize or summarize it.  Visualizations are **highly** encouraged at this point!  It's much easier to understand relationships visually.