# Example Notebook

As mentioned in the readme file, our first task is to create a repository of reproducible data science code-snippets. 
We'de like to methodically download datasets, jupyter notebooks, and all relevant metadata from Kaggle.com (online community of data scientists). <br>
Then, we'de like to parse this repository into tsv files that will be used as input to our models in the following stages. <br>
This notebook's purpose is to demonstrate the data gathering process - downloading the notebooks and data, and parsing it into usable files.

Keep in mind that it's better to use the relevant functions from the python files directly.

All of the functions that are used in this notebook are well-documented inside the relevant python files. 

## Step 1 - Create a list of relevant Kaggle datasets and competitions

First we need to decide which datasets we would like to download from Kaggle.com. <br>
The datasets in Kaggle are tagged with related subjects.
We set the wanted tags in ``search_terms`` (as a string where the tags are seperated by spaces).
The tags from ``search_terms`` will be searched via Kaggle API to collect all relevant datasets (that are relevant to all of the tags).

Using ```get_kaggle_metadata``` we create 2 txt files for later use:
* **consts.NOT_VALID_DATA_LINKS_NAME** - list of non-relevant datasets, so we won't search for them again
* **consts.DATA_LINKS_NAME** - list of relevant datasets that will be downloaded using "download.py"

You may change the paths to those files (and others) in [consts.py](https://github.com/TAU-DB/guided-ds/blob/master/data_gathering/consts.py).

Our files are currently located in the [assets](https://github.com/TAU-DB/guided-ds/tree/master/data_gathering/assests) folder.

We can decide whether to get info on regular datasets, competitions, or both, by setting ```comp``` and ```ds``` parameters.<br>
We can also decide what is the minimum number of notebooks per dataset that we want to consider by setting ```num```.


In [1]:
import kaggleapi
search_terms = "finance"    # example for multiple tags: "finance money"
print("collecting relevant datasets links...")
print("***************************")
kaggleapi.get_kaggle_metadata(search_terms, comp=True, ds=False, num=10)   # we only want to get competitions (to make example shorter)
print("Finished collecting links")

consts imported
collecting relevant datasets links...
***************************
Found 12 competitions
V - 0: two-sigma-financial-news
V - 1: santander-value-prediction-challenge
V - 2: two-sigma-financial-modeling
X - 3: Not enough notebooks
V - 4: donorschoose-application-screening
V - 5: elo-merchant-category-recommendation
V - 6: santander-customer-transaction-prediction
V - 7: bnp-paribas-cardif-claims-management
V - 8: santander-customer-satisfaction
V - 9: santander-product-recommendation
V - 10: home-credit-default-risk
V - 11: sberbank-russian-housing-market
Finished collecting links


Let's check to see some datasets that were collected to the data_links file:

In [2]:
import consts

with open(consts.DATA_LINKS_NAME, "r") as f:
    lines = f.readlines()
print(lines[1])
print(lines[2])
print(lines[3])

/c/santander-value-prediction-challenge/

/c/two-sigma-financial-modeling/

/c/donorschoose-application-screening/



## Step 2 - Download datasets, notebooks and metadata

Using the lists that we created, we get the download links using ```read_links()```. <br>
Then, in ```download_data_and_kernels_link``` we use WebDriver to scrape Kaggle website, download the datasets and relevant metadata, and get links for all of the jupyter notebooks that are associated with the datasets.
Finally, in ```pull_kernels()``` we use kaggle API to pull (download) all of the notebooks (kernels).

<em>**Important Note** - You need to setup your Kaggle username and password in [consts.py](https://github.com/TAU-DB/guided-ds/blob/master/data_gathering/consts.py) as ```KAGGLE_USER``` and ```KAGGLE_PW```</em>


In [3]:
from download import read_links
from download import download_data_and_kernels_link
from download import pull_kernels

# we set earlystop for the datasets and notebooks downloading functions
# otherwise it will be a very long run

link_list = read_links()
print("Datasets links were read...")
print("***************************")
print("Downloading datasets and metadata...")
print("***************************")
download_data_and_kernels_link(link_list, stop_nb=False, earlystop=1)    # we only download all data for 1 dataset (to make example shorter)
print("***************************")
print("Downloading notebooks")
print("***************************") 
pull_kernels(earlystop=10)                         # we only download 10 notebooks for this dataset (to make example shorter)
print("***************************")
print("Finished example downloading")

consts imported
Datasets links were read...
***************************
Downloading datasets and metadata...
***************************
Got total of 11 datasets
Starting with dataset 1
Dataset name is two-sigma-financial-news
Getting data from https://www.kaggle.com/c/two-sigma-financial-news
Chrome will save to - C:\Workspace\guided-ds\Example_datasets\two-sigma-financial-news\input
Opening https://www.kaggle.com/c/two-sigma-financial-news
Browsing again to requested link (After login may site may redirect)
['finance', 'time series', 'money', 'news agencies']
Got competition description
Got competition evaluation
Chrome will save to - C:\Workspace\guided-ds\Example_datasets\two-sigma-financial-news\input
Opening https://www.kaggle.com/c/two-sigma-financial-news/data
Browsing again to requested link (After login may site may redirect)
Downloading two-sigma-financial-news - Browser will remain open
Chrome will save to - C:\Workspace\guided-ds\Example_datasets\two-sigma-financial-news
O

Our downloaded datasets and notebooks, along with the relevant metadata, are stored in the [Datasets](https://github.com/TAU-DB/guided-ds/tree/master/datasets) folder. The notebooks are located within the 'kernels' folder inside the folder of the relevant dataset.

## Step 3 - Parse the downloaded data into usable tsv files

Now, we would like to generate usable tsv files to train our models (for the [workflow stage classifier](https://github.com/TAU-DB/guided-ds/tree/master/Classification)). We'll create three files:
* **datasets.tsv** - contains for each dataset: its name, description, evaluation method (if exists), tags
* **notebooks.tsv** - contains for each notebook: its name, username (of author), the relevant dataset name, score (if exists)
* **cells.tsv** - contains for each cell: unique cell id, relevant notebook, username (of author), cell's source code, output (initially empty, needs to be executed), execeution count.

To train our models we mainly used cells.tsv.
We later add more data to this file such as the code's AST and masked representation (see [Documentation](https://github.com/TAU-DB/guided-ds/tree/master/Documentation)).

In [4]:
from ds_csv_generator import generate_ds_tsv
from nb_csv_generator import generate_nb_tsv

print("***************************")
print("Creating datasets.tsv...")
print("***************************")
generate_ds_tsv()
print("***************************")
print("Creating notebooks.tsv , cells.tsv ...")
print("***************************")

# again, we set earlystop for the datasets and notebooks tsvs generator
# otherwise it will be a very long run

generate_nb_tsv(earlystop=3)
print("***************************")
print("Finished creating example tsvs")

consts imported
consts imported
***************************
Creating datasets.tsv...
***************************
Done, Saved at C:\Workspace\guided-ds\Example_Data\datasets.tsv
***************************
Creating notebooks.tsv , cells.tsv ...
***************************
Loading datasets from:  C:\Workspace\guided-ds\Example_datasets
0 --- Start with C:\Workspace\guided-ds\Example_datasets\two-sigma-financial-news 

Starting notebook 0 2.7857---oriormeir---xgboost-2-market-news.ipynb
Found 10 cells
Finished notebook number 0  - with total cells of 10
So far gathered 10 of cells
Starting notebook 1 2.89408---alluxia---lb-0-6326-tuned-xgboost-baseline.ipynb
Found 25 cells
Finished notebook number 1  - with total cells of 21
So far gathered 31 of cells
Starting notebook 2 3.01095---charleslandau---iterative-approach.ipynb
Found 16 cells
Finished notebook number 2  - with total cells of 15
***************************
Finished creating example tsvs


You can see how the parsed files look like in the [Data](https://github.com/TAU-DB/guided-ds/tree/master/Data) folder.

## Step 4 - Workflow stage classification

In the next stage of our project, we used the generated cells.tsv file to train a workflow stage classifier.
See the [Classification](https://github.com/TAU-DB/guided-ds/tree/master/Classification) folder for more details.

Using the classifier we classified our cells to the relevant data science workflow stage, adding a 'Label' for each cell in the cells.tsv file.

For future use, you may use the pretrained classifier model to tag new cells [Here](https://github.com/TAU-DB/guided-ds/blob/master/Classification/Classification.ipynb).

## Step 5 - Create chatbot's input files

The final (and main) stage of our project is to create a chatbot that will generate next-line recommendations for data scientists.
In order to train a chatbot model we need an input file of input and output pairs (seperated by tab).
During our research we used many different representations as input and output (see [Documentation](https://github.com/TAU-DB/guided-ds/tree/master/Documentation)). 

Here, we create the following files:
* **Source code pairs** - input cell source code \t output cell source code \n
2. **AST pairs** - input cell code's AST \t output cell code's AST \n
3. **Masked pairs** - input cell masked representation /t output cell masked representation \n

For our final recommendation engine we created a file of: 3-lines masked representation \t next-line masked representation \n <br>using ```cells_to_masked_lines```.

You may create any input and output pairs, seperated by tabs, and try to train our chatbot using it [here](https://github.com/TAU-DB/guided-ds/blob/master/Chatbot/Jupyter_Cells_Chatbot_Model.ipynb).

<em> Note: execution for all notebooks and cells that were downloaded may take a while</em> 

In [1]:
from convert_to_chatbot_input import cells_to_masked_pairs
from convert_to_chatbot_input import cells_to_ast_pairs
from convert_to_chatbot_input import cells_to_source_pairs
from convert_to_chatbot_input import cells_to_masked_lines

print("***************************")
print("Creating Source code pairs...")
print("***************************")
cells_to_source_pairs()
print("***************************")
print("Creating AST pairs...")
print("***************************")
cells_to_ast_pairs()

print("***************************")
print("Creating Masked pairs...")
print("***************************")

# we create the masked pairs file here for input cells with 'Explore' label, and we don't keep just unique pairs
cells_to_masked_pairs(label='Explore', unique=False, ptg=False) 

print("***************************")
print("Creating Masked line pairs...")
print("***************************")

# again, we set earlystop for the line pairs generator
# otherwise it will be a very long run
cells_to_masked_lines(load=False, earlystop=3, ptg=False)
print("***************************")
print("Finished creating chatbot input example files")

consts imported
***************************
Creating Source code pairs...
***************************
Cells number before cleaning: 43
Cells number after cleaning: 38
Found diffrent sources of: 0
Done. Kept: 0.0 %
***************************
Creating AST pairs...
***************************
Cells number before cleaning: 43
Cells number after cleaning: 40
Found diffrent sources of: 0
Done. Kept: 0.0 %
***************************
Creating Masked pairs...
***************************
Starts with 43
After initial cleaning 43
Cells number before cleaning: 43
Before filtering by label - total cells was 15
Cells number after cleaning: 3
***************************
Creating Masked line pairs...
***************************
Saving notebook and labels lists...
Notebooks size is 3
Labels list size is 3
Total 3 notebooks
Starts with notebook number -  1
Starts with notebook number -  2
Starts with notebook number -  3
Clearing any remains inside buffer -Prep. Remaining length is955
Clearing any rema

Again, you can see how the parsed files should look like in the [Data](https://github.com/TAU-DB/guided-ds/tree/master/Data) folder. <br>

The files are saved to the paths defined in consts.
specifically the files that we used are generated by the ```cells_to_masked_lines``` functions and will be stored in ```consts.LOAD_TSV```, ```consts.PREP_TSV```, etc. (for each stage)

More explanations about the content of each file is in the [project report](https://github.com/TAU-DB/guided-ds/blob/master/Documentation/ProjectReport_18-1-1-1638.pdf) or in the [chatbot training notebook](https://github.com/TAU-DB/guided-ds/blob/master/Chatbot/Jupyter_Cells_Chatbot_Model.ipynb).

## Extra Step - Execute notebooks to get output

We didn't use the cell's output to train our models, but it could be useful as an additional feature for future purposes.
The following script automatically executes each notebook, The script was tested on a Windows 10 machine (Unfortunaly no guarantee for Linux\Unix machines).

<em>**Important note** - the first argument of the function is the URL link to an activate conda localhost server. Run "jupyter notebook" command from the main folder of the project.</em>

We set ```to_execute=1```, meaning this will execute all of the notebooks for 1 dataset.
Selenium will control your browser and run the notebooks.

In [None]:
from execute_nb import execute_notebooks

execute_notebooks("http://localhost:8889/?token=f3dec8fa35423b846ef3ecccfb6d584ac1ef844974af0c65", to_execute=1)