# Purpose
This notebook will be used as a tutorial for understanding the steps, formating and other details for the lung disease detection project. You'll notice that this notebook has been split up into multiple sections; this is to make things easier to read, reuse and navigate. You can use the headings of next sections in an actual notebook as is. 

# About this notebook
The first step when writing a new notebook is to give it a **meaningful name**. Change the name of the notebook so that it has the format 'dataset_architecture_trainingStyle'. Assuming that this notebook is for the pneumonia dataset and implements an AlexNet model from scratch, it would be called "pneumonia_alexnet_custom". <br><br>
The next step is to mention the same details clearly at the very beggining of the notebook. Use a **table** to mention these details. The text cell in colab supports markdown and HTML so use whichever is comforable for you. It should look something like the next cell. 

<table width = "80%">
  <thead>
    <th>Charecteristic</th>
    <th>Value</th>
  </thead>
  <tr>
    <td> Dataset </td>
    <td> Pneumonia </td>
  </tr>
  <tr>
    <td> Architecture </td>
    <td> AlexNet </td>
  </tr>
  <tr>
    <td> Training </td>
    <td> Custom </td>
  </tr>
</table>

Make sure to include **links** for things like the dataset source and architecture paper. 

# Main Setup
The next step is to import important **libraries** and declare **global variables** that you'll use throughout the code. <br><br>
You don't have to know *all* the libraries that you might need for the project at the very beginning but you can come back and add them later. Just make sure that they're all in the same cell. <br><br>
It can also be helpful to add some kind of **print statement** at the end of cells which don't really produce an output so you know if and when the cell runs correctly. The next cell is an example of the same. 

In [None]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt

var1 = "some global string"
path1 = "./directory/subdirectory/file_name"

print("setup complete")

# Initial requirements
The pneumonia dataset is on kaggle but can be **downloaded** onto a colab notebook pretty quickly. Check this [link](https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/) for more information. The datasets are linked in the proposal. Download the pneumonia dataset for this tutorial. <br> <br>
Once the dataset is loaded, you might notice that the train-val split is not very good. The train directory has over 5000 examples and the val directory has only 16. This needs to be changed. Here are the steps that you can take to do that (if you think you can do it more efficiently, please do!):
1. Create a directory for a 'working_dataset' using a ! followed by the command line instruction; **mkdir** (shown in the next cell). Check out this [notebook](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.05-IPython-And-Shell-Commands.ipynb) on IPython shell commands for more information. 
2. Create subdirectories for train and val images. In each of those, create subdirectories for NORMAL and PNEUMONIA. 
3. Move the test directory to the 'working_dataset' as is. You can use the **mv** command for this. 
4. Make a list of all the files in the train and val directories of the kaggle dataset. Each element of the list is the full path of the file. 
5. Shuffle this list randomly (use the random library functions).
6. Split the list in an appropriate ratio. About 500 images for the val dataset seems reasonable, but pick whatever you feel makes sense. 
7. Transfer the files to the appropriate locations in the 'working_dataset'
8. **Move the working dataset to your drive**. Make sure it was mounted to the notebook first. This step is important. Without this, you'll have to redo the split every time you run the notebook. 
9. Create a folder in your drive to store the weights. 

This step is really important to get right but the good news is that you have to do it in just one notebook. Once the 'working_dataset' is on your drive, you can use that everywhere else.<br><br>Remember to add **comments** in these cells so you can come back to them later if required and know they're doing.

In [None]:
# example to make a directory and then move it
! mkdir some_directory
! mv /content/some_directory /content/drive/MyDrive

When writing function for anything in a notebook, make sure you include a description of the function at the beginning. This includes the parameters that are required and the return value(s). See the next cell for an example. 

In [None]:
def some_function(some_parameter):
  '''
  Function to find the square of an integer. 
  Parameters:
    some_parameter => an integer whose square we need to find
  Returns
    res => the square number
  '''
  res = some_parameter**2
  return res

ans = some_function(3)
print(ans)