# Unit 1: Interactive notebooks & tabular data objects
---

## Goals of this lecture:
* Understand what Interactive computing means
* Understand how Jupyter Notebook works
* Tabular data building blocks: Numpy arrays, Pandas series and Pandas dataframe


1. [Introducing Notebooks and Markdown](#section1)
2. [NumPy arrays](#section2)
3. [Introducing Pandas](#section3)


<a id='section1'></a>

## 1. Interactive computing: Jupyter Notebooks

* Interactive computing is a computer program with a "human in the loop."
* A combination of text (written in Markdown) & code that you can run (written in Python). 
* Run code, display and explain results.
* **Tell a story**.  

<div>
<center><img src="https://github.com/nlihin/data-analytics/blob/main/images/tell%20a%20story.jpg?raw=true" width="500"/></center> 
</div>



### ipynb

This notebook is in ipynb format.  
> ipynb stands for: Interactive Python NoteBook

We'll run ipynb through JupyterLabs. You can also run it using [Colab](https://colab.research.google.com/)

Advanced content:
* ipynb files are stored as JSON and can be opened with any text editor
* If you want to read more, [look here](https://jupyterlab.readthedocs.io/en/latest/user/notebook.html#notebook).
* Learn more here: [Jupyter: Thinking and Storytelling With Code and Data](https://ieeexplore.ieee.org/abstract/document/9387490)

### Markdown

> Markdown is an easy markup language - a language for formatting text. 

Double click on the cell to enter **edit mode**
Shift + Enter to run

Advanced content: 
* [More on Markdown](http://daringfireball.net/projects/markdown/syntax)
* When Markdown is not enough, use html


### Keyboard shortcuts - command mode

* Scroll up and down your cells with your Up and Down keys.  
* Press A or B to insert a new cell above or below the active cell.  
* M will transform the active cell to a Markdown cell.  
* Y will set the active cell to a code cell.  
* D + D (D twice) will delete the active cell.  
* Z will undo cell deletion.  
* Hold Shift and press Up or Down to select multiple cells at once.
* With multiple cells selected, Shift + M will merge your selection. 

### Keyboard shortcuts - edit mode

* Ctrl + Enter to run the current cell.  
* Shift + Enter to run the current cell and move to the next cell (or create a new one if there isn’t a next cell)  
* Alt + Enter to run the current cell and insert a new cell below.  
* Ctrl + Shift + - will split the active cell at the cursor.  
* Ctrl + Click to create multiple cursors within a cell.  

### Markdown syntax

#  level 1 heading

##  level 2 heading

### level 3 heading

---

> a blockquote
>> nested

###Markdown syntax
**bold emphasis**  *italic emphasis* 

1. Numbered list.
2. Another item.

[hyperlinks](https://www.dataquest.io)


Easy way to add an image: ![Alt text](https://github.com/nlihin/data-analytics/blob/main/images/LogoArielUniversity.jpg?raw=true)

### Code cells


In [1]:
print ('hello world')

hello world


The last line of every code cell is displayed   
Even if you don't print it

In [2]:
a = 2 + 2 # The result of this line will not be displayed
b = 3 + 3 # The result of this line will be displayed, because it is the last line of the cell
a
b

6

### <span style="color:blue"> Exercise:</span>
> How do you suggest to print both a and b? Try it:
>

### Notebook Kernal

In [3]:
class_name = "Intro to Data Analysis"

In [4]:
message = class_name + " is great!!!!!!"

In [5]:
message

'Intro to Data Analysis is great!!!!!!'

### <span style="color:blue"> Exercise:</span>
> Change the `class_name` (remove the # sign first)
>
> What does message display now?

In [6]:
# class_name = 
print(class_name)
print(message)

Intro to Data Analysis
Intro to Data Analysis is great!!!!!!


The Kernal is a computational engine that runs code in Jupyter Notebook.

- **Key Features:**
  - **Run Code**: Executes one cell at a time.
  - **Keeps Memory**: Remembers variables/functions until you restart.
  - **Execution Numbers**: The number next to the cell (e.g., `[3]`) shows the order it was run.

* Often, after changing a cell, you'll want to rerun all the cells below it
  * "Run -> Run Selected Cell and All Below".

**Before submitting your exam/project/homework**  
**"Kernal -> Restart kernal and run all cells"** 

## 2. NumPy arrays 

<div>
<center><img src="https://github.com/nlihin/data-analytics/blob/main/images/escher.PNG?raw=true" width="300"/>
    <p style="text-align: center;"><em>Learning to do the impossible. "Waterfall" by M.C. Escher, 1961.</em></p></center>
</div>




Our data is mostly numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc.  
The Numpy library provides specialized data structures, functions, and other tools for numerical computing in Python.  

### The difference between a list and a np.array

In [7]:
student1 = [100, 88, 97]
student2 = [100, 88, 60]

In [None]:
type(student1)

In [None]:
student1 == student2

In [None]:
student1 + student2

Same data, but in a np.array:

In [11]:
import numpy as np

In [12]:
student1_np = np.array([100, 88, 97])
student2_np = np.array([100, 88, 60])

In [None]:
type(student1_np)

In [None]:
student1_np == student2_np

### Numerical operations with np.array

Numpy arrays are better than list for operating on numerical data:

- **Ease of use:**  small, concise, and intuitive mathematical expressions rather than using loops & custom functions.

- **Performance:** Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Compute the average score in each exam:

In [None]:
(student1_np + student2_np)/2

Compare two arrays:

In [16]:
x = np.array([1,2,"cat"])
y = np.array([1,3,"cat"])

In [None]:
x == y

<a id='section3'></a>


### 3. Introducing Pandas
---

<div>
<center><img src="https://github.com/nlihin/data-analytics/blob/main/images/pandas.JPG?raw=true" width="400"/></center>
</div>


If you want to read some more:
* [Panda's documentation](https://pandas.pydata.org/pandas-docs/stable/)  
* [Intro to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html)


### The pandas library 
Pandas is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet).

There are two main data structure used by pandas
- Series: a one-dimensional labeled array
- Dataframe: a two-dimensional labeled data structure (think of a spreadsheet, table, or a dictionary of Series objects)

Pandas is efficient for **data manipulation and analysis**

### The pandas Dataframe object
First, import pandas:

In [18]:
import pandas as pd

Reading a Dataframe from an online file:

In [19]:
file_name = "https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/avocado.csv"
df = pd.read_csv(file_name)
df

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,total_sold,small_sold,large_sold,sma,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,27/12/2015,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,20/12/2015,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,13/12/2015,0.93,118220.22,794.70,109149.67,130.50,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,06/12/2015,1.08,78992.15,1132.00,71976.41,72.58,5811.16,5677.40,133.76,0.0,conventional,2015,Albany
4,4,29/11/2015,1.28,51039.60,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,7,04/02/2018,1.63,17074.83,2046.96,1529.20,0.00,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,8,28/01/2018,1.71,13888.04,1191.70,3431.50,0.00,9264.84,8940.04,324.80,0.0,organic,2018,WestTexNewMexico
18246,9,21/01/2018,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.80,42.31,0.0,organic,2018,WestTexNewMexico
18247,10,14/01/2018,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.00,0.0,organic,2018,WestTexNewMexico


This file contains data on Avocado sales. For example - the number of bags sold. 

### The pandas Series object

Each column in the `dataframe` object is a `series`

We can select a single column:

In [20]:
df["region"]

0                  Albany
1                  Albany
2                  Albany
3                  Albany
4                  Albany
               ...       
18244    WestTexNewMexico
18245    WestTexNewMexico
18246    WestTexNewMexico
18247    WestTexNewMexico
18248    WestTexNewMexico
Name: region, Length: 18249, dtype: object

Ask for its type:

In [21]:
type(df["region"])

pandas.core.series.Series

### <span style="color:blue"> Exercise:</span>
> Display the `Small Bags` column
>

#### Pandas Series

A series logic is similar to that of a `np.array`, not that of a `list`  
For example, if we want to check if `Total Bags` is correct:

In [22]:
df["Small Bags"]+ df["Large Bags"] + df["XLarge Bags"] == df["Total Bags"]

0         True
1         True
2         True
3         True
4         True
         ...  
18244     True
18245     True
18246    False
18247     True
18248     True
Length: 18249, dtype: bool

We see that it is not always correct. How many times is it correct? We use `sum()` to find out:

In [23]:
(df["Small Bags"]+ df["Large Bags"] + df["XLarge Bags"] == df["Total Bags"]).sum()

14213

## End of lesson - what you need to know:

* How to work with a Jupyter notebook
* What are Numpay array, Pandas series, Pandas dataframes
* Read a file and some simple manipulations (see Exercise 1)


<div>
<center><img src="https://github.com/nlihin/data-analytics/blob/main/images/hawaii.PNG?raw=true" width="700"/> </center>
</div>
