# First Steps with Python and Jupyter 

![](https://i.imgur.com/gvSnw4A.png)



This tutorial covers the following topics:

* Performing arithmetic operations using Python
* Solving multi-step problems using variables
* Evaluating conditions using Python
* Combining conditions with logical operators
* Adding text styles using Markdown

### How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai) (don't worry if these terms seem unfamiliar; we'll learn more about them soon). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>  **Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of _cells_. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc. instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.

## Performing Arithmetic Operations using Python

Let's begin by using Python as a calculator. You can write and execute Python using a code cell within Jupyter. 

> **Working with cells**: To create a new cell within Jupyter, you can select "Insert > Insert Cell Below" from the menu bar or just press the "+" button on the toolbar. You can also use the keyboard shortcut `Esc+B` to create a new cell. Once a cell is created, click on it to select it. You can then change the cell type to code or markdown (text) using the "Cell > Cell Type" menu option. You can also use the keyboard shortcuts `Esc+Y` and `Esc+M`. Double-click a cell to edit the content within the cell. To apply your changes and run a cell, use the "Cell > Run Cells" menu option or click the "Run" button on the toolbar or just use the keyboard shortcut `Shift+Enter`. You can see a full list of keyboard shortcuts using the "Help > Keyboard Shortcuts" menu option.

Run the code cells below to perform calculations and view their result. Try changing the numbers and run the modified cells again to see updated results. Can you guess what the `//`, `%`, and `**` operators are used for?

In [None]:
2 + 3 + 9

In [None]:
99 - 73

In [None]:
23.54 * -1432

In [None]:
100 / 7

In [None]:
100 // 7

In [None]:
100 % 7

In [None]:
5 ** 3

As you might expect, operators like `/` and `*` take precedence over other operators like `+` and `-` as per mathematical conventions. You can use parentheses, i.e. `(` and `)`, to specify the order in which operations are performed.

In [None]:
((2 + 5) * (17 - 3)) / (4 ** 3)

Python supports the following arithmetic operators:

| Operator   | Purpose           | Example     | Result    |
|------------|-------------------|-------------|-----------|
| `+`        | Addition          | `2 + 3`     | `5`       |
| `-`        | Subtraction       | `3 - 2`     | `1`       |
| `*`        | Multiplication    | `8 * 12`    | `96`      |
| `/`        | Division          | `100 / 7`   | `14.28..` |
| `//`       | Floor Division    | `100 // 7`  | `14`      |    
| `%`        | Modulus/Remainder | `100 % 7`   | `2`       |
| `**`       | Exponent          | `5 ** 3`    | `125`     |


Try solving some simple problems from this page:
https://www.math-only-math.com/worksheet-on-word-problems-on-four-operations.html . 

You can use the empty cells below and add more cells if required.

## Solving multi-step problems using variables

Let's try solving the following word problem using Python: 

> A grocery store sells a bag of ice for $1.25 and makes a 20% profit. If it sells 500 bags of ice, how much total profit does it make?

We can list out the information provided and gradually convert the word problem into a mathematical expression that can be evaluated using Python. 

*Cost of ice bag ($)* = 1.25

*Profit margin* = 20% = .2

*Profit per bag ($)* = profit margin * cost of ice bag = .2 * 1.25

*No. of bags* = 500

*Total profit* = no. of bags * profit per bag = 500 * (.2 * 1.25)

In [None]:
500 * (.2 * 1.25)

Thus, the grocery store makes a total profit of $125. While this is a reasonable way to solve a problem, it's not entirely clear by looking at the code cell what the numbers represent. We can give names to each of the numbers by creating Python *variables*.

> **Variables**: While working with a programming language such as Python, information is stored in *variables*. You can think of variables as containers for storing data. The data stored within a variable is called its *value*.

In [None]:
cost_of_ice_bag = 1.25
profit_margin = .2
number_of_bags = 500

The variables `cost_of_ice_bag`, `profit_margin`, and `number_of_bags` now contain the information provided in the word problem. We can check the value of a variable by typing its name into a cell. We can combine variables using arithmetic operations to create other variables.

> **Code completion**: While typing the name of an existing variable in a code cell within Jupyter, just type the first few characters and press the `Tab` key to autocomplete the variable's name. Try typing `pro` in a code cell below and press `Tab` to autocomplete to `profit_margin`.

In [None]:
profit_margin

In [None]:
profit_per_bag = cost_of_ice_bag * profit_margin
profit_per_bag

In [None]:
total_profit = number_of_bags * profit_per_bag
total_profit

If you try to view the value of a variable that has not been *defined*, i.e., given a value using the assignment statement `variable_name = value`, Python shows an error.

In [None]:
net_profit

Storing and manipulating data using appropriately named variables is a great way to explain what your code does.

Let's display the result of the word problem using a friendly message. We can do this using the `print` *function*.

> **Functions**: A function is a reusable set of instructions. It takes one or more inputs, performs certain operations, and often returns an output. Python provides many in-built functions like `print` and also allows us to define our own functions.

In [1]:
print("The grocery store makes a total profit of $", total_profit)

NameError: name 'total_profit' is not defined

> **`print`**: The `print` function is used to display information. It takes one or more inputs, which can be text (within quotes, e.g., `"this is some text"`), numbers, variables, mathematical expressions, etc. We'll learn more about variables & functions in the next tutorial.

Creating a code cell for each variable or mathematical operation can get tedious. Fortunately, Jupyter allows you to write multiple lines of code within a single code cell. The result of the last line of code within the cell is displayed as the output. 

Let's rewrite the solution to our word problem within a single cell.

In [None]:
# Store input data in variables
cost_of_ice_bag = 1.25
profit_margin = .2
number_of_bags = 500

# Perform the required calculations
profit_per_bag = cost_of_ice_bag * profit_margin
total_profit = number_of_bags * profit_per_bag

# Display the result
print("The grocery store makes a total profit of $", total_profit)

Note that we're using the `#` character to add *comments* within our code. 

> **Comments**: Comments and blank lines are ignored during execution, but they are useful for providing information to humans (including yourself) about what the code does. Comments can be inline (at the end of some code), on a separate line, or even span multiple lines. 

Inline and single-line comments start with `#`, whereas multi-line comments begin and end with three quotes, i.e. `"""`. Here are some examples of code comments:

In [None]:
my_favorite_number = 1 # an inline comment

In [None]:
# This comment gets its own line
my_least_favorite_number = 3

In [None]:
"""This is a multi-line comment.
Write as little or as much as you'd like.

Comments are really helpful for people reading
your code, but try to keep them short & to-the-point.

Also, if you use good variable names, then your code is
often self explanatory, and you may not even need comments!
"""
a_neutral_number = 5



> **EXERCISE**: A travel company wants to fly a plane to the Bahamas. Flying the plane costs 5000 dollars. So far, 29 people have signed up for the trip. If the company charges 200 dollars per ticket, what is the profit made by the company? Create variables for each numeric quantity and use appropriate arithmetic operations.

## Evaluating conditions using Python

Apart from arithmetic operations, Python also provides several operations for comparing numbers & variables.

| Operator    | Description                                                     |
|-------------|-----------------------------------------------------------------|
| `==`        | Check if operands are equal                                     |
| `!=`        | Check if operands are not equal                                 |
| `>`         | Check if left operand is greater than right operand             |
| `<`         | Check if left operand is less than right operand                |
| `>=`        | Check if left operand is greater than or equal to right operand |
| `<=`        | Check if left operand is less than or equal to right operand    |

The result of a comparison operation is either `True` or `False` (note the uppercase `T` and `F`). These are special keywords in Python. Let's try out some experiments with comparison operators.

In [None]:
my_favorite_number = 1
my_least_favorite_number = 5
a_neutral_number = 3

In [None]:
# Equality check - True
my_favorite_number == 1

In [None]:
# Equality check - False
my_favorite_number == my_least_favorite_number

In [None]:
# Not equal check - True
my_favorite_number != a_neutral_number

In [None]:
# Not equal check - True
my_favorite_number != a_neutral_number

In [None]:
# Not equal check - False
a_neutral_number != 3

In [None]:
# Greater than check - True
my_least_favorite_number > a_neutral_number

In [None]:
# Greater than check - False
my_favorite_number > my_least_favorite_number

In [None]:
# Less than check - True
my_favorite_number < 10

In [None]:
# Less than check - False
my_least_favorite_number < my_favorite_number

In [None]:
# Greater than or equal check - True
my_favorite_number >= 1

In [None]:
# Greater than or equal check - False
my_favorite_number >= 3

In [None]:
# Less than or equal check - True
3 + 6 <= 9

In [None]:
# Less than or equal check - False
my_favorite_number + a_neutral_number <= 3

Just like arithmetic operations, the result of a comparison operation can also be stored in a variable.

In [None]:
cost_of_ice_bag = 1.25
is_ice_bag_expensive = cost_of_ice_bag >= 10
print("Is the ice bag expensive?", is_ice_bag_expensive)

## Combining conditions with logical operators

The logical operators `and`, `or` and `not` operate upon conditions and `True` & `False` values (also known as *booleans*). `and` and `or` operate on two conditions, whereas `not` operates on a single condition.

The `and` operator returns `True` when both the conditions evaluate to `True`. Otherwise, it returns `False`.

| `a`     | `b`    | `a and b` |
|---------|--------|-----------|
|  `True` | `True` | `True`    |
|  `True` | `False`| `False`   |
|  `False`| `True` | `False`   |
|  `False`| `False`| `False`   |

In [None]:
my_favorite_number

In [None]:
my_favorite_number > 0 and my_favorite_number <= 3

In [None]:
my_favorite_number < 0 and my_favorite_number <= 3

In [None]:
my_favorite_number > 0 and my_favorite_number >= 3

In [None]:
True and False

In [None]:
True and True

The `or` operator returns `True` if at least one of the conditions evaluates to `True`. It returns `False` only if both conditions are `False`.

| `a`     | `b`    | `a or b`  |
|---------|--------|-----------|
|  `True` | `True` | `True`    |
|  `True` | `False`| `True`    |
|  `False`| `True` | `True`    |
|  `False`| `False`| `False`   |


In [None]:
a_neutral_number = 3

In [None]:
a_neutral_number == 3 or my_favorite_number < 0

In [None]:
a_neutral_number != 3 or my_favorite_number < 0

In [None]:
my_favorite_number < 0 or True

In [None]:
False or False

The `not` operator returns `False` if a condition is `True` and `True` if the condition is `False`.

In [None]:
not a_neutral_number == 3

In [None]:
not my_favorite_number < 0

In [None]:
not False

In [None]:
not True

In [None]:
Logical operators can be combined to form complex conditions. Use round brackets or parentheses `(` and `)` to indicate the order in which logical operators should be applied.

In [None]:
(2 > 3 and 4 <= 5) or not (my_favorite_number < 0 and True)

In [None]:
not (True and 0 < 1) or (False and True)

Experiment with arithmetic, conditional and logical operators in Python using the interactive nature of Jupyter notebooks. We will learn more about variables and functions in future tutorials.

There is also a ternary operator in Python. Learn about it here: https://data-flair.training/blogs/python-ternary-operator/

## Adding text styles using Markdown

Adding explanations using text cells (like this one) is a great way to make your notebook informative for other readers. It is also useful if you need to refer back to it in the future. You can double click on a text cell within Jupyter to edit it. In the edit mode, you'll notice that the text looks slightly different (for instance, the heading has a `##` prefix. This text is written using Markdown, a simple way to add styles to your text. Execute this cell to see the output without the special characters. You can switch back and forth between the source and the output to apply a specific style.

For instance, you can use one or more `#` characters at the start of a line to create headers of different sizes:

# Header 1

## Header 2

### Header 3

#### Header 4

To create a bulleted or numbered list, simply start a line with `*` or `1.`.

A bulleted list:

* Item 1
* Item 2
* Item 3

A numbered list:

1. Apple
2. Banana
3. Pineapple

You can make some text bold using `**`, e.g., **some bold text**, or make it italic using `*`, e.g., *some italic text.* You can also create links, e.g., [a link](https://jovian.ai). Images are easily embedded too:

![](https://i.imgur.com/3gjZMYK.png)

Another really nice feature of Markdown is the ability to include blocks of code. Note that code blocks inside Markdown cells cannot be executed.

```
# Perform the required calculations
profit_per_bag = cost_of_ice_bag * profit_margin
total_profit = number_of_bags * profit_per_bag

# Display the result
print("The grocery store makes a total profit of $", total_profit)

```

You can learn the full syntax of Markdown here: https://learnxinyminutes.com/docs/markdown/

## Save and upload your notebook

Whether you're running this Jupyter notebook online or on your computer, it's essential to save your work from time to time. You can continue working on a saved notebook later or share it with friends and colleagues to let them execute your code. [Jovian](https://jovian.ai/platform-features) offers an easy way of saving and sharing your Jupyter notebooks online.

First, you need to install the Jovian python library if it isn't already installed.

## Further Reading and References

Following are some resources where you can learn about more arithmetic, conditional and logical operations in Python:

* Python Tutorial at W3Schools: https://www.w3schools.com/python/
* Practical Python Programming: https://dabeaz-course.github.io/practical-python/Notes/Contents.html
* Python official documentation: https://docs.python.org/3/tutorial/index.html

Now that you have taken your first steps with Python, you are ready to move on to the next tutorial: ["A Quick Tour of Variables and Data Types in Python"](https://jovian.ml/aakashns/python-variables-and-data-types).

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is a Jupyter notebook? 
2. How do you add a new code cell below an existing cell?
3. How do you add a new Markdown cell below an existing cell?
4. How do you convert a code cell to a Markdown cell or vice versa?
5. How do you execute a code cell within Jupyter?
6. What the different arithmetic operations supported in Python?
7. How do you perform arithmetic operations using Python?
8. What is the difference between the `/` and the `//` operators?
9. What is the difference between the `*` and the `**` operators?
10. What is the order of precedence for arithmetic operators in Python?
11. How do you specify the order in which arithmetic operations are performed in an expression involving multiple operators?
12. How do you solve a multi-step arithmetic word problem using Python?
13. What are variables? Why are they useful?
14. How do you create a variable in Python?
15. What is the assignment operator in Python?
16. What are the rules for naming a variable in Python?
17. How do you view the value of a variable?
18. How do you store the result of an arithmetic expression in a variable?
19. What happens if you try to access a variable that has not been defined?
20. How do you display messages in Python?
21. What type of inputs can the print function accept?
22. What are code comments? How are they useful?
23. What are the different ways of creating comments in Python code?
24. What are the different comparison operations supported in Python?
25. What is the result of a comparison operation?
26. What is the difference between `=` and `==` in Python?
27. What are the logical operators supported in Python?
28. What is the difference between the `and` and `or` operators?
29. Can you use comparison and logical operators in the same expression?
30. What is the purpose of using parentheses in arithmetic or logical expressions?
31. What is Markdown? Why is it useful?
32. How do you create headings of different sizes using Markdown?
33. How do you create bulleted and numbered lists using Markdown?
34. How do you create bold or italic text using Markdown?
35. How do you include links & images within Markdown cells?
36. How do you include code blocks within Markdown cells?
37. Is it possible to execute the code blocks within Markdown cells?
38. How do you upload and share your Jupyter notebook online using Jovian?
39. What is the purpose of the API key requested by jovian.commit ? Where can you find the API key?
40. Where can you learn about arithmetic, conditional and logical operations in Python?

## Solution for Exercise


> **EXERCISE**: A travel company wants to fly a plane to the Bahamas. Flying the plane costs 5000 dollars. So far, 29 people have signed up for the trip. If the company charges 200 dollars per ticket, what is the profit made by the company? Create variables for each numeric quantity and use appropriate arithmetic operations.

In [None]:
plane_cost=5000
total_people_signed_up=29
ticket_cost=200

In [None]:
amount_received_by_company=total_people_signed_up*ticket_cost
total_profit=amount_received_by_company-plane_cost

In [None]:
total_profit

# A Quick Tour of Variables and Data Types in Python


![](https://i.imgur.com/6cg2E9Q.png)
These tutorials take a practical and coding-focused approach. The best way to learn the material is to execute the code and experiment with it yourself. 

This tutorial covers the following topics:

- Storing information using variables
- Primitive data types in Python: Integer, Float, Boolean, None and String
- Built-in data structures in Python: List, Tuple and Dictionary
- Methods and operators supported by built-in data types

## Storing information using variables

Computers are useful for two purposes: storing information (also known as data) and performing operations on stored data. While working with a programming language such as Python, data is stored in variables. You can think of variables are containers for storing data. The data stored within a variable is called its value. Creating variables in Python is pretty easy, as we've already seen in the [previous tutorial](https://jovian.ml/aakashns/first-steps-with-python/v/4#C15).


In [5]:
my_favorite_color = "blue"

my_favorite_color

'blue'

A variable is created using an assignment statement. It begins with the variable's name, followed by the assignment operator `=` followed by the value to be stored within the variable.  Note that the assignment operator `=` is different from the equality comparison operator `==`.

You can also assign values to multiple variables in a single statement by separating the variable names and values with commas.

In [None]:
color1, color2, color3 = "red", "green", "blue"

In [None]:
color1

In [None]:
color2

In [None]:
color3

You can assign the same value to multiple variables by chaining multiple assignment operations within a single statement.

In [None]:
color4 = color5 = color6 = "magenta"

In [None]:
color4

In [None]:
color5

In [None]:
color6

You can change the value stored within a variable by assigning a new value to it using another assignment statement. Be careful while reassigning variables: when you assign a new value to the variable, the old value is lost and no longer accessible.

In [6]:
my_favorite_color = "red"
my_favorite_color 

'red'

While reassigning a variable, you can also use the variable's previous value to compute the new value.

In [None]:
counter = 10

In [None]:
counter = counter + 1
counter 

The pattern `var = var op something` (where `op` is an arithmetic operator like `+`, `-`, `*`, `/`) is very common, so Python provides a *shorthand* syntax for it.

In [None]:
counter = 10

In [None]:
# Same as `counter = counter + 4`
counter += 4
counter

Variable names can be short (`a`, `x`, `y`, etc.) or descriptive ( `my_favorite_color`, `profit_margin`, `the_3_musketeers`, etc.). However, you must follow these rules while naming Python variables:

* A variable's name must start with a letter or the underscore character `_`. It cannot begin with a number.
* A variable name can only contain lowercase (small) or uppercase (capital) letters, digits, or underscores (`a`-`z`, `A`-`Z`, `0`-`9`, and `_`).
* Variable names are case-sensitive, i.e., `a_variable`, `A_Variable`, and `A_VARIABLE` are all different variables.

Here are some valid variable names:

In [None]:
a_variable = 23
is_today_Saturday = False
my_favorite_car = "Delorean"
the_3_musketeers = ["Athos", "Porthos", "Aramis"] 

Let's try creating some variables with invalid names. Python prints a syntax error if your variable's name is invalid.

> **Syntax**: The syntax of a programming language refers to the rules that govern the structure of a valid instruction or *statement*. If a statement does not follow these rules, Python stops execution and informs you that there is a *syntax error*. You can think of syntax as the rules of grammar for a programming language.

In [None]:
a variable = 23
is_today_$aturday = False
my-favorite-car = "Delorean"
3_musketeers = ["Athos", "Porthos", "Aramis"]

## Built-in data types in Python

Any data or information stored within a Python variable has a *type*. You can view the type of data stored within a variable using the `type` function.

In [None]:
a_variable

In [None]:
type(a_variable)

In [None]:
is_today_Saturday

In [None]:
type(is_today_Saturday)

In [None]:
my_favorite_car

In [None]:
type(my_favorite_car)

In [None]:
the_3_musketeers

In [None]:
type(the_3_musketeers)

Python has several built-in data types for storing different kinds of information in variables. Following are some commonly used data types:

1. Integer
2. Float
3. Boolean
4. None
5. String
6. List
7. Tuple
8. Dictionary

Integer, float, boolean, None, and string are *primitive data types* because they represent a single value. Other data types like list, tuple, and dictionary are often called *data structures* or *containers* because they hold multiple pieces of data together.

### Integer

Integers represent positive or negative whole numbers, from negative infinity to infinity. Note that integers should not include decimal points. Integers have the type `int`.

In [None]:
current_year = 2020
current_year 

In [None]:
type(current_year)

Unlike some other programming languages, integers in Python can be arbitrarily large (or small). There's no lowest or highest value for integers, and there's just one `int` type (as opposed to `short`, `int`, `long`, `long long`, `unsigned int`, etc. in C/C++/Java).

In [None]:
a_large_negative_number = -23374038374832934334234317348343
a_large_negative_number

In [None]:
type(a_large_negative_number)

### Float

Floats (or floating-point numbers) are numbers with a decimal point. There are no limits on the value or the number of digits before or after the decimal point. Floating-point numbers have the type `float`.

In [None]:
pi = 3.141592653589793238
pi

In [None]:
type(pi)

Note that a whole number is treated as a float if written with a decimal point, even though the decimal portion of the number is zero.

In [None]:
a_number = 3.0
a_number 

In [None]:
type(a_number)

In [None]:
another_number = 4.
another_number 

In [None]:
type(another_number)

Floating point numbers can also be written using the scientific notation with an "e" to indicate the power of 10.

In [1]:
one_hundredth = 1e-2
one_hundredth 

0.01

In [None]:
type(one_hundredth)

In [None]:
avogadro_number =  6.02214076e23
avogadro_number

In [None]:
type(avogadro_number)

You can convert floats into integers and vice versa using the `float` and `int` functions. The operation of converting one type of value into another is called casting.

In [None]:
float(current_year)
float(a_large_negative_number)
int(pi)
int(avogadro_number)

While performing arithmetic operations, integers are automatically converted to `float`s if any of the operands is a `float`. Also, the division operator `/` always returns a `float`, even if both operands are integers. Use the `//` operator if you want the result of the division to be an `int`.

In [None]:
type(45 * 3.0)
type(45 * 3)
type(10//2)
type(10/3)

### Boolean

Booleans represent one of 2 values: `True` and `False`. Booleans have the type `bool`.

In [None]:
is_today_Sunday = True
is_today_Sunday 

In [None]:
type(is_today_Saturday)

Booleans are generally the result of a comparison operation, e.g., `==`, `>=`, etc.

In [None]:
cost_of_ice_bag = 1.25
is_ice_bag_expensive = cost_of_ice_bag >= 10

In [None]:
is_ice_bag_expensive

In [None]:
type(is_ice_bag_expensive)

Booleans are automatically converted to `int`s when used in arithmetic operations. `True` is converted to `1` and `False` is converted to `0`.

In [None]:
5 + False

In [None]:
3. + True

Any value in Python can be converted to a Boolean using the `bool` function. 

Only the following values evaluate to `False` (they are often called *falsy* values):

1. The value `False` itself
2. The integer `0`
3. The float `0.0`
4. The empty value `None`
5. The empty text `""`
6. The empty list `[]`
7. The empty tuple `()`
8. The empty dictionary `{}`
9. The empty set `set()`
10. The empty range `range(0)`

Everything else evaluates to `True` (a value that evaluates to `True` is often called a *truthy* value).

In [None]:
bool(False)
bool(0)
bool(0.0)
bool(None)
bool("")
bool([])
bool({})
bool(())
bool(set())
bool(range(0))
bool(True), bool(1), bool(2.0), bool("hello"), bool([1,2]), bool((2,3)), bool(range(10))

### None

The None type includes a single value `None`, used to indicate the absence of a value. `None` has the type `NoneType`. It is often used to declare a variable whose value may be assigned later.

In [None]:
nothing = None

In [None]:
type(nothing)

### String

A string is used to represent text (*a string of characters*) in Python. Strings must be surrounded using quotations (either the single quote `'` or the double quote `"`). Strings have the type `string`.

In [None]:
today = "Saturday"
today 

In [None]:
type(today)

You can use single quotes inside a string written with double quotes, and vice versa.

In [None]:
my_favorite_movie = "One Flew over the Cuckoo's Nest" 
my_favorite_movie

In [None]:
my_favorite_pun = 'Thanks for explaining the word "many" to me, it means a lot.'
my_favorite_pun

To use a double quote within a string written with double quotes, *escape* the inner quotes by prefixing them with the `\` character.

In [None]:
another_pun = "The first time I got a universal remote control, I thought to myself \"This changes everything\"."
another_pun

Strings created using single or double quotes must begin and end on the same line. To create multiline strings, use three single quotes `'''` or three double quotes `"""` to begin and end the string. Line breaks are represented using the newline character `\n`.

In [2]:
yet_another_pun = '''Son: "Dad, can you tell me what a solar eclipse is?" 
Dad: "No sun."'''
yet_another_pun

'Son: "Dad, can you tell me what a solar eclipse is?" \nDad: "No sun."'

Multiline strings are best displayed using the `print` function.

In [3]:
print(yet_another_pun)

Son: "Dad, can you tell me what a solar eclipse is?" 
Dad: "No sun."


In [4]:
a_music_pun = """
Two windmills are standing in a field and one asks the other, 
"What kind of music do you like?"  

The other says, 
"I'm a big metal fan."
"""

In [5]:
print(a_music_pun)


Two windmills are standing in a field and one asks the other, 
"What kind of music do you like?"  

The other says, 
"I'm a big metal fan."



You can check the length of a string using the `len` function.

In [None]:
len(my_favorite_movie)

Note that special characters like `\n` and escaped characters like `\"` count as a single character, even though they are written and sometimes printed as two characters.

In [6]:
multiline_string = """a
b"""
print(multiline_string)

a
b


In [7]:
len(multiline_string)

3

A string can be converted into a list of characters using `list` function.

In [None]:
list(multiline_string)

Strings also support several list operations, which are discussed in the next section. We'll look at a couple of examples here.

You can access individual characters within a string using the `[]` indexing notation. Note the character indices go from `0` to `n-1`, where `n` is the length of the string.

In [None]:
today = "Saturday"

In [None]:
today[0]
today[3]
today[5:8]

You can also check whether a string contains a some text using the `in` operator. 

In [None]:
'day' in today
'sun ' in today

Two or more strings can be joined or *concatenated* using the `+` operator. Be careful while concatenating strings, sometimes you may need to add a space character `" "` between words.

In [None]:
full_name = "Derek O'Brien"

In [None]:
greeting = "Hello"

In [None]:
greeting + full_name

In [None]:
greeting + " " + full_name + "!" # additional space

Strings in Python have many built-in *methods* that are used to manipulate them. Let's try out some common string methods.

> **Methods**: Methods are functions associated with data types and are accessed using the `.` notation e.g. `variable_name.method()` or `"a string".method()`. Methods are a powerful technique for associating common operations with values of specific data types.

The `.lower()`, `.upper()` and `.capitalize()` methods are used to change the case of the characters.

In [None]:
today.lower()

In [None]:
"saturday".upper()

In [None]:
"monday".capitalize() # changes first character to uppercase

The `.replace` method replaces a part of the string with another string. It takes the portion to be replaced and the replacement text as *inputs* or *arguments*.

In [None]:
another_day = today.replace("Satur", "Wednes")
another_day

Note that `replace` returns a new string, and the original string is not modified.

In [None]:
today

The `.split` method splits a string into a list of strings at every occurrence of provided character(s).

In [8]:
"Sun,Mon,Tue,Wed,Thu,Fri,Sat".split(",")

['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']

The `.strip` method removes whitespace characters from the beginning and end of a string.

In [9]:
a_long_line = "       This is a long line with some space before, after,     and some space in the middle..    "
a_long_line_stripped = a_long_line.strip()
a_long_line_stripped

'This is a long line with some space before, after,     and some space in the middle..'

The `.format` method combines values of other data types, e.g., integers, floats, booleans, lists, etc. with strings. You can use `format` to construct output messages for display.

In [None]:
# Input variables
cost_of_ice_bag = 1.25
profit_margin = .2
number_of_bags = 500

# Template for output message
output_template = """If a grocery store sells ice bags at $ {} per bag, with a profit margin of {} %, 
then the total profit it makes by selling {} ice bags is $ {}."""

print(output_template)

In [None]:
# Inserting values into the string
total_profit = cost_of_ice_bag * profit_margin * number_of_bags
output_message = output_template.format(cost_of_ice_bag, profit_margin*100, number_of_bags, total_profit)

print(output_message)

Notice how the placeholders `{}` in the `output_template` string are replaced with the arguments provided to the `.format` method.

It is also possible to use the string concatenation operator `+` to combine strings with other values. However, those values must first be converted to strings using the `str` function.

In [None]:
"If a grocery store sells ice bags at $ " + str(cost_of_ice_bag) + ", with a profit margin of " + str(profit_margin)

You can `str` to convert a value of any data type into a string.

In [None]:
str(23)
str(23.563)
str(True)
the_3_musketeers = ["Athos", "Porthos", "Aramis"]
str(the_3_musketeers)

Note that all string methods return new values and DO NOT change the existing string. You can find a full list of string methods here: https://www.w3schools.com/python/python_ref_string.asp. 

Strings also support the comparison operators `==` and `!=` for checking whether two strings are equal.

In [None]:
first_name = "John"

In [None]:
first_name == "Doe"

In [None]:
first_name == "John"

In [None]:
first_name != "Jane"

We've looked at the primitive data types in Python. We're now ready to explore non-primitive data structures, also known as containers.

Before continuing, let us run `jovian.commit` once again to record another snapshot of our notebook.

### List

A list in Python is an ordered collection of values. Lists can hold values of different data types and support operations to add, remove, and change values. Lists have the type `list`.

To create a list, enclose a sequence of values within square brackets `[` and `]`, separated by commas.

In [None]:
fruits = ['apple', 'banana', 'cherry']
fruits

In [None]:
type(fruits)

Let's try creating a list containing values of different data types, including another list.

In [None]:
a_list = [23, 'hello', None, 3.14, fruits, 3 <= 5]
a_list

In [None]:
empty_list = []
empty_list

To determine the number of values in a list, use the `len` function. You can use `len`  to determine the number of values in several other data types.

In [None]:
len(fruits)

In [None]:
print("Number of fruits:", len(fruits))

In [None]:
len(a_list)

In [None]:
len(empty_list)

You can access an element from the list using its *index*, e.g., `fruits[2]` returns the element at index 2 within the list `fruits`. The starting index of a list is 0.

In [None]:
fruits[0]

In [None]:
fruits[2]

If you try to access an index equal to or higher than the length of the list, Python returns an `IndexError`.

In [None]:
fruits[3]

You can use negative indices to access elements from the end of a list, e.g., `fruits[-1]` returns the last element, `fruits[-2]` returns the second last element, and so on.

In [None]:
fruits[-1]

In [None]:
fruits[-3]

In [None]:
fruits[-4]

You can also access a range of values from the list. The result is itself a list. Let us look at some examples.

In [None]:
a_list[2:5]

Note that the range `2:5` includes the element at the start index `2` but does not include the element at the end index `5`. So, the result has 3 values (index `2`, `3`, and `4`).

Here are some experiments you should try out (use the empty cells below):

* Try setting one or both indices of the range are larger than the size of the list, e.g., `a_list[2:10]`
* Try setting the start index of the range to be larger than the end index, e.g., `a_list[12:10]`
* Try leaving out the start or end index of a range, e.g., `a_list[2:]` or `a_list[:5]`
* Try using negative indices for the range, e.g., `a_list[-2:-5]` or `a_list[-5:-2]` (can you explain the results?)

> The flexible and interactive nature of Jupyter notebooks makes them an excellent tool for learning and experimentation. If you are new to Python, you can resolve most questions as soon as they arise simply by typing the code into a cell and executing it. Let your curiosity run wild, discover what Python is capable of and what it isn't! 


You can also change the value at a specific index within a list using the assignment operation.

In [None]:
fruits

In [None]:
fruits[1] = 'blueberry'
fruits

A new value can be added to the end of a list using the `append` method.

In [None]:
fruits.append('dates')
fruits

A new value can also be inserted at a specific index using the `insert` method.

In [None]:
fruits.insert(1, 'banana')
fruits

You can remove a value from a list using the `remove` method.

In [None]:
fruits.remove('blueberry')
fruits

What happens if a list has multiple instances of the value passed to `.remove`? Try it out.

To remove an element from a specific index, use the `pop` method. The method also returns the removed element.

In [None]:
fruits

In [None]:
fruits.pop(1)
fruits

If no index is provided, the `pop` method removes the last element of the list.

In [None]:
fruits.pop()
fruits

You can test whether a list contains a value using the `in` operator.

In [None]:
'pineapple' in fruits

In [None]:
'cherry' in fruits

To combine two or more lists, use the `+` operator. This operation is also called *concatenation*.

In [None]:
fruits

In [None]:
more_fruits = fruits + ['pineapple', 'tomato', 'guava'] + ['dates', 'banana']
more_fruits

To create a copy of a list, use the `copy` method. Modifying the copied list does not affect the original.

In [None]:
more_fruits_copy = more_fruits.copy()
more_fruits_copy 

In [None]:
# Modify the copy
more_fruits_copy.remove('pineapple')
more_fruits_copy.pop()
more_fruits_copy

In [None]:
# Original list remains unchanged
more_fruits

Note that you cannot create a copy of a list by simply creating a new variable using the assignment operator `=`. The new variable will point to the same list, and any modifications performed using either variable will affect the other.

In [None]:
more_fruits_not_a_copy = more_fruits

In [None]:
more_fruits_not_a_copy.remove('pineapple')
more_fruits_not_a_copy.pop()

In [None]:
more_fruits_not_a_copy

In [None]:
more_fruits

Just like strings, there are several in-built methods to manipulate a list. However, unlike strings, most list methods modify the original list rather than returning a new one. Check out some common list operations here: https://www.w3schools.com/python/python_ref_list.asp .


Following are some exercises you can try out with list methods (use the blank code cells below):

* Reverse the order of elements in a list
* Add the elements of one list at the end of another list
* Sort a list of strings in alphabetical order
* Sort a list of numbers in decreasing order

### Tuple

A tuple is an ordered collection of values, similar to a list. However, it is not possible to add, remove, or modify values in a tuple. A tuple is created by enclosing values within parentheses `(` and `)`, separated by commas.

> Any data structure that cannot be modified after creation is called *immutable*. You can think of tuples as immutable lists.

Let's try some experiments with tuples.

In [None]:
fruits = ('apple', 'cherry', 'dates')

In [None]:
# check no. of elements
len(fruits)

In [None]:
# get an element (positive index)
fruits[0]

In [None]:
# get an element (negative index)
fruits[-2]

In [None]:
# check if it contains an element
'dates' in fruits

In [None]:
# try to change an element
fruits[0] = 'avocado'

In [None]:
# try to append an element
fruits.append('blueberry')

In [None]:
# try to remove an element
fruits.remove('apple')

You can also skip the parantheses `(` and `)` while creating a tuple. Python automatically converts comma-separated values into a tuple.

In [None]:
the_3_musketeers = 'Athos', 'Porthos', 'Aramis'
the_3_musketeers

You can also create a tuple with just one element by typing a comma after it. Just wrapping it with parentheses `(` and `)` won't make it a tuple.

In [None]:
single_element_tuple = 4,
single_element_tuple

In [None]:
another_single_element_tuple = (4,)
another_single_element_tuple 

In [None]:
not_a_tuple = (4)
not_a_tuple 

Tuples are often used to create multiple variables with a single statement.

In [None]:
point = (3, 4)

In [None]:
point_x, point_y = point

In [None]:
point_x

In [None]:
point_y

You can convert a list into a tuple using the `tuple` function, and vice versa using the `list` function

In [None]:
tuple(['one', 'two', 'three'])

In [None]:
list(('Athos', 'Porthos', 'Aramis'))

Tuples have just two built-in methods: `count` and `index`. Can you figure out what they do? While you look could look for documentation and examples online, there's an easier way to check a method's documentation, using the `help` function.

In [None]:
a_tuple = 23, "hello", False, None, 23, 37, "hello"

In [None]:
help(a_tuple.count)

Within a Jupyter notebook, you can also start a code cell with `?` and type the name of a function or method. When you execute this cell, you will see the function/method's documentation in a pop-up window.

In [None]:
?a_tuple.index

Try using `count` and `index` with `a_tuple` in the code cells below.

### Dictionary

A dictionary is an unordered collection of items. Each item stored in a dictionary has a key and value. You can use a key to retrieve the corresponding value from the dictionary.  Dictionaries have the type `dict`.

Dictionaries are often used to store many pieces of information e.g. details about a person, in a single variable. Dictionaries are created by enclosing key-value pairs within braces or curly brackets `{` and `}`.

In [None]:
person1 = {
    'name': 'John Doe',
    'sex': 'Male',
    'age': 32,
    'married': True
}
person1 

Dictionaries can also be created using the `dict` function.

In [None]:
person2 = dict(name='Jane Judy', sex='Female', age=28, married=False)
person2

In [None]:
type(person1)

Keys can be used to access values using square brackets `[` and `]`.

In [None]:
person1['name']

In [None]:
person1['married']

If a key isn't present in the dictionary, then a `KeyError` is *thrown*.

In [None]:
person1['address']

You can also use the `get` method to access the value associated with a key.

In [None]:
person2.get("name")

The `get` method also accepts a default value, returned if the key is not present in the dictionary.

In [None]:
person2.get("address", "Unknown")

You can check whether a key is present in a dictionary using the `in` operator.

In [None]:
'address' in person1

In [None]:
'name' in person1

You can change the value associated with a key using the assignment operator.

In [None]:
person2['married']


In [None]:
person2['married'] = True
person2['married']

The assignment operator can also be used to add new key-value pairs to the dictionary.

In [None]:
person1['address'] = '1, Penny Lane'
person1

To remove a key and the associated value from a dictionary, use the `pop` method.

In [None]:
person1.pop('address')
person1

Dictionaries also provide methods to view the list of keys, values, or key-value pairs inside it.

In [None]:
person1.keys()

In [None]:
person1.values()

In [None]:
person1.items()

Dictionaries provide many other methods. You can learn more about them here: https://www.w3schools.com/python/python_ref_dictionary.asp .

Here are some experiments you can try out with dictionaries (use the empty cells below):
* What happens if you use the same key multiple times while creating a dictionary?
* How can you create a copy of a dictionary (modifying the copy should not change the original)?
* Can the value associated with a key itself be a dictionary?
* How can you add the key-value pairs from one dictionary into another dictionary? Hint: See the `update` method.
* Can the dictionary's keys be something other than a string, e.g., a number, boolean, list, etc.?

## Further Reading

We've now completed our exploration of variables and common data types in Python. Following are some resources to learn more about data types in Python:

* Python official documentation: https://docs.python.org/3/tutorial/index.html
* Python Tutorial at W3Schools: https://www.w3schools.com/python/
* Practical Python Programming: https://dabeaz-course.github.io/practical-python/Notes/Contents.html

You are now ready to move on to the next tutorial: [Branching using conditional statements and loops in Python](https://jovian.ml/aakashns/python-branching-and-loops)

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is a variable in Python?
2. How do you create a variable?
3. How do you check the value within a variable?
4. How do you create multiple variables in a single statement?
5. How do you create multiple variables with the same value?
6. How do you change the value of a variable?
7. How do you reassign a variable by modifying the previous value?
8. What does the statement `counter += 4` do?
9. What are the rules for naming a variable?
10. Are variable names case-sensitive? Do `a_variable`, `A_Variable`, and `A_VARIABLE` represent the same variable or different ones?
11. What is Syntax? Why is it important?
12. What happens if you execute a statement with invalid syntax?
13. How do you check the data type of a variable?
14. What are the built-in data types in Python?
15. What is a primitive data type? 
16. What are the primitive data types available in Python?
17. What is a data structure or container data type?
18. What are the container types available in Python?
19. What kind of data does the Integer data type represent?
20. What are the numerical limits of the integer data type?
21. What kind of data does the float data type represent?
22. How does Python decide if a given number is a float or an integer?
23. How can you create a variable which stores a whole number, e.g., 4 but has the float data type?
24. How do you create floats representing very large (e.g., 6.023 x 10^23) or very small numbers (0.000000123)?
25. What does the expression `23e-12` represent?
26. Can floats be used to store numbers with unlimited precision?
27. What are the differences between integers and floats?
28. How do you convert an integer to a float?
29. How do you convert a float to an integer?
30. What is the result obtained when you convert 1.99 to an integer?
31. What are the data types of the results of the division operators `/` and `//`?
32. What kind of data does the Boolean data type represent?
33. Which types of Python operators return booleans as a result?
34. What happens if you try to use a boolean in arithmetic operation?
35. How can any value in Python be covered to a boolean?
36. What are truthy and falsy values?
37. What are the values in Python that evaluate to False?
38. Give some examples of values that evaluate to True.
39. What kind of data does the None data type represent?
40. What is the purpose of None?
41. What kind of data does the String data type represent?
42. What are the different ways of creating strings in Python?
43. What is the difference between strings creating using single quotes, i.e. `'` and `'` vs. those created using double quotes, i.e. `"` and `"`?
44. How do you create multi-line strings in Python?
45. What is the newline character, `\n`?
46. What are escaped characters? How are they useful?
47. How do you check the length of a string?
48. How do you convert a string into a list of characters?
49. How do you access a specific character from a string?
50. How do you access a range of characters from a string?
51. How do you check if a specific character occurs in a string?
52. How do you check if a smaller string occurs within a bigger string?
53. How do you join two or more strings?
54. What are "methods" in Python? How are they different from functions?
55. What do the `.lower`, `.upper` and `.capitalize` methods on strings do?
56. How do you replace a specific part of a string with something else?
57. How do you split the string "Sun,Mon,Tue,Wed,Thu,Fri,Sat" into a list of days?
58. How do you remove whitespace from the beginning and end of a string?
59. What is the string `.format` method used for? Can you give an example?
60. What are the benefits of using the `.format` method instead of string concatenation?
61. How do you convert a value of another type to a string?
62. How do you check if two strings have the same value?
63. Where can you find the list of all the methods supported by strings?
64. What is a list in Python?
65. How do you create a list?
66. Can a Python list contain values of different data types?
67. Can a list contain another list as an element within it?
68. Can you create a list without any values?
69. How do you check the length of a list in Python?
70. How do you retrieve a value from a list?
71. What is the smallest and largest index you can use to access elements from a list containing five elements?
72. What happens if you try to access an index equal to or larger than the size of a list?
73. What happens if you try to access a negative index within a list?
74. How do you access a range of elements from a list?
75. How many elements does the list returned by the expression `a_list[2:5]` contain?
76. What do the ranges `a_list[:2]` and `a_list[2:]` represent?
77. How do you change the item stored at a specific index within a list?
78. How do you insert a new item at the beginning, middle, or end of a list?
79. How do you remove an item from al list?
80. How do you remove the item at a given index from a list?
81. How do you check if a list contains a value?
82. How do you combine two or most lists to create a larger list?
83. How do you create a copy of a list?
84. Does the expression `a_new_list = a_list` create a copy of the list `a_list`?
85. Where can you find the list of all the methods supported by lists?
86. What is a Tuple in Python?
87. How is a tuple different from a list?
88. Can you add or remove elements in a tuple?
89. How do you create a tuple with just one element?
90. How do you convert a tuple to a list and vice versa?
91. What are the `count` and `index` method of a Tuple used for?
92. What is a dictionary in Python?
93. How do you create a dictionary?
94. What are keys and values?
95. How do you access the value associated with a specific key in a dictionary?
96. What happens if you try to access the value for a key that doesn't exist in a dictionary?
97. What is the `.get` method of a dictionary used for?
98. How do you change the value associated with a key in a dictionary?
99. How do you add or remove a key-value pair in a dictionary?
100. How do you access the keys, values, and key-value pairs within a dictionary?

# Branching using Conditional Statements and Loops in Python

![](https://i.imgur.com/7RfcHV0.png)

This tutorial covers the following topics:

- Branching with `if`, `else` and `elif`
- Nested conditions and `if` expressions
- Iteration with `while` loops
- Iterating over containers with `for` loops
- Nested loops, `break` and `continue` statements

## Branching with `if`, `else` and `elif`

One of the most powerful features of programming languages is *branching*: the ability to make decisions and execute a different set of statements based on whether one or more conditions are true.

### The `if` statement

In Python, branching is implemented using the `if` statement, which is written as follows:

```
if condition:
    statement1
    statement2
```

The `condition` can be a value, variable or expression. If the condition evaluates to `True`, then the statements within the *`if` block* are executed. Notice the four spaces before `statement1`, `statement2`, etc. The spaces inform Python that these statements are associated with the `if` statement above. This technique of structuring code by adding spaces is called *indentation*.

> **Indentation**: Python relies heavily on *indentation* (white space before a statement) to define code structure. This makes Python code easy to read and understand. You can run into problems if you don't use indentation properly. Indent your code by placing the cursor at the start of the line and pressing the `Tab` key once to add 4 spaces. Pressing `Tab` again will indent the code further by 4 more spaces, and press `Shift+Tab` will reduce the indentation by 4 spaces. 


For example, let's write some code to check and print a message if a given number is even.

In [None]:
a_number = 34
if a_number % 2 == 0:
    print("We're inside an if block")
    print('The given number {} is even.'.format(a_number))

We use the modulus operator `%` to calculate the remainder from the division of `a_number` by `2`. Then, we use the comparison operator `==` check if the remainder is `0`, which tells us whether the number is even, i.e., divisible by 2.

Since `34` is divisible by `2`, the expression `a_number % 2 == 0` evaluates to `True`, so the `print` statement under the `if` statement is executed. Also, note that we are using the string `format` method to include the number within the message.

Let's try the above again with an odd number.

In [2]:
another_number = 33
if another_number % 2 == 0:
    print('The given number {} is even.'.format(another_number))

As expected, since the condition `another_number % 2 == 0` evaluates to `False`, no message is printed. 

### The `else` statement

We may want to print a different message if the number is not even in the above example. This can be done by adding the `else` statement. It is written as follows:

```
if condition:
    statement1
    statement2
else:
    statement4
    statement5

```

If `condition` evaluates to `True`, the statements in the `if` block are executed. If it evaluates to `False`, the statements in the `else` block are executed.

In [None]:
a_number = 34
if a_number % 2 == 0:
    print('The given number {} is even.'.format(a_number))
else:
    print('The given number {} is odd.'.format(a_number))

In [3]:
another_number = 33
if another_number % 2 == 0:
    print('The given number {} is even.'.format(another_number))
else:
    print('The given number {} is odd.'.format(another_number))

The given number 33 is odd.


Here's another example, which uses the `in` operator to check membership within a tuple.

In [None]:
the_3_musketeers = ('Athos', 'Porthos', 'Aramis')

In [None]:
a_candidate = "D'Artagnan
if a_candidate in the_3_musketeers:
    print("{} is a musketeer".format(a_candidate))
else:
    print("{} is not a musketeer".format(a_candidate))

### The `elif` statement

Python also provides an `elif` statement (short for "else if") to chain a series of conditional blocks. The conditions are evaluated one by one. For the first condition that evaluates to `True`, the block of statements below it is executed. The remaining conditions and statements are not evaluated. So, in an `if`, `elif`, `elif`... chain, at most one block of statements is executed, the one corresponding to the first condition that evaluates to `True`. 

In [None]:
today = 'Wednesday'

if today == 'Sunday':
    print("Today is the day of the sun.")
elif today == 'Monday':
    print("Today is the day of the moon.")
elif today == 'Tuesday':
    print("Today is the day of Tyr, the god of war.")
elif today == 'Wednesday':
    print("Today is the day of Odin, the supreme diety.")
elif today == 'Thursday':
    print("Today is the day of Thor, the god of thunder.")
elif today == 'Friday':
    print("Today is the day of Frigga, the goddess of beauty.")
elif today == 'Saturday':
    print("Today is the day of Saturn, the god of fun and feasting.")

In the above example, the first 3 conditions evaluate to `False`, so none of the first 3 messages are printed. The fourth condition evaluates to `True`, so the corresponding message is printed. The remaining conditions are skipped. Try changing the value of `today` above and re-executing the cells to print all the different messages.


To verify that the remaining conditions are skipped, let us try another example.

In [None]:
a_number = 15

if a_number % 2 == 0:
    print('{} is divisible by 2'.format(a_number))
elif a_number % 3 == 0:
    print('{} is divisible by 3'.format(a_number))
elif a_number % 5 == 0:
    print('{} is divisible by 5'.format(a_number))
elif a_number % 7 == 0:
    print('{} is divisible by 7'.format(a_number))

Note that the message `15 is divisible by 5` is not printed because the condition `a_number % 5 == 0` isn't evaluated, since the previous condition `a_number % 3 == 0` evaluates to `True`. This is the key difference between using a chain of `if`, `elif`, `elif`... statements vs. a chain of `if` statements, where each condition is evaluated independently.

In [None]:
if a_number % 2 == 0:
    print('{} is divisible by 2'.format(a_number))
if a_number % 3 == 0:
    print('{} is divisible by 3'.format(a_number))
if a_number % 5 == 0:
    print('{} is divisible by 5'.format(a_number))
if a_number % 7 == 0:
    print('{} is divisible by 7'.format(a_number))

### Using `if`, `elif`, and `else` together

You can also include an `else` statement at the end of a chain of `if`, `elif`... statements. This code within the `else` block is evaluated when none of the conditions hold true.

In [None]:
a_number = 49

if a_number % 2 == 0:
    print('{} is divisible by 2'.format(a_number))
elif a_number % 3 == 0:
    print('{} is divisible by 3'.format(a_number))
elif a_number % 5 == 0:
    print('{} is divisible by 5'.format(a_number))
else:
    print('All checks failed!')
    print('{} is not divisible by 2, 3 or 5'.format(a_number))

Conditions can also be combined using the logical operators `and`, `or` and `not`. Logical operators are explained in detail in the [first tutorial](https://jovian.ml/aakashns/first-steps-with-python/v/4#C49).

In [None]:
a_number = 12

if a_number % 3 == 0 and a_number % 5 == 0:
    print("The number {} is divisible by 3 and 5".format(a_number))
elif not a_number % 5 == 0:
    print("The number {} is not divisible by 5".format(a_number))

### Non-Boolean Conditions

Note that conditions do not necessarily have to be booleans. In fact, a condition can be any value. The value is converted into a boolean automatically using the `bool` operator. This means that falsy values like `0`, `''`, `{}`, `[]`, etc. evaluate to `False` and all other values evaluate to `True`.

In [None]:
if '':
    print('The condition evaluted to True')
else:
    print('The condition evaluted to False')

In [None]:
if 'Hello':
    print('The condition evaluted to True')
else:
    print('The condition evaluted to False')

In [None]:
if { 'a': 34 }:
    print('The condition evaluted to True')
else:
    print('The condition evaluted to False')

In [None]:
if None:
    print('The condition evaluted to True')
else:
    print('The condition evaluted to False')

### Nested conditional statements

The code inside an `if` block can also include an `if` statement inside it. This pattern is called `nesting` and is used to check for another condition after a particular condition holds true.

In [None]:
a_number = 15

if a_number % 2 == 0:
    print("{} is even".format(a_number))
    if a_number % 3 == 0:
        print("{} is also divisible by 3".format(a_number))
    else:
        print("{} is not divisible by 3".format(a_number))
else:
    print("{} is odd".format(a_number))
    if a_number % 5 == 0:
        print("{} is also divisible by 5".format(a_number))
    else:
        print("{} is not divisible by 5".format(a_number))

Notice how the `print` statements are indented by 8 spaces to indicate that they are part of the inner `if`/`else` blocks.

> Nested `if`, `else` statements are often confusing to read and prone to human error. It's good to avoid nesting whenever possible, or limit the nesting to 1 or 2 levels.

### Shorthand `if` conditional expression

A frequent use case of the `if` statement involves testing a condition and setting a variable's value based on the condition.

In [None]:
a_number = 13

if a_number % 2 == 0:
    parity = 'even'
else:
    parity = 'odd'

print('The number {} is {}.'.format(a_number, parity))

Python provides a shorter syntax, which allows writing such conditions in a single line of code. It is known as a *conditional expression*, sometimes also referred to as a *ternary operator*. It has the following syntax:

```
x = true_value if condition else false_value
```

It has the same behavior as the following `if`-`else` block:

```
if condition:
    x = true_value
else:
    x = false_value
```

Let's try it out for the example above.

In [None]:
parity = 'even' if a_number % 2 == 0 else 'odd'
print('The number {} is {}.'.format(a_number, parity))

### Statements and Expressions

The conditional expression highlights an essential distinction between *statements* and *expressions* in Python. 

> **Statements**: A statement is an instruction that can be executed. Every line of code we have written so far is a statement e.g. assigning a variable, calling a function, conditional statements using `if`, `else`, and `elif`, loops using `for` and `while` etc.

> **Expressions**: An expression is some code that evaluates to a value. Examples include values of different data types, arithmetic expressions, conditions, variables, function calls, conditional expressions, etc. 


Most expressions can be executed as statements, but not all statements are expressions. For example, the regular `if` statement is not an expression since it does not evaluate to a value. It merely performs some branching in the code. Similarly, loops and function definitions are not expressions (we'll learn more about these in later sections).

As a rule of thumb, an expression is anything that can appear on the right side of the assignment operator `=`. You can use this as a test for checking whether something is an expression or not. You'll get a syntax error if you try to assign something that is not an expression.

In [None]:
# if statement
result = if a_number % 2 == 0: 
    'even'
else:
    'odd'

In [None]:
# if expression
result = 'even' if a_number % 2 == 0 else 'odd'

### The `pass` statement

`if` statements cannot be empty, there must be at least one statement in every `if` and `elif` block. You can use the `pass` statement to do nothing and avoid getting an error.

In [None]:
a_number = 9

if a_number % 2 == 0:
elif a_number % 3 == 0:
    print('{} is divisible by 3 but not divisible by 2')

In [None]:
if a_number % 2 == 0:
    pass
elif a_number % 3 == 0:
    print('{} is divisible by 3 but not divisible by 2'.format(a_number))

## Iteration with `while` loops

Another powerful feature of programming languages, closely related to branching, is running one or more statements multiple times. This feature is often referred to as *iteration* on *looping*, and there are two ways to do this in Python: using `while` loops and `for` loops. 

`while` loops have the following syntax:

```
while condition:
    statement(s)
```

Statements in the code block under `while` are executed repeatedly as long as the `condition` evaluates to `True`. Generally, one of the statements under `while` makes some change to a variable that causes the condition to evaluate to `False` after a certain number of iterations.

Let's try to calculate the factorial of `100` using a `while` loop. The factorial of a number `n` is the product (multiplication) of all the numbers from `1` to `n`, i.e., `1*2*3*...*(n-2)*(n-1)*n`.

In [None]:
result = 1
i = 1

while i <= 100:
    result = result * i
    i = i+1

print('The factorial of 100 is: {}'.format(result))

Here's how the above code works:

* We initialize two variables, `result` and, `i`. `result` will contain the final outcome. And `i` is used to keep track of the next number to be multiplied with `result`. Both are initialized to 1 (can you explain why?)

* The condition `i <= 100` holds true (since `i` is initially `1`), so the `while` block is executed.

* The `result` is updated to `result * i`, `i` is increased by `1` and it now has the value `2`.

* At this point, the condition `i <= 100` is evaluated again. Since it continues to hold true, `result` is again updated to `result * i`, and `i` is increased to `3`.

* This process is repeated till the condition becomes false, which happens when `i` holds the value `101`. Once the condition evaluates to `False`, the execution of the loop ends, and the `print` statement below it is executed. 

Can you see why `result` contains the value of the factorial of 100 at the end? If not, try adding `print` statements inside the `while` block to print `result` and `i` in each iteration.


> Iteration is a powerful technique because it gives computers a massive advantage over human beings in performing thousands or even millions of repetitive operations really fast. With just 4-5 lines of code, we were able to multiply 100 numbers almost instantly. The same code can be used to multiply a thousand numbers (just change the condition to `i <= 1000`) in a few seconds.

You can check how long a cell takes to execute by adding the *magic* command `%%time` at the top of a cell. Try checking how long it takes to compute the factorial of `100`, `1000`, `10000`, `100000`, etc. 

In [None]:
%%time

result = 1
i = 1

while i <= 1000:
    result *= i # same as result = result * i
    i += 1 # same as i = i+1

print(result)

Here's another example that uses two `while` loops to create an interesting pattern.

In [5]:
line = '*'
max_length = 10

while len(line) < max_length:
    print(line)
    line += "*"
    
while len(line) > 0:
    print(line)
    line = line[:-1]

*
**
***
****
*****
******
*******
********
*********
**********
*********
********
*******
******
*****
****
***
**
*


Can you see how the above example works? As an exercise, try printing the following pattern using a while loop (Hint: use string concatenation):

```
          *
         **
        ***
       ****
      *****
     ******
      *****
       ****
        ***
         **
          *
```

Here's another one, putting the two together:


```
          *
         ***
        *****
       *******
      *********
     ***********
      *********
       *******
        *****
         ***
          *
```

Watch this playlist to learn how to create the above patterns: https://www.youtube.com/watch?v=2Icpbawb-vw&list=PLyMom0n-MBrpVcMqVV9kbA-hq2ygir0uW

### Infinite Loops

Suppose the condition in a `while` loop always holds true. In that case, Python repeatedly executes the code within the loop forever, and the execution of the code never completes. This situation is called an infinite loop. It generally indicates that you've made a mistake in your code. For example, you may have provided the wrong condition or forgotten to update a variable within the loop, eventually falsifying the condition.

If your code is *stuck* in an infinite loop during execution, just press the "Stop" button on the toolbar (next to "Run") or select "Kernel > Interrupt" from the menu bar. This will *interrupt* the execution of the code. The following two cells both lead to infinite loops and need to be interrupted.

In [None]:
# INFINITE LOOP - INTERRUPT THIS CELL

result = 1
i = 1

while i <= 100:
    result = result * i
    # forgot to increment i

In [None]:
# INFINITE LOOP - INTERRUPT THIS CELL

result = 1
i = 1

while i > 0 : # wrong condition
    result *= i
    i += 1

### `break` and `continue` statements

You can use the `break` statement within the loop's body to immediately stop the execution and *break* out of the loop (even if the condition provided to `while` still holds true).

In [None]:
i = 1
result = 1

while i <= 100:
    result *= i
    if i == 42:
        print('Magic number 42 reached! Stopping execution..')
        break
    i += 1
    
print('i:', i)
print('result:', result)

As you can see above, the value of `i` at the end of execution is 42. This example also shows how you can use an `if` statement within a `while` loop.

Sometimes you may not want to end the loop entirely, but simply skip the remaining statements in the loop and *continue* to the next loop. You can do this using the `continue` statement.

In [None]:
i = 1
result = 1

while i < 20:
    i += 1
    if i % 2 == 0:
        print('Skipping {}'.format(i))
        continue
    print('Multiplying with {}'.format(i))
    result = result * i
    
print('i:', i)
print('result:', result)

In the example above, the statement `result = result * i` inside the loop is skipped when `i` is even, as indicated by the messages printed during execution.

> **Logging**: The process of adding `print` statements at different points in the code (often within loops and conditional statements) for inspecting the values of variables at various stages of execution is called logging. As our programs get larger, they naturally become prone to human errors. Logging can help in verifying the program is working as expected. In many cases, `print` statements are added while writing & testing some code and are removed later.

## Iteration with `for` loops

A `for` loop is used for iterating or looping over sequences, i.e., lists, tuples, dictionaries, strings, and *ranges*. For loops have the following syntax:

```
for value in sequence:
    statement(s)
```

The statements within the loop are executed once for each element in `sequence`. Here's an example that prints all the element of a list.

In [None]:
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

for day in days:
    print(day)

Let's try using `for` loops with some other data types.

In [None]:
# Looping over a string
for char in 'Monday':
    print(char)

In [None]:
# Looping over a tuple
for fruit in ('Apple', 'Banana', 'Guava'):
    print("Here's a fruit:", fruit)

In [None]:
# Looping over a dictionary
person = {
    'name': 'John Doe',
    'sex': 'Male',
    'age': 32,
    'married': True
}

for key in person:
    print("Key:", key, ",", "Value:", person[key])

Note that while using a dictionary with a `for` loop, the iteration happens over the dictionary's keys. The key can be used within the loop to access the value. You can also iterate directly over the values using the `.values` method or over key-value pairs using the `.items` method.

In [None]:
for value in person.values():
    print(value)

In [None]:
for key_value_pair in person.items():
    print(key_value_pair)

Since a key-value pair is a tuple, we can also extract the key & value into separate variables.

In [None]:
for key, value in person.items():
    print("Key:", key, ",", "Value:", value)

### Iterating using `range` and `enumerate`

The `range` function is used to create a sequence of numbers that can be iterated over using a `for` loop. It can be used in 3 ways:
 
* `range(n)` - Creates a sequence of numbers from `0` to `n-1`
* `range(a, b)` - Creates a sequence of numbers from `a` to `b-1`
* `range(a, b, step)` - Creates a sequence of numbers from `a` to `b-1` with increments of `step`

Let's try it out.

In [None]:
for i in range(7):
    print(i)

In [None]:
for i in range(3, 10):
    print(i)

In [None]:
for i in range(3, 14, 4):
    print(i)

Ranges are used for iterating over lists when you need to track the index of elements while iterating.

In [None]:
a_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

for i in range(len(a_list)):
    print('The value at position {} is {}.'.format(i, a_list[i]))

Another way to achieve the same result is by using the `enumerate` function with `a_list` as an input, which returns a tuple containing the index and the corresponding element.

In [None]:
for i, val in enumerate(a_list):
    print('The value at position {} is {}.'.format(i, val))

### `break`, `continue` and `pass` statements

Similar to `while` loops, `for` loops also support the `break` and `continue` statements. `break` is used for breaking out of the loop and `continue` is used for skipping ahead to the next iteration.

In [None]:
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

In [None]:
for day in weekdays:
    print('Today is {}'.format(day))
    if (day == 'Wednesday'):
        print("I don't work beyond Wednesday!")
        break

In [None]:
for day in weekdays:
    if (day == 'Wednesday'):
        print("I don't work on Wednesday!")
        continue
    print('Today is {}'.format(day))

Like `if` statements, `for` loops cannot be empty, so you can use a `pass` statement if you don't want to execute any statements inside the loop.

In [None]:
for day in weekdays:
    pass

### Nested `for` and `while` loops

Similar to conditional statements, loops can be nested inside other loops. This is useful for looping lists of lists, dictionaries etc.

In [None]:
persons = [{'name': 'John', 'sex': 'Male'}, {'name': 'Jane', 'sex': 'Female'}]

for person in persons:
    for key in person:
        print(key, ":", person[key])
    print(" ")

In [6]:
days = ['Monday', 'Tuesday', 'Wednesday']
fruits = ['apple', 'banana', 'guava']

for day in days:
    for fruit in fruits:
        print(day, fruit)

Monday apple
Monday banana
Monday guava
Tuesday apple
Tuesday banana
Tuesday guava
Wednesday apple
Wednesday banana
Wednesday guava


## Further Reading and References

We've covered a lot of ground in just 3 tutorials. 

Following are some resources to learn about more about conditional statements and loops in Python:

* Python Tutorial at W3Schools: https://www.w3schools.com/python/
* Practical Python Programming: https://dabeaz-course.github.io/practical-python/Notes/Contents.html
* Python official documentation: https://docs.python.org/3/tutorial/index.html

You are now ready to move on to the next tutorial: [Writing Reusable Code Using Functions in Python](https://jovian.ai/aakashns/python-functions-and-scope)

Let's save a snapshot of our notebook one final time using `jovian.commit`.

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is branching in programming languages?
2. What is the purpose of the `if` statement in Python?
3. What is the syntax of the `if` statement? Give an example.
4. What is indentation? Why is it used?
5. What is an indented block of statements?
6. How do you perform indentation in Python?
7. What happens if some code is not indented correctly?
8. What happens when the condition within the `if` statement evaluates to `True`? What happens if the condition evaluates for `false`?
9. How do you check if a number is even?
10. What is the purpose of the `else` statement in Python?
11. What is the syntax of the `else` statement? Give an example.
12. Write a program that prints different messages based on whether a number is positive or negative.
13. Can the `else` statement be used without an `if` statement?
14. What is the purpose of the `elif` statement in Python?
15. What is the syntax of the `elif` statement? Give an example.
16. Write a program that prints different messages for different months of the year.
17. Write a program that uses `if`, `elif`, and `else` statements together.
18. Can the `elif` statement be used without an `if` statement?
19. Can the `elif` statement be used without an `else` statement?
20. What is the difference between a chain of `if`, `elif`, `elif`… statements and a chain of `if`, `if`, `if`… statements? Give an example.
21. Can non-boolean conditions be used with `if` statements? Give some examples.
22. What are nested conditional statements? How are they useful?
23. Give an example of nested conditional statements.
24. Why is it advisable to avoid nested conditional statements?
25. What is the shorthand `if` conditional expression? 
26. What is the syntax of the shorthand `if` conditional expression? Give an example.
27. What is the difference between the shorthand `if` expression and the regular `if` statement?
28. What is a statement in Python?
29. What is an expression in Python?
30. What is the difference between statements and expressions?
31. Is every statement an expression? Give an example or counterexample.
32. Is every expression a statement? Give an example or counterexample.
33. What is the purpose of the pass statement in `if` blocks?
34. What is iteration or looping in programming languages? Why is it useful?
35. What are the two ways for performing iteration in Python?
36. What is the purpose of the `while` statement in Python?
37. What is the syntax of the `white` statement in Python? Give an example.
38. Write a program to compute the sum of the numbers 1 to 100 using a while loop. 
39. Repeat the above program for numbers up to 1000, 10000, and 100000. How long does it take each loop to complete?
40. What is an infinite loop?
41. What causes a program to enter an infinite loop?
42. How do you interrupt an infinite loop within Jupyter?
43. What is the purpose of the `break` statement in Python? 
44. Give an example of using a `break` statement within a while loop.
45. What is the purpose of the `continue` statement in Python?
46. Give an example of using the `continue` statement within a while loop.
47. What is logging? How is it useful?
48. What is the purpose of the `for` statement in Python?
49. What is the syntax of `for` loops? Give an example.
50. How are for loops and while loops different?
51. How do you loop over a string? Give an example.
52. How do you loop over a list? Give an example.
53. How do you loop over a tuple? Give an example.
54. How do you loop over a dictionary? Give an example.
55. What is the purpose of the `range` statement? Give an example.
56. What is the purpose of the `enumerate` statement? Give an example.
57. How are the `break`, `continue`, and `pass` statements used in for loops? Give examples.
58. Can loops be nested within other loops? How is nesting useful?
59. Give an example of a for loop nested within another for loop.
60. Give an example of a while loop nested within another while loop.
61. Give an example of a for loop nested within a while loop.
62. Give an example of a while loop nested within a for loop.


# Writing Reusable Code using Functions in Python

This tutorial is a part of [Data Analysis with Python: Zero to Pandas](https://jovian.ai/learn/data-analysis-with-python-zero-to-pandas) and [Zero to Data Analyst Science Bootcamp](https://jovian.ai/learn/zero-to-data-analyst-bootcamp).

![](https://i.imgur.com/TvNf5Jp.png)

These tutorials take a practical and coding-focused approach. The best way to learn the material is to execute the code and experiment with it yourself. 

This tutorial covers the following topics:

- Creating and using functions in Python
- Local variables, return values, and optional arguments
- Reusing functions and using Python library functions
- Exception handling using `try`-`except` blocks
- Documenting functions using docstrings

## Creating and using functions

A function is a reusable set of instructions that takes one or more inputs, performs some operations, and often returns an output. Python contains many in-built functions like `print`, `len`, etc., and provides the ability to define new ones.

In [None]:
today = "Saturday"
print("Today is", today)

You can define a new function using the `def` keyword.

In [None]:
def say_hello():
    print('Hello there!')
    print('How are you?')

Note the round brackets or parentheses `()` and colon `:` after the function's name. Both are essential parts of the syntax. The function's *body* contains an indented block of statements. The statements inside a function's body are not executed when the function is defined. To execute the statements, we need to *call* or *invoke* the function.

In [None]:
say_hello()

### Function arguments

Functions can accept zero or more values as *inputs* (also knows as *arguments* or *parameters*). Arguments help us write flexible functions that can perform the same operations on different values. Further, functions can return a result that can be stored in a variable or used in other expressions.

Here's a function that filters out the even numbers from a list and returns a new list using the `return` keyword.

In [2]:
def filter_even(number_list):
    result_list = []
    for number in number_list:
        if number % 2 == 0:
            result_list.append(number)
    return result_list

Can you understand what the function does by looking at the code? If not, try executing each line of the function's body separately within a code cell with an actual list of numbers in place of `number_list`.

In [3]:
even_list = filter_even([1, 2, 3, 4, 5, 6, 7])
even_list

[2, 4, 6]

## Writing great functions in Python

As a programmer, you will spend most of your time writing and using functions. Python offers many features to make your functions powerful and flexible. Let's explore some of these by solving a problem:

> Radha is planning to buy a house that costs `$1,260,000`. She considering two options to finance her purchase:
>
> * Option 1: Make an immediate down payment of `$300,000`, and take loan 8-year loan with an interest rate of 10% (compounded monthly) for the remaining amount.
> * Option 2: Take a 10-year loan with an interest rate of 8% (compounded monthly) for the entire amount.
>
> Both these loans have to be paid back in equal monthly installments (EMIs). Which loan has a lower EMI among the two?


Since we need to compare the EMIs for two loan options, defining a function to calculate the EMI for a loan would be a great idea.  The inputs to the function would be cost of the house, the down payment, duration of the loan, rate of interest etc. We'll build this function step by step.

First, let's write a simple function that calculates the EMI on the entire cost of the house, assuming that the loan must be paid back in one year, and there is no interest or down payment.

In [None]:
def loan_emi(amount):
    emi = amount / 12
    print('The EMI is ${}'.format(emi))

In [None]:
loan_emi(1260000)

### Local variables and scope

Let's add a second argument to account for the duration of the loan in months.

In [None]:
def loan_emi(amount, duration):
    emi = amount / duration
    print('The EMI is ${}'.format(emi))

Note that the variable `emi` defined inside the function is not accessible outside. The same is true for the parameters `amount` and `duration`. These are all *local variables* that lie within the *scope* of the function.

> **Scope**: Scope refers to the region within the code where a particular variable is visible. Every function (or class definition) defines a scope within Python. Variables defined in this scope are called *local variables*. Variables that are available everywhere are called *global variables*. Scope rules allow you to use the same variable names in different functions without sharing values from one to the other. 

In [None]:
emi

In [None]:
amount

In [None]:
duration

In [None]:
loan_emi(1260000, 8*12)

In [None]:
loan_emi(1260000, 18*12)

### Return values

As you might expect, the EMI for the 6-year loan is higher compared to the 10-year loan. Right now, we're printing out the result. It would be better to return it and store the results in variables for easier comparison. We can do this using the `return` statement

In [None]:
def loan_emi(amount, duration):
    emi = amount / duration
    return emi

In [None]:
emi1 = loan_emi(1260000, 8*12)

In [None]:
emi2 = loan_emi(1260000, 10*12)

In [None]:
emi1

In [None]:
emi2

### Optional arguments

Next, let's add another argument to account for the immediate down payment. We'll make this an *optional argument* with a default value of 0.

In [None]:
def loan_emi(amount, duration, down_payment=0):
    loan_amount = amount - down_payment
    emi = loan_amount / duration
    return emi

In [None]:
emi1 = loan_emi(1260000, 8*12, 3e5)
emi1

Next, let's add the interest calculation into the function. Here's the formula used to calculate the EMI for a loan:

<img src="https://i.imgur.com/iKujHGK.png" style="width:240px">

where:

* `P` is the loan amount (principal)
* `n` is the no. of months
* `r` is the rate of interest per month

The derivation of this formula is beyond the scope of this tutorial. See this video for an explanation: https://youtu.be/Coxza9ugW4E .

In [None]:
def loan_emi(amount, duration, rate, down_payment=0):
    loan_amount = amount - down_payment
    emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
    return emi

Note that while defining the function, required arguments like `cost`, `duration` and `rate` must appear before optional arguments like `down_payment`.

Let's calculate the EMI for Option 1

In [None]:
loan_emi(1260000, 8*12, 0.1/12, 3e5)

While calculating the EMI for Option 2, we need not include the `down_payment` argument.

In [None]:
loan_emi(1260000, 10*12, 0.08/12)

### Named arguments

Invoking a function with many arguments can often get confusing and is prone to human errors. Python provides the option of invoking functions with *named* arguments for better clarity. You can also split function invocation into multiple lines.

In [None]:
emi1 = loan_emi(
    amount=1260000, 
    duration=8*12, 
    rate=0.1/12, 
    down_payment=3e5
)
emi1

In [None]:
emi2 = loan_emi(amount=1260000, duration=10*12, rate=0.08/12)
emi2

### Modules and library functions

We can already see that the EMI for Option 1 is lower than the EMI for Option 2. However, it would be nice to round up the amount to full dollars, rather than showing digits after the decimal. To achieve this, we might want to write a function that can take a number and round it up to the next integer (e.g., 1.2 is rounded up to 2). That would be a great exercise to try out!

However, since rounding numbers is a fairly common operation, Python provides a function for it (along with thousands of other functions) as part of the [Python Standard Library](https://docs.python.org/3/library/). Functions are organized into *modules* that need to be imported to use the functions they contain. 

> **Modules**: Modules are files containing Python code (variables, functions, classes, etc.). They provide a way of organizing the code for large Python projects into files and folders. The key benefit of using modules is _namespaces_: you must import the module to use its functions within a Python script or notebook. Namespaces provide encapsulation and avoid naming conflicts between your code and a module or across modules.

We can use the `ceil` function (short for *ceiling*) from the `math` module to round up numbers. Let's import the module and use it to round up the number `1.2`. 

In [None]:
import math
help(math.ceil)

In [None]:
math.ceil(1.2)

Let's now use the `math.ceil` function within the `home_loan_emi` function to round up the EMI amount. 

> Using functions to build other functions is a great way to reuse code and implement complex business logic while still keeping the code small, understandable, and manageable. Ideally, a function should do one thing and one thing only. If you find yourself writing a function that does too many things, consider splitting it into multiple smaller, independent functions. As a rule of thumb, try to limit your functions to 10 lines of code or less. Good programmers always write short, simple, and readable functions.


In [None]:
def loan_emi(amount, duration, rate, down_payment=0):
    loan_amount = amount - down_payment
    emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
    emi = math.ceil(emi)
    return emi

In [None]:
emi1 = loan_emi(
    amount=1260000, 
    duration=8*12, 
    rate=0.1/12, 
    down_payment=3e5
)

In [None]:
emi1

In [None]:
emi2 = loan_emi(amount=1260000, duration=10*12, rate=0.08/12)
emi2

Let's compare the EMIs and display a message for the option with the lower EMI.

In [None]:
if emi1 < emi2:
    print("Option 1 has the lower EMI: ${}".format(emi1))
else:
    print("Option 2 has the lower EMI: ${}".format(emi2))

### Reusing and improving functions 

Now we know for sure that "Option 1" has the lower EMI among the two options. But what's even better is that we now have a handy function `loan_emi` that we can use to solve many other similar problems with just a few lines of code. Let's try it with a couple more questions.

> **Q**: Shaun is currently paying back a home loan for a house he bought a few years ago. The cost of the house was `$800,000`. Shaun made a down payment of `25%` of the price. He financed the remaining amount using a 6-year loan with an interest rate of `7%` per annum (compounded monthly). Shaun is now buying a car worth `$60,000`, which he is planning to finance using a 1-year loan with an interest rate of `12%` per annum. Both loans are paid back in EMIs. What is the total monthly payment Shaun makes towards loan repayment?

This question is now straightforward to solve, using the `loan_emi` function we've already defined.

In [None]:
cost_of_house = 800000
home_loan_duration = 6*12 # months
home_loan_rate = 0.07/12 # monthly
home_down_payment = .25 * 800000

emi_house = loan_emi(amount=cost_of_house,
                     duration=home_loan_duration,
                     rate=home_loan_rate, 
                     down_payment=home_down_payment)

emi_house

In [None]:
cost_of_car = 60000
car_loan_duration = 1*12 # months
car_loan_rate = .12/12 # monthly

emi_car = loan_emi(amount=cost_of_car, 
                   duration=car_loan_duration, 
                   rate=car_loan_rate)

emi_car

In [None]:
print("Shaun makes a total monthly payment of ${} towards loan repayments.".format(emi_house+emi_car))

### Exceptions and `try`-`except`

> Q: If you borrow `$100,000` using a 10-year loan with an interest rate of 9% per annum, what is the total amount you end up paying as interest?

One way to solve this problem is to compare the EMIs for two loans: one with the given rate of interest and another with a 0% rate of interest. The total interest paid is then simply the sum of monthly differences over the duration of the loan.

In [None]:
emi_with_interest = loan_emi(amount=100000, duration=10*12, rate=0.09/12)
emi_with_interest

In [None]:
emi_without_interest = loan_emi(amount=100000, duration=10*12, rate=0./12)
emi_without_interest

Something seems to have gone wrong! If you look at the error message above carefully, Python tells us precisely what is wrong. Python *throws* a `ZeroDivisionError` with a message indicating that we're trying to divide a number by zero. `ZeroDivisonError` is an *exception* that stops further execution of the program.

> **Exception**: Even if a statement or expression is syntactically correct, it may cause an error when the Python interpreter tries to execute it. Errors detected during execution are called exceptions. Exceptions typically stop further execution of the program unless handled within the program using `try`-`except` statements.

Python provides many built-in exceptions *thrown* when built-in operators, functions, or methods are used incorrectly: https://docs.python.org/3/library/exceptions.html#built-in-exceptions. You can also define your custom exception by extending the `Exception` class (more on that later).

You can use the `try` and `except` statements to *handle* an exception. Here's an example:

In [None]:
try:
    print("Now computing the result..")
    result = 5 / 0
    print("Computation was completed successfully")
except ZeroDivisionError:
    print("Failed to compute result because you were trying to divide by zero")
    result = None

print(result)

When an exception occurs inside a `try` block, the block's remaining statements are skipped. The `except` block is executed if the type of exception thrown matches that of the exception being handled. After executing the `except` block, the program execution returns to the normal flow.

You can also handle more than one type of exception using multiple `except` statements. Learn more about exceptions here: https://www.w3schools.com/python/python_try_except.asp .

Let's enhance the `loan_emi` function to use `try`-`except` to handle the scenario where the interest rate is 0%. It's common practice to make changes/enhancements to functions over time as new scenarios and use cases come up. It makes functions more robust & versatile.

In [None]:
def loan_emi(amount, duration, rate, down_payment=0):
    loan_amount = amount - down_payment
    try:
        emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
    except ZeroDivisionError:
        emi = loan_amount / duration
    emi = math.ceil(emi)
    return emi

We can use the updated `loan_emi` function to solve our problem.

> **Q**: If you borrow `$100,000` using a 10-year loan with an interest rate of 9% per annum, what is the total amount you end up paying as interest?


In [None]:
emi_with_interest = loan_emi(amount=100000, duration=10*12, rate=0.09/12)
emi_with_interest

In [None]:
emi_without_interest = loan_emi(amount=100000, duration=10*12, rate=0)
emi_without_interest

In [None]:
total_interest = (emi_with_interest - emi_without_interest) * 10*12
print("The total interest paid is ${}.".format(total_interest))

### Documenting functions using Docstrings

We can add some documentation within our function using a *docstring*. A docstring is simply a string that appears as the first statement within the function body, and is used by the `help` function. A good docstring describes what the function does, and provides some explanation about the arguments.

In [None]:
def loan_emi(amount, duration, rate, down_payment=0):
    """Calculates the equal montly installment (EMI) for a loan.
    
    Arguments:
        amount - Total amount to be spent (loan + down payment)
        duration - Duration of the loan (in months)
        rate - Rate of interest (monthly)
        down_payment (optional) - Optional intial payment (deducted from amount)
    """
    loan_amount = amount - down_payment
    try:
        emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
    except ZeroDivisionError:
        emi = loan_amount / duration
    emi = math.ceil(emi)
    return emi

In the docstring above, we've provided some additional information that the `duration` and `rate` are measured in months. You might even consider naming the arguments `duration_months` and `rate_monthly`, to avoid any confusion whatsoever. Can you think of some other ways to improve the function?

In [None]:
help(loan_emi)

## Exercise - Data Analysis for Vacation Planning

You're planning a vacation, and you need to decide which city you want to visit. You have shortlisted four cities and identified the return flight cost, daily hotel cost, and weekly car rental cost. While renting a car, you need to pay for entire weeks, even if you return the car sooner.


| City | Return Flight (`$`) | Hotel per day (`$`) | Weekly Car Rental  (`$`) | 
|------|--------------------------|------------------|------------------------|
| Paris|       200                |       20         |          200           |
| London|      250                |       30         |          120           |
| Dubai|       370                |       15         |          80           |
| Mumbai|      450                |       10         |          70           |         


Answer the following questions using the data above:

1. If you're planning a 1-week long trip, which city should you visit to spend the least amount of money?
2. How does the answer to the previous question change if you change the trip's duration to four days, ten days or two weeks?
3. If your total budget for the trip is `$1000`, which city should you visit to maximize the duration of your trip? Which city should you visit if you want to minimize the duration?
4. How does the answer to the previous question change if your budget is `$600`, `$2000`, or `$1500`?

*Hint: To answer these questions, it will help to define a function `cost_of_trip` with relevant inputs like flight cost, hotel rate, car rental rate, and duration of the trip. You may find the `math.ceil` function useful for calculating the total cost of car rental.*

## Summary and Further Reading

With this, we complete our discussion of functions in Python. We've covered the following topics in this tutorial:

* Creating and using functions
* Functions with one or more arguments
* Local variables and scope
* Returning values using `return`
* Using default arguments to make a function flexible
* Using named arguments while invoking a function
* Importing modules and using library functions
* Reusing and improving functions to handle new use cases
* Handling exceptions with `try`-`except`
* Documenting functions using docstrings

This tutorial on functions in Python is by no means exhaustive. Here are a few more topics to learn about:

* Functions with an arbitrary number of arguments using (`*args` and `**kwargs`)
* Defining functions inside functions (and closures)
* A function that invokes itself (recursion)
* Functions that accept other functions as arguments or return other functions
* Functions that enhance other functions (decorators)

Following are some resources to learn about more functions in Python:

* Python Tutorial at W3Schools: https://www.w3schools.com/python/
* Practical Python Programming: https://dabeaz-course.github.io/practical-python/Notes/Contents.html
* Python official documentation: https://docs.python.org/3/tutorial/index.html

You are ready to move on to the next tutorial: ["Reading from and writing to files using Python"](https://jovian.ml/aakashns/python-os-and-filesystem).

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is a function?
2. What are the benefits of using functions?
3. What are some built-in functions in Python?
4. How do you define a function in Python? Give an example.
5. What is the body of a function?
6. When are the statements in the body of a function executed?
7. What is meant by calling or invoking a function? Give an example.
8. What are function arguments? How are they useful?
9. How do you store the result of a function in a variable?
10. What is the purpose of the `return` keyword in Python?
11. Can you return multiple values from a function?
12. Can a `return` statement be used inside an `if` block or a `for` loop?
13. Can the `return` keyword be used outside a function?
14. What is scope in a programming region? 
15. How do you define a variable inside a function?
16. What are local & global variables?
17. Can you access the variables defined inside a function outside its body? Why or why not?
18. What do you mean by the statement "a function defines a scope within Python"?
19. Do for and while loops define a scope, like functions?
20. Do if-else blocks define a scope, like functions?
21. What are optional function arguments & default values? Give an example.
22. Why should the required arguments appear before the optional arguments in a function definition?
23. How do you invoke a function with named arguments? Illustrate with an example.
24. Can you split a function invocation into multiple lines?
25. Write a function that takes a number and rounds it up to the nearest integer.
26. What are modules in Python?
27. What is a Python library?
28. What is the Python Standard Library?
29. Where can you learn about the modules and functions available in the Python standard library?
30. How do you install a third-party library?
31. What is a module namespace? How is it useful?
32. What problems would you run into if Python modules did not provide namespaces?
33. How do you import a module?
34. How do you use a function from an imported module? Illustrate with an example.
35. Can you invoke a function inside the body of another function? Give an example.
36. What is the single responsibility principle, and how does it apply while writing functions?
37. What some characteristics of well-written functions?
38. Can you use if statements or while loops within a function? Illustrate with an example.
39. What are exceptions in Python? When do they occur?
40. How are exceptions different from syntax errors?
41. What are the different types of in-built exceptions in Python? Where can you learn about them?
42. How do you prevent the termination of a program due to an exception?
43. What is the purpose of the `try`-`except` statements in Python?
44. What is the syntax of the `try`-`except` statements? Give an example.
45. What happens if an exception occurs inside a `try` block?
46. How do you handle two different types of exceptions using `except`? Can you have multiple `except` blocks under a single `try` block?
47. How do you create an `except` block to handle any type of exception?
48. Illustrate the usage of `try`-`except` inside a function with an example.
49. What is a docstring? Why is it useful?
50. How do you display the docstring for a function?
51. What are *args and **kwargs? How are they useful? Give an example.
52. Can you define functions inside functions? 
53. What is function closure in Python? How is it useful? Give an example.
54. What is recursion? Illustrate with an example.
55. Can functions accept other functions as arguments? Illustrate with an example.
56. Can functions return other functions as results? Illustrate with an example.
57. What are decorators? How are they useful?
58. Implement a function decorator which prints the arguments and result of wrapped functions.
59. What are some in-built decorators in Python?
60. What are some popular Python libraries?

## Solution for Exercise

### Exercise - Data Analysis for Vacation Planning

You're planning a vacation, and you need to decide which city you want to visit. You have shortlisted four cities and identified the return flight cost, daily hotel cost, and weekly car rental cost. While renting a car, you need to pay for entire weeks, even if you return the car sooner.


| City | Return Flight (`$`) | Hotel per day (`$`) | Weekly Car Rental  (`$`) | 
|------|--------------------------|------------------|------------------------|
| Paris|       200                |       20         |          200           |
| London|      250                |       30         |          120           |
| Dubai|       370                |       15         |          80           |
| Mumbai|      450                |       10         |          70           |         


Answer the following questions using the data above:

1. If you're planning a 1-week long trip, which city should you visit to spend the least amount of money?
2. How does the answer to the previous question change if you change the trip's duration to four days, ten days or two weeks?
3. If your total budget for the trip is `$600`, which city should you visit to maximize the duration of your trip? Which city should you visit if you want to minimize the duration?
4. How does the answer to the previous question change if your budget is `$1000`, `$2000`, or `$1500`?

*Hint: To answer these questions, it will help to define a function `cost_of_trip` with relevant inputs like flight cost, hotel rate, car rental rate, and duration of the trip. You may find the `math.ceil` function useful for calculating the total cost of car rental.*

In [None]:
import math

In [None]:
Paris=[200,20,200,'Paris']
London = [250,30,120,'London']
Dubai = [370,15,80,'Dubai']
Mumbai = [450,10,70,'Mumbai']
Cities = [Paris,London,Dubai,Mumbai]

In [None]:
def cost_of_trip(flight,hotel_cost,car_rent,num_of_days=0):
    return flight+(hotel_cost*num_of_days)+(car_rent*math.ceil(num_of_days/7))

In [None]:
def days_to_visit(days):
    costs=[]
    for city in Cities:
        cost=cost_of_trip(city[0],city[1],city[2],days)
        costs.append((cost,city[3]))
    min_cost = min(costs)
    return min_cost

In [None]:
days_to_visit(7)

In [None]:
days_to_visit(4)

In [None]:
days_to_visit(10)

In [None]:
days_to_visit(14)

In [None]:
def given_budget(budget,less_days=False):
    days=1
    cost=0
    while cost<budget:
        #copy of city cost 
        cost_before=cost
        try:
            #copy of costs dictionary, if exists
            costs_before=costs.copy()
        except:
            #if costs dictionary doesn't exist, create an empty dictionary
            costs_before={}
        costs={}
        for city in Cities:
            cost = cost_of_trip(city[0],city[1],city[2],days)
            costs[cost] = city[3]
        if less_days:
            cost=max(list(costs.keys()))
            ''' The while loop breaks only after cost>600 condition is met.
            when the condition is met, the costs dictionary updates to values that are greater than 600 
            so we check if it is exceeding, if it does, we return the values from the previous dictionary cost_before. '''
            if cost>=budget:
                return costs_before[cost_before],days-1
        else:   
            cost=min(list(costs.keys()))
            if cost>=budget:
                return costs_before[cost_before],days-1
        days+=1

In [None]:
city_to_stay_maximum_days=given_budget(600)
print(city_to_stay_maximum_days)

In [None]:
city_to_stay_minimum_days=given_budget(600,less_days=True)
print(city_to_stay_minimum_days)

> 4. How does the answer to the previous question change if your budget is `$1000`, `$2000`, or `$1500`?

In [None]:
city_to_stay_maximum_days=given_budget(1000)
print(city_to_stay_maximum_days)

In [None]:
city_to_stay_minimum_days=given_budget(1000,less_days=True)
print(city_to_stay_minimum_days)

In [None]:
city_to_stay_maximum_days=given_budget(2000)
print(city_to_stay_maximum_days)

In [None]:
city_to_stay_maximum_days=given_budget(1500)
print(city_to_stay_maximum_days)

In [None]:
city_to_stay_minimum_days=given_budget(1500,less_days=True)
print(city_to_stay_minimum_days)

# Numerical Computing with Python and Numpy

![](https://i.imgur.com/mg8O3kd.png)

This tutorial series is a beginner-friendly introduction to programming and data analysis using the Python programming language. These tutorials take a practical and coding-focused approach. The best way to learn the material is to execute the code and experiment with it yourself. 

This tutorial covers the following topics:

- Working with numerical data in Python
- Going from Python lists to Numpy arrays
- Multi-dimensional Numpy arrays and their benefits
- Array operations, broadcasting, indexing, and slicing
- Working with CSV data files using Numpy

## Working with numerical data

The "data" in *Data Analysis* typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The [Numpy](https://numpy.org) library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.


> Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in  millimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`. Here's an example set of values:

In [7]:
w1, w2, w3 = 0.3, 0.2, 0.5

 Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [None]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

We can now substitute these variables into the linear equation to predict the yield of apples.

In [None]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples

In [None]:
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))

To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, i.e., a list of numbers.

In [None]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively. 

We can also represent the set of weights used in the formula as a vector.

In [None]:
weights = [w1, w2, w3]

In [None]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result

In [None]:
crop_yield(kanto, weights)

In [None]:
crop_yield(johto, weights)

In [None]:
crop_yield(unova, weights)

## Going from Python lists to Numpy arrays


The calculation performed by the `crop_yield` (element-wise multiplication of two vectors and taking a sum of the results) is also called the *dot product*. Learn more about dot product here: https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length . 

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

Let's install the Numpy library using the `pip` package manager.

In [None]:
!pip install numpy --upgrade --quiet

Next, let's import the `numpy` module. It's common practice to import numpy with the alias `np`.

In [None]:
import numpy as np

We can now use the `np.array` function to create Numpy arrays.

In [None]:
kanto = np.array([73, 67, 43])
kanto

In [None]:
weights = np.array([w1, w2, w3])
weights

Numpy arrays have the type `ndarray`.

In [None]:
type(kanto)

In [None]:
type(weights)

Just like lists, Numpy arrays support the indexing notation `[]`.

## Operating on Numpy arrays

We can now compute the dot product of the two vectors using the `np.dot` function.

In [None]:
np.dot(kanto, weights)

We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.

In [None]:
(kanto * weights).sum()

The `*` operator performs an element-wise multiplication of two arrays if they have the same size. The `sum` method calculates the sum of numbers in an array.

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

In [None]:
arr1 * arr2

In [None]:
arr2.sum()

## Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [None]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [None]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

In [None]:
%%time
np.dot(arr1_np, arr2_np)

As you can see, using `np.dot` is 100 times faster than using a `for` loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.

Let's save our work before continuing.

In [None]:
## Multi-dimensional Numpy arrays 

We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.

In [None]:
climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])
climate_data

If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.

Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the `.shape` property of an array.

<img src="https://fgnt.github.io/python_crashkurs_doc/_images/numpy_array_t.png" width="420">


In [None]:
# 2D array (matrix)
climate_data.shape

In [None]:
weights

In [None]:
# 1D array (vector)
weights.shape

In [None]:
# 3D array 
arr3 = np.array([
    [[11, 12, 13], 
     [13, 14, 15]], 
    [[15, 16, 17], 
     [17, 18, 19.5]]])

In [None]:
arr3.shape

All the elements in a numpy array have the same data type. You can check the data type of an array using the `.dtype` property.

In [None]:
weights.dtype

In [None]:
climate_data.dtype

If an array contains even a single floating point number, all the other elements are also converted to floats.

In [None]:
arr3.dtype

We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between `climate_data` (a 5x3 matrix) and `weights` (a vector of length 3). Here's what it looks like visually:

<img src="https://i.imgur.com/LJ2WKSI.png" width="240">

You can learn about matrices and matrix multiplication by watching the first 3-4 videos of this playlist: https://www.youtube.com/watch?v=xyAuNHPsq-g&list=PLFD0EB975BA0CC1E0&index=1 .

We can use the `np.matmul` function or the `@` operator to perform matrix multiplication.

In [None]:
np.matmul(climate_data, weights)

In [None]:
climate_data @ weights

## Working with CSV data files

Numpy also provides helper functions reading from & writing to files. Let's download a file `climate.txt`, which contains 10,000 climate measurements (temperature, rainfall & humidity) in the following format:


```
temperature,rainfall,humidity
25.00,76.00,99.00
39.00,65.00,70.00
59.00,45.00,77.00
84.00,63.00,38.00
66.00,50.00,52.00
41.00,94.00,77.00
91.00,57.00,96.00
49.00,96.00,99.00
67.00,20.00,28.00
...
```

This format of storing data is known as *comma-separated values* or CSV. 

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)


To read this file into a numpy array, we can use the `genfromtxt` function.

In [8]:
import urllib.request

urllib.request.urlretrieve(
    'https://gist.github.com/BirajCoder/a4ffcb76fd6fb221d76ac2ee2b8584e9/raw/4054f90adfd361b7aa4255e99c2e874664094cea/climate.csv', 
    'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x107352ad0>)

In [None]:
climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header=1)
climate_data

In [None]:
climate_data.shape

We can now perform a matrix multiplication using the `@` operator to predict the yield of apples for the entire dataset using a given set of weights.

In [None]:
weights = np.array([0.3, 0.2, 0.5])

In [None]:
yields = climate_data @ weights
yields

In [None]:
yields.shape

Let's add the `yields` to `climate_data` as a fourth column using the [`np.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) function.

In [None]:
climate_results = np.concatenate((climate_data, yields.reshape(10000, 1)), axis=1)
climate_results

There are a couple of subtleties here:

* Since we wish to add new columns, we pass the argument `axis=1` to `np.concatenate`. The `axis` argument specifies the dimension for concatenation.

*  The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the [`np.reshape`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) function to change the shape of `yields` from `(10000,)` to `(10000,1)`.

Here's a visual explanation of `np.concatenate` along `axis=1` (can you guess what `axis=0` results in?):

<img src="https://www.w3resource.com/w3r_images/python-numpy-image-exercise-58.png" width="300">

The best way to understand what a Numpy function does is to experiment with it and read the documentation to learn about its arguments & return values. Use the cells below to experiment with `np.concatenate` and `np.reshape`.

Let's write the final results from our computation above back to a file using the `np.savetxt` function.

In [None]:
climate_results

In [None]:
np.savetxt('climate_results.txt', 
           climate_results, 
           fmt='%.2f', 
           delimiter=',',
           header='temperature,rainfall,humidity,yeild_apples', 
           comments='')

The results are written back in the CSV format to the file `climate_results.txt`. 

```
temperature,rainfall,humidity,yeild_apples
25.00,76.00,99.00,72.20
39.00,65.00,70.00,59.70
59.00,45.00,77.00,65.20
84.00,63.00,38.00,56.80
...
```


Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:


* Mathematics: `np.sum`, `np.exp`, `np.round`, arithemtic operators 
* Array manipulation: `np.reshape`, `np.stack`, `np.concatenate`, `np.split`
* Linear Algebra: `np.matmul`, `np.dot`, `np.transpose`, `np.eigvals`
* Statistics: `np.mean`, `np.median`, `np.std`, `np.max`

> **How to find the function you need?** The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to [this tutorial on array concatenation](https://cmdlinetips.com/2018/04/how-to-concatenate-arrays-in-numpy/). 

You can find a full list of array functions here: https://numpy.org/doc/stable/reference/routines.html


## Arithmetic operations, broadcasting and comparison

Numpy arrays support arithmetic operators like `+`, `-`, `*`, etc. You can perform an arithmetic operation with a single number (also called scalar) or with another array of the same shape. Operators make it easy to write mathematical expressions with multi-dimensional arrays.

In [None]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [None]:
arr3 = np.array([[11, 12, 13, 14], 
                 [15, 16, 17, 18], 
                 [19, 11, 12, 13]])

In [None]:
# Adding a scalar
arr2 + 3

In [None]:
# Element-wise subtraction
arr3 - arr2

In [None]:
# Division by scalar
arr2 / 2

In [None]:
# Element-wise multiplication
arr2 * arr3

In [None]:
# Modulus with scalar
arr2 % 4

### Array Broadcasting

Numpy arrays also support *broadcasting*, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.

In [None]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [None]:
arr2.shape

In [None]:
arr4 = np.array([4, 5, 6, 7])

In [None]:
arr4.shape

In [None]:
arr2 + arr4

When the expression `arr2 + arr4` is evaluated, `arr4` (which has the shape `(4,)`) is replicated three times to match the shape `(3, 4)` of `arr2`. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png" width="360">

Broadcasting only works if one of the arrays can be replicated to match the other array's shape.

In [None]:
arr5 = np.array([7, 8])

In [None]:
arr5.shape

In [None]:
arr2 + arr5

### Array Comparison

Numpy arrays also support comparison operations like `==`, `!=`, `>` etc. The result is an array of booleans.

In [None]:
arr1 = np.array([[1, 2, 3], [3, 4, 5]])
arr2 = np.array([[2, 2, 3], [1, 2, 5]])

In [None]:
arr1 == arr2

In [None]:
arr1 != arr2

In [None]:
arr1 >= arr2

In [None]:
arr1 <= arr2

In [None]:
arr1 < arr2

Array comparison is frequently used to count the number of equal elements in two arrays using the `sum` method. Remember that `True` evaluates to `1` and `False` evaluates to `0` when booleans are used in arithmetic operations.

In [None]:
(arr1 == arr2).sum()

## Array indexing and slicing

Numpy extends Python's list indexing notation using `[]` to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.

In [None]:
arr3 = np.array([
    [[11, 12, 13, 14], 
     [13, 14, 15, 19]], 
    
    [[15, 16, 17, 21], 
     [63, 92, 36, 18]], 
    
    [[98, 32, 81, 23],      
     [17, 18, 19.5, 43]]])

In [None]:
arr3.shape

In [None]:
# Single element
arr3[1, 1, 2]

In [None]:
# Subarray using ranges
arr3[1:, 0:1, :2]

In [None]:
# Mixing indices and ranges
arr3[1:, 1, 3]

In [None]:
# Mixing indices and ranges
arr3[1:, 1, :3]

In [None]:
# Using fewer indices
arr3[1]

In [None]:
# Using fewer indices
arr3[:2, 1]

In [None]:
# Using too many indices
arr3[1,3,2,1]

The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it. Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:

<img src="https://scipy-lectures.org/_images/numpy_indexing.png" width="360">


## Other ways of creating Numpy arrays

Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the [official documentation](https://numpy.org/doc/stable/reference/routines.array-creation.html) or use the `help` function to learn more.

In [None]:
# All zeros
np.zeros((3, 2))

In [None]:
# All ones
np.ones([2, 2, 3])

In [None]:
# Identity matrix
np.eye(3)

In [None]:
# Random vector
np.random.rand(5)

In [None]:
# Random matrix
np.random.randn(2, 3) # rand vs. randn - what's the difference?

In [None]:
# Fixed value
np.full([2, 3], 42)

In [None]:
# Range with start, end and step
np.arange(10, 90, 3)

In [None]:
# Equally spaced numbers in a range
np.linspace(3, 27, 9)

## Exercises

Try the following exercises to become familiar with Numpy arrays and practice your skills:

- Assignment on Numpy array functions: https://jovian.ai/aakashns/numpy-array-operations
- (Optional) 100 numpy exercises: https://jovian.ai/aakashns/100-numpy-exercises

## Summary and Further Reading

With this, we complete our discussion of numerical computing with Numpy. We've covered the following topics in this tutorial:

- Going from Python lists to Numpy arrays
- Operating on Numpy arrays
- Benefits of using Numpy arrays over lists
- Multi-dimensional Numpy arrays
- Working with CSV data files
- Arithmetic operations and broadcasting
- Array indexing and slicing
- Other ways of creating Numpy arrays


Check out the following resources for learning more about Numpy:

- Official tutorial: https://numpy.org/devdocs/user/quickstart.html
- Numpy tutorial on W3Schools: https://www.w3schools.com/python/numpy_intro.asp
- Advanced Numpy (exploring the internals): http://scipy-lectures.org/advanced/advanced_numpy/index.html

You are ready to move on to the next tutorial: [Analyzing Tabular Data using Pandas](https://jovian.ai/aakashns/python-pandas-data-analysis).

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is a vector?
2. How do you represent vectors using a Python list? Give an example.
3. What is a dot product of two vectors?
4. Write a function to compute the dot product of two vectors.
5. What is Numpy?
6. How do you install Numpy?
7. How do you import the `numpy` module?
8. What does it mean to import a module with an alias? Give an example.
9. What is the commonly used alias for `numpy`?
10. What is a Numpy array?
11. How do you create a Numpy array? Give an example.
12. What is the type of Numpy arrays?
13. How do you access the elements of a Numpy array?
14. How do you compute the dot product of two vectors using Numpy?
15. What happens if you try to compute the dot product of two vectors which have different sizes?
16. How do you compute the element-wise product of two Numpy arrays?
17. How do you compute the sum of all the elements in a Numpy array?
18. What are the benefits of using Numpy arrays over Python lists for operating on numerical data?
19. Why do Numpy array operations have better performance compared to Python functions and loops?
20. Illustrate the performance difference between Numpy array operations and Python loops using an example.
21. What are multi-dimensional Numpy arrays? 
22. Illustrate the creation of Numpy arrays with 2, 3, and 4 dimensions.
23. How do you inspect the number of dimensions and the length along each dimension in a Numpy array?
24. Can the elements of a Numpy array have different data types?
25. How do you check the data type of the elements of a Numpy array?
26. What is the data type of a Numpy array?
27. What is the difference between a matrix and a 2D Numpy array?
28. How do you perform matrix multiplication using Numpy?
29. What is the `@` operator used for in Numpy?
30. What is the CSV file format?
31. How do you read data from a CSV file using Numpy?
32. How do you concatenate two Numpy arrays?
33. What is the purpose of the `axis` argument of `np.concatenate`?
34. When are two Numpy arrays compatible for concatenation?
35. Give an example of two Numpy arrays that can be concatenated.
36. Give an example of two Numpy arrays that cannot be concatenated.
37. What is the purpose of the `np.reshape` function?
38. What does it mean to “reshape” a Numpy array?
39. How do you write a numpy array into a CSV file?
40. Give some examples of Numpy functions for performing mathematical operations.
41. Give some examples of Numpy functions for performing array manipulation.
42. Give some examples of Numpy functions for performing linear algebra.
43. Give some examples of Numpy functions for performing statistical operations.
44. How do you find the right Numpy function for a specific operation or use case?
45. Where can you see a list of all the Numpy array functions and operations?
46. What are the arithmetic operators supported by Numpy arrays? Illustrate with examples.
47. What is array broadcasting? How is it useful? Illustrate with an example.
48. Give some examples of arrays that are compatible for broadcasting?
49. Give some examples of arrays that are not compatible for broadcasting?
50. What are the comparison operators supported by Numpy arrays? Illustrate with examples.
51. How do you access a specific subarray or slice from a Numpy array?
52. Illustrate array indexing and slicing in multi-dimensional Numpy arrays with some examples.
53. How do you create a Numpy array with a given shape containing all zeros?
54. How do you create a Numpy array with a given shape containing all ones?
55. How do you create an identity matrix of a given shape?
56. How do you create a random vector of a given length?
57. How do you create a Numpy array with a given shape with a fixed value for each element?
58. How do you create a Numpy array with a given shape containing randomly initialized elements?
59. What is the difference between `np.random.rand` and `np.random.randn`? Illustrate with examples.
60. What is the difference between `np.arange` and `np.linspace`? Illustrate with examples.

# Analyzing Tabular Data using Python and Pandas

![](https://i.imgur.com/zfxLzEv.png)


This tutorial series is a beginner-friendly introduction to programming and data analysis using the Python programming language. These tutorials take a practical and coding-focused approach. The best way to learn the material is to execute the code and experiment with it yourself. 


## Reading a CSV file using Pandas

[Pandas](https://pandas.pydata.org/) is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL, and more. Let's download a file `italy-covid-daywise.txt` which contains day-wise Covid-19 data for Italy in the following format:

```
date,new_cases,new_deaths,new_tests
2020-04-21,2256.0,454.0,28095.0
2020-04-22,2729.0,534.0,44248.0
2020-04-23,3370.0,437.0,37083.0
2020-04-24,2646.0,464.0,95273.0
2020-04-25,3021.0,420.0,38676.0
2020-04-26,2357.0,415.0,24113.0
2020-04-27,2324.0,260.0,26678.0
2020-04-28,1739.0,333.0,37554.0
...
```

This format of storing data is known as *comma-separated values* or CSV. 

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)


We'll download this file using the `urlretrieve` function from the `urllib.request` module.

In [None]:
from urllib.request import urlretrieve

In [None]:
italy_covid_url = 'https://gist.githubusercontent.com/aakashns/f6a004fa20c84fec53262f9a8bfee775/raw/f309558b1cf5103424cef58e2ecb8704dcd4d74c/italy-covid-daywise.csv'

urlretrieve(italy_covid_url, 'italy-covid-daywise.csv')

To read the file, we can use the `read_csv` method from Pandas. First, let's install the Pandas library.

In [None]:
!pip install pandas --upgrade --quiet

We can now import the `pandas` module. As a convention, it is imported with the alias `pd`.

In [None]:
import pandas as pd

In [None]:
covid_df = pd.read_csv('italy-covid-daywise.csv')

Data from the file is read and stored in a `DataFrame` object - one of the core data structures in Pandas for storing and working with tabular data. We typically use the `_df` suffix in the variable names for dataframes.

In [None]:
type(covid_df)

In [None]:
covid_df

Here's what we can tell by looking at the dataframe:

- The file provides four day-wise counts for COVID-19 in Italy
- The metrics reported are new cases, deaths, and tests
- Data is provided for 248 days: from Dec 12, 2019, to Sep 3, 2020

Keep in mind that these are officially reported numbers. The actual number of cases & deaths may be higher, as not all cases are diagnosed. 

We can view some basic information about the data frame using the `.info` method.

In [None]:
covid_df.info()

It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the `.describe` method.

In [None]:
covid_df.describe()

The `columns` property contains the list of columns within the data frame.

In [None]:
covid_df.columns

You can also retrieve the number of rows and columns in the data frame using the `.shape` property

In [None]:
covid_df.shape

Here's a summary of the functions & methods we've looked at so far:

* `pd.read_csv` - Read data from a CSV file into a Pandas `DataFrame` object
* `.info()` - View basic infomation about rows, columns & data types
* `.describe()` - View statistical information about numeric columns
* `.columns` - Get the list of column names
* `.shape` - Get the number of rows & columns as a tuple

## Retrieving data from a data frame

The first thing you might want to do is retrieve data from this data frame, e.g., the counts of a specific day or the list of values in a particular column. To do this, it might help to understand the internal representation of data in a data frame. Conceptually, you can think of a dataframe as a dictionary of lists: keys are column names, and values are lists/arrays containing data for the respective columns. 

In [10]:
# Pandas format is simliar to this
covid_data_dict = {
    'date':       ['2020-08-30', '2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03'],
    'new_cases':  [1444, 1365, 996, 975, 1326],
    'new_deaths': [1, 4, 6, 8, 6],
    'new_tests': [53541, 42583, 54395, None, None]
}

Representing data in the above format has a few benefits:

* All values in a column typically have the same type of value, so it's more efficient to store them in a single array.
* Retrieving the values for a particular row simply requires extracting the elements at a given index from each column array.
* The representation is more compact (column names are recorded only once) compared to other formats that use a dictionary for each row of data (see the example below).

In [None]:
# Pandas format is not similar to this
covid_data_list = [
    {'date': '2020-08-30', 'new_cases': 1444, 'new_deaths': 1, 'new_tests': 53541},
    {'date': '2020-08-31', 'new_cases': 1365, 'new_deaths': 4, 'new_tests': 42583},
    {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
    {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8 },
    {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6},
]

In [None]:
With the dictionary of lists analogy in mind, you can now guess how to retrieve data from a data frame. For example, we can get a list of values from a specific column using the `[]` indexing notation.

In [None]:
covid_data_dict['new_cases']

In [None]:
covid_df['new_cases']

Each column is represented using a data structure called `Series`, which is essentially a numpy array with some extra methods and properties.

In [None]:
type(covid_df['new_cases'])

In [None]:
covid_df['new_cases'][246]

In [None]:
covid_df['new_tests'][240]

In [None]:
covid_df.at[246, 'new_cases']

In [None]:
covid_df.at[240, 'new_tests']

Instead of using the indexing notation `[]`, Pandas also allows accessing columns as properties of the dataframe using the `.` notation. However, this method only works for columns whose names do not contain spaces or special characters.

In [None]:
covid_df.new_cases

In [None]:
Further, you can also pass a list of columns within the indexing notation `[]` to access a subset of the data frame with just the given columns.

In [None]:
cases_df = covid_df[['date', 'new_cases']]
cases_df

The new data frame `cases_df` is simply a "view" of the original data frame `covid_df`. Both point to the same data in the computer's memory. Changing any values inside one of them will also change the respective values in the other. Sharing data between data frames makes data manipulation in Pandas blazing fast. You needn't worry about the overhead of copying thousands or millions of rows every time you want to create a new data frame by operating on an existing one.

Sometimes you might need a full copy of the data frame, in which case you can use the `copy` method.

In [None]:
covid_df_copy = covid_df.copy()

The data within `covid_df_copy` is completely separate from `covid_df`, and changing values inside one of them will not affect the other.


To access a specific row of data, Pandas provides the `.loc` method.

In [None]:
covid_df

In [None]:
covid_df.loc[243]

In [None]:
type(covid_df.loc[243])

We can use the `.head` and `.tail` methods to view the first or last few rows of data.

In [None]:
covid_df.head(5)

In [None]:
covid_df.tail(4)

Notice above that while the first few values in the `new_cases` and `new_deaths` columns are `0`, the corresponding values within the `new_tests` column are `NaN`. That is because the CSV file does not contain any data for the `new_tests` column for specific dates (you can verify this by looking into the file). These values may be missing or unknown.

In [None]:
covid_df.at[0, 'new_tests']

In [None]:
type(covid_df.at[0, 'new_tests'])

The distinction between `0` and `NaN` is subtle but important. In this dataset, it represents that daily test numbers were not reported on specific dates. Italy started reporting daily tests on Apr 19, 2020. 93,5310 tests had already been conducted before Apr 19. 

We can find the first index that doesn't contain a `NaN` value using a column's `first_valid_index` method.

In [None]:
covid_df.new_tests.first_valid_index()

Let's look at a few rows before and after this index to verify that the values change from `NaN` to actual numbers. We can do this by passing a range to `loc`.

In [None]:
covid_df.loc[108:113]

We can use the `.sample` method to retrieve a random sample of rows from the data frame.

In [None]:
covid_df.sample(10)

Notice that even though we have taken a random sample, each row's original index is preserved - this is a useful property of data frames.


Here's a summary of the functions & methods we looked at in this section:

- `covid_df['new_cases']` - Retrieving columns as a `Series` using the column name
- `new_cases[243]` - Retrieving values from a `Series` using an index
- `covid_df.at[243, 'new_cases']` - Retrieving a single value from a data frame
- `covid_df.copy()` - Creating a deep copy of a data frame
- `covid_df.loc[243]` - Retrieving a row or range of rows of data from the data frame
- `head`, `tail`, and `sample` - Retrieving multiple rows of data from the data frame
- `covid_df.new_tests.first_valid_index` - Finding the first non-empty index in a series


## Analyzing data from data frames

Let's try to answer some questions about our data.

**Q: What are the total number of reported cases and deaths related to Covid-19 in Italy?**

Similar to Numpy arrays, a Pandas series supports the `sum` method to answer these questions.

In [None]:
total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

In [None]:
print('The number of reported cases is {} and the number of reported deaths is {}.'.format(int(total_cases), int(total_deaths)))

**Q: What is the overall death rate (ratio of reported deaths to reported cases)?**

In [None]:
death_rate = covid_df.new_deaths.sum() / covid_df.new_cases.sum()

print("The overall reported death rate in Italy is {:.2f} %.".format(death_rate*100))

**Q: What is the overall number of tests conducted? A total of 935310 tests were conducted before daily test numbers were reported.**

In [None]:
initial_tests = 935310
total_tests = initial_tests + covid_df.new_tests.sum()


In [None]:
total_tests

**Q: What fraction of tests returned a positive result?**

In [None]:
positive_rate = total_cases / total_tests

In [None]:
print('{:.2f}% of tests in Italy led to a positive diagnosis.'.format(positive_rate*100))

## Querying and sorting rows

Let's say we want only want to look at the days which had more than 1000 reported cases. We can use a boolean expression to check which rows satisfy this criterion.

In [None]:
high_new_cases = covid_df.new_cases > 1000
high_new_cases

 The boolean expression returns a series containing `True` and `False` boolean values. You can use this series to select a subset of rows from the original dataframe, corresponding to the `True` values in the series.

In [None]:
covid_df[high_new_cases]

We can write this succinctly on a single line by passing the boolean expression as an index to the data frame.

In [None]:
high_cases_df = covid_df[covid_df.new_cases > 1000]

In [None]:
high_cases_df

The data frame contains 72 rows, but only the first & last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.

In [None]:
from IPython.display import display
with pd.option_context('display.max_rows', 100):
    display(covid_df[covid_df.new_cases > 1000])

We can also formulate more complex queries that involve multiple columns. As an example, let's try to determine the days when the ratio of cases reported to tests conducted is higher than the overall `positive_rate`.

In [None]:
positive_rate

In [None]:
high_ratio_df = covid_df[covid_df.new_cases / covid_df.new_tests > positive_rate]

In [None]:
high_ratio_df

The result of performing an operation on two columns is a new series.

In [None]:
covid_df.new_cases / covid_df.new_tests

We can use this series to add a new column to the data frame.

In [None]:
covid_df['positive_rate'] = covid_df.new_cases / covid_df.new_tests

In [None]:
covid_df

However, keep in mind that sometimes it takes a few days to get the results for a test, so we can't compare the number of new cases with the number of tests conducted on the same day. Any inference based on this `positive_rate` column is likely to be incorrect. It's essential to watch out for such subtle relationships that are often not conveyed within the CSV file and require some external context. It's always a good idea to read through the documentation provided with the dataset or ask for more information.

For now, let's remove the `positive_rate` column using the `drop` method.

In [None]:
covid_df.drop(columns=['positive_rate'], inplace=True)

Can you figure the purpose of the `inplace` argument?

### Sorting rows using column values

The rows can also be sorted by a specific column using `.sort_values`. Let's sort to identify the days with the highest number of cases, then chain it with the `head` method to list just the first ten results.

In [None]:
covid_df.sort_values('new_cases', ascending=False).head(10)

It looks like the last two weeks of March had the highest number of daily cases. Let's compare this to the days where the highest number of deaths were recorded.

In [None]:
covid_df.sort_values('new_deaths', ascending=False).head(10)

It appears that daily deaths hit a peak just about a week after the peak in daily new cases.

Let's also look at the days with the least number of cases. We might expect to see the first few days of the year on this list.

In [None]:
covid_df.sort_values('new_cases').head(10)

It seems like the count of new cases on Jun 20, 2020, was `-148`, a negative number! Not something we might have expected, but that's the nature of real-world data. It could be a data entry error, or the government may have issued a correction to account for miscounting in the past. Can you dig through news articles online and figure out why the number was negative?

Let's look at some days before and after Jun 20, 2020.

In [None]:
covid_df.loc[169:175]

For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:
1. Replace it with `0`.
2. Replace it with the average of the entire column
3. Replace it with the average of the values on the previous & next date
4. Discard the row entirely

Which approach you pick requires some context about the data and the problem. In this case, since we are dealing with data ordered by date, we can go ahead with the third approach.

You can use the `.at` method to modify a specific value within the dataframe.

In [None]:
covid_df.at[172, 'new_cases'] = (covid_df.at[171, 'new_cases'] + covid_df.at[173, 'new_cases'])/2

Here's a summary of the functions & methods we looked at in this section:

- `covid_df.new_cases.sum()` - Computing the sum of values in a column or series
- `covid_df[covid_df.new_cases > 1000]` - Querying a subset of rows satisfying the chosen criteria using boolean expressions
- `df['pos_rate'] = df.new_cases/df.new_tests` - Adding new columns by combining data from existing columns
- `covid_df.drop('positive_rate')` - Removing one or more columns from the data frame
- `sort_values` - Sorting the rows of a data frame using column values
- `covid_df.at[172, 'new_cases'] = ...` - Replacing a value within the data frame

## Working with dates

While we've looked at overall numbers for the cases, tests, positive rate, etc., it would also be useful to study these numbers on a month-by-month basis. The `date` column might come in handy here, as Pandas provides many utilities for working with dates.

In [None]:
covid_df.date

The data type of date is currently `object`, so Pandas does not know that this column is a date. We can convert it into a `datetime` column using the `pd.to_datetime` method.

In [None]:
covid_df['date'] = pd.to_datetime(covid_df.date)
covid_df['date']

You can see that it now has the datatype `datetime64`. We can now extract different parts of the data into separate columns, using the `DatetimeIndex` class ([view docs](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DatetimeIndex.html)).

In [None]:
covid_df['year'] = pd.DatetimeIndex(covid_df.date).year
covid_df['month'] = pd.DatetimeIndex(covid_df.date).month
covid_df['day'] = pd.DatetimeIndex(covid_df.date).day
covid_df['weekday'] = pd.DatetimeIndex(covid_df.date).weekday

In [None]:
covid_df

Let's check the overall metrics for May. We can query the rows for May, choose a subset of columns, and use the `sum` method to aggregate each selected column's values.

In [None]:
# Query the rows for May
covid_df_may = covid_df[covid_df.month == 5]

# Extract the subset of columns to be aggregated
covid_df_may_metrics = covid_df_may[['new_cases', 'new_deaths', 'new_tests']]

# Get the column-wise sum
covid_may_totals = covid_df_may_metrics.sum()

In [None]:
covid_may_totals

In [None]:
type(covid_may_totals)

In [None]:
We can also combine the above operations into a single statement.

In [None]:
covid_df[covid_df.month == 5][['new_cases', 'new_deaths', 'new_tests']].sum()

As another example, let's check if the number of cases reported on Sundays is higher than the average number of cases reported every day. This time, we might want to aggregate columns using the `.mean` method.

In [None]:
# Overall average
covid_df.new_cases.mean()

In [None]:
# Average for Sundays
covid_df[covid_df.weekday == 6].new_cases.mean()

It seems like more cases were reported on Sundays compared to other days.

Try asking and answering some more date-related questions about the data using the cells below.

## Grouping and aggregation

As a next step, we might want to summarize the day-wise data and create a new dataframe with month-wise data. We can use the `groupby` function to create a group for each month, select the columns we wish to aggregate, and aggregate them using the `sum` method. 

In [None]:
covid_month_df = covid_df.groupby('month')[['new_cases', 'new_deaths', 'new_tests']].sum()
covid_month_df

The result is a new data frame that uses unique values from the column passed to `groupby` as the index. Grouping and aggregation is a powerful method for progressively summarizing data into smaller data frames.

Instead of aggregating by sum, you can also aggregate by other measures like mean. Let's compute the average number of daily new cases, deaths, and tests for each month.

In [None]:
covid_month_mean_df = covid_df.groupby('month')[['new_cases', 'new_deaths', 'new_tests']].mean()
covid_month_mean_df

Apart from grouping, another form of aggregation is the running or cumulative sum of cases, tests, or death up to each row's date. We can use the `cumsum` method to compute the cumulative sum of a column as a new series. Let's add three new columns: `total_cases`, `total_deaths`, and `total_tests`.

In [None]:
covid_df['total_cases'] = covid_df.new_cases.cumsum()
covid_df['total_deaths'] = covid_df.new_deaths.cumsum()
covid_df['total_tests'] = covid_df.new_tests.cumsum() + initial_tests


We've also included the initial test count in `total_test` to account for tests conducted before daily reporting was started. 

In [None]:
covid_df

Notice how the `NaN` values in the `total_tests` column remain unaffected.

## Merging data from multiple sources

To determine other metrics like test per million, cases per million, etc., we require some more information about the country, viz. its population. Let's download another file `locations.csv` that contains health-related information for many countries, including Italy.

In [None]:
urlretrieve('https://gist.githubusercontent.com/aakashns/8684589ef4f266116cdce023377fc9c8/raw/99ce3826b2a9d1e6d0bde7e9e559fc8b6e9ac88b/locations.csv', 
            'locations.csv')

In [None]:
locations_df = pd.read_csv('locations.csv')
locations_df

In [None]:
locations_df[locations_df.location == "Italy"]

We can merge this data into our existing data frame by adding more columns. However, to merge two data frames, we need at least one common column. Let's insert a `location` column in the `covid_df` dataframe with all values set to `"Italy"`.

In [None]:
covid_df['location'] = "Italy"
covid_df

We can now add the columns from `locations_df` into `covid_df` using the `.merge` method.

In [None]:
merged_df = covid_df.merge(locations_df, on="location")
merged_df

The location data for Italy is appended to each row within `covid_df`. If the `covid_df` data frame contained data for multiple locations, then the respective country's location data would be appended for each row.

We can now calculate metrics like cases per million, deaths per million, and tests per million.

In [None]:
merged_df['cases_per_million'] = merged_df.total_cases * 1e6 / merged_df.population

In [None]:
merged_df['deaths_per_million'] = merged_df.total_deaths * 1e6 / merged_df.population

In [None]:
merged_df['tests_per_million'] = merged_df.total_tests * 1e6 / merged_df.population

In [None]:
merged_df

## Writing data back to files

After completing your analysis and adding new columns, you should write the results back to a file. Otherwise, the data will be lost when the Jupyter notebook shuts down. Before writing to file, let us first create a data frame containing just the columns we wish to record.

In [None]:
result_df = merged_df[['date',
                       'new_cases', 
                       'total_cases', 
                       'new_deaths', 
                       'total_deaths', 
                       'new_tests', 
                       'total_tests', 
                       'cases_per_million', 
                       'deaths_per_million', 
                       'tests_per_million']]

result_df

To write the data from the data frame into a file, we can use the `to_csv` function. 

In [None]:
result_df.to_csv('results.csv', index=None)

The `to_csv` function also includes an additional column for storing the index of the dataframe by default. We pass `index=None` to turn off this behavior. You can now verify that the `results.csv` is created and contains data from the data frame in CSV format:

```
date,new_cases,total_cases,new_deaths,total_deaths,new_tests,total_tests,cases_per_million,deaths_per_million,tests_per_million
2020-02-27,78.0,400.0,1.0,12.0,,,6.61574439992122,0.1984723319976366,
2020-02-28,250.0,650.0,5.0,17.0,,,10.750584649871982,0.28116913699665186,
2020-02-29,238.0,888.0,4.0,21.0,,,14.686952567825108,0.34732658099586405,
2020-03-01,240.0,1128.0,8.0,29.0,,,18.656399207777838,0.47964146899428844,
2020-03-02,561.0,1689.0,6.0,35.0,,,27.93498072866735,0.5788776349931067,
2020-03-03,347.0,2036.0,17.0,52.0,,,33.67413899559901,0.8600467719897585,
...
```

You can attach the `results.csv` file to our notebook while uploading it to [Jovian](https://jovian.ai) using the `outputs` argument to `jovian.commit`.


## Bonus: Basic Plotting with Pandas

We generally use a library like `matplotlib` or `seaborn` plot graphs within a Jupyter notebook. However, Pandas dataframes & series provide a handy `.plot` method for quick and easy plotting.

Let's plot a line graph showing how the number of daily cases varies over time.

In [None]:
result_df.new_cases.plot();

While this plot shows the overall trend, it's hard to tell where the peak occurred, as there are no dates on the X-axis. We can use the `date` column as the index for the data frame to address this issue.

In [None]:
result_df.set_index('date', inplace=True)
result_df

Notice that the index of a data frame doesn't have to be numeric. Using the date as the index also allows us to get the data for a specific data using `.loc`.

In [None]:
result_df.loc['2020-09-01']

In [None]:
result_df.new_cases.plot()
result_df.new_deaths.plot();

We can also compare the total cases vs. total deaths.

In [None]:
result_df.total_cases.plot()
result_df.total_deaths.plot();

Let's see how the death rate and positive testing rates vary over time.

In [None]:
death_rate = result_df.total_deaths / result_df.total_cases

In [None]:
death_rate.plot(title='Death Rate');

In [None]:
positive_rates = result_df.total_cases / result_df.total_tests
positive_rates.plot(title='Positive Rate');

Finally, let's plot some month-wise data using a bar chart to visualize the trend at a higher level.

In [None]:
covid_month_df.new_cases.plot(kind='bar');

In [None]:
covid_month_df.new_tests.plot(kind='bar')

## Exercises

Try the following exercises to become familiar with Pandas dataframe and practice your skills:

* Assignment on Pandas dataframes: https://jovian.ml/aakashns/pandas-practice-assignment
* Additional exercises on Pandas: https://github.com/guipsamora/pandas_exercises
* Try downloading and analyzing some data from Kaggle: https://www.kaggle.com/datasets


## Summary and Further Reading


We've covered the following topics in this tutorial:

- Reading a CSV file into a Pandas data frame
- Retrieving data from Pandas data frames
- Querying, soring, and analyzing data
- Merging, grouping, and aggregation of data
- Extracting useful information from dates
- Basic plotting using line and bar charts
- Writing data frames to CSV files


Check out the following resources to learn more about Pandas:

* User guide for Pandas: https://pandas.pydata.org/docs/user_guide/index.html
* Python for Data Analysis (book by Wes McKinney - creator of Pandas): https://www.oreilly.com/library/view/python-for-data/9781491957653/

You are ready to move on to the next tutorial: [Data Visualization using Matplotlib & Seaborn](https://jovian.ai/aakashns/python-matplotlib-data-visualization).

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is Pandas? What makes it useful?
2. How do you install the Pandas library?
3. How do you import the `pandas` module?
4. What is the common alias used while importing the `pandas` module?
5. How do you read a CSV file using Pandas? Give an example?
6. What are some other file formats you can read using Pandas? Illustrate with examples.
7. What are Pandas dataframes? 
8. How are Pandas dataframes different from Numpy arrays?
9. How do you find the number of rows and columns in a dataframe?
10. How do you get the list of columns in a dataframe?
11. What is the purpose of the `describe` method of a dataframe?
12. How are the `info` and `describe` dataframe methods different?
13. Is a Pandas dataframe conceptually similar to a list of dictionaries or a dictionary of lists? Explain with an example.
14. What is a Pandas `Series`? How is it different from a Numpy array?
15. How do you access a column from a dataframe?
16. How do you access a row from a dataframe?
17. How do you access an element at a specific row & column of a dataframe?
18. How do you create a subset of a dataframe with a specific set of columns?
19. How do you create a subset of a dataframe with a specific range of rows?
20. Does changing a value within a dataframe affect other dataframes created using a subset of the rows or columns? Why is it so?
21. How do you create a copy of a dataframe?
22. Why should you avoid creating too many copies of a dataframe?
23. How do you view the first few rows of a dataframe?
24. How do you view the last few rows of a dataframe?
25. How do you view a random selection of rows of a dataframe?
26. What is the "index" in a dataframe? How is it useful?
27. What does a `NaN` value in a Pandas dataframe represent?
28. How is `Nan` different from `0`?
29. How do you identify the first non-empty row in a Pandas series or column?
30. What is the difference between `df.loc` and `df.at`?
31. Where can you find a full list of methods supported by Pandas `DataFrame` and `Series` objects?
32. How do you find the sum of numbers in a column of dataframe?
33. How do you find the mean of numbers in a column of a dataframe?
34. How do you find the number of non-empty numbers in a column of a dataframe?
35. What is the result obtained by using a Pandas column in a boolean expression? Illustrate with an example.
36. How do you select a subset of rows where a specific column's value meets a given condition? Illustrate with an example.
37. What is the result of the expression `df[df.new_cases > 100]` ?
38. How do you display all the rows of a pandas dataframe in a Jupyter cell output?
39. What is the result obtained when you perform an arithmetic operation between two columns of a dataframe? Illustrate with an example.
40. How do you add a new column to a dataframe by combining values from two existing columns? Illustrate with an example.
41. How do you remove a column from a dataframe? Illustrate with an example.
42. What is the purpose of the `inplace` argument in dataframe methods?
43. How do you sort the rows of a dataframe based on the values in a particular column?
44. How do you sort a pandas dataframe using values from multiple columns?
45. How do you specify whether to sort by ascending or descending order while sorting a Pandas dataframe?
46. How do you change a specific value within a dataframe?
47. How do you convert a dataframe column to the `datetime` data type?
48. What are the benefits of using the `datetime` data type instead of `object`?
49. How do you extract different parts of a date column like the month, year, month, weekday, etc., into separate columns? Illustrate with an example.
50. How do you aggregate multiple columns of a dataframe together?
51. What is the purpose of the `groupby` method of a dataframe? Illustrate with an example.
52. What are the different ways in which you can aggregate the groups created by `groupby`?
53. What do you mean by a running or cumulative sum? 
54. How do you create a new column containing the running or cumulative sum of another column?
55. What are other cumulative measures supported by Pandas dataframes?
56. What does it mean to merge two dataframes? Give an example.
57. How do you specify the columns that should be used for merging two dataframes?
58. How do you write data from a Pandas dataframe into a CSV file? Give an example.
59. What are some other file formats you can write to from a Pandas dataframe? Illustrate with examples.
60. How do you create a line plot showing the values within a column of dataframe?
61. How do you convert a column of a dataframe into its index?
62. Can the index of a dataframe be non-numeric?
63. What are the benefits of using a non-numeric dataframe? Illustrate with an example.
64. How you create a bar plot showing the values within a column of a dataframe?
65. What are some other types of plots supported by Pandas dataframes and series?

# Data Visualization using Python, Matplotlib and Seaborn


![](https://i.imgur.com/9i806Rh.png)


This tutorial series is a beginner-friendly introduction to programming and data analysis using the Python programming language. These tutorials take a practical and coding-focused approach. The best way to learn the material is to execute the code and experiment with it yourself. 

## Introduction

Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. Visualizing data is an essential part of data analysis and machine learning. We'll use Python libraries [Matplotlib](https://matplotlib.org) and [Seaborn](https://seaborn.pydata.org) to learn and apply some popular data visualization techniques. We'll use the words _chart_, _plot_, and _graph_ interchangeably in this tutorial.

To begin, let's install and import the libraries. We'll use the `matplotlib.pyplot` module for basic plots like line & bar charts. It is often imported with the alias `plt`. We'll use the `seaborn` module for more advanced plots. It is commonly imported with the alias `sns`. 

In [None]:
!pip install matplotlib seaborn --upgrade --quiet

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Notice this we also include the special command `%matplotlib inline` to ensure that our plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.

## Line Chart

The line chart is one of the simplest and most widely used data visualization techniques. A line chart displays information as a series of data points or markers connected by straight lines. You can customize the shape, size, color, and other aesthetic elements of the lines and markers for better visual clarity.

Here's a Python list showing the yield of apples (tons per hectare) over six years in an imaginary country called Kanto.

In [None]:
yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]

We can visualize how the yield of apples changes over time using a line chart. To draw a line chart, we can use the `plt.plot` function.

In [None]:
plt.plot(yield_apples)

Calling the `plt.plot` function draws the line chart as expected. It also returns a list of plots drawn `[<matplotlib.lines.Line2D at 0x7ff70aa20760>]`, shown within the output. We can include a semicolon (`;`) at the end of the last statement in the cell to avoiding showing the output and display just the graph.

In [None]:
plt.plot(yield_apples);

Let's enhance this plot step-by-step to make it more informative and beautiful.

### Customizing the X-axis

The X-axis of the plot currently shows list element indexes 0 to 5. The plot would be more informative if we could display the year for which we're plotting the data. We can do this by two arguments `plt.plot`.

In [None]:
years = [2010, 2011, 2012, 2013, 2014, 2015]
yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]

In [None]:
plt.plot(years, yield_apples)

### Axis Labels

We can add labels to the axes to show what each axis represents using the `plt.xlabel` and `plt.ylabel` methods.

In [None]:
plt.plot(years, yield_apples)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');

### Plotting Multiple Lines

You can invoke the `plt.plot` function once for each line to plot multiple lines in the same graph. Let's compare the yields of apples vs. oranges in Kanto.

In [None]:
years = range(2000, 2012)
apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931, 0.934, 0.936, 0.937, 0.9375, 0.9372, 0.939]
oranges = [0.962, 0.941, 0.930, 0.923, 0.918, 0.908, 0.907, 0.904, 0.901, 0.898, 0.9, 0.896, ]

In [None]:
plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');

### Chart Title and  Legend

To differentiate between multiple lines, we can include a legend within the graph using the `plt.legend` function. We can also set a title for the chart using the `plt.title` function.

In [None]:
plt.plot(years, apples)
plt.plot(years, oranges)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

### Line Markers

We can also show markers for the data points on each line using the `marker` argument of `plt.plot`. Matplotlib provides many different markers, like a circle, cross, square, diamond, etc. You can find the full list of marker types here: https://matplotlib.org/3.1.1/api/markers_api.html .

In [None]:
plt.plot(years, apples, marker='o')
plt.plot(years, oranges, marker='x')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

### Styling Lines and Markers

The `plt.plot` function supports many arguments for styling lines and markers:

- `color` or `c`: Set the color of the line ([supported colors](https://matplotlib.org/3.1.0/gallery/color/named_colors.html))
- `linestyle` or `ls`: Choose between a solid or dashed line
- `linewidth` or `lw`: Set the width of a line
- `markersize` or `ms`: Set the size of markers
- `markeredgecolor` or `mec`: Set the edge color for markers
- `markeredgewidth` or `mew`: Set the edge width for markers
- `markerfacecolor` or `mfc`: Set the fill color for markers
- `alpha`: Opacity of the plot


Check out the documentation for `plt.plot` to learn more: [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) .

In [None]:
plt.plot(years, apples, marker='s', c='b', ls='-', lw=2, ms=8, mew=2, mec='navy')
plt.plot(years, oranges, marker='o', c='r', ls='--', lw=3, ms=10, alpha=.5)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

The `fmt` argument provides a shorthand for specifying the marker shape, line style, and line color. It can be provided as the third argument to `plt.plot`.

```
fmt = '[marker][line][color]'
```


In [None]:
plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

If you don't specify a line style in `fmt`, only markers are drawn.

In [None]:
plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

### Changing the Figure Size

You can use the `plt.figure` function to change the size of the figure.

In [None]:
plt.figure(figsize=(12, 6))

plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

### Improving Default Styles using Seaborn

An easy way to make your charts look beautiful is to use some default styles from the Seaborn library. These can be applied globally using the `sns.set_style` function. You can see a full list of predefined styles here: https://seaborn.pydata.org/generated/seaborn.set_style.html .

In [None]:
sns.set_style("whitegrid")
plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

In [None]:
sns.set_style("darkgrid")
plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

In [None]:
plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

You can also edit default styles directly by modifying the `matplotlib.rcParams` dictionary. Learn more: https://matplotlib.org/3.2.1/tutorials/introductory/customizing.html#matplotlib-rcparams .

In [None]:
import matplotlib

In [None]:
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## Scatter Plot

In a scatter plot, the values of 2 variables are plotted as points on a 2-dimensional grid. Additionally, you can also use a third variable to determine the size or color of the points. Let's try out an example.

The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) provides sample measurements of sepals and petals for three species of flowers. The Iris dataset is included with the Seaborn library and can be loaded as a Pandas data frame.

In [None]:
# Load data into a Pandas dataframe
flowers_df = sns.load_dataset("iris")
flowers_df

In [None]:
flowers_df.species.unique()

Let's try to visualize the relationship between sepal length and sepal width. Our first instinct might be to create a line chart using `plt.plot`.

In [None]:
plt.plot(flowers_df.sepal_length, flowers_df.sepal_width);

The output is not very informative as there are too many combinations of the two properties within the dataset. There doesn't seem to be simple relationship between them.

We can use a scatter plot to visualize how sepal length & sepal width vary using the `scatterplot` function from the `seaborn` module (imported as `sns`).

In [None]:
sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);

### Adding Hues

Notice how the points in the above plot seem to form distinct clusters with some outliers. We can color the dots using the flower species as a `hue`. We can also make the points larger using the `s` argument.

In [None]:
sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width, hue=flowers_df.species, s=100);


Adding hues makes the plot more informative. We can immediately tell that Setosa flowers have a smaller sepal length but higher sepal widths. In contrast, the opposite is true for Virginica flowers. 

### Customizing Seaborn Figures

Since Seaborn uses Matplotlib's plotting functions internally, we can use functions like `plt.figure` and `plt.title` to modify the figure.

In [None]:
plt.figure(figsize=(12, 6))
plt.title('Sepal Dimensions')

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=100);

### Plotting using Pandas Data Frames

Seaborn has in-built support for Pandas data frames. Instead of passing each column as a series, you can provide column names and use the `data` argument to specify a data frame.

In [None]:
plt.title('Sepal Dimensions')
sns.scatterplot(x='sepal_length', 
                y='sepal_width', 
                hue='species',
                s=100,
                data=flowers_df);

## Histogram

A histogram represents the distribution of a variable by creating bins (interval) along the range of values and showing vertical bars to indicate the number of observations in each bin. 

For example, let's visualize the distribution of values of sepal width in the flowers dataset. We can use the `plt.hist` function to create a histogram.

In [None]:
# Load data into a Pandas dataframe
flowers_df = sns.load_dataset("iris")
flowers_df.sepal_width

In [None]:
plt.title("Distribution of Sepal Width")
plt.hist(flowers_df.sepal_width);

We can immediately see that the sepal widths lie in the range 2.0 - 4.5, and around 35 values are in the range 2.9 - 3.1, which seems to be the most populous bin.

### Controlling the size and number of bins

We can control the number of bins or the size of each one using the bins argument.

In [None]:
# Specifying the number of bins
plt.hist(flowers_df.sepal_width, bins=5);

In [None]:
import numpy as np

# Specifying the boundaries of each bin
plt.hist(flowers_df.sepal_width, bins=np.arange(2, 5, 0.25));

In [None]:
# Bins of unequal sizes
plt.hist(flowers_df.sepal_width, bins=[1, 3, 4, 4.5]);

### Multiple Histograms

Similar to line charts, we can draw multiple histograms in a single chart. We can reduce each histogram's opacity  so that one histogram's bars don't hide the others'.

Let's draw separate histograms for each species of flowers.

In [None]:
setosa_df = flowers_df[flowers_df.species == 'setosa']
versicolor_df = flowers_df[flowers_df.species == 'versicolor']
virginica_df = flowers_df[flowers_df.species == 'virginica']

In [None]:
plt.hist(setosa_df.sepal_width, alpha=0.4, bins=np.arange(2, 5, 0.25));
plt.hist(versicolor_df.sepal_width, alpha=0.4, bins=np.arange(2, 5, 0.25));

We can also stack multiple histograms on top of one another.

In [None]:
plt.title('Distribution of Sepal Width')

plt.hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(2, 5, 0.25), 
         stacked=True);

plt.legend(['Setosa', 'Versicolor', 'Virginica']);

## Bar Chart

Bar charts are quite similar to line charts, i.e., they show a sequence of values. However, a bar is shown for each value, rather than points connected by lines. We can use the `plt.bar` function to draw a bar chart.

In [None]:
years = range(2000, 2006)
apples = [0.35, 0.6, 0.9, 0.8, 0.65, 0.8]
oranges = [0.4, 0.8, 0.9, 0.7, 0.6, 0.8]

In [None]:
plt.bar(years, oranges);

Like histograms, we can stack bars on top of one another. We use the `bottom` argument of `plt.bar` to achieve this.

In [None]:
plt.bar(years, apples)
plt.bar(years, oranges, bottom=apples);

### Bar Plots with Averages

Let's look at another sample dataset included with Seaborn, called `tips`. The dataset contains information about the sex, time of day, total bill, and tip amount for customers visiting a restaurant over a week.

In [None]:
tips_df = sns.load_dataset("tips");
tips_df

We might want to draw a bar chart to visualize how the average bill amount varies across different days of the week. One way to do this would be to compute the day-wise averages and then use `plt.bar` (try it as an exercise).

However, since this is a very common use case, the Seaborn library provides a `barplot` function which can automatically compute averages.

In [None]:
sns.barplot(x='day', y='total_bill', data=tips_df);

The lines cutting each bar represent the amount of variation in the values. For instance, it seems like the variation in the total bill is relatively high on Fridays and low on Saturday.

We can also specify a `hue` argument to compare bar plots side-by-side based on a third feature, e.g., sex.

In [None]:
sns.barplot(x='day', y='total_bill', hue='sex', data=tips_df);

You can make the bars horizontal simply by switching the axes.

In [None]:
sns.barplot(x='total_bill', y='day', hue='sex', data=tips_df);

## Heatmap

A heatmap is used to visualize 2-dimensional data like a matrix or a table using colors. The best way to understand it is by looking at an example. We'll use another sample dataset from Seaborn, called `flights`, to visualize monthly passenger footfall at an airport over 12 years.

In [None]:
flights_df = sns.load_dataset("flights").pivot("month", "year", "passengers")
flights_df

`flights_df` is a matrix with one row for each month and one column for each year. The values show the number of passengers (in thousands) that visited the airport in a specific month of a year. We can use the `sns.heatmap` function to visualize the footfall at the airport.

In [None]:
plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df);

The brighter colors indicate a higher footfall at the airport. By looking at the graph, we can infer two things:

- The footfall at the airport in any given year tends to be the highest around July & August.
- The footfall at the airport in any given month tends to grow year by year.

We can also display the actual values in each block by specifying `annot=True` and using the `cmap` argument to change the color palette.

In [None]:
plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df, fmt="d", annot=True, cmap='Blues');

## Images

We can also use Matplotlib to display images. Let's download an image from the internet.

In [10]:
from urllib.request import urlretrieve
urlretrieve('https://i.imgur.com/SkPbq.jpg', 'chart.jpg');

Before displaying an image, it has to be read into memory using the `PIL` module.

In [None]:
from PIL import Image

In [None]:
img = Image.open('chart.jpg')

An image loaded using PIL is simply a 3-dimensional numpy array containing pixel intensities for the red, green & blue (RGB) channels of the image. We can convert the image into an array using `np.array`.

In [None]:
img_array = np.array(img)

In [None]:
img_array.shape

We can display the PIL image using `plt.imshow`.

In [None]:
plt.imshow(img);

We can turn off the axes & grid lines and show a title using the relevant functions.

In [None]:
plt.grid(False)
plt.title('A data science meme')
plt.axis('off')
plt.imshow(img);

To display a part of the image, we can simply select a slice from the numpy array.

In [None]:
plt.grid(False)
plt.axis('off')
plt.imshow(img_array[125:325,105:305]);

## Plotting multiple charts in a grid

Matplotlib and Seaborn also support plotting multiple charts in a grid, using `plt.subplots`, which returns a set of axes for plotting. 

Here's a single grid showing the different types of charts we've covered in this tutorial.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(16, 8))

# Use the axes for plotting
axes[0,0].plot(years, apples, 's-b')
axes[0,0].plot(years, oranges, 'o--r')
axes[0,0].set_xlabel('Year')
axes[0,0].set_ylabel('Yield (tons per hectare)')
axes[0,0].legend(['Apples', 'Oranges']);
axes[0,0].set_title('Crop Yields in Kanto')


# Pass the axes into seaborn
axes[0,1].set_title('Sepal Length vs. Sepal Width')
sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species, 
                s=100, 
                ax=axes[0,1]);

# Use the axes for plotting
axes[0,2].set_title('Distribution of Sepal Width')
axes[0,2].hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(2, 5, 0.25), 
         stacked=True);

axes[0,2].legend(['Setosa', 'Versicolor', 'Virginica']);

# Pass the axes into seaborn
axes[1,0].set_title('Restaurant bills')
sns.barplot(x='day', y='total_bill', hue='sex', data=tips_df, ax=axes[1,0]);

# Pass the axes into seaborn
axes[1,1].set_title('Flight traffic')
sns.heatmap(flights_df, cmap='Blues', ax=axes[1,1]);

# Plot an image using the axes
axes[1,2].set_title('Data Science Meme')
axes[1,2].imshow(img)
axes[1,2].grid(False)
axes[1,2].set_xticks([])
axes[1,2].set_yticks([])

plt.tight_layout(pad=2);### Pair plots with Seaborn


### Pair plots with Seaborn

Seaborn also provides a helper function `sns.pairplot` to automatically plot several different charts for pairs of features within a dataframe.

In [None]:
sns.pairplot(flowers_df, hue='species');

In [None]:
sns.pairplot(tips_df, hue='sex');

## Summary and Further Reading

We have covered the following topics in this tutorial: 

- Creating and customizing line charts using Matplotlib
- Visualizing relationships between two or more variables using scatter plots
- Studying distributions of variables using histograms & bar charts to 
- Visualizing two-dimensional data using heatmaps
- Displaying images using Matplotlib's `plt.imshow`
- Plotting multiple Matplotlib and Seaborn charts in a grid

In this tutorial we've covered some of the fundamental concepts and popular techniques for data visualization using Matplotlib and Seaborn. Data visualization is a vast field and we've barely scratched the surface here. Check out these references to learn and discover more:

* Data Visualization cheat sheet: https://jovian.ml/aakashns/dataviz-cheatsheet
* Seaborn gallery: https://seaborn.pydata.org/examples/index.html
* Matplotlib gallery: https://matplotlib.org/3.1.1/gallery/index.html
* Matplotlib tutorial: https://github.com/rougier/matplotlib-tutorial

You are now ready to move on to the next tutorial: [Exploratory Data Analysis - A Case Study](https://jovian.ai/aakashns/python-eda-stackoverflow-survey)

## Questions for Revision

Try answering the following questions to test your understanding of the topics covered in this notebook:

1. What is data visualization?
2. What is Matplotlib?
3. What is Seaborn?
4. How do you install Matplotlib and Seaborn?
5. How you import Matplotlib and Seaborn? What are the common aliases used while importing these modules?
6. What is the purpose of the magic command `%matplotlib inline`?
7. What is a line chart?
8. How do you plot a line chart in Python? Illustrate with an example.
9. How do you specify values for the X-axis of a line chart?
10. How do you specify labels for the axes of a chart?
11. How do you plot multiple line charts on the same axes?
12. How do you show a legend for a line chart with multiple lines?
13. How you set a title for a chart?
14. How do you show markers on a line chart?
15. What are the different options for styling lines & markers in line charts? Illustrate with examples?
16. What is the purpose of the `fmt` argument to `plt.plot`?
17. How do you markers without a line using `plt.plot`?
18. Where can you see a list of all the arguments accepted by `plt.plot`?
19. How do you change the size of the figure using Matplotlib?
20. How do you apply the default styles from Seaborn globally for all charts?
21. What are the predefined styles available in Seaborn? Illustrate with examples.
22. What is a scatter plot?
23. How is a scatter plot different from a line chart?
24. How do you draw a scatter plot using Seaborn? Illustrate with an example.
25. How do you decide when to use a scatter plot v.s. a line chart?
26. How do you specify the colors for dots on a scatter plot using a categorical variable?
27. How do you customize the title, figure size, legend, etc., for Seaborn plots?
28. How do you use a Pandas dataframe with `sns.scatterplot`?
29. What is a histogram?
30. When should you use a histogram v.s. a line chart?
31. How do you draw a histogram using Matplotlib? Illustrate with an example.
32. What are "bins" in a histogram?
33. How do you change the sizes of bins in a histogram?
34. How do you change the number of bins in a histogram?
35. How do you show multiple histograms on the same axes?
36. How do you stack multiple histograms on top of one another?
37. What is a bar chart?
38. How do you draw a bar chart using Matplotlib? Illustrate with an example.
39. What is the difference between a bar chart and a histogram?
40. What is the difference between a bar chart and a line chart?
41. How do you stack bars on top of one another?
42. What is the difference between `plt.bar` and `sns.barplot`?
43. What do the lines cutting the bars in a Seaborn bar plot represent?
44. How do you show bar plots side-by-side?
45. How do you draw a horizontal bar plot?
46. What is a heat map?
47. What type of data is best visualized with a heat map?
48. What does the `pivot` method of a Pandas dataframe do?
49. How do you draw a heat map using Seaborn? Illustrate with an example.
50. How do you change the color scheme of a heat map?
51. How do you show the original values from the dataset on a heat map?
52. How do you download images from a URL in Python?
53. How do you open an image for processing in Python?
54. What is the purpose of the `PIL` module in Python?
55. How do you convert an image loaded using PIL into a Numpy array?
56. How many dimensions does a Numpy array for an image have? What does each dimension represent?
57. What are "color channels" in an image?
58. What is RGB?
59. How do you display an image using Matplotlib?
60. How do you turn off the axes and gridlines in a chart?
61. How do you display a portion of an image using Matplotlib?
62. How do you plot multiple charts in a grid using Matplotlib and Seaborn? Illustrate with examples.
63. What is the purpose of the `plt.subplots` function?
64. What are pair plots in Seaborn? Illustrate with an example.
65. How do you export a plot into a PNG image file using Matplotlib?
66. Where can you learn about the different types of charts you can create using Matplotlib and Seaborn?

# Exploratory Data Analysis using Python - A Case Study

*Analyzing responses from the Stack Overflow Annual Developer Survey 2020*

![](https://i.imgur.com/qXhHKqv.png)

### Part 9 of "Data Analysis with Python: Zero to Pandas"

This tutorial series is a beginner-friendly introduction to programming and data analysis using the Python programming language. These tutorials take a practical and coding-focused approach. The best way to learn the material is to execute the code and experiment with it yourself. Check out the full series here: 

1. [First Steps with Python and Jupyter](https://jovian.ai/aakashns/first-steps-with-python)
2. [A Quick Tour of Variables and Data Types](https://jovian.ai/aakashns/python-variables-and-data-types)
3. [Branching using Conditional Statements and Loops](https://jovian.ai/aakashns/python-branching-and-loops)
4. [Writing Reusable Code Using Functions](https://jovian.ai/aakashns/python-functions-and-scope)
5. [Reading from and Writing to Files](https://jovian.ai/aakashns/python-os-and-filesystem)
6. [Numerical Computing with Python and Numpy](https://jovian.ai/aakashns/python-numerical-computing-with-numpy)
7. [Analyzing Tabular Data using Pandas](https://jovian.ai/aakashns/python-pandas-data-analysis)
8. [Data Visualization using Matplotlib & Seaborn](https://jovian.ai/aakashns/python-matplotlib-data-visualization)
9. [Exploratory Data Analysis - A Case Study](https://jovian.ai/aakashns/python-eda-stackoverflow-survey)

The following topics are covered in this tutorial:

- Selecting and downloading a dataset
- Data preparation and cleaning
- Exploratory analysis and visualization
- Asking and answering interesting questions
- Summarizing inferences and drawing conclusions


## Introduction

In this tutorial, we'll analyze the StackOverflow developer survey dataset. The dataset contains responses to an annual survey conducted by StackOverflow. You can find the raw data & official analysis here: https://insights.stackoverflow.com/survey.

There are several options for getting the dataset into Jupyter:

- Download the CSV manually and upload it via Jupyter's GUI
- Use the `urlretrieve` function from the `urllib.request` to download CSV files from a raw URL
- Use a helper library, e.g., [`opendatasets`](https://github.com/JovianML/opendatasets), which contains a collection of curated datasets and provides a helper function for direct download.

We'll use the `opendatasets` helper library to download the files.

In [None]:
import opendatasets as od
od.download('stackoverflow-developer-survey-2020')

You can through the downloaded files using the "File" > "Open" menu option in Jupyter. It seems like the dataset contains three files:

- `README.txt` - Information about the dataset
- `survey_results_schema.csv` - The list of questions, and shortcodes for each question
- `survey_results_public.csv` - The full list of responses to the questions 

Let's load the CSV files using the Pandas library. We'll use the name `survey_raw_df` for the data frame to indicate this is unprocessed data that we might clean, filter, and modify to prepare a data frame ready for analysis.

In [None]:
import pandas as pd

In [None]:
survey_raw_df = pd.read_csv('stackoverflow-developer-survey-2020/survey_results_public.csv')
survey_raw_df

The dataset contains over 64,000 responses to 60 questions (although many questions are optional). The responses have been anonymized to remove personally identifiable information, and each respondent has been assigned a randomized respondent ID.

Let's view the list of columns in the data frame. 

In [None]:
survey_raw_df.columns

It appears that shortcodes for questions have been used as column names. 

We can refer to the schema file to see the full text of each question. The schema file contains only two columns: `Column` and `QuestionText`. We can load it as Pandas Series with `Column` as the index and the  `QuestionText` as the value.

In [None]:
schema_fname = 'stackoverflow-developer-survey-2020/survey_results_schema.csv'
schema_raw = pd.read_csv(schema_fname, index_col='Column').QuestionText

In [None]:
schema_raw

In [None]:
schema_raw['YearsCodePro']

We've now loaded the dataset. We're ready to move on to the next step of preprocessing & cleaning the data for our analysis.


## Data Preparation & Cleaning

While the survey responses contain a wealth of information, we'll limit our analysis to the following areas:

- Demographics of the survey respondents and the global programming community
- Distribution of programming skills, experience, and preferences
- Employment-related information, preferences, and opinions

Let's select a subset of columns with the relevant data for our analysis.

In [None]:
selected_columns = [
    # Demographics
    'Country',
    'Age',
    'Gender',
    'EdLevel',
    'UndergradMajor',
    # Programming experience
    'Hobbyist',
    'Age1stCode',
    'YearsCode',
    'YearsCodePro',
    'LanguageWorkedWith',
    'LanguageDesireNextYear',
    'NEWLearn',
    'NEWStuck',
    # Employment
    'Employment',
    'DevType',
    'WorkWeekHrs',
    'JobSat',
    'JobFactors',
    'NEWOvertime',
    'NEWEdImpt'
]

In [None]:
len(selected_columns)

Let's extract a copy of the data from these columns into a new data frame `survey_df`. We can continue to modify further without affecting the original data frame.

In [None]:
survey_df = survey_raw_df[selected_columns].copy()

In [None]:
schema = schema_raw[selected_columns]

Let's view some basic information about the data frame.

In [None]:
survey_df.shape

In [None]:
survey_df.info()

Most columns have the data type `object`, either because they contain values of different types or contain empty values (`NaN`). It appears that every column contains some empty values since the Non-Null count for every column is lower than the total number of rows (64461). We'll need to deal with empty values and manually adjust the data type for each column on a case-by-case basis. 

Only two of the columns were detected as numeric columns (`Age` and `WorkWeekHrs`), even though a few other columns have mostly numeric values. To make our analysis easier, let's convert some other columns into numeric data types while ignoring any non-numeric value. The non-numeric are converted to `NaN`.

In [None]:
survey_df['Age1stCode'] = pd.to_numeric(survey_df.Age1stCode, errors='coerce')
survey_df['YearsCode'] = pd.to_numeric(survey_df.YearsCode, errors='coerce')
survey_df['YearsCodePro'] = pd.to_numeric(survey_df.YearsCodePro, errors='coerce')

Let's now view some basic statistics about numeric columns.

In [None]:
survey_df.describe()

There seems to be a problem with the age column, as the minimum value is 1 and the maximum is 279. This is a common issue with surveys: responses may contain invalid values due to accidental or intentional errors while responding. A simple fix would be to ignore the rows where the age is higher than 100 years or lower than 10 years as invalid survey responses. We can do this using the `.drop` method, [as explained here](https://www.geeksforgeeks.org/drop-rows-from-the-dataframe-based-on-certain-condition-applied-on-a-column/). 

In [None]:
survey_df.drop(survey_df[survey_df.Age < 10].index, inplace=True)
survey_df.drop(survey_df[survey_df.Age > 100].index, inplace=True)

The same holds for `WorkWeekHrs`. Let's ignore entries where the value for the column is higher than 140 hours. (~20 hours per day).

In [None]:
survey_df.drop(survey_df[survey_df.WorkWeekHrs > 140].index, inplace=True)

The gender column also allows for picking multiple options. We'll remove values containing more than one option to simplify our analysis.

In [None]:
survey_df['Gender'].value_counts()

In [None]:
import numpy as np

In [None]:
survey_df.where(~(survey_df.Gender.str.contains(';', na=False)), np.nan, inplace=True)

We've now cleaned up and prepared the dataset for analysis. Let's take a look at a sample of rows from the data frame.

In [None]:
survey_df.sample(10)

## Exploratory Analysis and Visualization

Before we ask questions about the survey responses, it would help to understand the respondents' demographics, i.e., country, age, gender, education level, employment level, etc. It's essential to explore these variables to understand how representative the survey is of the worldwide programming community. A survey of this scale generally tends to have some [selection bias](https://en.wikipedia.org/wiki/Selection_bias).

Let's begin by importing `matplotlib.pyplot` and `seaborn`.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

### Country

Let's look at the number of countries from which there are responses in the survey and plot the ten countries with the highest number of responses.

In [None]:
schema.Country

In [None]:
survey_df.Country.nunique()

We can identify the countries with the highest number of respondents using the `value_counts` method.

In [None]:
top_countries = survey_df.Country.value_counts().head(15)
top_countries

We can visualize this information using a bar chart.

In [None]:
plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title(schema.Country)
sns.barplot(x=top_countries.index, y=top_countries);

It appears that a disproportionately high number of respondents are from the US and India, probably because the survey is in English, and these countries have the highest English-speaking populations. We can already see that the survey may not be representative of the global programming community - especially from non-English speaking countries. Programmers from non-English speaking countries are almost certainly underrepresented.

**Exercise**:
Try finding the percentage of responses from English-speaking vs. non-English speaking countries. You can use [this list of languages spoken in different countries](https://github.com/JovianML/opendatasets/blob/master/data/countries-languages-spoken/countries-languages.csv).


### Age

The distribution of respondents' age is another crucial factor to look at. We can use a histogram to visualize it. 

In [None]:
plt.figure(figsize=(12, 6))
plt.title(schema.Age)
plt.xlabel('Age')
plt.ylabel('Number of respondents')

plt.hist(survey_df.Age, bins=np.arange(10,80,5), color='purple');

It appears that a large percentage of respondents are 20-45 years old. It's somewhat representative of the programming community in general. Many young people have taken up computer science as their field of study or profession in the last 20 years.

**Exercise**: You may want to filter out responses by age (or age group) if you'd like to analyze and compare the survey results for different age groups. Create a new column called AgeGroup containing values like `Less than 10 years`, `10-18 years`, `18-30 years`, `30-45 years`, `45-60 years` and `Older than 60 years`. Then, repeat the analysis in the rest of this notebook for each age group.


### Gender

Let's look at the distribution of responses for the Gender. It's a well-known fact that women and non-binary genders are underrepresented in the programming community, so we might expect to see a skewed distribution here.

In [None]:
schema.Gender

In [None]:
gender_counts = survey_df.Gender.value_counts()
gender_counts

A pie chart would be a great way to visualize the distribution.

In [None]:
plt.figure(figsize=(12,6))
plt.title(schema.Gender)
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=180);

Only about 8% of survey respondents who have answered the question identify as women or non-binary. This number is lower than the overall percentage of women & non-binary genders in the programming community - which is estimated to be around 12%. 

**Exercise**: It would be interesting to compare the survey responses & preferences across genders. Repeat this analysis with these breakdowns. How do the relative education levels differ across genders? How do the salaries vary? You may find this analysis on the [Gender Divide in Data Science](https://medium.com/datadriveninvestor/exploratory-data-analysis-eda-understanding-the-gender-divide-in-data-science-roles-9faa5da44f5b) useful.
    
### Education Level

Formal education in computer science is often considered an essential requirement for becoming a programmer. However, there are many free resources & tutorials available online to learn programming. Let's compare the education levels of respondents to gain some insight into this. We'll use a horizontal bar plot here.

In [None]:
sns.countplot(y=survey_df.EdLevel)
plt.xticks(rotation=75);
plt.title(schema['EdLevel'])
plt.ylabel(None);

It appears that well over half of the respondents hold a bachelor's or master's degree, so most programmers seem to have some college education. However, it's not clear from this graph alone if they hold a degree in computer science.

**Exercises**: The graph currently shows the number of respondents for each option. Can you modify it to show the percentage instead? Further, try comparing the percentages for each degree for men vs. women. 


Let's also plot undergraduate majors, but this time we'll convert the numbers into percentages and sort the values to make it easier to visualize the order.

In [None]:
schema.UndergradMajor

In [None]:
undergrad_pct = survey_df.UndergradMajor.value_counts() * 100 / survey_df.UndergradMajor.count()

sns.barplot(x=undergrad_pct, y=undergrad_pct.index)

plt.title(schema.UndergradMajor)
plt.ylabel(None);
plt.xlabel('Percentage');

It turns out that 40% of programmers holding a college degree have a field of study other than computer science - which is very encouraging. It seems to suggest that while a college education is helpful in general, you do not need to pursue a major in computer science to become a successful programmer.

**Exercises**: Analyze the `NEWEdImpt` column for respondents who hold some college degree vs. those who don't. Do you notice any difference in opinion?


### Employment

Freelancing or contract work is a common choice among programmers, so it would be interesting to compare the breakdown between full-time, part-time, and freelance work. Let's visualize the data from the `Employment` column.

In [None]:
schema.Employment

In [None]:
(survey_df.Employment.value_counts(normalize=True, ascending=True)*100).plot(kind='barh', color='g')
plt.title(schema.Employment)
plt.xlabel('Percentage');

It appears that close to 10% of respondents are employed part time or as freelancers.

**Exercise**: Add a new column `EmploymentType` containing the values `Enthusiast` (student or not employed but looking for work), `Professional` (employed full-time, part-time or freelancing), and `Other` (not employed or retired). For each of the graphs that follow, show a comparison between `Enthusiast` and `Professional`.

The `DevType` field contains information about the roles held by respondents. Since the question allows multiple answers, the column contains lists of values separated by a semi-colon `;`, making it a bit harder to analyze directly.


In [None]:
schema.DevType

In [None]:
survey_df.DevType.value_counts()

Let's define a helper function that turns a column containing lists of values (like `survey_df.DevType`) into a data frame with one column for each possible option.

In [None]:
def split_multicolumn(col_series):
    result_df = col_series.to_frame()
    options = []
    # Iterate over the column
    for idx, value  in col_series[col_series.notnull()].iteritems():
        # Break each value into list of options
        for option in value.split(';'):
            # Add the option as a column to result
            if not option in result_df.columns:
                options.append(option)
                result_df[option] = False
            # Mark the value in the option column as True
            result_df.at[idx, option] = True
    return result_df[options]

In [None]:
dev_type_df = split_multicolumn(survey_df.DevType)
dev_type_df

The `dev_type_df` has one column for each option that can be selected as a response. If a respondent has chosen an option, the corresponding column's value is `True`. Otherwise, it is `False`.

We can now use the column-wise totals to identify the most common roles.

In [None]:
dev_type_totals = dev_type_df.sum().sort_values(ascending=False)
dev_type_totals

As one might expect, the most common roles include "Developer" in the name. 

**Exercises**: 

* Can you figure out what percentage of respondents work in roles related to data science? 
* Which positions have the highest percentage of women?

We've only explored a handful of columns from the 20 columns that we selected. Explore and visualize the remaining columns using the empty cells below.

## Asking and Answering Questions

We've already gained several insights about the respondents and the programming community by exploring individual columns of the dataset. Let's ask some specific questions and try to answer them using data frame operations and visualizations.

#### Q: What are the most popular programming languages in 2020? 

To answer, this we can use the `LanguageWorkedWith` column. Similar to `DevType`, respondents were allowed to choose multiple options here.

In [None]:
survey_df.LanguageWorkedWith

In [None]:
languages_worked_df = split_multicolumn(survey_df.LanguageWorkedWith)
languages_worked_df

It appears that a total of 25 languages were included among the options. Let's aggregate these to identify the percentage of respondents who selected each language.

In [None]:
languages_worked_percentages = languages_worked_df.mean().sort_values(ascending=False) * 100
languages_worked_percentages

We can plot this information using a horizontal bar chart.

In [None]:
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_worked_percentages, y=languages_worked_percentages.index)
plt.title("Languages used in the past year");
plt.xlabel('count');

Perhaps unsurprisingly, Javascript & HTML/CSS comes out at the top as web development is one of today's most sought skills. It also happens to be one of the easiest to get started. SQL is necessary for working with relational databases, so it's no surprise that most programmers work with SQL regularly. Python seems to be the popular choice for other forms of development, beating out Java, which was the industry standard for server & application development for over two decades.

**Exercises**:

* What are the most common languages used by students? How does the list compare with the most common languages used by professional developers?
* What are the most common languages among respondents who do not describe themselves as "Developer, front-end"?
* What are the most common languages among respondents who work in fields related to data science?
* What are the most common languages used by developers older than 35 years of age? 
* What are the most common languages used by developers in your home country?

#### Q: Which languages are the most people interested to learn over the next year?

For this, we can use the `LanguageDesireNextYear` column, with similar processing as the previous one.

In [None]:
languages_interested_df = split_multicolumn(survey_df.LanguageDesireNextYear)
languages_interested_percentages = languages_interested_df.mean().sort_values(ascending=False) * 100
languages_interested_percentages

In [None]:
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_interested_percentages, y=languages_interested_percentages.index)
plt.title("Languages people are intersted in learning over the next year");
plt.xlabel('count');

Once again, it's not surprising that Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for a variety of domains: application development, numerical computing, data analysis, machine learning, big data, cloud automation, web scraping, scripting, etc. We're using Python for this very analysis, so we're in good company!

**Exercises**: Repeat the exercises from the previous question, replacing "most common languages" with "languages people are interested in learning/using."

#### Q:  Which are the most loved languages, i.e., a high percentage of people who have used the language want to continue learning & using it over the next year?

While this question may seem tricky at first, it's straightforward to solve using Pandas array operations. Here's what we can do:

- Create a new data frame `languages_loved_df` that contains a `True` value for a language only if the corresponding values in `languages_worked_df` and `languages_interested_df` are both `True`
- Take the column-wise sum of `languages_loved_df` and divide it by the column-wise sum of `languages_worked_df` to get the percentage of respondents who "love" the language
- Sort the results in decreasing order and plot a horizontal bar graph

In [None]:
languages_loved_df = languages_worked_df & languages_interested_df

In [None]:
languages_loved_percentages = (languages_loved_df.sum() * 100/ languages_worked_df.sum()).sort_values(ascending=False)

In [None]:
plt.figure(figsize=(12, 12))
sns.barplot(x=languages_loved_percentages, y=languages_loved_percentages.index)
plt.title("Most loved languages");
plt.xlabel('count');

[Rust](https://www.rust-lang.org) has been StackOverflow's most-loved language for [four years in a row](https://stackoverflow.blog/2020/01/20/what-is-rust-and-why-is-it-so-popular/). The second most-loved language is TypeScript, a popular alternative to JavaScript for web development.

Python features at number 3, despite already being one of the most widely-used languages in the world. Python has a solid foundation, is easy to learn & use, has a large ecosystem of domain-specific libraries, and a massive worldwide community.

**Exercises:** What are the most dreaded languages, i.e., languages which people have used in the past year but do not want to learn/use over the next year. Hint: `~languages_interested_df`.

#### Q: In which countries do developers work the highest number of hours per week? Consider countries with more than 250 responses only.

To answer this question, we'll need to use the `groupby` data frame method to aggregate the rows for each country. We'll also need to filter the results to only include the countries with more than 250 respondents.

In [None]:
countries_df = survey_df.groupby('Country')[['WorkWeekHrs']].mean().sort_values('WorkWeekHrs', ascending=False)

In [None]:
high_response_countries_df = countries_df.loc[survey_df.Country.value_counts() > 250].head(15)
high_response_countries_df

The Asian countries like Iran, China, and Israel have the highest working hours, followed by the United States. However, there isn't too much variation overall, and the average working hours seem to be around 40 hours per week.

**Exercises:**

* How do the average work hours compare across continents? You may find this list of [countries in each continent](https://hub.jovian.ml/wp-content/uploads/2020/09/countries.csv) useful.
* Which role has the highest average number of hours worked per week? Which one has the lowest?
* How do the hours worked compare between freelancers and developers working full-time?

#### Q: How important is it to start young to build a career in programming?

Let's create a scatter plot of `Age` vs. `YearsCodePro` (i.e., years of coding experience) to answer this question.

In [None]:
schema.YearsCodePro

In [None]:
sns.scatterplot(x='Age', y='YearsCodePro', hue='Hobbyist', data=survey_df)
plt.xlabel("Age")
plt.ylabel("Years of professional coding experience");

In [None]:
You can see points all over the graph, which indicates that you can **start programming professionally at any age**. Many people who have been coding for several decades professionally also seem to enjoy it as a hobby.

We can also view the distribution of the `Age1stCode` column to see when the respondents tried programming for the first time.

In [None]:
plt.title(schema.Age1stCode)
sns.histplot(x=survey_df.Age1stCode, bins=30, kde=True);

As you might expect, most people seem to have had some exposure to programming before the age of 40. However, but there are people of all ages and walks of life learning to code.

**Exercises**:

* How does programming experience change opinions & preferences? Repeat the entire analysis while comparing the responses of people who have more than ten years of professional programming experience vs. those who don't. Do you see any interesting trends?
* Compare the years of professional coding experience across different genders. 

Hopefully, you are already thinking of many more questions you'd like to answer using this data. Use the empty cells below to ask and answer more questions.

## Inferences and Conclusions

We've drawn many inferences from the survey. Here's a summary of a few of them:

- Based on the survey respondents' demographics, we can infer that the survey is somewhat representative of the overall programming community. However, it has fewer responses from programmers in non-English-speaking countries and women & non-binary genders.

- The programming community is not as diverse as it can be. Although things are improving, we should make more efforts to support & encourage underrepresented communities, whether in terms of age, country, race, gender, or otherwise.


- Although most programmers hold a college degree, a reasonably large percentage did not have computer science as their college major. Hence, a computer science degree isn't compulsory for learning to code or building a career in programming.

- A significant percentage of programmers either work part-time or as freelancers, which can be a great way to break into the field, especially when you're just getting started.

- Javascript & HTML/CSS are the most used programming languages in 2020, closely followed by SQL & Python.

- Python is the language most people are interested in learning - since it is an easy-to-learn general-purpose programming language well suited for various domains.

- Rust and TypeScript are the most "loved" languages in 2020, both of which have small but fast-growing communities. Python is a close third, despite already being a widely used language.

- Programmers worldwide seem to be working for around 40 hours a week on average, with slight variations by country.

- You can learn and start programming professionally at any age. You're likely to have a long and fulfilling career if you also enjoy programming as a hobby.

## Exercises

There's a wealth of information to be discovered using the survey, and we've barely scratched the surface. Here are some ideas for further exploration:

- Repeat the analysis for different age groups & genders, and compare the results
- Pick a different set of columns (we chose 20 out of 65) to analyze other facets of the data
- Prepare an analysis focusing on diversity - and identify areas where underrepresented communities are at par with the majority (e.g., education) and where they aren't (e.g., salaries)
- Compare the results of this year's survey with the previous years and identify interesting trends

## References and Future Work

Check out the following resources to learn more about the dataset and tools used in this notebook:

- Stack Overflow Developer Survey: https://insights.stackoverflow.com/survey
- Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
- Matplotlib user guide: https://matplotlib.org/3.3.1/users/index.html
- Seaborn user guide & tutorial: https://seaborn.pydata.org/tutorial.html
- `opendatasets` Python library: https://github.com/JovianML/opendatasets

As a next step, you can try out a project on another dataset of your choice: https://jovian.ml/aakashns/zerotopandas-course-project-starter .