# Python for Data Analysis

Based on Greg Hammel's Sessions: https://www.kaggle.com/hamelg/python-for-data-analysis-index

## 1. Getting started
### Shortcuts
* "A" to create a new cell above the current cell 
* "B" to create a new cell below the current cell 
* "M" to convert the current cell to Markdown 
* "Y" to convert the current cell to code 
* "DD" (press "D" twice) to delete the current cell

## 2. Python Arithmetic
### Operations
* Substraction: 10 - 3
* Addition: 10 + 3
* Multiplication: 10 * 3
* Decimal division: 10 / 3
* Floor division: 10 // 3
* Exponentiation: 10 ** 3
* Modulus (produces the remainder you'd get when dividing two numbers): 100 % 75 = 25

### Math module

In [2]:
import math
# Logarithm of argument
math.log(2.7182)

0.9999698965391098

In [3]:
# Add a second argument to specify the log base
math.log(100,10)

2.0

In [4]:
# Raise e to the power of its argument
math.exp(10)

22026.465794806718

In [5]:
# Take the root square of a number
math.sqrt(64)

8.0

In [7]:
# Absolute value of a number
abs(-30)

30

In [8]:
# Constant of pi
math.pi

3.141592653589793

In [9]:
# Round to nearest whole number
round(10.6)

11

In [10]:
# Add a second number to specify number of decimals
round(233.4678, 2)

233.47

In [11]:
# Round down to nearest whole number
math.floor(100.5)

100

## 3. Basic data types


* Integers: whole-numbered numeric values (positive or negative)
* Floats: numbers with decimal values. Inf and -Inf (infinite values) are floats
* Booleans: true/false values that result from logical statements. bool(1)=True, bool(0)=False
* Strings: text value ('') or ("")
* None: represents a missing value

## 4. Variables
A variable is a name you assign a value or object.

In [1]:
x = 10
y = "Python is fun"
z = 144**0.5 == 12

print(x,y,z)

10 Python is fun True


In [3]:
# 'Tuple unpacking' is the method of extracting variables from a comma separated sequence

x, y, z = (10 ,20 ,30)

print(x)
print(y)
print(z)

10
20
30


In [4]:
# Swap values of two variables

(x, y) = (y, x)

print(x)
print(y)

20
10


When assigning a variable in Python, the variable is a reference to a specific object in the computer's memory.
Reassigning a variable simply switches the reference to a different object in memory.
If the object a variable refers to in memory is altered, the value of the variable corresponding to the altered object will also change. 
All of the data types seen so far are inmutable (they cannot be changed after created).
If an operations appears to be altering an inmutable object, it is actually creating a totally new object rather than altering the one that exists.

In [8]:
x = "Hello"
y = x        # Assign y the same object as x
y = y.lower() # Assing y the result of lower()

# Strings are inmutable, Python creates an entirely new string "hello" and stores it somewhere else separate from "Hello"
# x and y are different objects in memory

print(x)
print(y)

Hello
hello


Lists are a mutable data structure that can hold multiple objects. When altering a list, Python doesn't make an entirely new list in memory, it changes the actual list object itself.

In [10]:
x = [1, 2, 3]  # Create a list
y = x          # Assing y the same object as x
y.append(4)    # Add 4 to the end of the list
print(x)
print(y)

# x and y have the same value, even though it may appear that 4 was only added to y

[1, 2, 3, 4]
[1, 2, 3, 4]


## 5. Lists
### List basics
One of the most common sequenced data types in Python.
* A list is a mutable, ordered collection of objects **it can be altered after created**
* Lists are heterogenous, they can hold objects of different types
* A list with no content is an empty list. Will return []

In [11]:
my_list = ["Lesson", 5, "Is fun?", True]
print(my_list)

['Lesson', 5, 'Is fun?', True]


Construct a list by passing some iterable into the list() function. An **iterable** is an object you can look through one item at a time (lists, tuples, strings...) 

In [22]:
second_list = list("Life is awesome")
print(second_list)

['L', 'i', 'f', 'e', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e']


In [13]:
# Add an item to an existing list with the list.append() function

second_list.append("!!!")
print(second_list)

['L', 'i', 'f', 'e', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '!!!']


In [21]:
# Remove a matching item from a list with list.remove()
# It deletes the first matching item only

second_list.remove('i')
print(second_list)

['f', 'e', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '!!!']


In [23]:
# Join two lists together with the + operator

combined_list = my_list + second_list
print(combined_list)

['Lesson', 5, 'Is fun?', True, 'L', 'i', 'f', 'e', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e']


In [24]:
# Add a sequence to the end of an existing list with the list.extend() function

combined_list.extend({1, 2, 3})
print(combined_list)

['Lesson', 5, 'Is fun?', True, 'L', 'i', 'f', 'e', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', 1, 2, 3]


In [27]:
# Check min, max, length

num_list =[1, 3, 5, 7, 9]
print(len(num_list))  # Get lenght of list
print(min(num_list))  # Get min of list
print(max(num_list))  # Get max of list
print(sum(num_list))  # Get the sum of items in list
print(sum(num_list)/len(num_list))  # Get the mean

5
1
9
25
5.0


In [28]:
# Check wether a list contains a certain object with "in"

1 in num_list

True

In [29]:
# Check wether an object is not in a list "not in"
8 not in num_list

True

In [30]:
# Count recurrence of an object in a list with lis.count()
num_list.count(1)

1

In [34]:
# Reverse and sort list

new_list = [1, 2, 3, 4]
new_list.reverse()
print ("Reversed list: ", new_list)

new_list.sort()
print("Sorted list: ", new_list)

Reversed list:  [4, 3, 2, 1]
Sorted list:  [1, 2, 3, 4]


### List indexign and slicing
* Indexed: each position in the sequence has a corresponding number called the "index", used to look up the value at that position
* First element of a sequence in Python is 0

In [37]:
another_list = ["Hello","my","name","is","Micaela"]

print(another_list[0]) # Get the first object in list

Hello


In [38]:
# When supplying a negative number while indexing, it accesses items from the other end of the list going backwards
print(another_list[-1])

Micaela


In [39]:
# IndexError happens when supplying an index outside of the list's range

In [40]:
# When list contains indexed numbers, you can supply additional indexes to get items contained within the nested objects

nested_list = [[0, 1, 2], [3, 4, 5]]
print(nested_list[0][1])

1


You can slice a list using the sintax [start:stop:step]:
* Start: starting index
* Stop: ending index
* Step: controls how frequently to sample values along the slice. The default step size is one

In [41]:
another_list[0:3]

['Hello', 'my', 'name']

In [42]:
another_list[0:3:2]

['Hello', 'name']

In [47]:
# Leave the starting and ending index blank to slice from the beginning or up to the end of the list
print(another_list[:4]) # End index is 4
print(another_list[3:]) # Start index is 3
print(another_list[:])  # To slice the entire list
print(another_list[::-1]) # Slices and reverses the list

['Hello', 'my', 'name', 'is']
['is', 'Micaela']
['Hello', 'my', 'name', 'is', 'Micaela']
['Micaela', 'is', 'name', 'my', 'Hello']


In [50]:
# Index new items and delete items
another_list[3]="new" # Adding new index in position 3
print(another_list)
del(another_list[3]) # Delete from another_list index 3
print(another_list)

['Hello', 'my', 'name', 'new']
['Hello', 'my', 'name']


In [52]:
# pop() function removes the final item in a list and returns it
another_list = ["Hello", "here", "I", "am"]

final_item = another_list.pop()
print(final_item)
print(another_list)

am
['Hello', 'here', 'I']


### Copying Lists

In [54]:
# Copy a list with the list.copy() function
list1 = [1, 2, 3]
list2 = list1.copy() # Copy list
list1.append(4)      # Add item to list 1
print("List1: ", list1)
print("List2: ", list2)

List1:  [1, 2, 3, 4]
List2:  [1, 2, 3]


List2 (the copy) is not affected by the append() function, it's a **'shallow copy'**. A shallow copy makes a new list where each element refers to the object at the same position in the original list. Shallow copies can have *undisered effects* when they coppy lists that contain mutable objects, like other lists.

In [56]:
list1 = [1, 2, 3]
list2 = ["The list", list1] # Nest list in another list
list3 = list2.copy()        # Shallow copy list2

print("Before appending to list1: ")
print("List2: ", list2)
print("List3: ", list3, "\n")

list1.append(4)
print("After appending to list1: ")
print("List2: ", list2)
print("List3: ", list3)

Before appending to list1: 
List2:  ['The list', [1, 2, 3]]
List3:  ['The list', [1, 2, 3]] 

After appending to list1: 
List2:  ['The list', [1, 2, 3, 4]]
List3:  ['The list', [1, 2, 3, 4]]


When altering list 1, the copies list2 and list3 both change. 
**When working with nested lists, you have to make a deepcopy if you want to trully copy nested objects in the original to avoid this behavior.**

In [57]:
import copy      # Load the copy module

list1 = [1, 2, 3]

list2 = ["List within a list", list1]   # Nest list1 into another list
list3 = copy.deepcopy(list2)            # Deep copy list 2

print("Before appending to list1:")
print("List2:", list2)
print("List3:", list3, "\n")

list1.append(4)                        # Add an item to list1
print("After appending to list1:")
print("List2:", list2)
print("List3:", list3)

Before appending to list1:
List2: ['List within a list', [1, 2, 3]]
List3: ['List within a list', [1, 2, 3]] 

After appending to list1:
List2: ['List within a list', [1, 2, 3, 4]]
List3: ['List within a list', [1, 2, 3]]


List3 isn't altered by the change in list1, because list3 is a copy rather than a reference of list1.

## 6. Tuples and Strings
### Tuples
Tuples are an inmutable sequence data type used to hold short collections of related data (example: latitude and longitude coordinates). Recommended for values related and not likely to change. Tuples can store objects of different types. 

In [1]:
my_tuple = (1, 3, 5)
print(my_tuple)

(1, 3, 5)


In [2]:
# Change from list to tuple with the tuple() function

my_list = [2, 3, 1]
another_tuple = tuple(my_list)
another_tuple

(2, 3, 1)

Tuples have the same indexing and slicing operations as lists and some of the same functions. **However, tuples cannot be changed once created, so we can't append new values to them or remove values from them.** 

In [3]:
another_tuple[2] # Index into tuples

1

In [5]:
another_tuple[0:1] # Slice tuples

(2,)

You can sort objects in tuple using the sorted() function, but doing so creates a new list containing the result rather than sorting the original tuple itself like the list.sort(). To avoid the "shallow copy" behaviour, you have to make a deepcopy using the copy library.

In [7]:
sorted(another_tuple)

[1, 2, 3]

In [10]:
import copy
list1 = [1, 2, 3]

tuple1 = ("Tuples are inmutable", list1)

tuple2 = copy.deepcopy(tuple1) # Make a deep copy

list1.append("But lists are mutable") # The append won't change the tupple because you made a deepcopy and not a shallow one

print(tuple2)

('Tuples are inmutable', [1, 2, 3])


### Strings
Strings are sequences: inmutable sequences of text characters. They support indexing operations, and start with index 0.
**As inmutable objects, every time you transform a string with a function, Python makes a new string object rather than altering the original one in you computer's memory.**

In [11]:
my_string = "Hello world"
print(my_string[3]) # Get the character at index 3
print(my_string[3:]) # Slice from index 3 to the end
print(my_string[::-1]) # Reverse the string

l
lo world
dlrow olleH


In [13]:
# Functions and len() and count() work on strings
print(len(my_string))
print(my_string.count("l"))

11
3


In [15]:
# Some other functions

print(my_string.lower())    # Make all characters lowercase
print(my_string.upper())    # Make all characters uppercase
print(my_string.title())    # Make the first letter of each word uppercase

# None of these functions change the original my_string

# Find the index of the first appearing substring within a string with the str.find() function
print(my_string.find("w"))   # This one will work because "w" appears in the original my_string
print(my_string.find("W"))   # This will return -1 because it won't find it in my_string

hello world
HELLO WORLD
Hello World
6
-1


In [16]:
# Find and replace with the str.replace() function
my_string.replace("world", "friend")

'Hello friend'

In [17]:
# Split a string with str.split() function
print(my_string.split())    # By default splits on spaces
print(my_string.split("l")) # Supply a substring to split on other values

['Hello', 'world']
['He', '', 'o wor', 'd']


In [18]:
# Split a multi-line string into a list of lines using str.splitlines()

multiline_string = """"I am
a multiline"
string!
"""

multiline_string.splitlines()

['"I am', 'a multiline"', 'string!']

In [19]:
# Strip leading and trailing characters from both ends of a string with str.split()

print("  strip white space  ".split())   # Removes white space by default

['strip', 'white', 'space']


In [2]:
# Join strings with + or with join() function
print("Hello"+"world")
print(" ".join(["Hello","world","please","join","me"]))

Helloworld
Hello world please join me


In [3]:
# For string operations it is recommended to use str.format()
name = "Joe"
age = 29
city = "Paris"
template_string = "My name is {} I am {} and I live in {}"
template_string.format(name, age, city)

'My name is Joe I am 29 and I live in Paris'

## 7. Data dictionaries and sets

Lists, tuples and strings are ordered sequences. Ordering comes at a price, when searching through sequences, the computer goes through each element one at a time to find the object you're looking for.

Dictionaries and sets are **unordered** Python data structures. They use a technique called *hashing*. They let you check wether they contain a certain object without having to search each element one at a time.

### Dictionaries
A dictionary or dict maps a set of indexes called keys to a set of corresponding values. Dictionaries are **mutable**, so you can add or remove keys and their associated values. **A dictionaries keys are inmutable**, but they can be anything (strings, tuples, ints). 

In [7]:
my_dict = {"name" : "Joe",
          "age" : 29,
          "city" : "Paris"}
print(my_dict)

{'name': 'Joe', 'age': 29, 'city': 'Paris'}


In [9]:
# Index into a dictionary using keys rather than indexes
my_dict["name"]

'Joe'

In [10]:
# Add new items to an existing dictionary
my_dict["new key"] = "new_value"
print(my_dict)

{'name': 'Joe', 'age': 29, 'city': 'Paris', 'new key': 'new_value'}


In [12]:
# Delete existing keys with del
del my_dict["new key"]
print(my_dict)

{'name': 'Joe', 'age': 29, 'city': 'Paris'}


In [13]:
# Check number of items with len()
len(my_dict)

3

In [14]:
# Check wether a certain key exists in the dictionary
"name" in my_dict

True

In [15]:
# Access keys
print(my_dict.keys())

# Acces values
print(my_dict.values())

# Access items
print(my_dict.items())

dict_keys(['name', 'age', 'city'])
dict_values(['Joe', 29, 'Paris'])
dict_items([('name', 'Joe'), ('age', 29), ('city', 'Paris')])


In [18]:
# Store data in a dictionary

my_table_dict = {"name" : ["Joe", "Emily", "Lupin"],
                "age" : [29, 27, 30],
                "city" : ["Paris", "Paris", "Paris"]}
print(my_table_dict)

{'name': ['Joe', 'Emily', 'Lupin'], 'age': [29, 27, 30], 'city': ['Paris', 'Paris', 'Paris']}


### Sets
Sets are **unordered, mutable collections of inmutable objects that cannot contain duplicates.** Sets are useful for storing and performing operations on data where each value is unique. Sets do not support indexing but they do support max(), min(), len() and in operations.

In [19]:
# Create a set with {}
my_set = {1,2,3,4,5,6}
type(my_set)

set

In [20]:
# Add items to a set

my_set.add(7)

print(my_set)

{1, 2, 3, 4, 5, 6, 7}


In [22]:
# Remove items from set

my_set.remove(7)
print(my_set)

{1, 2, 3, 4, 5, 6}


One of the main goals of sets is to perform set operations that compare or combine different sets. Operations like union, intersection, difference and checking wether a set is a subset from another.

In [2]:
set1 = {1,3,5,6}
set2 = {1,2,3,4}

# Get the union of two sets
set1.union(set2)

{1, 2, 3, 4, 5, 6}

In [3]:
# Get the intersection of two sets

set1.intersection(set2)

{1, 3}

In [5]:
# Get the difference between two sets

set1.difference(set2)

{5, 6}

In [6]:
# Check wether set 1 is a subset of set 2

set1.issubset(set2)

False

To convert a list into a set use the set() function. Converting a list into a set **drops any duplicate elements** in the list. This is a useful way of dropping duplicates or counting the unique items in a list. **Membership lookups are faster with sets than with lists**, when planning to look up items repeatedly.

In [7]:
my_list = [1,2,2,2,3,3,4,5,5,5,6]
set(my_list)

{1, 2, 3, 4, 5, 6}

## 8. Numpy arrays

### Nummpy and array basics
Numpy implements a data structure called N-dimensional array or ndarray. ndarrays contain a collection of items that can be accessed via indexes. They are **homogeneous**, meaning they can only contain objects of the same type and they can be multi-dimensional, making it easy to store 2-dimensional tables or matrices. 

In [4]:
import numpy as np

# Create an ndarray by passing a list to np.array()
my_list = [1, 2, 3, 4]          # Define a list
my_array = np.array(my_list)   # Pass the list to nparray
type(my_array)

numpy.ndarray

In [7]:
# Create an array with more than one dimension, pass a nested list to np.array()

second_list = [5, 6, 7, 8]
two_d_array = np.array([my_list,second_list])
type(two_d_array)

numpy.ndarray

An ndarray is defined by the number of dimensions it has, the size of each dimension and the type of data it holds. You can check the number and size of dimensions of an ndarray with the shape attribute

In [8]:
# Check number and size of dimensions

two_d_array.shape

(2, 4)

In [9]:
# Check total number of items (total size) in an array with the size attribute
two_d_array.size

8

In [11]:
# Check the type with the dtype attribute
two_d_array.dtype

dtype('int32')

In [12]:
# np.identity() to create a square 2nd array with 1's across the diagonal

np.identity(n=5)  # size of the array

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [15]:
# np.eye() to create a 2nd array with 1's across a specified diagonal

np.eye(N = 3,  # Number of rows
       M = 5,  # Number of columns
       k = 1)  # Index of the diagonal (main diagonal (0) is default)

array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.]])

In [16]:
# np.ones() to create an array filled with ones:

np.ones(shape = [2, 4])

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [17]:
# np.zeros() to create an array filled with zeros:

np.zeros(shape = [4, 6])

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

### Array indexing and slicing
They mirror the syntax of Python lists.

In [19]:
# Get the item at index 3

one_d_array = np.array([1, 2, 3, 4, 5, 6])
one_d_array[3]

4

In [21]:
# Get a slice from index 3 to the end INCLUDES 3
one_d_array[3:]

array([4, 5, 6])

In [22]:
# Slice backwards to reverse the array
one_d_array[::-1]

array([6, 5, 4, 3, 2, 1])

In [24]:
# If an ndarray has more than one dimension, separate indexes for each dimension with a comma
two_d_array = np.array([one_d_array, one_d_array + 6, one_d_array + 12])
print(two_d_array)

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]]


In [25]:
# Get the element at row index 1, column index 4
two_d_array[1, 4] 

11

In [26]:
# Slice elements starting at row 2 and column 5
two_d_array[1:,4:]

array([[11, 12],
       [17, 18]])

In [27]:
# Reverse both dimensions (180 degree rotation)
two_d_array[::-1,::-1]

array([[18, 17, 16, 15, 14, 13],
       [12, 11, 10,  9,  8,  7],
       [ 6,  5,  4,  3,  2,  1]])

### Reshaping arrays
Functions to manipulate arrays quickly without complicated indexing operations.

In [28]:
# Reshape an array with same data but different function with np.reshape()
np.reshape(a = two_d_array,    # Array to reshape 
          newshape = (6,3))    # Dimensions of the new array

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

In [30]:
# Unravel a multidimensional into 1 dimension with np.ravel()
np.ravel(a = two_d_array,
        order = 'C')    # Use C-style unraveling (by rows)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])

In [31]:
# Unravel a multidimensional into 1 dimension with np.ravel()
np.ravel(a = two_d_array,
        order = 'F')    # Use Fortran-style unraveling (by columns)

array([ 1,  7, 13,  2,  8, 14,  3,  9, 15,  4, 10, 16,  5, 11, 17,  6, 12,
       18])

In [32]:
# Flatten a multi-dimensional into 1 dimension and return a copy of the result with ndarray.flatten()
two_d_array.flatten()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])

In [33]:
# Get the Transpose of an array with ndarray.T
two_d_array.T

array([[ 1,  7, 13],
       [ 2,  8, 14],
       [ 3,  9, 15],
       [ 4, 10, 16],
       [ 5, 11, 17],
       [ 6, 12, 18]])

In [40]:
# Flip an array vertically
print(two_d_array)
print(np.flipud(two_d_array))

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]]
[[13 14 15 16 17 18]
 [ 7  8  9 10 11 12]
 [ 1  2  3  4  5  6]]


In [41]:
# Flip an array horizontally
print(two_d_array)
print(np.fliplr(two_d_array))

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]]
[[ 6  5  4  3  2  1]
 [12 11 10  9  8  7]
 [18 17 16 15 14 13]]


In [42]:
# Rotate an array 90 degrees counter-clockwise with np.rot90()
np.rot90(two_d_array, 
         k=1)    # Number of 90 degree rotations

array([[ 6, 12, 18],
       [ 5, 11, 17],
       [ 4, 10, 16],
       [ 3,  9, 15],
       [ 2,  8, 14],
       [ 1,  7, 13]])

In [44]:
# Shift elements in an array along a given dimension with np.roll()
np.roll(a=two_d_array,
       shift = 2,     # Shift elements two positions
       axis = 1)      # In each row

array([[ 5,  6,  1,  2,  3,  4],
       [11, 12,  7,  8,  9, 10],
       [17, 18, 13, 14, 15, 16]])

In [46]:
# Leave the axis argument empty to shift on a flattened version of the array (shift across all dimensions)
np.roll(a=two_d_array,
       shift = 2)     # Shift elements two positions

array([[17, 18,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15, 16]])

In [48]:
# Join arrays along an axis with np.concatenate()
array_to_join = np.array([[10,20,30],[40,50,60],[70,80,90]])
np.concatenate((two_d_array, array_to_join),   # Arrays to join
              axis = 1)                        # Axis to join upon

array([[ 1,  2,  3,  4,  5,  6, 10, 20, 30],
       [ 7,  8,  9, 10, 11, 12, 40, 50, 60],
       [13, 14, 15, 16, 17, 18, 70, 80, 90]])

### Array math operations
Basic math operations can be performed with ndarrays.

Math operations: https://numpy.org/doc/stable/reference/routines.math.html

Linear algebra operations: https://numpy.org/doc/stable/reference/routines.linalg.html

In [49]:
two_d_array + 100  # Add 100 to each element

array([[101, 102, 103, 104, 105, 106],
       [107, 108, 109, 110, 111, 112],
       [113, 114, 115, 116, 117, 118]])

In [50]:
two_d_array - 100  # Substract 100 from each element

array([[-99, -98, -97, -96, -95, -94],
       [-93, -92, -91, -90, -89, -88],
       [-87, -86, -85, -84, -83, -82]])

In [51]:
two_d_array * 2   # Multiply each element by 2

array([[ 2,  4,  6,  8, 10, 12],
       [14, 16, 18, 20, 22, 24],
       [26, 28, 30, 32, 34, 36]])

In [52]:
two_d_array ** 2  # Square each element 

array([[  1,   4,   9,  16,  25,  36],
       [ 49,  64,  81, 100, 121, 144],
       [169, 196, 225, 256, 289, 324]], dtype=int32)

In [58]:
two_d_array % 2   # Take modulus of each element

array([[1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0]], dtype=int32)

You can also use the basic math operators on two arrays of the same shape. When operating on two arrays, the basic math operators function in an element-wise fashion, returning an array with the same shape as the original.

In [59]:
small_array1 = np.array([[1,2],[3,4]])

small_array1 + small_array1

array([[2, 4],
       [6, 8]])

In [60]:
small_array1 - small_array1

array([[0, 0],
       [0, 0]])

In [61]:
small_array1 * small_array1

array([[ 1,  4],
       [ 9, 16]])

In [62]:
small_array1 ** small_array1

array([[  1,   4],
       [ 27, 256]], dtype=int32)

In [63]:
# Get the mean of all elements in an array with np.mean()
np.mean(two_d_array)

9.5

In [64]:
# Provide an axis argument to get means across a dimension
np.mean(two_d_array,
       axis = 1)    # Get means of each row

array([ 3.5,  9.5, 15.5])

In [65]:
# Get the standard deviation of all the elements in an array with np.std()
np.std(two_d_array)

5.188127472091127

In [66]:
# Provide an axis argument to get std across a dimension
np.std(two_d_array,
      axis = 1)     # Get std of each row

array([1.70782513, 1.70782513, 1.70782513])

In [67]:
# Provide an axis argument to get std across a dimension
np.std(two_d_array,
      axis = 0)     # Get std of each column

array([4.89897949, 4.89897949, 4.89897949, 4.89897949, 4.89897949,
       4.89897949])

In [68]:
# Sum of the elements of an array across an axis with np.sum()
np.sum(two_d_array,
      axis=1)      # Get the row sums

array([21, 57, 93])

In [69]:
np.sum(two_d_array,
      axis=0)     # Get the column sums

array([21, 24, 27, 30, 33, 36])

In [70]:
# Take the log of each element in an array with np.log()
np.log(two_d_array)

array([[0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
        1.79175947],
       [1.94591015, 2.07944154, 2.19722458, 2.30258509, 2.39789527,
        2.48490665],
       [2.56494936, 2.63905733, 2.7080502 , 2.77258872, 2.83321334,
        2.89037176]])

In [71]:
# Take the square root of each element with np.sqrt()
np.sqrt(two_d_array)

array([[1.        , 1.41421356, 1.73205081, 2.        , 2.23606798,
        2.44948974],
       [2.64575131, 2.82842712, 3.        , 3.16227766, 3.31662479,
        3.46410162],
       [3.60555128, 3.74165739, 3.87298335, 4.        , 4.12310563,
        4.24264069]])

Take the dot product of two arrays with np.dot(). This function performs an element-wise multiply and then a sum for 1-dimensional arrays (vectors) and a matrix multiplication for 2-dimensional arrays.

In [72]:
# Take the vector dot product of row 0 and row 1
np.dot(two_d_array[0,0:],  # Slice row 0
       two_d_array[1,0:])  # Slice row 1

217

In [74]:
# Do a matrix multiply
np.dot(small_array1, small_array1)

array([[ 7, 10],
       [15, 22]])

## 9. Pandas DataFrames
Numpy arrays fall short when it comes to dealing with heterogenous data sets. To store data from an external source like an excel workbook or a database, we need a data structure that can hold different data types. It is also desirable to be able to refer to rows and columns in the data by custom labels rather than numbered indexes.

The Pandas library offers data structures designed with this in mind: the series and the DataFrame. 
* **Series**: 1-dimensional labeled arrays similar to numpy's ndarrays
* **DataFrames**: 2-dimensional labeled structures, that essentially function as spreadsheet tables

### Pandas Series
Series are very similar to ndarrays: the main difference between them is that with series, you can provide custom index labels and then operations you perform on series automatically align the data based on the labels. 

A series is a valid argument to many of the numpy array functions.

In [2]:
import numpy as np
import pandas as pd

# Define a new series by passing a collection of homogeneous data like ndarray or list, along with a list of associated indexes

my_series = pd.Series(data = [2,3,5,4],           # Data
                      index = ['a','b','c','d'])  # Indexes
my_series

a    2
b    3
c    5
d    4
dtype: int64

In [77]:
# Create a series from a dictionary. The dictionary keys act as the labels and the values act as the data
my_dict = {'x':2, 'a':5, 'b':4, 'c':8}
my_series2 = pd.Series(my_dict)

my_series2

x    2
a    5
b    4
c    8
dtype: int64

In [78]:
# You can access items in a series by the labels:
my_series['a']

2

In [81]:
# Numeric indexing also works
my_series[0]

2

In [82]:
# If you take a slice of a series, you get both the values and the labels contained in the slice:
my_series[1:3]

b    3
c    5
dtype: int64

In [83]:
# As mentioned earlier, operations performed on two series align by label:
my_series + my_series

a     4
b     6
c    10
d     8
dtype: int64

In [84]:
# If you perform an operation with 2 series that have different labels, the unmatched labels will return a value of NaN
my_series + my_series2

a     7.0
b     7.0
c    13.0
d     NaN
x     NaN
dtype: float64

In [85]:
# numpy arrary functions generally work on series
np.mean(my_series)

3.5

### DataFrame creation and indexing
A DataFrame is a 2D table with labeled columns that can each hold different types of data. They're essentially the types of tables you'd see in Excel workbook or SQL. They're the defacto data structure for working with tabular data in Python.

You can create a df out a variety of data sources like dictionaries, 2D numpy arrays and using series through the pd.Data.Frame() fuction. Dictionaries provide an intuitive way to create df: when passed to pd.DataFrame() a dictionarie's keys become column labels and the values become the columns themselves. 

More indexing operations: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

In [86]:
# Create a dictionary with some different data types as values
my_dict = {'name' : ['Joe','Bob','Frans'],
           'age' : np.array([10,15,20]),
           'weight' : (75,123,239),
           'height' : pd.Series([4.5,5,6.1],
                               index = ['Joe','Bob','Frans']),
           'siblings' : 1,
           'gender' : 'M'}

df = pd.DataFrame(my_dict)     # Convert the dict to DataFrame
df

Unnamed: 0,name,age,weight,height,siblings,gender
Joe,Joe,10,75,4.5,1,M
Bob,Bob,15,123,5.0,1,M
Frans,Frans,20,239,6.1,1,M


When passing a singular value, the value is duplicated to every row in the DataFrame. The rows were automatically given indexes that align with the indexes of the series we passed in for the 'height' column. If we did not use a series with index lables to create the df, it would be given numeric row index labels by default.

In [88]:
# Dictionary without index labels
my_dict2 = {'name' : ['Joe','Bob','Frans'],
            'age' : np.array([10,15,20]),
           'weight' : (75,123,239),
           'height' : [4.5,5,6.1],
           'siblings' : 1,
           'gender' : 'M'}

df2 = pd.DataFrame(my_dict2)     # Convert the dict to DataFrame
df2 

Unnamed: 0,name,age,weight,height,siblings,gender
0,Joe,10,75,4.5,1,M
1,Bob,15,123,5.0,1,M
2,Frans,20,239,6.1,1,M


In [99]:
# You can provide custom row labels when creating a df by adding the index argument:
df2 = pd.DataFrame(my_dict2,
                   index = my_dict2['name'])
df2

Unnamed: 0,name,age,weight,height,siblings,gender
Joe,Joe,10,75,4.5,1,M
Bob,Bob,15,123,5.0,1,M
Frans,Frans,20,239,6.1,1,M


A df behaves like a dictionary of series objects that each have the same length and indexes. This means we can get, add and delete columns in a df the same way we would dealing with a dictionary.

In [92]:
# Get a column by name
df2['weight']

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int64

In [93]:
# You can also get a column by label
df2.weight

Joe       75
Bob      123
Frans    239
Name: weight, dtype: int64

In [100]:
# Add a column
df2['IQ'] =[130,105,115]
df2

Unnamed: 0,name,age,weight,height,siblings,gender,IQ
Joe,Joe,10,75,4.5,1,M,130
Bob,Bob,15,123,5.0,1,M,105
Frans,Frans,20,239,6.1,1,M,115


In [98]:
# Delete a column
del df2['age']
df2

Unnamed: 0,weight,height,siblings,gender,IQ
Joe,75,4.5,1,M,130
Bob,123,5.0,1,M,105
Frans,239,6.1,1,M,115


In [101]:
# Inserting a single value into a df causes it to populate across all the rows:
df2['Married'] = False
df2

Unnamed: 0,name,age,weight,height,siblings,gender,IQ,Married
Joe,Joe,10,75,4.5,1,M,130,False
Bob,Bob,15,123,5.0,1,M,105,False
Frans,Frans,20,239,6.1,1,M,115,False


In [102]:
# When inserting a series into a df, rows are matched by index. Unmatched rows will be filled with NaN
df2['College'] = pd.Series(['Harvard'],
                            index = ['Frans'])
df2

Unnamed: 0,name,age,weight,height,siblings,gender,IQ,Married,College
Joe,Joe,10,75,4.5,1,M,130,False,
Bob,Bob,15,123,5.0,1,M,105,False,
Frans,Frans,20,239,6.1,1,M,115,False,Harvard


In [103]:
# Select both rows or columns by label with df.loc[row,column]
df2.loc['Joe']   # Select row joe

name          Joe
age            10
weight         75
height        4.5
siblings        1
gender          M
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [104]:
df2.loc['Joe','IQ']  # Select both Joe and column IQ

130

In [105]:
# Slice by label
df2.loc['Joe':'Bob','IQ':'College']

Unnamed: 0,IQ,Married,College
Joe,130,False,
Bob,105,False,


In [106]:
# Select rows or columns by numeric index with df.iloc[row,column]
df2.iloc[0] # Get row 0

name          Joe
age            10
weight         75
height        4.5
siblings        1
gender          M
IQ            130
Married     False
College       NaN
Name: Joe, dtype: object

In [107]:
df2.iloc[0,5] # Get row 0 and column 5

'M'

In [108]:
df2.iloc[0:2,5:8] # Slice by numeric row and column index

Unnamed: 0,gender,IQ,Married
Joe,M,130,False
Bob,M,105,False


In [109]:
# Select rows by passing boolean values. Only the rows where the bool is true are returned
boolean_index = [False, True, True]
df2[boolean_index]

Unnamed: 0,name,age,weight,height,siblings,gender,IQ,Married,College
Bob,Bob,15,123,5.0,1,M,105,False,
Frans,Frans,20,239,6.1,1,M,115,False,Harvard


In [110]:
# Create a boolean sequence with a logical comparison
boolean_index = df2['age'] > 12
# Use the index to get the rows where age > 12
df2[boolean_index]

Unnamed: 0,name,age,weight,height,siblings,gender,IQ,Married,College
Bob,Bob,15,123,5.0,1,M,105,False,
Frans,Frans,20,239,6.1,1,M,115,False,Harvard


In [111]:
# Indexing all in one operation
df2[df2['age'] > 12]

Unnamed: 0,name,age,weight,height,siblings,gender,IQ,Married,College
Bob,Bob,15,123,5.0,1,M,105,False,
Frans,Frans,20,239,6.1,1,M,115,False,Harvard


### Exploring DataFrames
Many ndarray functions work on DataFrames.

In [7]:
titanic_train = pd.read_csv("C:/Users/Micaela Rodriguez/Mentorías/Python for Data Analysis/train.csv")
type(titanic_train)

pandas.core.frame.DataFrame

In [8]:
# Check shape of dataframe 
titanic_train.shape

(891, 12)

Titanic training data has 891 rows and 12 columns.

In [9]:
# Check the first n rows with df.head(n)
titanic_train.head(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [11]:
# Check the last n rows with df.trail(n)
titanic_train.tail(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


With large datasets, head() and tail() help get a sense of the data without printing hundreds of rows. 

Each line represents a different passenger, so we'll set the row indexes to the passenger's name.

In [12]:
# You can assing new row indexes with df.index
titanic_train.index = titanic_train["Name"]   # Set index to name
del titanic_train["Name"]                     # Delete name column
print(titanic_train.index[0:10])              # Print new indexes

Index(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)'],
      dtype='object', name='Name')


In [15]:
titanic_train.head(2)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"Braund, Mr. Owen Harris",1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C


In [13]:
# Access the column labels with df.columns
titanic_train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [17]:
# Get a quick statistical summary of the data set with df.describe()
titanic_train.describe()  # Summarize the first 6 columns  

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [18]:
np.mean(titanic_train, 
        axis = 0)      # Get the mean of each numeric column

PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

In [19]:
# Get an overview of the overall structure of a DataFrame with the df.info()
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, Braund, Mr. Owen Harris to Dooley, Mr. Patrick
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 83.5+ KB


## 10. Reading and writing data
Reading data into dataframes is the first step when conducting data analysis in Python.

### Python Working Directory and File Paths
When you launch Python it starts in a default location in your computer's file system, know as the **working directory**. You can check your current working directory by importing the os module and then using os.getcwd()

In [20]:
# Get current directory
import os
os.getcwd()

'C:\\Users\\Micaela Rodriguez\\Documents\\GitHub\\PythonForDataAnalysis'

The working directory acts as your starting point for accessing files on your computer from within Python. To load a dataset, you either need to put the file in your working directory, change your working directory to the folder containing the data or supply the data file's file path to the data reading function.  

In [21]:
# Change your working directory by supplying a new file path in quotes to the os.chdir()
os.chdir("C:\\Users\\Micaela Rodriguez\\Documents\\GitHub")
os.getcwd()    # Check the working directory again

'C:\\Users\\Micaela Rodriguez\\Documents\\GitHub'

In [22]:
# Change back to original directory
os.chdir("C:\\Users\\Micaela Rodriguez\\Documents\\GitHub\\PythonForDataAnalysis")
os.getcwd()    # Check the working directory again

'C:\\Users\\Micaela Rodriguez\\Documents\\GitHub\\PythonForDataAnalysis'

## Reading CSV and TSV files
Data is commonly stored in CSV (comma separated files) and TSV (tab separated files).

* CSV:

You can read CSV files with the pd.read_csv() function.

* TSV:

You can load a TSV file with the pd.read_table() function, which is a general file reading algorithm that reads TSV files by default, by you can it it to read flat text files separated by any delimiting character by setting the "sep" argument to a different character. More options here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html

### Reading Excel files
Pandas is capable of loading data directly from Excel file formats. 
To load data from Excel you can use the "xlrd" module that comes with the Anaconda distribution.

In [31]:
# Load data from an Excel file to a dataframe with pd.read_excel() supplying the file path and the name of the worksheet
draft = pd.read_excel("C://Users//Micaela Rodriguez//Documents//Alia//Documentos Alia//Master//Costos.xlsx",
                     sheet_name = "Hoja1")  # Add name of the sheet
draft.head(6)

Unnamed: 0,Program,University,City,Total tuition (USD),Application fee,Dorm cost per month,Language,Duration,Admission requirements,Undergraduate GPA,TOEFL iBT,Recommendation letters,Website
0,MA Statistics,University of Haifa,Haifa,11339,140,465,English,1 year (non thesis),"2 recommendation letters, personal statement, ...",0.8,89,2.0,https://uhaifa.org/academics/graduate-programs...
1,MA Human Computer Interaction,Reichman University,Hertzliya,22000,95 (application) and 2500 down payment,880 USD single / 680 USD roomie (without elect...,English,1 year (non thesis),Interview on skype or person,0.85,,2.0,https://www.idc.ac.il/en/schools/rris/graduate...
2,MSc Machine Learning & Data Science,Reichman University,Hertzliya,22000,100 (application) and 2500 down payment,880 USD single / 680 USD roomie (without elect...,English,2 years,Interview on skype or person,0.85,,,https://www.idc.ac.il/en/schools/rris/graduate...
3,Msc in Industrial Engineering (Operations & Lo...,Tel Aviv University,Tel Aviv,5500 USD per year,100,600 - 800 USD,English,2 years (it depends),Class ranking,0.8,79 (B2),,https://en-engineering.tau.ac.il/industrial/ms...
4,Certificate of completion of Y-DATA program,Yandex School of Data Science,Tel Aviv,9000,,,English,1 year (non thesis),"Online test of programming skills, interview",,,,https://yandexdataschool.com/israel/


### Reading Web data
Pandas comes with a read_html() function to read data directly from web pages. However, to use this function you need the HTML5lib package. Install it by opening a command console and running "pip install HTLM5lib" (without quotes). Note that HTML can have all sorts of nested structures and formatting quirks, which makes parsing it to extract data troublesome. The read_html() function does its best to draw out tabular data in web pages, but the results aren't always perfect.

### Writing Data
Each reading function in pandas has a corresponding writer function that lets you write data back into the format it came from. Recommendation is to save the data in a CSV format.

In [32]:
# Write a dataframe to CSV in the working directory by passing the desired file name to the df.to_csv()
draft.to_csv("draft_saved.csv")
os.listdir("C://Users//Micaela Rodriguez//Documents//Alia//Documentos Alia//Master")

['bar-ilan-university_en.pdf',
 'Cartas recomendación',
 'Costos.xlsx',
 'Expediente académico.xlsx',
 'IDC Herzliya',
 'ITC',
 'MA Statistics - University of Haifa',
 'Preguntas Sojnut.docx',
 'sample-recommendation-letter.pdf',
 'TAU',
 'Tel_Aviv_Adobe.jpg',
 'TOEFL']

## 11. Control Flow
When running code, each statement is executed in the order in which they appear. **Control flow statements let you alter the order in which code executes.**

### If, Else and Elif
* **If**: Checks wether some logical expression evaluates to true or false and then executes a code block if the expression is true. They are accompanied by else statements.
* **Else**: Come after if statements and execute code in the event that logical expression checked by an if statement is false.
* **Elif**: Else if. Performs an additional logical check and executes its code if the check is true.

In [1]:
y = 25
x = 10
if x > y :
    print("x is greater than y")
else: 
    print("y is greater than x")

y is greater than x


In [2]:
y = 10
if x > y :
    print("x is greater than y")
elif x == y :
    print("x and y are equal")
else :
    print("y is greater than x")

x and y are equal


### For loops
Programming construct that lets you go through each item in a sequence and then perform some operation on each one. For loops execute their contents, at most, a number of iterations equal to the length of the sequence you are looping over. 

In [5]:
# The continue key word causes a for loop to skip the current iteration and go to the next one:
for number in my_sequence :
    if number < 50 :
        continue              # Skip numbers less than 50
    print(number)

50
60
70
80
90
100


In [7]:
# The break keyword halts the execution of a for loop entirely. Use break to "break out" of a for loop.
# It is best to break out of loops early if possible to reduce execution time.
for number in my_sequence :
    if number > 50 :
        break                 # Break out of the for loop if number > 50
    print(number)

0
10
20
30
40
50


### While loops
They allow to execute code over and over again. While loops keep executing their contents as long as a logical expression you supply remains true.

It is important to provide a way to break out of the while loop or it will run forever.

While loops should be reserved for cases where you don't know how many times you will need to execute a loop.

In [11]:
x = 5
iters = 0

while iters < x :        # Execute the contents as long as iters < x
    print("Study")
    iters = iters + 1      # Increment iters by 1 each time the loop executes

Study
Study
Study
Study
Study


In [12]:
while True:        # True is always true!
    print("Study!")
    break          # But we break out of the loop here

Study!


### The np.where() function
When you want to perform the same operation to each object in a numpy or pandas data structure.

In [14]:
import numpy as np
# Draw 25 random numbers from -1 to 1

my_data = np.random.uniform(-1,1,25)

for index, number in enumerate(my_data) :
    if number < 0:
        my_data[index] = 0     # Sets numbers less than 0 to 0
print(my_data)

[0.64369151 0.63229363 0.96965457 0.         0.37463731 0.
 0.05840826 0.         0.         0.53593743 0.         0.
 0.         0.37507229 0.57790315 0.94994099 0.12102382 0.32119269
 0.8061548  0.         0.         0.19777092 0.         0.
 0.        ]


Enumerate takes a sequence and turns it into a sequence of (index,value) tuples. Enumerate() lets you loop over the items in a sequence while also having access the item's index. 

In [16]:
# The np.where() function lets you perform an if/else check on a sequence with less code:
# np.where() is more efficient than a foor loop

my_data = np.random.uniform(-1,1,25)   # Generate new random numbers

my_data = np.where(my_data < 0,        # Logical test
                   0,                  # Value to set if the test is true
                   my_data)            # Value to set if the test is false
print(my_data)

[0.         0.55424429 0.         0.29525008 0.         0.
 0.         0.03563538 0.         0.         0.09606753 0.80310516
 0.         0.         0.         0.         0.         0.05605358
 0.         0.81305239 0.95402976 0.         0.97868369 0.
 0.        ]


## 12. Functions

### Defining functions
Define a function using the "def" keyword followed by the function's name, a tuple of function arguments and then a colon. After defining a function, you can call it using the name you assigned to it.

The "return" keyword specifies what the function produces as its output. When a function reaches a return statement, it immediately exits and returns the specified value. 

In [2]:
def my_function(arg1,arg2):     # Defines a new function
    return arg1 + arg2         # Function body

In [3]:
my_function(5,10)

15

In [4]:
# Set a default value with the argument_name = argument_value syntax
def sum_3_items(x, y, z, print_args = False):
    if print_args:
        print(x,y,z)
    return x + y + z

In [5]:
sum_3_items(5,10,20)     # By default the arguments will not be printed

35

In [6]:
sum_3_items(5,10,20,True)   # Changing the default prints the arguments 

5 10 20


35

A function can be set up to accept any number of named or unnamed arguments. Accept extra unnamed arguments by including *args* in the argument list. The unnamed arguments are accessible within the function body as a tuple.  

In [7]:
def sum_many_args(*args):
    print(type(args))
    return (sum(args))

sum_many_args(1, 2, 3, 4, 5)

<class 'tuple'>


15

Accept additional keyword arguments by putting **kwargs** in the argument list. They keyword arguments are accessible as a dictionary:

In [8]:
def sum_keywords(**kwargs):
    print(type(kwargs))
    return (sum(kwargs.values()))

sum_keywords(mynum=100, yournum=200)

<class 'dict'>


300

### Function documentation 
It is useful to place some documentation to explain how the function works. You can include documentation below the function definition statement as a multi-line string. Documentation typically includes a short description of what the function does, a summary of the arguments and a description of the value the function returns. Documentation should provide enough information that the user doesn't have to read the code in the code in the body of the function to use the function.

In [9]:
import numpy as np
def rmse(predicted, targets):
    """""
    Computers root mean squared error of two numpy ndarrays
    Args: 
        predicted: an ndarray of predictions
        targets: an ndarray of target values
    Returns:
        The root mean squared error as a float
    """""
    return (np.sqrt(np.mean((targets-predicted)**2)))

### Lambda functions
Named functions are great for code that you are going to reuse several times, but sometimes you only need to use a simple function once. Python provides a shorthand for creating functions that let you define unnamed (anonymous) functions named Lambda functions, which are typically used in situations where you only plan to use a function in one part of your code.

In [10]:
# Create a lambda function
lambda x, y: x + y

<function __main__.<lambda>(x, y)>

"Lambda" is similar to "def". The values x and y are the arguments of the function and the code after the colon is the value that the function returns.

You can assign a lambda function a variable name and use it just like a normal function.

In [11]:
my_function2 = lambda x,y: x + y
my_function2(5,10)

15

The main purpose of a lambda function is for use in situations where you need to create an unnamed function on the fly, such as when using functions that take other functions as input.

In [12]:
# Example of using map() without a lambda function
def square(x):   # Define a function
    return x**2
my_map = map(square, [1,2,3,4,5])   # Pass the function to map()

for item in my_map:
    print(item)

1
4
9
16
25


In [13]:
# Example using map() with a lambda function
my_map = map(lambda x: x**2, [1,2,3,4,5])

for item in my_map:
    print(item)

1
4
9
16
25


## 13. List Comprehensions
List comprehensions let you populate lists in one line of code by taking the logic from a for loop and moving it inside the list brackets.

Do not create convoluted one liners when a series of few shorter, more readable operations will yield the same result.

In [1]:
my_list2 = [number for number in range(0,101)]
# range() creates a sequence of numbers from some specified starting number up to but not including an ending number

print(my_list2)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


In a list comprehension, the value that you want to append to the list comes first, in this case "number", followed by a for statement. You can also include if clauses after the for clause to filter the results based on some logical check. 

In [2]:
# Add an if statement to filter out odd numbers
my_list3 = [number for number in range(0,101) if number % 2 == 0]
print(my_list3)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100]


It is possible to put more than one for loop in a list comprehension, such as to construct a list from 2 iterables.

In [3]:
# Make a list of each combination of 2 letters in 2 different strings

combined = [a + b for a in "life" for b in "study"]
print(combined)

['ls', 'lt', 'lu', 'ld', 'ly', 'is', 'it', 'iu', 'id', 'iy', 'fs', 'ft', 'fu', 'fd', 'fy', 'es', 'et', 'eu', 'ed', 'ey']


In [4]:
# Nest one list comprehension inside of another

nested = [letters[1] for letters in [a + b for a in "life" for b in "study"]]
print(nested)

['s', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y']


In [5]:
# Use more lines, it makes the code more readable
combined = [a + b for a in "life" for b in "study"]
non_nested = [letters[1] for letters in combined]
print(non_nested)

['s', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y', 's', 't', 'u', 'd', 'y']


### Dictionary comprehensions
You can create dictionaries quickly in one line using a syntax that mirrors list comprehensions. 

In [6]:
words = ["life","is","study"]

word_length_dict = {}

for word in words:
    word_length_dict[word] = len(word)

print(word_length_dict)

{'life': 4, 'is': 2, 'study': 5}
